Data race at native scheduler leads to `Scheduler has been terminated or reset`
Data race between opening file WORKERS_CPU_CORES_TMP_FILE
for read and write by get_workers_cpu_cores() and reserve_workers_cpu_cores() from global_config.py leads to job corruption with error Scheduler has been terminated or reset
.
Log file /opt/bin/klever-work/native-scheduler/info.log
shows exception:
2023-10-18 17:50:21,852 (resource_scheduler.py:225) root INFO> Submit information about the workload to Bridge
2023-10-18 17:50:23,566 (__init__.py:399) root ERROR> An error occurred:
Traceback (most recent call last):
File "/home/gitlab-runner/builds/3oezZwqq/0/verification/klever/venv/lib/python3.10/site-packages/klever/scheduler/schedulers/__init__.py", line 339, in launch
tasks_to_start, jobs_to_start = self.runner.schedule(pending_tasks, pending_jobs)
File "/home/gitlab-runner/builds/3oezZwqq/0/verification/klever/venv/lib/python3.10/site-packages/klever/scheduler/schedulers/native.py", line 162, in schedule
new_tasks, new_jobs = self._manager.schedule(pending_tasks, pending_jobs)
File "/home/gitlab-runner/builds/3oezZwqq/0/verification/klever/venv/lib/python3.10/site-packages/klever/scheduler/schedulers/resource_scheduler.py", line 275, in schedule
cur_max_tasks = self.__max_tasks - get_workers_cpu_cores()
File "/home/gitlab-runner/builds/3oezZwqq/0/verification/klever/venv/lib/python3.10/site-packages/klever/scheduler/schedulers/global_config.py", line 52, in get_workers_cpu_cores
temp_prev_val = struct.unpack('b', fd.read(1))[0]
struct.error: unpack requires a buffer of 1 bytes
2023-10-18 17:50:23,575 (native.py:411) root INFO> Going to cancel execution of the job 25e7ee2d-8a25-4c4b-a7f5-574db723eada