Integration with Slurm#
Overview#
Cerebras has streamlined its appliance workflow by integrating Slurm, allowing for a well-orchestrated division of responsibilities between Slurm and Kubernetes (k8s) for resource management. This integration introduces “surrogate jobs” to enhance job tracking and management.
Users can deploy their run.py
script using two different Slurm commands: salloc and sbatch, each catering to different needs:
Using salloc#
When the user submits their job using salloc, it requests and obtains the necessary resource allocation from Slurm.
After acquiring the allocation, the run.py
script is executed.
Importantly, relevant surrogate jobs are automatically created as part of the workflow when run.py
is invoked through salloc. These surrogate jobs help with resource tracking and accounting.
Using sbatch#
Alternatively, the user can submit their job through sbatch, which invokes a bash script.
This bash script, in turn, calls the run.py
script. Similar to the salloc method, the relevant surrogate jobs are also automatically generated when run.py
is invoked through sbatch. These surrogate jobs serve the same purpose of aiding in resource tracking and accounting.
It’s important to note that the surrogate jobs are a crucial part of the Cerebras appliance workflow and are automatically created when run.py
is submitted via either salloc or sbatch. However, these surrogate jobs will not be generated if run.py
is submitted using the srun command.
Surrogate jobs#
A surrogate job, in the context of the Cerebras Wafer-Scale cluster and Slurm integration, is essentially a job step within the Slurm cluster that represents the execution of an appliance job on the Cerebras Wafer-Scale cluster. Here are key characteristics and behaviors of surrogate jobs:
- Representation of Appliance Jobs:
A surrogate job serves as a representation or surrogate for an appliance job that is running on the Cerebras cluster.
- Creation and Termination:
Surrogate jobs are automatically created in the Slurm cluster when the corresponding appliance job is in the “running” state on the Cerebras cluster.
- Completion States:
A surrogate job can conclude with one of two states:
COMPLETED: This state indicates that the associated appliance job completed successfully without errors.
FAILED: This state signifies that the associated appliance job encountered issues or errors during execution and did not complete successfully.
- Naming Convention:
Jobs that require the allocation of CS-Xs, which are Cerebras-specific hardware resources, may have a naming suffix of “-csxN,” where “N” represents the number of allocated CS-Xs. This naming convention helps identify the resource allocation associated with the appliance job.
The runtime of a surrogate job should match with the runtime of its appliance job.
Submit jobs with sbatch#
As an example, we will train GPT2 small with Weight Streaming using the PyTorch implementation in Cerebras Model Zoo.
Once you Clone Cerebras Model Zoo, you will the following contents in the folder <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/
:
configs/
params_gpt2_small.yaml
input/
data.py
gpt2_model.py
model.py
run.py
...
Here is an example of sbatch.sh
script which invokes the run.py
located at <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/
.
Assume that you have Set up a Cerebras virtual environment at $HOME/venv_cerebras_pt
.
The run.py
will start a compile and a train appliance job for a pytorch GPT2 small test:
#!/bin/bash
#SBATCH --job-name=gpt2-small-test
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --cpus-per-task 40
#SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/run-%j.out
source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2
python \
run.py CSX \
--params configs/params_gpt2_small.yaml \
--num_csx 1 \
--model_dir model_dir/ \
--num_wgt_servers 2 \
--mode train
Invoke this script from the slurm with the command - sbatch sbatch.sh
. When the script is run from
the sacct
command, you will see the surrogate jobs starting up:
sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
JobID JobName Account State CPUTime AllocCPUS AllocNodes ExitCode
------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
65 gpt2-small-test RUNNING 01:09:20 40 1 0:0
65.batch batch RUNNING 01:09:20 40 1 0:0
65.0 wsjob-vpd7iavtknrtvgonhe7mog COMPLETED 00:02:40 40 1 0:0
65.1 wsjob-z3pbtbl8sgmfqtivpazrtn-csx1 RUNNING 00:02:40 40 1 0:0
In this example,
The
gpt2-small-test
is the Slurm specific top-level job.The
batch
job step is a slurm specific job step.The surrogate job step
wsjob-vpd7iavtknrtvgonhe7mog
is created for the compile appliance job, and it was completed. Since it is a compile job, no CS-Xs are required, so there is no-csxN
suffix.The surrogate job step
wsjob-z3pbtbl8sgmfqtivpazrtn-csx1
is created for the train appliance job, and it was in running state. The suffix-csx1
indicates that 1 CS-X is allocated to the job.
Submit jobs with salloc#
The above run.py
can also be invoked using salloc
:
salloc --cpus-per-task 40 --tasks 1
source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2
python \
run.py CSX \
--params configs/params_gpt2_small.yaml \
--num_csx 1 \
--model_dir model_dir/ \
--num_wgt_servers 2 \
--mode train
When this is running, with sacct
command, you should see the corresponding surrogate jobs:
sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
JobID JobName Account State CPUTime AllocCPUS AllocNodes ExitCode
------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
68 interactive RUNNING 00:03:30 2 1 0:0
68.0 wsjob-lbrjprpjuj2dfsbfsebdq8 COMPLETED 00:00:08 2 1 0:0
68.1 wsjob-dazjdtytvfn4njtcbchsik-csx1 RUNNING 00:00:06 2 1 0:0
In this example,
The
interactive
job is slurm specific top-level job.The surrogate job step
wsjob-lbrjprpjuj2dfsbfsebdq8
is created for the compile appliance job, and it was completed.The surrogate job step
wsjob-dazjdtytvfn4njtcbchsik-csx1
is created for the train appliance job, and it was in running state.
Time limits#
We support two types of time limits:
Time limit in Slurm#
This time limit defines the runtime limit for the sbatch
or salloc
job. It includes the time when
the underline appliance jobs are in the appliance queue. An example of enabling this time limit is as follows.
#SBATCH --job-name=gpt2-small-test #SBATCH --nodes=1 #SBATCH --tasks=1 #SBATCH --cpus-per-task 40 #SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2/run-%j.out #SBATCH --time=0:60 #SBATCH --signal=TERM@30
This sets the 60
seconds timeout for the sbatch
job.
Time limit in the appliance#
This time limit defines the runtime limit for all appliance jobs in a run.py
. It does not
count the time when the appliance jobs are in the appliance queue. The limit can be specify through run.py
command line argument
--job-time-sec
, an example as follows.
source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/models/nlp/gpt2
python run.py \
CSX \
--params configs/params_gpt2_small.yaml \
--num_csx 1 \
--model_dir model_dir/ \
--num_wgt_servers 2 \
--mode train \
--job_time_sec 60
This sets a 60
seconds timeout on the appliance jobs for this run.py
invocation.
Conclusion#
The integration of Slurm with the Cerebras appliance workflow marks a significant advancement in resource management and job scheduling for deep learning tasks. By utilizing surrogate jobs, users can efficiently monitor and manage their workloads on the Cerebras Wafer-Scale Engine, ensuring that resources are optimally allocated and utilized. Whether through salloc or sbatch, the workflow allows for precise tracking and control over job execution, enhancing the overall efficiency and predictability of the training process. This integration not only streamlines the job submission process but also provides a robust framework for scaling and managing complex machine learning workloads in a high-performance computing environment.