Integration with Slurm#
In 1.8.0, we introduced a light-weight integration with Slurm in our appliance flow. From high level,
slurm will manage the user node resources, while k8s will manage the appliance resources. The
appliance jobs will be exposed as additional job steps in slurm to help with resource tracking and
accounting. We called these job steps surrogate jobs
.
The user can submit their run.py
through 2 different slurm commands:
salloc
which callsrun.py
after getting the resource allocation.sbatch
, which invokes a bash script that calls torun.py
.
The relevant surrogate jobs will be automatically created after run.py
is invoked.
The surrogate jobs will not be created when run.py is submitted with srun.
Surrogate jobs#
A surrogate job is a job step on the Slurm cluster, which represents an appliance job running in a Cerebras cluster.
Currently, a surrogate job will be added to the slurm cluster when the appliance job is in the running state,
and will end with either COMPLETED
or FAILED
state depending on the appliance job status. Jobs that require
CS-2s will have a name suffix of -csxN
, where N
is the number of CS-2s allocated.
The runtime of a surrogate job should match with the runtime of its appliance job.
Submit jobs with sbatch#
For this example we will train GPT2 small
in Weight Streaming using the PyTorch implementation in Cerebras Model Zoo. After Clone Cerebras Model Zoo,
the folder <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/
contains:
configs/
params_gpt2_small.yaml
input/
data.py
gpt2_model.py
model.py
run.py
...
Here is an example of sbatch.sh
script which invokes the run.py
located at <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/
.
Assume that you have Set up Cerebras’s virtual environment at $HOME/venv_cerebras_pt
.
The run.py
will start a compile and a train appliance job for a pytorch GPT2 small test:
#!/bin/bash
#SBATCH --job-name=gpt2-small-test
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --cpus-per-task 40
#SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/run-%j.out
source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
python \
run.py CSX \
--params configs/params_gpt2_small.yaml \
--num_csx 1 \
--model_dir model_dir/ \
--num_wgt_servers 2 \
--mode train
The user will invoke this script to the slurm with sbatch sbatch.sh
. When this is running,
with sacct
command, you should see the surrogate jobs starting up:
$ sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
JobID JobName Account State CPUTime AllocCPUS AllocNodes ExitCode
------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
65 gpt2-small-test RUNNING 01:09:20 40 1 0:0
65.batch batch RUNNING 01:09:20 40 1 0:0
65.0 wsjob-vpd7iavtknrtvgonhe7mog COMPLETED 00:02:40 40 1 0:0
65.1 wsjob-z3pbtbl8sgmfqtivpazrtn-csx1 RUNNING 00:02:40 40 1 0:0
In this example,
The
gpt2-small-test
is the Slurm specific top-level job.The
batch
job step is a slurm specific job step.The surrogate job step
wsjob-vpd7iavtknrtvgonhe7mog
is created for the compile appliance job, and it was completed. Since it is a compile job, no CS-2s are required, so there is no-csxN
suffix.The surrogate job step
wsjob-z3pbtbl8sgmfqtivpazrtn-csx1
is created for the train appliance job, and it was in running state. The suffix-csx1
indicates that 1 CS-2 is allocated to the job.
Submit jobs with salloc#
The above run.py
can also be invoked through salloc
:
$ salloc --cpus-per-task 40 --tasks 1
$ source $HOME/venv_cerebras_pt/bin/activate
(venv_cerebras_pt) $ cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
python \
run.py CSX \
--params configs/params_gpt2_small.yaml \
--num_csx 1 \
--model_dir model_dir/ \
--num_wgt_servers 2 \
--mode train
When this is running, with sacct
command, you should see the corresponding surrogate jobs:
$ sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
JobID JobName Account State CPUTime AllocCPUS AllocNodes ExitCode
------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
68 interactive RUNNING 00:03:30 2 1 0:0
68.0 wsjob-lbrjprpjuj2dfsbfsebdq8 COMPLETED 00:00:08 2 1 0:0
68.1 wsjob-dazjdtytvfn4njtcbchsik-csx1 RUNNING 00:00:06 2 1 0:0
In this example,
The
interactive
job is slurm specific top-level job.The surrogate job step
wsjob-lbrjprpjuj2dfsbfsebdq8
is created for the compile appliance job, and it was completed.The surrogate job step
wsjob-dazjdtytvfn4njtcbchsik-csx1
is created for the train appliance job, and it was in running state.
Time limits#
We support 2 types of time limits:
Time limit in Slurm. This time limit defines the runtime limit for the
sbatch
orsalloc
job. It includes the time when the underline appliance jobs are in the appliance queue. An example of enabling this time limit is as follows.#SBATCH --job-name=gpt2-small-test #SBATCH --nodes=1 #SBATCH --tasks=1 #SBATCH --cpus-per-task 40 #SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/run-%j.out #SBATCH --time=0:60 #SBATCH --signal=TERM@30
This sets the
60
seconds timeout for thesbatch
job.Time limit in the appliance. This time limit defines the runtime limit for all appliance jobs in a
run.py
. It does not count the time when the appliance jobs are in the appliance queue. The limit can be specify through run.py command line argument--job-time-sec
, an example as follows.$ source $HOME/venv_cerebras_pt/bin/activate (venv_cerebras_pt) $ cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2 (venv_cerebras_pt) $ python run.py \ CSX \ --params configs/params_gpt2_small.yaml \ --num_csx 1 \ --model_dir model_dir/ \ --num_wgt_servers 2 \ --mode train \ --job_time_sec 60
This sets a
60
seconds timeout on the appliance jobs for thisrun.py
invocation.