Integration with Slurm#

In 1.8.0, we introduced a light-weight integration with Slurm in our appliance flow. From high level, slurm will manage the user node resources, while k8s will manage the appliance resources. The appliance jobs will be exposed as additional job steps in slurm to help with resource tracking and accounting. We called these job steps surrogate jobs.

The user can submit their run.py through 2 different slurm commands:

salloc which calls run.py after getting the resource allocation.
sbatch, which invokes a bash script that calls to run.py.

The relevant surrogate jobs will be automatically created after run.py is invoked. The surrogate jobs will not be created when run.py is submitted with srun.

Surrogate jobs#

A surrogate job is a job step on the Slurm cluster, which represents an appliance job running in a Cerebras cluster. Currently, a surrogate job will be added to the slurm cluster when the appliance job is in the running state, and will end with either COMPLETED or FAILED state depending on the appliance job status. Jobs that require CS-2s will have a name suffix of -csxN, where N is the number of CS-2s allocated.

The runtime of a surrogate job should match with the runtime of its appliance job.

Submit jobs with sbatch#

For this example we will train GPT2 small in Weight Streaming using the PyTorch implementation in Cerebras Model Zoo. After Clone Cerebras Model Zoo, the folder <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/ contains:

configs/
  params_gpt2_small.yaml
input/
data.py
gpt2_model.py
model.py
run.py
...

Here is an example of sbatch.sh script which invokes the run.py located at <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/. Assume that you have Set up Cerebras’s virtual environment at $HOME/venv_cerebras_pt. The run.py will start a compile and a train appliance job for a pytorch GPT2 small test:

#!/bin/bash

#SBATCH --job-name=gpt2-small-test
#SBATCH --nodes=1
#SBATCH --tasks=1
#SBATCH --cpus-per-task 40
#SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/run-%j.out

source $HOME/venv_cerebras_pt/bin/activate
cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
python \
  run.py CSX \
    --params configs/params_gpt2_small.yaml \
    --num_csx 1 \
    --model_dir model_dir/ \
    --num_wgt_servers 2 \
    --mode train

The user will invoke this script to the slurm with sbatch sbatch.sh. When this is running, with sacct command, you should see the surrogate jobs starting up:

$ sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
       JobID                        JobName    Account      State    CPUTime  AllocCPUS AllocNodes ExitCode
------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
65                              gpt2-small-test               RUNNING   01:09:20         40          1      0:0
65.batch                                  batch               RUNNING   01:09:20         40          1      0:0
65.0               wsjob-vpd7iavtknrtvgonhe7mog             COMPLETED   00:02:40         40          1      0:0
65.1          wsjob-z3pbtbl8sgmfqtivpazrtn-csx1               RUNNING   00:02:40         40          1      0:0

In this example,

The gpt2-small-test is the Slurm specific top-level job.
The batch job step is a slurm specific job step.
The surrogate job step wsjob-vpd7iavtknrtvgonhe7mog is created for the compile appliance job, and it was completed. Since it is a compile job, no CS-2s are required, so there is no -csxN suffix.
The surrogate job step wsjob-z3pbtbl8sgmfqtivpazrtn-csx1 is created for the train appliance job, and it was in running state. The suffix -csx1 indicates that 1 CS-2 is allocated to the job.

Submit jobs with salloc#

The above run.py can also be invoked through salloc:

$ salloc --cpus-per-task 40 --tasks 1

$ source $HOME/venv_cerebras_pt/bin/activate
(venv_cerebras_pt) $ cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
python \
  run.py CSX \
    --params configs/params_gpt2_small.yaml \
    --num_csx 1 \
    --model_dir model_dir/ \
    --num_wgt_servers 2 \
    --mode train

When this is running, with sacct command, you should see the corresponding surrogate jobs:

$ sacct --format="JobID,JobName%34,Account,State,CPUTime,AllocCPUS,AllocNodes,ExitCode"
        JobID                        JobName    Account      State    CPUTime  AllocCPUS AllocNodes ExitCode
 ------------ ---------------------------------- ---------- ---------- ---------- ---------- ---------- --------
 68                                  interactive               RUNNING   00:03:30          2          1      0:0
 68.0               wsjob-lbrjprpjuj2dfsbfsebdq8             COMPLETED   00:00:08          2          1      0:0
 68.1          wsjob-dazjdtytvfn4njtcbchsik-csx1               RUNNING   00:00:06          2          1      0:0

In this example,

The interactive job is slurm specific top-level job.
The surrogate job step wsjob-lbrjprpjuj2dfsbfsebdq8 is created for the compile appliance job, and it was completed.
The surrogate job step wsjob-dazjdtytvfn4njtcbchsik-csx1 is created for the train appliance job, and it was in running state.

Time limits#

We support 2 types of time limits:

Time limit in Slurm. This time limit defines the runtime limit for the sbatch or salloc job. It includes the time when the underline appliance jobs are in the appliance queue. An example of enabling this time limit is as follows.
#SBATCH --job-name=gpt2-small-test #SBATCH --nodes=1 #SBATCH --tasks=1 #SBATCH --cpus-per-task 40 #SBATCH --output <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2/run-%j.out #SBATCH --time=0:60 #SBATCH --signal=TERM@30
This sets the 60 seconds timeout for the sbatch job.

Time limit in the appliance. This time limit defines the runtime limit for all appliance jobs in a run.py. It does not count the time when the appliance jobs are in the appliance queue. The limit can be specify through run.py command line argument --job-time-sec, an example as follows.

$ source $HOME/venv_cerebras_pt/bin/activate
(venv_cerebras_pt) $ cd <parent_dir_modelzoo>/modelzoo/modelzoo/transformers/pytorch/gpt2
(venv_cerebras_pt) $ python run.py \
    CSX \
    --params configs/params_gpt2_small.yaml \
    --num_csx 1 \
    --model_dir model_dir/ \
    --num_wgt_servers 2 \
    --mode train \
    --job_time_sec 60

This sets a 60 seconds timeout on the appliance jobs for this run.py invocation.

Cluster monitoring with Grafana

Resource requirements for parallel training and compilation