Launch your job#
Running jobs in the Cerebras Wafer-Scale Cluster is as easy as running jobs on a single device. To start, you should have already followed the steps in Set up Cerebras’s virtual environment and Clone Cerebras Model Zoo.
1. Activate Cerebras virtual environment#
After you have Set up Cerebras’s virtual environment, activate this environment on the user node using
$ source venv_cerebras_pt/bin/activate
Note
You will need to activate your virtual environment any time you run jobs on the Cerebras Wafer-Scale Cluster.
3. Prepare your datasets#
Each of the models at the Cerebras Model Zoo contains scripts to prepare your datasets. You can find general guidance in the Data Processing and Dataloaders section. Also, dataset examples for each model can be found in the README
file in Cerebras Model Zoo. For example, the FC-MNIST model contains a prepare_data.py
script that downloads sample data. For Language models, you can use one of Cerebras Model Zoo scripts for data processing. An example can be found in the trainining and finetuning LLMs tutorial.
Here, we will assume that you have already prepared the data. Then you will change the data path as an absolute path in the configuration file inside configs/
. The configs/
folder contains yaml files with different model sizes.
train_input:
data_dir: "/absolute/path/to/training/dataset"
...
eval_input:
data_dir: "/absolute/path/to/evaluation/dataset/"
4. Launch your job#
All models in Cerebras Model Zoo contain the script run.py
. These scripts are instrumented to launch compilation, training, and evaluation of your models in the Cerebras Cluster.
You will need to specify these flags:
Flag |
Mandatory |
Description |
---|---|---|
|
Yes |
Specifies that the target device for execution is a Cerebras Cluster. |
|
Yes |
Path to a YAML file containing model/run configuration options. |
|
Yes |
Whether to run train, evaluate, train and evaluate, or eval_all. |
|
Yes |
List of paths to be mounted to the Appliance containers. It should include parent paths for Cerebras Model Zoo and
other locations needed by the dataloader, including datasets and code.
(Default: Pulled from path defined by env variable |
|
Yes |
List of paths to be exported to |
|
No |
Path to a TLS certificate which is used to authenticate the user against the Wafer-Scale Cluster.
(Default: |
|
No |
Address of the Wafer-Scaler Cluster management server.
(Default: Pulled from |
|
No |
Compile the model including matching to Cerebras kernels and mapping to hardware. It does not execute on system.
Upon success, compile artifacts are stored inside the Cerebras cluster, under the directory specified in
|
|
No |
Validate model can be matched to Cerebras kernels. This is a lightweight compilation. It does not map to the hardware
nor execute on system. Mutually exclusive with compile_only.
(Default: |
|
No |
Path to store model checkpoints, TensorBoard events files, etc.
(Default: |
|
No |
Path to store the compile artifacts inside Cerebras cluster.
(Default: |
|
No |
Number of CS-X systems to use in training.
(Default: |
4.1 (Optional) Compile your job#
To validate that your model implementation is compatible with Cerebras Software Platform, you can use a --validate_only
flag. This flag allows you to quickly iterate and check compatibility without requiring full model execution.
(venv_cerebras_pt) $ python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used} \
--validate_only
You can also use a compile_only
compilation, to create executables to run your model in the Cerebras Cluster. This compilation takes longer than validate_only
, depending on the size and complexity of the model (15 minutes to an hour).
(venv_cerebras_pt) $ python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--model_dir model_dir \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used} \
--compile_only
Note
You can use precompiled artifacts obtained by --validate_only
and --compile_only
to speed up your training or evaluation runs. Use the same --compile_dir
during compilation and execution, to reuse precompile artifacts.
Since train
and eval
modes require different fabric programming in the CS-2 system, you will obtained different compile artifacts when running with flags --mode train --compile_only
and --mode eval --compile_only
4.2 Execute your job#
To execute your job, you need to provide the following information:
The target device that you would like to execute on. To run on the Cerebras Cluster, this is done by adding
CSX
as the first positional argument in the commandline. These scripts can also be run locally usingCPU
orGPU
.Information about the Cerebras Cluster where the job will be executed using the flags
--python_paths
,--mount_dirs
, and optionally--credentials_path
and--mgmt_address
. Please note thatpython_paths
andmount_dirs
can be omitted from the command line as long as they are specified in therunconfig
section ofparams.yaml
. They should both generally include paths to the directory in which the Cerebras Modelzoo resides. More information in Cerebras Cluster settings.Finally, the mode of execution {train, eval, eval_all, train_and_eval} and a path to the configuration file must be passed.
(venv_cerebras_pt) $ python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--model_dir model_dir \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used}
Here is an example of typical output log for a training job:
Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO: Finished sending initial weights
INFO: | Train Device=xla:0 Step=50 Loss=8.31250 Rate=69.37 GlobalRate=69.37
INFO: | Train Device=xla:0 Step=100 Loss=7.25000 Rate=68.41 GlobalRate=68.56
INFO: | Train Device=xla:0 Step=150 Loss=6.53125 Rate=68.31 GlobalRate=68.46
INFO: | Train Device=xla:0 Step=200 Loss=6.53125 Rate=68.54 GlobalRate=68.51
INFO: | Train Device=xla:0 Step=250 Loss=6.12500 Rate=68.84 GlobalRate=68.62
INFO: | Train Device=xla:0 Step=300 Loss=5.53125 Rate=68.74 GlobalRate=68.63
INFO: | Train Device=xla:0 Step=350 Loss=4.81250 Rate=68.01 GlobalRate=68.47
INFO: | Train Device=xla:0 Step=400 Loss=5.37500 Rate=68.44 GlobalRate=68.50
INFO: | Train Device=xla:0 Step=450 Loss=6.43750 Rate=68.43 GlobalRate=68.49
INFO: | Train Device=xla:0 Step=500 Loss=5.09375 Rate=66.71 GlobalRate=68.19
INFO: Training Complete. Completed 60500 sample(s) in 887.2672743797302 seconds.
Note
Cerebras only supports using a single CS-2 when running in eval mode.
Note
To scale to multiple CS-2 systems, simply add the --num_csx
flag specifying the number of CS-2 systems. For models from Cerebras Model Zoo,
the batch size specified in the configuration yaml file is the global batch size. The global batch size divided by the number of CS-2s
will be the effective batch size per device.
Note
Once you have submitted your job to execute in the Cerebras Wafer-Scale cluster, you can track the progress or kill your job using the csctl tool. You can also monitor the performance using a Grafana dashboard
5. Explore output files and artifacts#
The contents of the model directory (as specified by --model_dir
flag) contain all the results and artifacts of the latest run, including:
Checkpoints
Tensorboard event files
yaml
files
Checkpoints#
Checkpoints are stored in <model_dir>/model-ckpt*
.
Tensorboard event files#
Tensorboard event files are stored in the <model_dir>
directory.