Running PyTorch Models
On This Page
Running PyTorch Models¶
Users interact with the Cerebras Wafer-Scale Cluster as if it were an appliance, meaning models of various sizes on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for PyTorch jobs, see Pytorch: Getting Started.
Activate your PyTorch environment¶
To run PyTorch jobs on Wafer-Scale Cluster, you first must activate your PyTorch environment on the user node.
Enter the Python environment using the following command:
source venv_cerebras_pt/bin/activate
Running the scripts to compile, train, or evaluate your model¶
The steps to train your model are as follows. We use GPT-2 model available in Cerebras Model Zoo git repository for this example. Check with your sysadmin whether your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the ReadMe files in the respository on how to set up training datasets.
In the Model Zoo, you can find
run.py
scripts for PyTorch models. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo:cd modelzoo/transformers/pytorch/gpt2Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform.
python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --validate_only --mode train --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted> --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>This step can be skipped, if you are confident in your code. But it is very convenient for fast iteration on your code as it is considerably faster than a full compile.
The next step is to run the full compile. The artifacts from this run are used in the training run.
This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour).
python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --compile_only --mode train --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted> --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>
This is the training step.
If you are running one CS-2, enter the following:
python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 --mode train --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted> --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>If you are running multiple CS-2s, which is only allowed in Weight Streaming execution, enter the following. Note that
--num_csx=2
in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a data-parallel job on two CS-2 systems within the Wafer-Scale Cluster.python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=2 --mode train --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data dir and paths to be mounted> --python_paths <paths to modelzoo and other python code if used> --mgmt_address <management node address for cluster>The output logs are as follows:
Transferring weights to server: 100%|████| 983/983 [00:13<00:00, 72.77tensors/s] 2023-01-31 14:29:09,453 INFO: Finished sending initial weights 2023-01-31 14:35:14,468 INFO: | Train Device=xla:0, Step=100, Loss=6.60547, Rate=38096.59 samples/sec, GlobalRate=38092.92 samples/sec 2023-01-31 14:35:14,545 INFO: | Train Device=xla:0, Step=200, Loss=6.27734, Rate=40148.59 samples/sec, GlobalRate=39732.25 samples/sec 2023-01-31 14:35:14,619 INFO: | Train Device=xla:0, Step=300, Loss=5.96484, Rate=41927.08 samples/sec, GlobalRate=40798.72 samples/sec 2023-01-31 14:35:14,695 INFO: | Train Device=xla:0, Step=400, Loss=5.92578, Rate=42184.20 samples/sec, GlobalRate=41177.08 samples/sec 2023-01-31 14:35:14,769 INFO: | Train Device=xla:0, Step=500, Loss=5.56641, Rate=42517.85 samples/sec, GlobalRate=41480.51 samples/sec ...... 2023-01-31 14:35:15,147 INFO: | Train Device=xla:0, Step=1000, Loss=4.96094, Rate=41571.75 samples/sec, GlobalRate=41921.10 samples/sec 2023-01-31 14:35:15,148 INFO: Training Complete. Completed 32000 sample(s) in 0.7639429569244385 seconds. 2023-01-31 14:35:30,443 INFO: Monitoring is over without any issue
To run an eval job, run the following command:
python run.py --appliance --execution_strategy <pipeline or weight_streaming> --params params.yaml --num_csx=1 -–mode eval --model_dir model_dir --credentials_path <path to tls certificate> --mount_dirs <paths to data and paths to be mounted> --python_paths <paths to modelzoo and other python code if used> --checkpoint_path <path to checkpoint to be evaluated> --mgmt_address <management node address for cluster>
Note
For the --execution_strategy
argument in the example commands above, as a rule of thumb, please specify --execution_strategy=pipeline
for small to medium models with <1 billion parameters to run in Pipelined execution, and specify --execution_strategy=weight_streaming
for large models with >= 1 billion parameters to run in Weight Streaming execution.
Example command to train a 117M GPT2 in PyTorch in Pipelined execution: python run.py --appliance --execution_strategy pipeline --params params_PT_GPT2_117M.yaml --num_csx=1 --num_workers_per_csx=8 -–mode train --model_dir model_dir_PT_GPT2_117M --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/python/packages --mgmt_address management_node_address_for_cluster
Example command to train a 1.5B GPT2 in PyTorch in Weight Streaming execution: python run.py --appliance --execution_strategy weight_streaming --params params_PT_GPT2_1p5B.yaml --num_csx=1 -–mode train --model_dir model_dir_PT_GPT2_1p5B --credentials_path /path/to/certificate --mount_dirs /path/to/data /path/to/modelzoo /path/to/mount --python_paths /path/to/modelzoo /path/to/python/packages --mgmt_address management_node_address_for_cluster
Note
Cerebras only supports one CS-2 for eval mode or for Pipelined execution.
Contents of run.py
¶
For your reference, the contents of run.py
is as shown in Cerebras Model Zoo.
Output files and artifacts¶
The output files and artifacts of the model directory (model_dir
) contain all the results and artifacts of the latest run, including:
Checkpoints
Tensorboard event files in
train/
A copy of the
yaml
file for your run intrain/
Performance summary in
performance/