Running Small to Medium Models (Pipelined Execution)
On This Page
Running Small to Medium Models (Pipelined Execution)¶
This reference guide is for the original workflow that utilizes Slurm as the orchestrating software running on our CPU nodes to mediate communication between the CS system and the original Cerebras support cluster. Note that this is no longer the recommended workflow. Slurm/Singularity is supported only on the Cerebras original installation while the latest Wafer-Scale Clusters only support a Kubernetes-based workflow. Refer to Running Small to Medium Models (Pipelined Execution) for more information.
This is a step-by-step guide to compile a PyTorch FC-MNIST model (already ported to Cerebras) targeting your CS system.
Prerequisites¶
System Setup Confirmation¶
Before you can start using the CS system, check with your system administrator and go over the following prerequisites first.
The Singularity software is installed on all the nodes, including the chief and the worker nodes, and can launch Cerebras container that consists of the Cerebras Graph Compiler (CGC) and other necessary libraries.
The Slurm orchestrator software is installed and is running on all the nodes. The orchestrator software performs the coordination between the CS system and the nodes in the CS cluster.
You have the hostnames of the chief and the worker nodes. You will log in to the chief node and perform all your work on the chief node. You need hostnames of the worker nodes for debugging.
You have the IP address and the port number of the network attached CS system accelerator. You pass this IP address and port number to the
--cs_ip
flag of your runtime scripts during compiling and running your models.Steps to log in to the chief node of the CS system cluster. Logging into the chief node is done by using
ssh
.
Compile the model¶
To compile your system, perform the following steps.
Log in to the CS system cluster usernode.
Clone the Cerebras Model Zoo repository to your preferred location in your home directory and check out branch
original_cerebras_installation
, using the following command:git clone https://github.com/Cerebras/modelzoo.git git checkout original_cerebras_installation
In the Model Zoo directory, there are the following PyTorch model examples:
A PyTorch version of FC-MNIST.
The PyTorch versions of BERT Base and BERT Large.
In this quick start, we use the FC-MNIST model. Navigate to the``fc_mnist`` model directory using the following command.
cd cerebras/modelzoo/fc_mnist/pytorch
Compile the model targeting the CS system.
The below
csrun_cpu
command compiles the code in thetrain
mode for the CS system. Note that this step only compiles the code and does not run training on the CS system.Run the compilation in
validate_only
mode. This performs the initial stage of compilation to get feedback about whether your model can be lowered. This process should be considerably faster running than a full compile.csrun_cpu python-pt run.py --mode train \ --validate_only \ --params configs/<name-of-the-params-file.yaml> \ --cs_ip <specify your CS_IP>:<port>Run the full compilation in
compile_only
mode. This step runs the full compilation through all stages of the Cerebras software stack to generate a CS system executable.csrun_cpu python-pt run.py --mode train \ --compile_only \ --params configs/<name-of-the-params-file.yaml> \ --cs_ip <specify you CS_IP>:<port>Note
The parameters can also be set in the
params.yaml
file.
Train on GPU¶
To train on a GPU, run the following command:
python run.py --mode train --params configs/<name-of-the-params-file.yaml>
Train on CS system¶
Execute the csrun_wse
command to run the training on the CS system. See the command format below.
Note
For PyTorch models only, the
cs_ip
flag must include both the IP address and the port number of the CS system. Only the IP address, for example:--cs_ip 192.168.1.1
, is not sufficient. You must also include the port number, for example:--cs_ip 192.168.1.1:9000
.csrun_wse python-pt run.py --mode train \ --cs_ip <IP:port-number> \ --params configs/<name-of-the-params-file.yaml> \
Output files and artifacts¶
The output files and artifacts include a model directory (model_dir
) that contains all the results and artifacts of the latest run, including:
Compile directory (
cs_<checksum>
)
performance.json
fileCheckpoints
Tensorboard event files
A copy of the
yaml
file for the run
Compile dir – The directory containing the cs_<checksum>
¶
The compilation artifacts during and after compilation are stored in the <model_dir>/cs_<checksum>
directory.
Compilation logs and intermediate outputs are helpful to debug compilations issues.
The xla_service.log
should contain information about the status of compilation, and whether it passed or failed. In case of failure, it should print an error message and stacktrace in xla_service.log
.
performance.json
file and its parameters¶
The performance directory should contain the performance.json <model_dir>/performance/performance.json
. This contains information as listed below:
compile_time
- The amount of time that it took to compile the model to generate the Cerebras executable.
est_samples_per_sec
- The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.
programming_time
- This is the time taken to prepare the system and load with the model that is compiled.
samples_per_sec
- The actual performance of your run execution.
suspected_input_bottleneck
- This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.
total_samples
- The total gross samples that were iterated during the execution.
total_time
- The total time it took to complete the total samples.
Checkpoints¶
Checkpoints are stored in <model_dir>/checkpoint_*.mdl
. They are saved with the frequency specified in the runconfig
file.
Tensorboard event files¶
Tensorboard event files are stored in <model_dir>/train/
directory.
yaml
files content after the run¶
The yaml
file is stored in the train directory. This yaml
file contains information about the specifics of the run, such as model specific configuration (eg., dropout
, activation_fn
), optimizer type and optimizer parameters, input data configuration, such as batch_size
, and shuffle and run configuration, such as max_steps
, checkpoint_steps
, and num_epochs
.