Running Large Models (Weight Streaming Execution)¶

Users interact with the Cerebras Wafer-Scale Cluster as if it was an appliance, meaning running large models on the Cerebras Wafer-Scale Cluster is as easy as running on a single device. For first-time user setup for TensorFlow jobs, see TensorFlow: Getting Started.

Activate your TensorFlow environment¶

To run TensorFlow jobs on Wafer-Scale Cluster, you first need to activate your TensorFlow environment on the user node.

Enter the Python environment using the following command:

source venv_cerebras_tf/bin/activate

Running the scripts to compile, train, or evaluate your model¶

The steps to train your model are as follows. We will use GPT-2 model available in Cerebras Model Zoo git repository for this example. Check with your sysadmin if your setup has a local copy of the Model Zoo repository available with pre-installed datasets. Otherwise, you can clone this git repository on your user node yourself and follow the instructions in the readme files in the respository on how to set up training datasets.

In the Model Zoo, you can find run-appliance.py scripts for TensorFlow models supported with the weight streaming execution mode. For the GPT-2 model, navigate to the following directory in your copy of the Model Zoo:
cd modelzoo/transformers/tf/gpt2
Within this directory, run the following command that performs the initial stage of compilation to get feedback about whether your model is compatible with Cerebras Software Platform.
python run-appliance.py --params params.yaml --num_csx=1 --model_dir model_dir --validate_only --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
This step can be skipped, if you are confident in your code. But it is very convenient for fast iteration on your code as it is considerably faster than a full compile.

The next step is to run the full compile. The artifacts from this run are used in the training run.

This compile can take longer, depending on the size and complexity of the model (15 minutes to an hour).

python run-appliance.py --params params.yaml --num_csx=1  --model_dir
model_dir --compile_only --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>

This is the training step.

If you are running one CS-2, enter the following:

python run-appliance.py --params params.yaml --num_csx=1 --model_dir=model_dir --max_steps=10 --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>

Note that csx=2 in the below code block refers to the number of CS-2 systems you are using. In this case, you are running a distributed job on two CS-2 systems within the Wafer-Scale Cluster.

python run-appliance.py --params params.yaml --num_csx=2 --model_dir=model_dir --num_steps=10 --mode train --credentials_path=<path to tls certificate> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>

The output log is as follows:

INFO     root:start_utils.py:519 # 1. Start Coordinator on separate process
INFO     root:start_utils.py:534 # 2. Begin Run
INFO     root:start_utils.py:545 # 3. Start Workers on separate processes
INFO     root:start_utils.py:554 # 4. Start Chief on separate processes
INFO     root:start_utils.py:564 # 5. Start WS Runtime servers (i.e. ws-srv) on separate processes
INFO     root:cs_estimator_app.py:274 Loaded global step 0
INFO     root:cs_estimator_app.py:817 Output activation tensors: ['truediv_3_1']
INFO     root:cluster_client.py:217 Initiating a new compile wsjob against the cluster server.
INFO     root:cluster_client.py:220 Compile job initiated
INFO     root:appliance_manager.py:135 Creating a framework GRPC client: localhost:50065, None,
INFO     root:appliance_manager.py:359 Compile successfully written to cache directory: cs_10097974384330522877
INFO     root:cluster_client.py:243 Initiating a new execute wsjob against the cluster server.
INFO     root:cluster_client.py:246 Execute job initiated
INFO     root:appliance_manager.py:149 Removing a framework GRPC client
INFO     root:cs_estimator_app.py:940 final generation of weights: 9
INFO     cerebras_appliance.appliance_client:appliance_client.py:435 Input fn serialized: 80036374657374732e77732e6d696c6573746f6e655f6d6f64656c732e74662e646174610a746f795f696e7075745f666e0a71002e
INFO     root:appliance_manager.py:135 Creating a framework GRPC client: localhost:50066, None,
INFO     root:appliance_manager.py:282 About to send initial weights
INFO     root:tf_appliance_manager.py:85 Dropping tensor: 'good_steps'
INFO     root:appliance_manager.py:284 Finished sending initial weights
INFO     root:cs_estimator_app.py:482 global step 2: loss = 0.0 (0.37 steps/sec)
INFO     root:cs_estimator_app.py:482 global step 4: loss = 0.0 (0.74 steps/sec)
INFO     root:cs_estimator_app.py:388 Taking checkpoint at step: 5
INFO     root:cs_estimator_app.py:437 saving last set of weights: 9
INFO     root:cs_estimator_app.py:482 global step 6: loss = 0.0 (1.06 steps/sec)
INFO     root:cs_estimator_app.py:482 global step 8: loss = 0.0 (1.41 steps/sec)
INFO     root:cs_estimator_app.py:388 Taking checkpoint at step: 10
INFO     root:cs_estimator_app.py:391 Taking final checkpoint
INFO     root:cs_estimator_app.py:437 saving last set of weights: 9
INFO     root:cs_estimator_app.py:482 global step 10: loss = 0.0 (1.69 steps/sec)
INFO     root:cs_estimator_app.py:489 Training complete. Completed 640 sample(s) in 5.9104249477386475 seconds
INFO     root:start_utils.py:587 Wait for server completion
INFO     root:start_utils.py:599 Servers Completed

To run an eval job, run the following command:

python run-appliance.py --params params.yaml --num_csx=1 --model_dir=model_dir -–mode eval –-eval_steps=10 --credentials_path=<path to tls certificate> --checkpoint_path <path to trained checkpoint file> --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>

Note

Cerebras only supports one CS-2 for eval mode.

Contents of `run-appliance.py`¶

For your reference, the contents of run-appliance.py is as shown in Cerebras Model Zoo.

Output files and artifacts¶

The contents of the model directory (model_dir) contain all the results and artifacts of the latest run, including:

Checkpoints

Tensorboard event files

yaml files

Checkpoints¶

Checkpoints are stored in <model_dir>/model-ckpt*.

Tensorboard event files¶

Tensorboard event files are stored in the <model_dir> directory.

Software Documentation (Version 1.7.0)

Running Large Models (Weight Streaming Execution)

On This Page

Running Large Models (Weight Streaming Execution)¶

Activate your TensorFlow environment¶

Running the scripts to compile, train, or evaluate your model¶

Contents of `run-appliance.py`¶

Output files and artifacts¶

Checkpoints¶

Tensorboard event files¶

Software Documentation (Version 1.7.0)

Running Large Models (Weight Streaming Execution)

On This Page

Running Large Models (Weight Streaming Execution)¶

Activate your TensorFlow environment¶

Running the scripts to compile, train, or evaluate your model¶

Contents of run-appliance.py¶

Output files and artifacts¶

Checkpoints¶

Tensorboard event files¶

Contents of `run-appliance.py`¶