.. _cs-tf-pl-slurm-singularity:
Running Small to Medium Models (Pipelined Execution)
====================================================
This reference guide is for the original workflow that utilizes Slurm as the orchestrating software running on our CPU nodes to mediate communication between the CS system and the original Cerebras support-cluster.
.. Note::
This is no longer the recommended workflow. Slurm/Singularity is supported only on the original Cerebras installation while the latest Cerebras Wafer-Scale Clusters only support a Kubernetes-based workflow. Refer to :ref:`cs-tf-pl-k8s` for more information.
Prerequisites
-------------
System Setup Confirmation
-------------------------
Before you can start using the CS system, check with your system administrator and go over the following prerequisites first.
1. The `Singularity `_ software is installed on all the nodes, including the chief and the worker nodes, and can launch Cerebras container that consists of the Cerebras Graph Compiler (CGC) and other necessary libraries.
2. The `Slurm orchestrator software `_ is installed and is running on all the nodes. The orchestrator software performs the coordination between the CS system and the nodes in the CS cluster.
3. You have the hostnames of the chief and the worker nodes. You log in to the chief node and perform all your work on the chief node. You need hostnames of the worker nodes for debugging.
4. You have the IP address and the port number of the network attached CS system accelerator. You pass this IP address and port number to the ``--cs_ip`` flag of your runtime scripts during compiling and running your models.
5. Steps to log in to the chief node of the CS system cluster. Logging into the chief node is done by using ``ssh``.
Clone the Model Zoo repository
--------------------------------------
1. Log in to your CS system cluster.
2. Clone the `Model Zoo repository `_ to your preferred location in your home directory and check out branch ``original_cerebras_installation``, using the following command:
.. code-block:: bash
git clone https://github.com/Cerebras/modelzoo.git
git checkout original_cerebras_installation
3. In the Model Zoo directory, there are a few models for PyTorch and TensorFlow. In this guide, we use the FC-MNIST model.
4. Navigate to the ``fc_mnist`` model directory.
.. code-block:: bash
cd modelzoo/fc_mnist/tf/
Compile on CPU
--------------
Cerebras recommends that you first compile your model successfully on a support cluster CPU node before running it on the CS system.
You can run in ``validate_only`` mode that runs a fast, light-weight verification. In this mode, the compilation only runs through the first few stages, up until kernel library matching.
After a successful ``validate_only`` run, you can run full compilation with ``compile_only`` mode.
This section of the quick start shows how to execute these steps on a CPU node.
.. Tip::
The ``validate_only`` step is very fast, enabling you to rapidly iterate on your model code. Without needing access to the CS system wafer scale engine, you can determine in this ``validate_only `` step if you are using any TensorFlow layer or functionality that is unsupported by either XLA or CGC.
Follow these steps:
1. Navigate to the model directory.
.. code-block:: bash
cd modelzoo/fc_mnist/tf/
2. Run the compilation in ``validate_only`` mode.
.. code-block:: bash
csrun_cpu python run.py --mode train --validate_only
...
XLA Extraction Complete
=============== Starting Cerebras Compilation ===============
Cerebras compilation completed: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02s, 1.23s/stages]
=============== Cerebras Compilation Completed ===============
.. Note::
The ``validate_only`` mode checks the kernel compatibility of your model. When your model passes this mode, run the full compilation with ``compile_only`` to generate the CS system executable.
3. Run the full compilation process in ``compile_only`` mode.
This step runs the full compilation through all stages of the Cerebras software stack to generate a CS system executable.
.. code-block:: bash
csrun_cpu python run.py --mode train --compile_only --cs_ip
...
XLA Extraction Complete
=============== Starting Cerebras Compilation ===============
Cerebras compilation completed: | | 17/? [00:18s, 1.09s/stages]
=============== Cerebras Compilation Completed ===============
When the above compilation is successful, the model is guaranteed to run on the CS system. You can also use validate-only mode to run pre-compilations of many different model configurations offline so you can more fully utilize the allotted CS system cluster time.
.. Note::
The compiler detects whether a binary already exists for a particular model config and skips compiling on the fly during training if it detects one.
Train and evaluate on CPU
-------------------------
To train and eval on CPU follow these steps:
1. Navigate to the model directory.
.. code-block:: bash
cd modelzoo/fc_mnist/tf/
2. Train and evaluate the model on the CPU.
.. code-block:: bash
# train on CPU
csrun_cpu python run.py --mode train
# run eval on CPU
csrun_cpu python run.py --mode eval --eval_steps 1000
.. Note::
The ``max_steps`` and other parameters such as ``save_checkpoints_steps`` can also be set in the ``params.yaml`` file.
The above command trains the FC-MNIST model for 100,000 steps by executing on the CS system at the IP address specified in the ``--cs_ip`` flag. When the command executes, you see an output similar to what is shown below:
.. code-block:: bash
srun: job 5834 queued and waiting for resources
srun: job 5834 has been allocated resources
...
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into model_dir/model.ckpt.
INFO:tensorflow:Programming CS system fabric. This may take a couple of minutes - please do not interrupt.
INFO:tensorflow:Fabric programmed
INFO:tensorflow:Coordinator fully up. Waiting for Streaming (using 0.97% out of 301600 cores on the fabric)
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
...
INFO:tensorflow:Training finished with 25600000 samples in 187.465 seconds, 136558.69 samples / second
INFO:tensorflow:Saving checkpoints for 100000 into model_dir/model.ckpt.
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:Loss for final step: 1.9e-05.
Run the model on the CS system
------------------------------
The below ``csrun_wse`` command compiles the code if no existing compile artifacts are found, and then runs the compiled executable on the CS system.
.. code-block:: bash
csrun_wse python run.py --mode train \
--cs_ip \
--max_steps 100000
.. Note::
The ``max_steps`` and other parameters such as ``save_checkpoints_steps`` can also be set in the ``params.yaml`` file.
The above command trains the FC-MNIST model for 100,000 steps by executing on the CS system at the IP address specified in the ``--cs_ip`` flag. When the command executes, you see an output similar to what is shown below:
.. code-block:: bash
srun: job 5834 queued and waiting for resources
srun: job 5834 has been allocated resources
...
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into model_dir/model.ckpt.
INFO:tensorflow:Programming CS system fabric. This may take a couple of minutes - please do not interrupt.
INFO:tensorflow:Fabric programmed
INFO:tensorflow:Coordinator fully up. Waiting for Streaming (using 0.97% out of 301600 cores on the fabric)
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
...
INFO:tensorflow:Training finished with 25600000 samples in 187.465 seconds, 136558.69 samples / second
INFO:tensorflow:Saving checkpoints for 100000 into model_dir/model.ckpt.
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:global step 100000: loss = 1.901388168334961e-05 (532.0 steps/sec)
INFO:tensorflow:Loss for final step: 1.9e-05.
Output files and artifacts
--------------------------
The output files and artifacts include a model directory (``model_dir``), which contains all the results and artifacts of the latest run, including:
- Compile directory (``cs_``)
- ``performance.json`` file
- Checkpoints
- Tensorboard event files
- ``yaml`` files
Compile dir – The directory containing the ``cs_``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``cs_`` dir (also known as cached compile directory), contains the ``.elf``, which is used to program the system.
Output of compilation indicates whether the compile passed or failed; if failed, then the logs show at which stage compilation failed.
``performance.json`` file and its parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is a performance directory that should contain the ``performance.json /performance/performance.json``. This contains information as listed below:
- ``compile_time`` - The amount of time that it took to compile the model to generate the Cerebras executable.
- ``est_samples_per_sec ``- The estimated performance in terms of samples per second based on the Cerebras compile. Note that this number is theoretical and actual performance may vary.
- ``programming_time`` - This is the time taken to prepare the system and load with the model that is compiled.
- ``samples_per_sec`` - The actual performance of your run execution; i.e., the number of samples processed on the WST per second.
- ``suspected_input_bottleneck`` - This is a beta feature. It indicates whether you are input-starved and need more input workers to feed the Cerebras system.
- ``total_samples`` - The total gross samples that were iterated during the execution.
- ``total_time`` - The total time it took to complete the total samples.
Checkpoints
~~~~~~~~~~~
Checkpoints are stored in ````; for example, ``/model-ckpt-0.index``, ``/model-ckpt-0.meta``, and ``/model-ckpt-1.data-00000-of-00001``. They are saved with the frequency specified in the ``runconfig`` file.
Tensorboard event files
~~~~~~~~~~~~~~~~~~~~~~~
Ternsorboard event files are also stored in the ````.
``yaml`` files content after the run
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``yaml`` file is stored in the train directory. This ``yaml`` file contains information about the specifics of the run, such as model specific configuration (e.g., ``dropout``, ``activation_fn``), optimizer type and optimizer parameters, input data configuration, such as ``batch_size``, and shuffle and run configuration, such as ``max_steps``, ``checkpoint_steps``, and ``num_epochs.``