.. _train-eval-predict: Train, Eval, and Predict ======================== This section describes how to: - Run training, eval or prediction on the Cerebras system. - Without modifying your code, run training, eval or prediction on a CPU or a GPU. You will use the ``csrun_wse`` script to accomplish this. See below. .. seealso:: See :ref:`csrun-wse` for detailed documentation on how to use this Bash script. .. _fig-csrun-wse: .. figure:: ../../images/csrun-wse.png :align: center :width: 700 px Examples -------- Train ~~~~~ The following command executes the user command ``python run.py --mode=train --cs_ip=0.0.0.0``, which initiates model training on the CS system at the given ``cs_ip`` address. .. code-block:: bash csrun_wse python run.py --mode=train --cs_ip=0.0.0.0 The following command executes the user command ``python run.py --mode=train --cs_ip=0.0.0.0``, which initiates training using the trained model on the CS system at the given ``cs_ip`` address. The Slurm orchestrator uses 3 nodes, 5 tasks per worker and 16 CPUs per task for this training run. .. code-block:: bash csrun_wse --nodes=3 --tasks_per_worker=5 --cpus_per_task=16 python run.py --mode=train --cs_ip=0.0.0.0 The following command executes the user command ``python run.py --mode train --params configs/your-params-file.yaml --model_dir your-model-dir --cs_ip 10.255.253.0``, which initiates training on the CS system at the IP address: 10.255.253.0. If the Slurm variables are not passed to the ``csrun_wse`` script, then the values provided by system administrator in the ``csrun_cpu`` script are used. See :ref:`config-csrun-cpu`. .. code-block:: bash csrun_wse python run.py --mode=train \ --params configs/your-params-file.yaml \ --model_dir your-model-dir \ --cs_ip=10.255.253.0 Eval ~~~~ .. code-block:: bash csrun_wse python run.py --mode=eval --cs_ip=0.0.0.0 - Executes the command ``python run.py --mode=eval --cs_ip=0.0.0.0``, which initiates model evaluation on the CS system at the given ``cs_ip`` address. Predict ~~~~~~~ .. code-block:: bash csrun_wse python run.py --mode=predict --cs_ip=0.0.0.0 - Executes the command ``python run.py --mode=predict --cs_ip=0.0.0.0``, which initiates prediction using the trained model on the CS system at the given ``cs_ip`` address. .. note:: Also note another advantage after you ported your TensorFlow code to Cerebras: You can run training, eval or prediction on either the CS system or on CPU- and GPU-based systems without recoding your model. Train on CPU, GPU ----------------- On a CPU within the Cerebras cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To train on a CPU, do not use Slurm to invoke the standard Singularity container. Instead, simply launch the training command with ``cs_cpu``. See the following example: .. code-block:: bash csrun_cpu python run.py .. seealso:: See :ref:`csrun-cpu` for detailed documentation on how to use this Bash script. On a CPU or GPU outside the Cerebras cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To run on CPU or GPU outside of the CS system environment, you will not use the Cerebras container. However, you must use the version of ``CerebrasEstimator`` provided in the `Cerebras Model Zoo repository `__. Follow these steps: - Clone the Cerebras Cerebras Model Zoo repository. - Import the versions of ``CerebrasEstimator`` and ``RunConfig`` from the repo. - Use these versions both within the Cerebras container and outside of the Cerebras environment to easily switch between the CS system and GPU.