Evaluate your model during training#

Overview#

The run.py script in the Cerebras Model Zoo can be used to evaluate your models through a forward pass. This script offers four types of evaluation modes, selectable using the --mode flag:

Table 7 Evaluation Modes in run.py#

Flag

Description

train

Runs the training process for your model according to the specified parameters in the configuration file

train_and_eval

Evaluates a model at a fixed frequency during training. This is convenient for identifying issues early in long training runs

eval

Evaluates a specific checkpoint. The latest checkpoint will be used if you don’t provide the --checkpoint_path flag

eval_all

Evaluates all the checkpoints inside a model directory once the model has been trained

Note

When evaluating a model with run.py, the latest saved checkpoint will be used by default. If no checkpoint exists, then weights will be initialized as stated in the YAML file, and the model will be evaluated using these weights. If you want to evaluate a previously trained model, make sure that the checkpoints are available in the model_dir or provide the --checkpoint_path flag.

Train and Eval Mode#

This feature allows users to evaluate models throughout long training runs. This is beneficial to identify any issues with models earlier, rather than after training runs finish.

How to use this feature#

Within the runconfig portion of the config YAML:

  • Either num_epochs or num_steps must be defined (but not both)

  • If num_epochs is defined, train_and_eval trains for num_epoch epochs, with each epoch being followed by an evaluation

  • If num_steps is defined, the user must also define eval_frequency in the config. num_steps governs the total number of steps the model will train for and eval_frequency indicates how many steps will pass between each evaluation. For example, if num_steps is 100 and eval_frequency is 20, the model will train for 100 steps and will be evaluated after 20, 40, 60, 80, and 100 steps.

When running your model, enable the --mode flag with train_and_eval:

Example:

(venv_cerebras_pt) $ python run.py --mode=train_and_eval --model_dir=<path> --params=<config_path> ...<rest of the args>

Note

The train and eval modes require different fabric programming in the CS-2 system. Therefore, using train_and_eval mode in the Cerebras Wafer-Scale cluster results in additional overheads any time training is stopped to perform evaluation.

Eval All Mode#

This feature allows users to run evaluation on multiple model checkpoints within a provided model_dir. This permits users to evaluate models that have already been trained, though Train and Eval Mode may be more suitable for evaluating a model throughout a training run.

How to use this feature#

Provide eval_all as the argument for the --mode flag, specify the directory with model checkpoints with the --model_dir flag.

Example:

(venv_cerebras_pt) $ python run.py --mode=eval_all --model_dir=<path> ...<rest of the args>

Conclusion#

Evaluating your model during training is crucial for identifying and addressing issues early, ensuring the effectiveness of your model. The run.py script in the Cerebras Model Zoo provides versatile evaluation modes through the --mode flag, allowing you to choose the most suitable method for your needs. Whether you are evaluating a single checkpoint, all checkpoints, or conducting evaluations at regular intervals during training, these features offer flexibility and control over the evaluation process.

By leveraging the train_and_eval and eval_all modes, you can maintain high model performance and make necessary adjustments throughout the training cycle, ultimately leading to more robust and accurate models.