Deterministically restart a dataloader#
Overview#
When training a Large Language Model (LLM), you might have to pause the training run for various reasons, such as changing the batch size or a hyperparameter, addressing training instabilities, or training with more hardware.
To combat this, you can add checkpoints to the model to restart the training at the same point where the training was paused. This feature offers a similar capability for input-generating data loaders. Moreover, a model trained on duplicate samples performs worse than the one trained on deduplicated data ([1, 2]). Even if a dataset is deduplicated, and the dataloader is not deterministically restarted after the training has been paused, the model could get trained on samples or batches the model has already seen. This could lead to memorization and degrade the upstream and downstream performance of the model.
To overcome this issue, Cerebras offers a feature that enables users to resume training deterministically – from the same point in the input-generating dataloader where a previous run was halted – thereby ensuring the model does not get trained with repeated data samples.
Note
This feature requires train_input.num_workers=1.
Resuming training deterministically#
For your original run from global step=0
, in the config YAML file, set the parameter cerebras.save_iter_state_path
to the mounted directory where the data checkpoints will get written. Note that this directory should be visible to the worker nodes in the Weight Streaming mode. Ensure this path is a mounted path and specified under the --mount_dirs
. A new directory will be created if it does not already exist.
The following examples shows how to set the cerebras.save_iter_state_path
parameter to the mounted model zoo path in the YAML config file:
cerebras:
save_iter_state_path: </path/to/mounted/modelzoo/dir>
Once the run starts, confirm if the dataloader saves its state under the provided path. You should see two types of files in path </path/to/mounted/modelzoo/dir>
:
data_iter_checkpoint_state_file_global
data_iter_state_file_worker_<worker_id>_step_<global_step>.txt
The data_iter_checkpoint_state_file_global
file records the integer step at which the last weight checkpoint for a given run was captured.
The set of files data_iter_state_file_worker_<worker_id>_step_<global_step>.txt
records the iterator state at each of the global steps.
Note
The number of individual worker checkpoints will match the number of weight checkpoints – i.e., the frequency at which the dataloader state is saved is the same at which we capture the model checkpoints.
Best practices#
When restarting, provide the same path for
cerebras.save_iter_state_path
in the YAML, and the input dataloader will automatically restart from the same state as the original run.Try rewinding and restarting from a different step in the previous training run. To achieve this, modify
data_iter_checkpoint_state_file_global
and set it to the global step you want the run to restart. Note that this global step must be one of the steps where a dataloader checkpoint was saved.Try restarting the run with a different batch size, and the deterministic data restart feature should work seamlessly.
For Large Language Models, set the parameter runconfig.num_workers_per_csx
in the config file, training to 1.
runconfig:
num_workers_per_csx: 1
This argument can also be passed to run as a command as shown below:
(venv_cerebras_pt) $ python run.py \ CSX \ -p </path/to/params> \ -m <mode> \ --model_dir </path/to/model/dir> \ --num_workers_per_csx 1
Note
The user needs to manually modify the global state file to restart a run from a different checkpoint than the latest.