Failing to save checkpoints using experimental PyTorch API#
Observed Error#
When checkpoint_steps is set to 1 or save_initial_checkpoint is set
to True in the runconfig section of the params.yaml file, the
resulting checkpoints on step 0 and 1 are invalid and won’t be generally usable.
e.g. params.yaml
...
runconfig:
...
experimental_api: True
checkpoint_steps: 1 # not supported in 1.8 when using experimental API
save_initial_checkpoint: True # not supported in 1.8 when using experimental API
...
...
Explanation#
A bug was found with the format that these two checkpoints were being saved with. As a result, they do not fully conform to Cerebras’s H5 checkpointing format and may fail to load at all. It may also affect the state of the rest of the run.
Work around#
If checkpoints are desired, please ensure that
checkpoint_stepsis greater than1save_initial_checkpointis not set toTrue.
e.g. params.yaml
...
runconfig:
...
experimental_api: True
checkpoint_steps: 1000 # must be greater than 1
save_initial_checkpoint: False # must not be True
...
...