Cerebras Model Zoo Callbacks#
This module contains the base Callback class as well as a number of core callbacks directly invoked by the Trainer as well as other optional callbacks that can be used to extend the functionality of the Trainer.
- class cerebras.modelzoo.trainer.callbacks.Callback[source]#
Base class for all callbacks.
- pre_setup(trainer)[source]#
Called before the trainer setup.
- Parameters
trainer (Trainer) – Trainer instance.
- setup(trainer)[source]#
Setup the callback using the trainer.
- Parameters
trainer (Trainer) – Trainer instance.
- on_enter_fit(trainer, stack, train_dataloader, val_dataloader, loop)[source]#
Hook that allows arbitrary context managers to be entered at the beginning of the fit method.
- Parameters
trainer (Trainer) – Trainer instance.
stack (ExitStack) – ExitStack object.
train_dataloader (cerebras.pytorch.utils.data.DataLoader) – Train dataloader.
val_dataloader (cerebras.pytorch.utils.data.DataLoader) – Validation dataloader.
loop (TrainingLoop) – TrainingLoop object.
- on_fit_start(trainer, train_dataloader, val_dataloader, loop)[source]#
Called at the beginning of the fit method.
- Parameters
trainer (Trainer) – Trainer instance.
train_dataloader (cerebras.pytorch.utils.data.DataLoader) – Train dataloader.
val_dataloader (cerebras.pytorch.utils.data.DataLoader) – Validation dataloader.
loop (TrainingLoop) – TrainingLoop object.
- on_fit_end(trainer, loop)[source]#
Called at the end of the fit method.
- Parameters
trainer (Trainer) – Trainer instance.
loop (TrainingLoop) – TrainingLoop object.
- on_fit_exception(trainer, exception)[source]#
Called if an exception is raised during fit.
- Parameters
trainer (Trainer) – Trainer instance.
exception (Exception) – Exception object.
- on_enter_train(trainer, stack, train_dataloader, loop, loop_idx)[source]#
Hook that allows arbitrary context managers to be entered at the beginning of every training iteration.
- Parameters
trainer (Trainer) – Trainer instance.
stack (ExitStack) – ExitStack object.
train_dataloader (cerebras.pytorch.utils.data.DataLoader) – Train dataloader.
loop (TrainingLoop) – TrainingLoop object.
loop_idx (int) – training loop index.
- on_train_start(trainer, model, train_dataloader, loop, loop_idx)[source]#
Called at the beginning of the train loop.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
train_dataloader (cerebras.pytorch.utils.data.DataLoader) – Train dataloader.
loop (TrainingLoop) – TrainingLoop object.
loop_idx (int) – training loop index.
- on_train_end(trainer, model, loop, loop_idx)[source]#
Called at the end of the train loop.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
loop (TrainingLoop) – TrainingLoop object.
loop_idx (int) – training loop index.
- on_train_exception(trainer, exception)[source]#
Called if an exception is raised during a training iteration.
- Parameters
trainer – Trainer instance.
exception – Exception object.
- on_train_batch_start(trainer, model, batch, batch_idx)[source]#
Called at the beginning of every training iteration.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
batch (Any) – Batch data.
batch_idx (int) – Batch index.
- on_train_batch_end(trainer, model, outputs, batch, batch_idx)[source]#
Called at the end of every training iteration.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
outputs (Dict[str, Any]) – Model outputs.
batch (Any) – Batch data.
batch_idx (int) – Batch index.
- run_validation(trainer, loop_idx, is_last)[source]#
Perform a validation run.
Override this method to perform a custom validation run.
- Parameters
trainer (Trainer) – Trainer instance.
val_dataloader – Validation dataloader.
loop_idx (int) – Training loop index.
is_last (bool) – Whether the last training iteration just happened.
- on_enter_validate(trainer, stack, val_dataloader, loop)[source]#
Hook that allows arbitrary context managers to be entered at the beginning of every validation run.
- Parameters
trainer (Trainer) – Trainer instance.
stack (ExitStack) – ExitStack object.
val_dataloader (cerebras.pytorch.utils.data.DataLoader) – Validation dataloader.
loop (ValidationLoop) – ValidationLoop object.
- on_validate_start(trainer, model, val_dataloader, loop)[source]#
Called at the beginning of the validation loop.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
val_dataloader (cerebras.pytorch.utils.data.DataLoader) – Validation dataloader.
loop (ValidationLoop) – ValidationLoop object.
- on_validate_end(trainer, model, loop)[source]#
Called at the end of the validation loop.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
loop (ValidationLoop) – ValidationLoop object.
- on_validate_exception(trainer, exception)[source]#
Called if an exception is raised during validation.
- Parameters
trainer (Trainer) – Trainer instance.
exception (Exception) – Exception object.
- on_validate_batch_start(trainer, model, batch, batch_idx)[source]#
Called at the beginning of every validation iteration.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
batch (Any) – Batch data.
batch_idx (int) – Batch index.
- on_validate_batch_end(trainer, model, outputs, batch, batch_idx)[source]#
Called at the end of every validation iteration.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
outputs (Dict[str, Any]) – Model outputs.
batch (Any) – Batch data.
batch_idx (int) – Batch index.
- on_enter_validate_all(trainer, stack, val_dataloaders, loop)[source]#
Hook that allows arbitrary context managers to be entered at the beginning of every validate all run.
- Parameters
trainer (Trainer) – Trainer instance.
stack (ExitStack) – ExitStack object.
val_dataloaders (cerebras.pytorch.utils.data.DataLoader) – Validation dataloaders.
loop (ValidationLoop) – ValidationLoop object.
- on_before_forward(trainer, model, batch, args, kwargs)[source]#
Called before the forward pass.
The args and kwargs may be added to to provide additional arguments to the forward method.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
batch (Any) – Batch data.
args (List[Any]) – Forward pass arguments.
kwargs (dict) – Forward pass keyword arguments.
- on_after_forward(trainer, model, outputs, batch)[source]#
Called after the forward pass.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
outputs (Dict[str, Any]) – Model outputs.
batch (Any) – Batch data.
- on_before_backward(trainer, model, outputs)[source]#
Called before the backward pass.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
outputs (Dict[str, Any]) – Model outputs.
- on_after_backward(trainer, model, outputs)[source]#
Called after the backward pass.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
outputs (Dict[str, Any]) – Model outputs.
batch_idx – Batch index.
- on_before_optimizer_step(trainer, model, optimizer)[source]#
Called before the optimizer step.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
optimizer (cerebras.pytorch.optim.Optimizer) – Optimizer instance.
- on_after_optimizer_step(trainer, model, optimizer)[source]#
Called after the optimizer step.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
optimizer (cerebras.pytorch.optim.Optimizer) – Optimizer instance.
- on_before_optimizer_zero_grad(trainer, model, optimizer)[source]#
Called before the optimizer zero_grad.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
optimizer (cerebras.pytorch.optim.Optimizer) – Optimizer instance.
- on_after_optimizer_zero_grad(trainer, model, optimizer)[source]#
Called after the optimizer zero_grad.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
optimizer (cerebras.pytorch.optim.Optimizer) – Optimizer instance.
- on_before_scheduler_step(trainer, model, optimizer, scheduler)[source]#
Called before the scheduler step.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
optimizer (cerebras.pytorch.optim.Optimizer) – Optimizer instance.
scheduler (cerebras.pytorch.optim.scheduler.Scheduler) – A scheduler instance.
- on_after_scheduler_step(trainer, model, optimizer, scheduler)[source]#
Called after the scheduler step.
- Parameters
trainer (Trainer) – Trainer instance.
model (torch.nn.Module) – Model instance.
optimizer (cerebras.pytorch.optim.Optimizer) – Optimizer instance.
scheduler (cerebras.pytorch.optim.scheduler.Scheduler) – A scheduler instance.
- on_save_checkpoint(trainer, state_dict)[source]#
Called before saving the checkpoint.
Callbacks should override this method to add states to the checkpoint.
- Parameters
trainer (Trainer) – Trainer instance.
state_dict (dict) – Trainer state dictionary.
- postprocess_checkpoint(trainer, state_dict)[source]#
Called after constructing the checkpoint.
Callbacks should override this method to modify the checkpoint before saving.
- Parameters
trainer (Trainer) – Trainer instance.
state_dict (dict) – Trainer state dictionary.
- on_after_save_checkpoint(trainer, ckpt_path)[source]#
Called after saving the checkpoint.
- Parameters
trainer (Trainer) – Trainer instance.
ckpt_path (str) – Checkpoint path.
- on_before_load_checkpoint(trainer, ckpt_path)[source]#
Called before loading the checkpoint.
- Parameters
trainer (Trainer) – Trainer instance.
ckpt_path (str) – Checkpoint path.
- class cerebras.modelzoo.trainer.callbacks.ValidationCallback[source]#
A special type of callback that indicates to the trainer that it will perform some custom validation logic.
This is useful for callbacks that need to perform downstream validation logic that is not covered by the default validation loop.
All ValidationCallbacks must implement the following methods:
run_validation
Essentially, you are telling the trainer what to run at the end of each training run.
Core Callbacks#
The set of callbacks that implement core behaviour inside the
Trainer
.
ArtifactDirCallback
#
- class cerebras.modelzoo.trainer.callbacks.ArtifactDirCallback[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Sets up the artifact directory and write metadata to executor artifact dir with some information about the run.
- loop#
The loop object from which to extract metadata.
BackendCallback
#
- class cerebras.modelzoo.trainer.callbacks.BackendCallback(backend, device)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Callback to set the backend for the trainer.
- Parameters
backend (cstorch.Backend or None) – The backend object to be used for the trainer. If None, the device argument must be provided. If both are provided, an error is raised.
device (str or None) – The device type to be used for the trainer. If None, the backend argument must be provided.
Checkpoint
#
- class cerebras.modelzoo.trainer.callbacks.Checkpoint(steps=None, autoload_last_checkpoint=True, disable_strict_checkpoint_loading=False, save_initial_checkpoint=False, checkpoint_name='checkpoint_{step}.mdl')[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
A callback that handles standard checkpointing logic.
- Parameters
steps (Optional[int]) – The frequency at which to save a checkpoint. If None, no checkpoints will be saved. Defaults to None.
autoload_last_checkpoint (bool) – Whether to autoload the last checkpoint in the model directory. Defaults to True.
disable_strict_checkpoint_loading (bool) – Whether to disable strict checkpoint loading. If True, the model will not raise an error if the checkpoint contains keys that are not present in the model. Defaults to False.
save_initial_checkpoint (bool) – Whether to save the initial checkpoint at the start of training. Defaults to False.
checkpoint_name (str) – The unformatted name of the checkpoint file. The string will be formatted with the following keys: step
- static check_compatibility(state_dict)[source]#
Checks that the checkpoint is compatible with the current version of modelzoo.
- get_checkpoint_path(ckpt_dir, step)[source]#
Construct a path to the checkpoint file.
If a checkpoint already exists inside the given checkpoint directory at the given step, append a timestamp to the filename.
- Parameters
ckpt_dir (str) – The directory where the checkpoint will be saved.
step (int) – The step at which the checkpoint is saved.
- Returns
A path to which the checkpoint can be saved
- Return type
pathlib.Path
DataLoaderCallback
#
- class cerebras.modelzoo.trainer.callbacks.DataLoaderCallback[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Callback class that handles saving and loading dataloader state to the checkpoint.
- dataloader#
The training dataloader object to save to the checkpoint.
GradientAccumulationCallback
#
- class cerebras.modelzoo.trainer.callbacks.GradientAccumulationCallback[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Callback class to accumulate gradients.
- grad_accum_steps#
The number of steps to accumulate gradients for before stepping the optimizer.
- should_run_optimizer_step#
If True, run the optimizer step in the current step.
LoopCallback
#
- class cerebras.modelzoo.trainer.callbacks.LoopCallback[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
,abc.ABC
Base class for all loop callbacks.
This class should not be instantiated directly. Only subclasses of LoopCallback should be used.
The loop callback owns the global step and is responsible for incrementing it after each training step.
TrainingLoop
#
- class cerebras.modelzoo.trainer.callbacks.TrainingLoop(num_steps=None, max_steps=None, num_epochs=None, steps_per_epoch=None, eval_frequency=1.0, eval_steps=None, grad_accum_steps=1)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.loop.LoopCallback
Callback class that manages the training loop.
- Parameters
num_steps (Optional[int]) – The total number of training steps to perform. This will take precedence over max_steps.
max_steps (Optional[int]) – The maximum number of training steps to perform. max_steps if provided will take the global step into account. That is, providing max_steps is equivalent to setting
num_steps = max_steps - global_step
.num_epochs (Optional[int]) – The number of epochs to train for. This argument is mutually exclusive with num_steps.
steps_per_epoch (Optional[int]) – Number of steps to train for in each epoch.
eval_frequency (Optional[Union[int, float]]) –
Frequency of evaluation during training. It can be: - a positive integer which specifies the number of
training steps between evaluations.
a float in the range [0.0, 1.0] which specifies the fraction of training steps between evaluations. i.e. if eval_frequency=0.5, evaluation will be performed once after half of the training steps have completed and once more at the end of training.
If None or zero, no evaluation is performed during training.
eval_steps (Optional[int]) – The number of validation steps to perform.
grad_accum_steps (int) – Number of steps to accumulate gradients before performing an optimizer step. This is only relevant for CPU/GPU runs.
ValidationLoop
#
- class cerebras.modelzoo.trainer.callbacks.ValidationLoop(eval_steps=None, hook='validate')[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.loop.LoopCallback
Callback class that manages the validation loop.
- Parameters
eval_steps (Optional[int]) – The number of validation steps to perform.
hook – The base name of the validation hooks to run. Default: “validate”.
- property eval_steps: int#
Returns the number of validation steps to perform.
Logging
#
- class cerebras.modelzoo.trainer.callbacks.Logging(log_steps, log_level='INFO', wsc_log_level=None, enable_act_frequency=False)#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
,cerebras.modelzoo.trainer.callbacks.logging.ClassLogger
Callback that handles setting up the Trainer’s logger as well as facilitates the cadence of logging.
- Parameters
log_steps (int) – Number of steps after which to log.
log_level (str) – Logging level for the Python logger.
wsc_log_level (Optional[dict]) – Specifes the logging level for particular Wafer-Scale Cluster servers or tasks.
enable_act_frequency (bool) – If True, set the activation steps to be the log steps.
- setup_logging()#
- setup_logging_excepthook()#
- set_wsc_log_level()#
- flush_logs()#
- is_log_step()#
ModelCallback
#
- class cerebras.modelzoo.trainer.callbacks.ModelCallback(model)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Callback class that handles setting up and compiling the model.
- Parameters
model (Union[Callable[[], torch.nn.Module], torch.nn.Module]) –
The model to train. It must be one of the following: - If a callable is passed, it is assumed to be a function that
takes in no arguments returns a torch.nn.Module.
If a torch.nn.Module is passed, it is used as is.
OptimizerCallback
#
- class cerebras.modelzoo.trainer.callbacks.OptimizerCallback(optimizer=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Callback to setup the optimizer for the Trainer.
- Parameters
optimizer (Optional[Union[cerebras.pytorch.optim.optimizer.Optimizer, Callable[[torch.nn.Module], cerebras.pytorch.optim.optimizer.Optimizer]]]) – Optimizer to be used for training. It can be a an instance of
cstorch.optim.Optimizer
or a callable that takes atorch.nn.Module
as input and returns an instance ofcstorch.optim.Optimizer
. If None, the optimizer will not be set up by this callback.
Precision
#
- class cerebras.modelzoo.trainer.callbacks.Precision[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
,abc.ABC
Base precision class for implementing custom backwards pass and optimization step to handle different precision types.
- abstract autocast_context_manager()[source]#
Returns the context manager that performs autocasting for the forward pass.
- abstract backward(loss)[source]#
Performs the backward pass.
- Parameters
loss (torch.Tensor) – Loss tensor.
- abstract clip_gradients(optimizer)[source]#
Clips the gradients before the optimization step.
- Parameters
optimizer (cerebras.pytorch.optim.optimizer.Optimizer) – The optimizer to step.
- abstract optimizer_step(optimizer)[source]#
Performs the optimization step.
- Parameters
optimizer (cerebras.pytorch.optim.optimizer.Optimizer) – The optimizer to step.
MixedPrecision
#
- class cerebras.modelzoo.trainer.callbacks.MixedPrecision(enabled=True, fp16_type='bfloat16', precision_opt_level=None, loss_scaling_factor=1.0, initial_loss_scale=None, steps_per_increase=2000, min_loss_scale=None, max_loss_scale=None, max_gradient_norm=None, max_gradient_value=None, log_loss_scale=False)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.precision.Precision
Callback class that handles mixed precision training.
- Parameters
enabled (bool) – If True, enables mixed precision training.
fp16_type (Literal['float16', 'bfloat16', 'cbfloat16']) – Half precision type. One of “float16”, “bfloat16”, “cbfloat16”.
precision_opt_level (Optional[Literal[0, 1, 2]]) – Precision optimization level. If not None, sets the global precision optimization level.
loss_scaling_factor (Union[float, Literal['dynamic']]) – Initial loss scaling factor.
initial_loss_scale (Optional[float]) – Initial loss scale.
steps_per_increase (int) – Number of steps before increasing the loss scale.
min_loss_scale (Optional[float]) – Minimum loss scale.
max_loss_scale (Optional[float]) – Maximum loss scale.
max_gradient_norm (Optional[float]) – Maximum gradient norm for gradient clipping.
max_gradient_value (Optional[float]) – Maximum gradient value for gradient clipping.
log_loss_scale (bool) – If True, log the gradient scaler’s loss scale.
Reproducibility
#
- class cerebras.modelzoo.trainer.callbacks.Reproducibility(seed=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
A callback that facilitates reproducibility.
- Parameters
seed (Optional[int]) – If provided, sets the torch seed.
SchedulersCallback
#
- class cerebras.modelzoo.trainer.callbacks.SchedulersCallback(schedulers=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Callback that sets up all the schedulers for the Trainer.
- Parameters
schedulers (Union[Callable[[cerebras.pytorch.optim.optimizer.Optimizer], cerebras.pytorch.optim.scheduler.Scheduler], cerebras.pytorch.optim.scheduler.Scheduler, None, List[Optional[Union[Callable[[cerebras.pytorch.optim.optimizer.Optimizer], cerebras.pytorch.optim.scheduler.Scheduler], cerebras.pytorch.optim.scheduler.Scheduler]]]]) –
The set of optimizer schedulers to be used. Common schedulers include LR schedulers. It must be a list of these items: - If a cstorch.optim.scheduler.Scheduler is passed, it is used as is.
A callable that is assumed to be a function that takes in a cstorch.optim.Optimizer and returns a cstorch.optim.scheduler.Scheduler.
If None, there is no optimizer param group scheduling.
SparsityCallback
#
- class cerebras.modelzoo.trainer.callbacks.SparsityCallback(sparsity=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.CoreCallback
Callback class that applies sparsity to the model and optimizer.
- Parameters
sparsity (Optional[cerebras.pytorch.sparse.base.SparsityAlgorithm]) – Sparsity algorithm instance.
Add-on Callbacks#
A set of optional callbacks that can be used to enhance the
Trainer
.
CoreCallback
#
- class cerebras.modelzoo.trainer.callbacks.CoreCallback[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
A special type of callback that indicates to the trainer that it is a core callback. Core callbacks are used internally by the trainer and should not be removed or replaced by the user.
Note: User-defined callbacks should not subclass CoreCallback
LoadCheckpointStates
#
- class cerebras.modelzoo.trainer.callbacks.LoadCheckpointStates(load_checkpoint_states='all')[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback to load specific states of the model from the checkpoint.
- Parameters
load_checkpoint_states (Union[str, List[str]]) – The list of state names to load from the checkpoint.
KeepNCheckpoints
#
- class cerebras.modelzoo.trainer.callbacks.KeepNCheckpoints(n=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback to regulate the maximum number of checkpoints retained.
- Parameters
n (Optional[int]) – Number of checkpoint files to keep. If the number of checkpoint files saved exceeds this number, checkpoint files are deleted starting with the oldest one. Does not affect checkpoints taken from previous runs. If n is None, no checkpoints are deleted.
SaveCheckpointState
#
- class cerebras.modelzoo.trainer.callbacks.SaveCheckpointState(k, checkpoint_states='model', checkpoint_name='{checkpoint_states}_{ckpt_name}')[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback to save an alternative checkpoint file that contains a subset of states and is not affected by deletion policies.
- Parameters
k (int) – Cadence at which alternative checkpoint is saved. Specifes after how many checkpoints saved an alternative checkpoint is saved. For example, if a full checkpoint is taken every 100 steps and k=5, then an alternative checkpoint is saved every 500 steps.
checkpoint_states (Union[str, List[str]]) – List of valid checkpoint states to save. Can be a single state or list of states or ‘all’ (all states).
checkpoint_name (str) – Prefix to add to the alternative checkpoint file name. The name will be formatted with the following keys: *
checkpoint_states
:_
separated list of checkpoint states *ckpt_name
: original checkpoint file name
Lora
#
- class cerebras.modelzoo.trainer.callbacks.Lora(lora_params)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class that handles lorafying the model.
- Parameters
lora_params (Union[dict, List[dict], cerebras.modelzoo.common.utils.model.lora.LoraConfig, List[cerebras.modelzoo.common.utils.model.lora.LoraConfig]]) – The parameters to configure LoRA.
LogInputSummaries
#
- class cerebras.modelzoo.trainer.callbacks.LogInputSummaries[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class that logs the batches produced by the dataloader.
LogOptimizerParamGroup
#
- class cerebras.modelzoo.trainer.callbacks.LogOptimizerParamGroup(keys)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Logs specific param group keys the optimizer used in the most recent step.
- Parameters
keys (Union[str, Iterable[str]]) – A string or an iterable of strings representing the keys in the param group to log.
LogSparsity
#
- class cerebras.modelzoo.trainer.callbacks.LogSparsity[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Log target and actual sparsity levels.
WeightCompression
#
- class cerebras.modelzoo.trainer.callbacks.WeightCompression(compressions)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class to apply weight compression to the model.
- Parameters
compressions (Union[dict, List[dict]]) – Compression configuration to apply to the model.
GlobalFlags
#
- class cerebras.modelzoo.trainer.callbacks.GlobalFlags(**flags)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback to set global perf/debug flags with no scoping.
This has side effect on all runs. To scope to a given run, use scoped flags instead.
- Parameters
flags – Dictionary of debug/performance flags to set The keys must be the full path to the flag after cstorch.backends, e.g. “csx.debug.debug_args”.
ScopedTrainFlags
#
- class cerebras.modelzoo.trainer.callbacks.ScopedTrainFlags(**flags)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.flags._ScopedFlags
Callback to set global perf/debug flags within the training scope.
The overwritten flags are restored after training is complete
- Parameters
flags – Dictionary of debug/performance flags to set The keys must be the full path to the flag after cstorch.backends, e.g. “csx.debug.debug_args”.
ScopedValidateFlags
#
- class cerebras.modelzoo.trainer.callbacks.ScopedValidateFlags(**flags)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.flags._ScopedFlags
Callback to set global perf/debug flags within the validation scope.
The overwritten flags are restored after validation is complete
- Parameters
flags – Dictionary of debug/performance flags to set The keys must be the full path to the flag after cstorch.backends, e.g. “csx.debug.debug_args”.
DebugArgsPath
#
- class cerebras.modelzoo.trainer.callbacks.DebugArgsPath(debug_args_path)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback to load debug args from a file.
- Parameters
debug_args_path (str) – Path to the debug args file.
CheckLoss
#
- class cerebras.modelzoo.trainer.callbacks.CheckLoss[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class that checks for NaN or inf loss values.
It also checks whether the model output contains a scalar loss value.
- check_loss(loss)[source]#
Checks for NaN or inf loss values.
- Parameters
loss (torch.Tensor) – Scalar loss tensor.
ModelZooParamsMetadata
#
- class cerebras.modelzoo.trainer.callbacks.ModelZooParamsMetadata(params=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class that stores the model zoo parameters in the checkpoint metadata.
- Parameters
params (Optional[dict]) – Model zoo parameters.
ModelEvalMetrics
#
- class cerebras.modelzoo.trainer.callbacks.ModelEvalMetrics[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class that logs all metrics attached to the model.
DumpAvailableTensorNames
#
SummaryTensorListener
#
- class cerebras.modelzoo.trainer.callbacks.SummaryTensorListener(listener_name, tensor_names)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.listener._ListenerCallback
Tensor listener that summarizes every tensor.
Constructs named tensor listener.
- Parameters
listener_name (str) – a listener name to be used in summarized tensor name.
tensor_names (Union[str, List[str]]) – a list of tensor names to be captured. It also supports glob patterns to match group of tensors using pattern. See https://docs.python.org/3/library/fnmatch.html for more details.
NormTensorListener
#
- class cerebras.modelzoo.trainer.callbacks.NormTensorListener(listener_name, tensor_names)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.listener._ListenerCallback
Tensor listener that computes tensor norms.
Constructs named tensor listener.
- Parameters
listener_name (str) – a listener name to be used in summarized tensor name.
tensor_names (Union[str, List[str]]) – a list of tensor names to be captured. It also supports glob patterns to match group of tensors using pattern. See https://docs.python.org/3/library/fnmatch.html for more details.
ComputeNorm
#
- class cerebras.modelzoo.trainer.callbacks.ComputeNorm[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class that computes the model wise and per layer norm of the parameters.
DumpActivations
#
- class cerebras.modelzoo.trainer.callbacks.DumpActivations(outdir=None, buffer_steps=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback to dump activations for CPU/GPU runs.
- Parameters
outdir (Optional[str]) – The output directory at which to dump the activations
buffer_steps (Optional[int]) – If given, flush to a new .npz file after this many steps.
Profiler
#
- class cerebras.modelzoo.trainer.callbacks.Profiler[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Base class for all Profiler callbacks.
- property perf_metrics: dict#
Returns the performance metrics collected by the profiler.
RateProfiler
#
- class cerebras.modelzoo.trainer.callbacks.RateProfiler[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.profiler.Profiler
Callback that tracks the rate of samples processed by the model measured by the client.
Sets up the rate tracker.
- property rate: float#
Smoothed samples/second of all the samples added since last queried.
This value is cached and recomputed only when the count is updated.
- property global_rate: float#
Non-smoothed samples/second since the beginning of when the rate tracker as initialized.
This value is cached and recomputed only when the count is updated.
- property elapsed_seconds: float#
Time (seconds) elapsed since the last reset.
This value is cached and recomputed only when the count is updated.
- property total_count: int#
Total number of samples processed since the last reset.
- property perf_metrics#
OpProfiler
#
- class cerebras.modelzoo.trainer.callbacks.OpProfiler(start_step=- 1, end_step=- 1, host_activities=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.profiler.Profiler
Callback class that profiles the model using the Cerebras Profiler.
- Parameters
start_step (int) – Start step for profiling.
end_step (int) – End step for profiling.
host_activities (Optional[List[str]]) – List of ACT/WGT/CSX numbers to profile
SavePerformanceData
#
- class cerebras.modelzoo.trainer.callbacks.SavePerformanceData[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback that saves the performance metrics collected by all Profiler callbacks.
FlopUtilization
#
- class cerebras.modelzoo.trainer.callbacks.FlopUtilization[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.profiler.Profiler
Callback that computes the FLOP utilization of the model.
Initializes the FLOP utilization tracker.
- property flop_utilization: Optional[float]#
Returns the FLOP utilization of the model.
- property perf_metrics: dict#
SelectiveGrad
#
- class cerebras.modelzoo.trainer.callbacks.SelectiveGrad(selective_grads)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback class that selectively applies gradient computation.
Constructs a SelectiveGrad instance.
- Parameters
selective_grads (Union[dict, List[dict]]) – Configuration for selective gradient computation. It may be initialized with a configuration dict or list of dicts.
CountParams
#
- class cerebras.modelzoo.trainer.callbacks.CountParams(search_and_replace=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback that runs on model setup for counting the number of parameters in a network.
Along with printing the total number of parameters, it also prints out a table which shows the relative contribution (%) that each parameter has to the total count. Additionally, parameters can be grouped together to better see the relative contributions.
For example, the following groups parameters across layers together using regex style search & replace: callbacks:
- CountParams:
search_and_replace: [[”.layers.d+.”, “.grouped_layers.”]]
╒═══════════════════════════════════════════════════════════════════════════════╤══════════════╤═══════╕ │ Modules │ Parameters │ % │ ╞═══════════════════════════════════════════════════════════════════════════════╪══════════════╪═══════╡ │ model.embedding_layer.word_embeddings.weight │ 6,432,896 │ 93.96 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.embedding_layer.position_embeddings.embed.weight │ 16,384 │ 0.24 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.ln_f.weight │ 128 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.ln_f.bias │ 128 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_q_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_q_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_k_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_k_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_v_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_v_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_output_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_output_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm1.weight │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm1.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm3.weight │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm3.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.0.linear_layer.weight │ 131,072 │ 1.91 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.0.linear_layer.bias │ 1,024 │ 0.01 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.1.linear_layer.weight │ 131,072 │ 1.91 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.1.linear_layer.bias │ 256 │ 0.00 │ ╘═══════════════════════════════════════════════════════════════════════════════╧══════════════╧═══════╛
- Parameters
search_and_replace (Optional[List[Tuple[str, str]]]) – An optional list of search & replace to apply to
a (parameter names. Each search & replace is a tuple containing) –
string. (regex string for searching and a corresponding replacement) –
example (For) –
by (you can "group" parameters together across layers) –
"grouped_layers" (using .layers.d+. for search and replace with) –
EmailNotification
#
- class cerebras.modelzoo.trainer.callbacks.EmailNotification(mailto, notification_endpoint=None)[source]#
Bases:
cerebras.modelzoo.trainer.callbacks.callback.Callback
Callback for sending email notifications on certain events.
Currently, the notification system requires an external notification server to be up and running which actually sends the emails. Please contact support@cerebras.net for setting up this server.
Constructs an EmailNotification callback.
- Parameters
mailto (Union[str, List[str]]) – email address(es) to send notifications to.
notification_endpoint (Optional[str]) – A notification server that listens for requests which it then forwards to the recipients. If provided, this endpoint is used. Otherwise, its value is read from CEREBRAS_NOTIFICATION_ENDPOINT environment variable.
Utility Functions#
- cerebras.modelzoo.trainer.callbacks.register_global_callback(callback)[source]#
Register a global callback.
- Parameters
callback – the Callback to register. If a class is passed, an instance of the class is created. If an instance is passed, it is registered as is.
- Returns
A torch.utils.hooks.RemoveableHandle object.