Cerebras Model Zoo Callbacks#

This module contains the base Callback class as well as a number of core callbacks directly invoked by the Trainer as well as other optional callbacks that can be used to extend the functionality of the Trainer.

class cerebras.modelzoo.trainer.callbacks.Callback[source]#

Base class for all callbacks.

pre_setup(trainer)[source]#

Called before the trainer setup.

Parameters

trainer (Trainer) – Trainer instance.

setup(trainer)[source]#

Setup the callback using the trainer.

Parameters

trainer (Trainer) – Trainer instance.

finalize()[source]#

Clean up the callback.

This method is called when the trainer is destructed.

on_enter_fit(trainer, stack, train_dataloader, val_dataloader, loop)[source]#

Hook that allows arbitrary context managers to be entered at the beginning of the fit method.

Parameters
on_fit_start(trainer, train_dataloader, val_dataloader, loop)[source]#

Called at the beginning of the fit method.

Parameters
on_fit_end(trainer, loop)[source]#

Called at the end of the fit method.

Parameters
on_fit_exception(trainer, exception)[source]#

Called if an exception is raised during fit.

Parameters
  • trainer (Trainer) – Trainer instance.

  • exception (Exception) – Exception object.

on_enter_train(trainer, stack, train_dataloader, loop, loop_idx)[source]#

Hook that allows arbitrary context managers to be entered at the beginning of every training iteration.

Parameters
on_train_start(trainer, model, train_dataloader, loop, loop_idx)[source]#

Called at the beginning of the train loop.

Parameters
on_train_end(trainer, model, loop, loop_idx)[source]#

Called at the end of the train loop.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • loop (TrainingLoop) – TrainingLoop object.

  • loop_idx (int) – training loop index.

on_train_exception(trainer, exception)[source]#

Called if an exception is raised during a training iteration.

Parameters
  • trainer – Trainer instance.

  • exception – Exception object.

on_train_batch_start(trainer, model, batch, batch_idx)[source]#

Called at the beginning of every training iteration.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • batch (Any) – Batch data.

  • batch_idx (int) – Batch index.

on_train_batch_end(trainer, model, outputs, batch, batch_idx)[source]#

Called at the end of every training iteration.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • outputs (Dict[str, Any]) – Model outputs.

  • batch (Any) – Batch data.

  • batch_idx (int) – Batch index.

run_validation(trainer, loop_idx, is_last)[source]#

Perform a validation run.

Override this method to perform a custom validation run.

Parameters
  • trainer (Trainer) – Trainer instance.

  • val_dataloader – Validation dataloader.

  • loop_idx (int) – Training loop index.

  • is_last (bool) – Whether the last training iteration just happened.

on_enter_validate(trainer, stack, val_dataloader, loop)[source]#

Hook that allows arbitrary context managers to be entered at the beginning of every validation run.

Parameters
on_validate_start(trainer, model, val_dataloader, loop)[source]#

Called at the beginning of the validation loop.

Parameters
on_validate_end(trainer, model, loop)[source]#

Called at the end of the validation loop.

Parameters
on_validate_exception(trainer, exception)[source]#

Called if an exception is raised during validation.

Parameters
  • trainer (Trainer) – Trainer instance.

  • exception (Exception) – Exception object.

on_validate_batch_start(trainer, model, batch, batch_idx)[source]#

Called at the beginning of every validation iteration.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • batch (Any) – Batch data.

  • batch_idx (int) – Batch index.

on_validate_batch_end(trainer, model, outputs, batch, batch_idx)[source]#

Called at the end of every validation iteration.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • outputs (Dict[str, Any]) – Model outputs.

  • batch (Any) – Batch data.

  • batch_idx (int) – Batch index.

on_enter_validate_all(trainer, stack, val_dataloaders, loop)[source]#

Hook that allows arbitrary context managers to be entered at the beginning of every validate all run.

Parameters
on_before_forward(trainer, model, batch, args, kwargs)[source]#

Called before the forward pass.

The args and kwargs may be added to to provide additional arguments to the forward method.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • batch (Any) – Batch data.

  • args (List[Any]) – Forward pass arguments.

  • kwargs (dict) – Forward pass keyword arguments.

on_after_forward(trainer, model, outputs, batch)[source]#

Called after the forward pass.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • outputs (Dict[str, Any]) – Model outputs.

  • batch (Any) – Batch data.

on_before_backward(trainer, model, outputs)[source]#

Called before the backward pass.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • outputs (Dict[str, Any]) – Model outputs.

on_after_backward(trainer, model, outputs)[source]#

Called after the backward pass.

Parameters
  • trainer (Trainer) – Trainer instance.

  • model (torch.nn.Module) – Model instance.

  • outputs (Dict[str, Any]) – Model outputs.

  • batch_idx – Batch index.

on_before_optimizer_step(trainer, model, optimizer)[source]#

Called before the optimizer step.

Parameters
on_after_optimizer_step(trainer, model, optimizer)[source]#

Called after the optimizer step.

Parameters
on_before_optimizer_zero_grad(trainer, model, optimizer)[source]#

Called before the optimizer zero_grad.

Parameters
on_after_optimizer_zero_grad(trainer, model, optimizer)[source]#

Called after the optimizer zero_grad.

Parameters
on_before_scheduler_step(trainer, model, optimizer, scheduler)[source]#

Called before the scheduler step.

Parameters
on_after_scheduler_step(trainer, model, optimizer, scheduler)[source]#

Called after the scheduler step.

Parameters
on_save_checkpoint(trainer, state_dict)[source]#

Called before saving the checkpoint.

Callbacks should override this method to add states to the checkpoint.

Parameters
  • trainer (Trainer) – Trainer instance.

  • state_dict (dict) – Trainer state dictionary.

postprocess_checkpoint(trainer, state_dict)[source]#

Called after constructing the checkpoint.

Callbacks should override this method to modify the checkpoint before saving.

Parameters
  • trainer (Trainer) – Trainer instance.

  • state_dict (dict) – Trainer state dictionary.

on_after_save_checkpoint(trainer, ckpt_path)[source]#

Called after saving the checkpoint.

Parameters
  • trainer (Trainer) – Trainer instance.

  • ckpt_path (str) – Checkpoint path.

on_before_load_checkpoint(trainer, ckpt_path)[source]#

Called before loading the checkpoint.

Parameters
  • trainer (Trainer) – Trainer instance.

  • ckpt_path (str) – Checkpoint path.

preprocess_checkpoint(trainer, state_dict)[source]#

Called after loading the checkpoint.

Callbacks should override this method to modify the state_dict after loading.

Parameters
  • trainer (Trainer) – Trainer instance.

  • state_dict (dict) – Trainer state dictionary.

on_load_checkpoint(trainer, state_dict)[source]#

Called after loading the checkpoint.

Callbacks should override this method to load states from the checkpoint.

Parameters
  • trainer (Trainer) – Trainer instance.

  • state_dict (dict) – Trainer state dictionary.

class cerebras.modelzoo.trainer.callbacks.ValidationCallback[source]#

A special type of callback that indicates to the trainer that it will perform some custom validation logic.

This is useful for callbacks that need to perform downstream validation logic that is not covered by the default validation loop.

All ValidationCallbacks must implement the following methods:

  • run_validation

Essentially, you are telling the trainer what to run at the end of each training run.

abstract run_validation(trainer, loop_idx, is_last)[source]#

Core Callbacks#

The set of callbacks that implement core behaviour inside the Trainer.

ArtifactDirCallback#

class cerebras.modelzoo.trainer.callbacks.ArtifactDirCallback[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Sets up the artifact directory and write metadata to executor artifact dir with some information about the run.

loop#

The loop object from which to extract metadata.

BackendCallback#

class cerebras.modelzoo.trainer.callbacks.BackendCallback(backend, device)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Callback to set the backend for the trainer.

Parameters
  • backend (cstorch.Backend or None) – The backend object to be used for the trainer. If None, the device argument must be provided. If both are provided, an error is raised.

  • device (str or None) – The device type to be used for the trainer. If None, the backend argument must be provided.

Checkpoint#

class cerebras.modelzoo.trainer.callbacks.Checkpoint(steps=None, autoload_last_checkpoint=True, disable_strict_checkpoint_loading=False, save_initial_checkpoint=False, checkpoint_name='checkpoint_{step}.mdl')[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

A callback that handles standard checkpointing logic.

Parameters
  • steps (Optional[int]) – The frequency at which to save a checkpoint. If None, no checkpoints will be saved. Defaults to None.

  • autoload_last_checkpoint (bool) – Whether to autoload the last checkpoint in the model directory. Defaults to True.

  • disable_strict_checkpoint_loading (bool) – Whether to disable strict checkpoint loading. If True, the model will not raise an error if the checkpoint contains keys that are not present in the model. Defaults to False.

  • save_initial_checkpoint (bool) – Whether to save the initial checkpoint at the start of training. Defaults to False.

  • checkpoint_name (str) – The unformatted name of the checkpoint file. The string will be formatted with the following keys: step

static check_compatibility(state_dict)[source]#

Checks that the checkpoint is compatible with the current version of modelzoo.

get_checkpoint_path(ckpt_dir, step)[source]#

Construct a path to the checkpoint file.

If a checkpoint already exists inside the given checkpoint directory at the given step, append a timestamp to the filename.

Parameters
  • ckpt_dir (str) – The directory where the checkpoint will be saved.

  • step (int) – The step at which the checkpoint is saved.

Returns

A path to which the checkpoint can be saved

Return type

pathlib.Path

get_latest_checkpoint(trainer)[source]#

Return the path to the latest checkpoint.

get_all_checkpoints(model_dir)[source]#

Return the path to all available checkpoints.

Parameters

model_dir (str) – The directory where the checkpoints are located.

DataLoaderCallback#

class cerebras.modelzoo.trainer.callbacks.DataLoaderCallback[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Callback class that handles saving and loading dataloader state to the checkpoint.

dataloader#

The training dataloader object to save to the checkpoint.

GradientAccumulationCallback#

class cerebras.modelzoo.trainer.callbacks.GradientAccumulationCallback[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Callback class to accumulate gradients.

grad_accum_steps#

The number of steps to accumulate gradients for before stepping the optimizer.

should_run_optimizer_step#

If True, run the optimizer step in the current step.

LoopCallback#

class cerebras.modelzoo.trainer.callbacks.LoopCallback[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback, abc.ABC

Base class for all loop callbacks.

This class should not be instantiated directly. Only subclasses of LoopCallback should be used.

The loop callback owns the global step and is responsible for incrementing it after each training step.

TrainingLoop#

class cerebras.modelzoo.trainer.callbacks.TrainingLoop(num_steps=None, max_steps=None, num_epochs=None, steps_per_epoch=None, eval_frequency=1.0, eval_steps=None, grad_accum_steps=1)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.loop.LoopCallback

Callback class that manages the training loop.

Parameters
  • num_steps (Optional[int]) – The total number of training steps to perform. This will take precedence over max_steps.

  • max_steps (Optional[int]) – The maximum number of training steps to perform. max_steps if provided will take the global step into account. That is, providing max_steps is equivalent to setting num_steps = max_steps - global_step.

  • num_epochs (Optional[int]) – The number of epochs to train for. This argument is mutually exclusive with num_steps.

  • steps_per_epoch (Optional[int]) – Number of steps to train for in each epoch.

  • eval_frequency (Optional[Union[int, float]]) –

    Frequency of evaluation during training. It can be: - a positive integer which specifies the number of

    training steps between evaluations.

    • a float in the range [0.0, 1.0] which specifies the fraction of training steps between evaluations. i.e. if eval_frequency=0.5, evaluation will be performed once after half of the training steps have completed and once more at the end of training.

    • If None or zero, no evaluation is performed during training.

  • eval_steps (Optional[int]) – The number of validation steps to perform.

  • grad_accum_steps (int) – Number of steps to accumulate gradients before performing an optimizer step. This is only relevant for CPU/GPU runs.

ValidationLoop#

class cerebras.modelzoo.trainer.callbacks.ValidationLoop(eval_steps=None, hook='validate')[source]#

Bases: cerebras.modelzoo.trainer.callbacks.loop.LoopCallback

Callback class that manages the validation loop.

Parameters
  • eval_steps (Optional[int]) – The number of validation steps to perform.

  • hook – The base name of the validation hooks to run. Default: “validate”.

property eval_steps: int#

Returns the number of validation steps to perform.

Logging#

class cerebras.modelzoo.trainer.callbacks.Logging(log_steps, log_level='INFO', wsc_log_level=None, enable_act_frequency=False)#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback, cerebras.modelzoo.trainer.callbacks.logging.ClassLogger

Callback that handles setting up the Trainer’s logger as well as facilitates the cadence of logging.

Parameters
  • log_steps (int) – Number of steps after which to log.

  • log_level (str) – Logging level for the Python logger.

  • wsc_log_level (Optional[dict]) – Specifes the logging level for particular Wafer-Scale Cluster servers or tasks.

  • enable_act_frequency (bool) – If True, set the activation steps to be the log steps.

setup_logging()#
setup_logging_excepthook()#
set_wsc_log_level()#
flush_logs()#
is_log_step()#

ModelCallback#

class cerebras.modelzoo.trainer.callbacks.ModelCallback(model)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Callback class that handles setting up and compiling the model.

Parameters

model (Union[Callable[[], torch.nn.Module], torch.nn.Module]) –

The model to train. It must be one of the following: - If a callable is passed, it is assumed to be a function that

takes in no arguments returns a torch.nn.Module.

  • If a torch.nn.Module is passed, it is used as is.

OptimizerCallback#

class cerebras.modelzoo.trainer.callbacks.OptimizerCallback(optimizer=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Callback to setup the optimizer for the Trainer.

Parameters

optimizer (Optional[Union[cerebras.pytorch.optim.optimizer.Optimizer, Callable[[torch.nn.Module], cerebras.pytorch.optim.optimizer.Optimizer]]]) – Optimizer to be used for training. It can be a an instance of cstorch.optim.Optimizer or a callable that takes a torch.nn.Module as input and returns an instance of cstorch.optim.Optimizer. If None, the optimizer will not be set up by this callback.

Precision#

class cerebras.modelzoo.trainer.callbacks.Precision[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback, abc.ABC

Base precision class for implementing custom backwards pass and optimization step to handle different precision types.

abstract autocast_context_manager()[source]#

Returns the context manager that performs autocasting for the forward pass.

abstract backward(loss)[source]#

Performs the backward pass.

Parameters

loss (torch.Tensor) – Loss tensor.

abstract clip_gradients(optimizer)[source]#

Clips the gradients before the optimization step.

Parameters

optimizer (cerebras.pytorch.optim.optimizer.Optimizer) – The optimizer to step.

abstract optimizer_step(optimizer)[source]#

Performs the optimization step.

Parameters

optimizer (cerebras.pytorch.optim.optimizer.Optimizer) – The optimizer to step.

MixedPrecision#

class cerebras.modelzoo.trainer.callbacks.MixedPrecision(enabled=True, fp16_type='bfloat16', precision_opt_level=None, loss_scaling_factor=1.0, initial_loss_scale=None, steps_per_increase=2000, min_loss_scale=None, max_loss_scale=None, max_gradient_norm=None, max_gradient_value=None, log_loss_scale=False)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.precision.Precision

Callback class that handles mixed precision training.

Parameters
  • enabled (bool) – If True, enables mixed precision training.

  • fp16_type (Literal['float16', 'bfloat16', 'cbfloat16']) – Half precision type. One of “float16”, “bfloat16”, “cbfloat16”.

  • precision_opt_level (Optional[Literal[0, 1, 2]]) – Precision optimization level. If not None, sets the global precision optimization level.

  • loss_scaling_factor (Union[float, Literal['dynamic']]) – Initial loss scaling factor.

  • initial_loss_scale (Optional[float]) – Initial loss scale.

  • steps_per_increase (int) – Number of steps before increasing the loss scale.

  • min_loss_scale (Optional[float]) – Minimum loss scale.

  • max_loss_scale (Optional[float]) – Maximum loss scale.

  • max_gradient_norm (Optional[float]) – Maximum gradient norm for gradient clipping.

  • max_gradient_value (Optional[float]) – Maximum gradient value for gradient clipping.

  • log_loss_scale (bool) – If True, log the gradient scaler’s loss scale.

on_before_optimizer_step(trainer, model, optimizer)[source]#

Unscales the gradients and performs gradient clipping.

Reproducibility#

class cerebras.modelzoo.trainer.callbacks.Reproducibility(seed=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

A callback that facilitates reproducibility.

Parameters

seed (Optional[int]) – If provided, sets the torch seed.

SchedulersCallback#

class cerebras.modelzoo.trainer.callbacks.SchedulersCallback(schedulers=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Callback that sets up all the schedulers for the Trainer.

Parameters

schedulers (Union[Callable[[cerebras.pytorch.optim.optimizer.Optimizer], cerebras.pytorch.optim.scheduler.Scheduler], cerebras.pytorch.optim.scheduler.Scheduler, None, List[Optional[Union[Callable[[cerebras.pytorch.optim.optimizer.Optimizer], cerebras.pytorch.optim.scheduler.Scheduler], cerebras.pytorch.optim.scheduler.Scheduler]]]]) –

The set of optimizer schedulers to be used. Common schedulers include LR schedulers. It must be a list of these items: - If a cstorch.optim.scheduler.Scheduler is passed, it is used as is.

  • A callable that is assumed to be a function that takes in a cstorch.optim.Optimizer and returns a cstorch.optim.scheduler.Scheduler.

  • If None, there is no optimizer param group scheduling.

SparsityCallback#

class cerebras.modelzoo.trainer.callbacks.SparsityCallback(sparsity=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.CoreCallback

Callback class that applies sparsity to the model and optimizer.

Parameters

sparsity (Optional[cerebras.pytorch.sparse.base.SparsityAlgorithm]) – Sparsity algorithm instance.

Add-on Callbacks#

A set of optional callbacks that can be used to enhance the Trainer.

CoreCallback#

class cerebras.modelzoo.trainer.callbacks.CoreCallback[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

A special type of callback that indicates to the trainer that it is a core callback. Core callbacks are used internally by the trainer and should not be removed or replaced by the user.

Note: User-defined callbacks should not subclass CoreCallback

LoadCheckpointStates#

class cerebras.modelzoo.trainer.callbacks.LoadCheckpointStates(load_checkpoint_states='all')[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback to load specific states of the model from the checkpoint.

Parameters

load_checkpoint_states (Union[str, List[str]]) – The list of state names to load from the checkpoint.

preprocess_checkpoint(trainer, state_dict)[source]#

KeepNCheckpoints#

class cerebras.modelzoo.trainer.callbacks.KeepNCheckpoints(n=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback to regulate the maximum number of checkpoints retained.

Parameters

n (Optional[int]) – Number of checkpoint files to keep. If the number of checkpoint files saved exceeds this number, checkpoint files are deleted starting with the oldest one. Does not affect checkpoints taken from previous runs. If n is None, no checkpoints are deleted.

on_after_save_checkpoint(trainer, ckpt_path)[source]#

SaveCheckpointState#

class cerebras.modelzoo.trainer.callbacks.SaveCheckpointState(k, checkpoint_states='model', checkpoint_name='{checkpoint_states}_{ckpt_name}')[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback to save an alternative checkpoint file that contains a subset of states and is not affected by deletion policies.

Parameters
  • k (int) – Cadence at which alternative checkpoint is saved. Specifes after how many checkpoints saved an alternative checkpoint is saved. For example, if a full checkpoint is taken every 100 steps and k=5, then an alternative checkpoint is saved every 500 steps.

  • checkpoint_states (Union[str, List[str]]) – List of valid checkpoint states to save. Can be a single state or list of states or ‘all’ (all states).

  • checkpoint_name (str) – Prefix to add to the alternative checkpoint file name. The name will be formatted with the following keys: * checkpoint_states: _ separated list of checkpoint states * ckpt_name: original checkpoint file name

on_train_start(trainer, model, train_dataloader, loop, loop_idx)[source]#
on_train_batch_start(trainer, model, batch, batch_idx)[source]#
on_after_save_checkpoint(trainer, ckpt_path)[source]#

Lora#

class cerebras.modelzoo.trainer.callbacks.Lora(lora_params)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class that handles lorafying the model.

Parameters

lora_params (Union[dict, List[dict], cerebras.modelzoo.common.utils.model.lora.LoraConfig, List[cerebras.modelzoo.common.utils.model.lora.LoraConfig]]) – The parameters to configure LoRA.

pre_setup(trainer)[source]#

LogInputSummaries#

class cerebras.modelzoo.trainer.callbacks.LogInputSummaries[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class that logs the batches produced by the dataloader.

log_input_summaries(trainer, batch)[source]#

Logs the input summaries.

on_before_forward(trainer, model, batch, args, kwargs)[source]#

LogOptimizerParamGroup#

class cerebras.modelzoo.trainer.callbacks.LogOptimizerParamGroup(keys)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Logs specific param group keys the optimizer used in the most recent step.

Parameters

keys (Union[str, Iterable[str]]) – A string or an iterable of strings representing the keys in the param group to log.

setup(trainer)[source]#
on_after_optimizer_step(trainer, model, optimizer)[source]#

LogSparsity#

class cerebras.modelzoo.trainer.callbacks.LogSparsity[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Log target and actual sparsity levels.

setup(trainer)[source]#

WeightCompression#

class cerebras.modelzoo.trainer.callbacks.WeightCompression(compressions)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class to apply weight compression to the model.

Parameters

compressions (Union[dict, List[dict]]) – Compression configuration to apply to the model.

setup(trainer)[source]#

GlobalFlags#

class cerebras.modelzoo.trainer.callbacks.GlobalFlags(**flags)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback to set global perf/debug flags with no scoping.

This has side effect on all runs. To scope to a given run, use scoped flags instead.

Parameters

flags – Dictionary of debug/performance flags to set The keys must be the full path to the flag after cstorch.backends, e.g. “csx.debug.debug_args”.

pre_setup(trainer)[source]#

ScopedTrainFlags#

class cerebras.modelzoo.trainer.callbacks.ScopedTrainFlags(**flags)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.flags._ScopedFlags

Callback to set global perf/debug flags within the training scope.

The overwritten flags are restored after training is complete

Parameters

flags – Dictionary of debug/performance flags to set The keys must be the full path to the flag after cstorch.backends, e.g. “csx.debug.debug_args”.

on_train_start(trainer, model, train_dataloader, loop, loop_idx)[source]#
on_train_end(trainer, model, loop, loop_idx)[source]#

ScopedValidateFlags#

class cerebras.modelzoo.trainer.callbacks.ScopedValidateFlags(**flags)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.flags._ScopedFlags

Callback to set global perf/debug flags within the validation scope.

The overwritten flags are restored after validation is complete

Parameters

flags – Dictionary of debug/performance flags to set The keys must be the full path to the flag after cstorch.backends, e.g. “csx.debug.debug_args”.

on_validate_start(trainer, model, val_dataloader, loop)[source]#
on_validate_end(trainer, model, loop)[source]#

DebugArgsPath#

class cerebras.modelzoo.trainer.callbacks.DebugArgsPath(debug_args_path)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback to load debug args from a file.

Parameters

debug_args_path (str) – Path to the debug args file.

setup(trainer)[source]#

CheckLoss#

class cerebras.modelzoo.trainer.callbacks.CheckLoss[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class that checks for NaN or inf loss values.

It also checks whether the model output contains a scalar loss value.

on_after_forward(trainer, model, outputs, batch)[source]#
check_loss(loss)[source]#

Checks for NaN or inf loss values.

Parameters

loss (torch.Tensor) – Scalar loss tensor.

on_train_batch_end(trainer, model, outputs, batch, batch_idx)[source]#
on_validate_batch_end(trainer, model, outputs, batch, batch_idx)[source]#

ModelZooParamsMetadata#

class cerebras.modelzoo.trainer.callbacks.ModelZooParamsMetadata(params=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class that stores the model zoo parameters in the checkpoint metadata.

Parameters

params (Optional[dict]) – Model zoo parameters.

on_save_checkpoint(trainer, state_dict)[source]#
on_load_checkpoint(trainer, state_dict)[source]#

ModelEvalMetrics#

class cerebras.modelzoo.trainer.callbacks.ModelEvalMetrics[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class that logs all metrics attached to the model.

on_validate_end(trainer, model, loop)[source]#

DumpAvailableTensorNames#

class cerebras.modelzoo.trainer.callbacks.DumpAvailableTensorNames[source]#

Bases: cerebras.modelzoo.trainer.callbacks.listener._ListenerCallback

traced_tensor_hook(tensor, name)[source]#
trace_fn_pre_hook()[source]#
trace_fn_post_hook()[source]#

SummaryTensorListener#

class cerebras.modelzoo.trainer.callbacks.SummaryTensorListener(listener_name, tensor_names)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.listener._ListenerCallback

Tensor listener that summarizes every tensor.

Constructs named tensor listener.

Parameters
  • listener_name (str) – a listener name to be used in summarized tensor name.

  • tensor_names (Union[str, List[str]]) – a list of tensor names to be captured. It also supports glob patterns to match group of tensors using pattern. See https://docs.python.org/3/library/fnmatch.html for more details.

traced_tensor_hook(tensor, name)[source]#

NormTensorListener#

class cerebras.modelzoo.trainer.callbacks.NormTensorListener(listener_name, tensor_names)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.listener._ListenerCallback

Tensor listener that computes tensor norms.

Constructs named tensor listener.

Parameters
  • listener_name (str) – a listener name to be used in summarized tensor name.

  • tensor_names (Union[str, List[str]]) – a list of tensor names to be captured. It also supports glob patterns to match group of tensors using pattern. See https://docs.python.org/3/library/fnmatch.html for more details.

traced_tensor_hook(tensor, name)[source]#
trace_fn_pre_hook()[source]#
trace_fn_post_hook()[source]#

ComputeNorm#

class cerebras.modelzoo.trainer.callbacks.ComputeNorm[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class that computes the model wise and per layer norm of the parameters.

compute_param_norm(trainer, model)[source]#

Compute the model wise norm of the parameters.

on_before_backward(trainer, model, outputs)[source]#
compute_grad_norm(trainer, model)[source]#

Compute the model wise and per layer norm of the gradients.

on_before_optimizer_step(trainer, model, optimizer)[source]#

DumpActivations#

class cerebras.modelzoo.trainer.callbacks.DumpActivations(outdir=None, buffer_steps=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback to dump activations for CPU/GPU runs.

Parameters
  • outdir (Optional[str]) – The output directory at which to dump the activations

  • buffer_steps (Optional[int]) – If given, flush to a new .npz file after this many steps.

setup(trainer)[source]#
on_train_batch_start(trainer, model, batch, batch_idx)[source]#
on_train_batch_end(trainer, model, outputs, batch, batch_idx)[source]#
on_train_end(trainer, model, loop, loop_idx)[source]#
on_validate_batch_start(trainer, model, batch, batch_idx)[source]#
on_validate_batch_end(trainer, model, outputs, batch, batch_idx)[source]#
on_validate_end(trainer, model, loop)[source]#

Profiler#

class cerebras.modelzoo.trainer.callbacks.Profiler[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Base class for all Profiler callbacks.

property perf_metrics: dict#

Returns the performance metrics collected by the profiler.

RateProfiler#

class cerebras.modelzoo.trainer.callbacks.RateProfiler[source]#

Bases: cerebras.modelzoo.trainer.callbacks.profiler.Profiler

Callback that tracks the rate of samples processed by the model measured by the client.

Sets up the rate tracker.

property rate: float#

Smoothed samples/second of all the samples added since last queried.

This value is cached and recomputed only when the count is updated.

property global_rate: float#

Non-smoothed samples/second since the beginning of when the rate tracker as initialized.

This value is cached and recomputed only when the count is updated.

property elapsed_seconds: float#

Time (seconds) elapsed since the last reset.

This value is cached and recomputed only when the count is updated.

property total_count: int#

Total number of samples processed since the last reset.

clear_cache()[source]#

Clear all cached properties.

conditional_reset(trainer)[source]#

Reset the rate tracker if on first iteration.

update(count)[source]#

Update the rate tracker with the count of samples processed.

property perf_metrics#
on_before_forward(trainer, model, batch, args, kwargs)[source]#

OpProfiler#

class cerebras.modelzoo.trainer.callbacks.OpProfiler(start_step=- 1, end_step=- 1, host_activities=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.profiler.Profiler

Callback class that profiles the model using the Cerebras Profiler.

Parameters
  • start_step (int) – Start step for profiling.

  • end_step (int) – End step for profiling.

  • host_activities (Optional[List[str]]) – List of ACT/WGT/CSX numbers to profile

setup_op_profiler(trainer)[source]#

Context manager to profile the model using the Cerebras Op Profiler.

on_enter_fit(trainer, stack, train_dataloader, val_dataloader, loop)[source]#

SavePerformanceData#

class cerebras.modelzoo.trainer.callbacks.SavePerformanceData[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback that saves the performance metrics collected by all Profiler callbacks.

save_perf_json(trainer)[source]#

Context manager to save the performance metrics to a JSON file.

on_enter_train(trainer, stack, train_dataloader, loop, loop_idx)[source]#
on_enter_validate(trainer, stack, val_dataloader, loop)[source]#

FlopUtilization#

class cerebras.modelzoo.trainer.callbacks.FlopUtilization[source]#

Bases: cerebras.modelzoo.trainer.callbacks.profiler.Profiler

Callback that computes the FLOP utilization of the model.

Initializes the FLOP utilization tracker.

setup(trainer)[source]#
property flop_utilization: Optional[float]#

Returns the FLOP utilization of the model.

property perf_metrics: dict#
get_flops(trainer)[source]#

Get the FLOPs of the model from the compile response.

on_enter_train(trainer, stack, train_dataloader, loop, loop_idx)[source]#
on_enter_validate(trainer, stack, val_dataloader, loop)[source]#
on_after_forward(trainer, model, outputs, batch)[source]#

SelectiveGrad#

class cerebras.modelzoo.trainer.callbacks.SelectiveGrad(selective_grads)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback class that selectively applies gradient computation.

Constructs a SelectiveGrad instance.

Parameters

selective_grads (Union[dict, List[dict]]) – Configuration for selective gradient computation. It may be initialized with a configuration dict or list of dicts.

setup(trainer)[source]#

CountParams#

class cerebras.modelzoo.trainer.callbacks.CountParams(search_and_replace=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback that runs on model setup for counting the number of parameters in a network.

Along with printing the total number of parameters, it also prints out a table which shows the relative contribution (%) that each parameter has to the total count. Additionally, parameters can be grouped together to better see the relative contributions.

For example, the following groups parameters across layers together using regex style search & replace: callbacks:

  • CountParams:

    search_and_replace: [[”.layers.d+.”, “.grouped_layers.”]]

╒═══════════════════════════════════════════════════════════════════════════════╤══════════════╤═══════╕ │ Modules │ Parameters │ % │ ╞═══════════════════════════════════════════════════════════════════════════════╪══════════════╪═══════╡ │ model.embedding_layer.word_embeddings.weight │ 6,432,896 │ 93.96 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.embedding_layer.position_embeddings.embed.weight │ 16,384 │ 0.24 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.ln_f.weight │ 128 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.ln_f.bias │ 128 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_q_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_q_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_k_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_k_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_v_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_v_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_output_dense_layer.weight │ 32,768 │ 0.48 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.self_attn.proj_output_dense_layer.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm1.weight │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm1.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm3.weight │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.norm3.bias │ 256 │ 0.00 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.0.linear_layer.weight │ 131,072 │ 1.91 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.0.linear_layer.bias │ 1,024 │ 0.01 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.1.linear_layer.weight │ 131,072 │ 1.91 │ ├───────────────────────────────────────────────────────────────────────────────┼──────────────┼───────┤ │ model.transformer_decoder.all_layers.ffn.ffn.1.linear_layer.bias │ 256 │ 0.00 │ ╘═══════════════════════════════════════════════════════════════════════════════╧══════════════╧═══════╛

Parameters
  • search_and_replace (Optional[List[Tuple[str, str]]]) – An optional list of search & replace to apply to

  • a (parameter names. Each search & replace is a tuple containing) –

  • string. (regex string for searching and a corresponding replacement) –

  • example (For) –

  • by (you can "group" parameters together across layers) –

  • "grouped_layers" (using .layers.d+. for search and replace with) –

setup(trainer)[source]#
get_table(model)[source]#
get_parameter_counts(model, search_and_replace=None)[source]#

EmailNotification#

class cerebras.modelzoo.trainer.callbacks.EmailNotification(mailto, notification_endpoint=None)[source]#

Bases: cerebras.modelzoo.trainer.callbacks.callback.Callback

Callback for sending email notifications on certain events.

Currently, the notification system requires an external notification server to be up and running which actually sends the emails. Please contact support@cerebras.net for setting up this server.

Constructs an EmailNotification callback.

Parameters
  • mailto (Union[str, List[str]]) – email address(es) to send notifications to.

  • notification_endpoint (Optional[str]) – A notification server that listens for requests which it then forwards to the recipients. If provided, this endpoint is used. Otherwise, its value is read from CEREBRAS_NOTIFICATION_ENDPOINT environment variable.

on_train_exception(trainer, exception)[source]#
on_fit_end(trainer, loop)[source]#
on_validate_exception(trainer, exception)[source]#
send_email_notification(trainer, message)[source]#
get_formatted_message(message)[source]#

Utility Functions#

cerebras.modelzoo.trainer.callbacks.register_global_callback(callback)[source]#

Register a global callback.

Parameters

callback – the Callback to register. If a class is passed, an instance of the class is created. If an instance is passed, it is registered as is.

Returns

A torch.utils.hooks.RemoveableHandle object.