cerebras_pytorch.experimental package#

Automatic mixed precision#

The following classes and subclasses are designed to facilitate automatic mixed precision on the Cerebras Wafer Scale Cluster

GradScaler#

class experimental.amp.GradScaler[source]#

Faciliates mixed precision training and DLS, DLS + GCC

For more details please see docs for amp.initialize.

Parameters
  • loss_scale – If loss_scale == “dynamic”, then configure dynamic loss scaling. Otherwise, it is the loss scale value used in static loss scaling.

  • init_scale – The initial loss scale value if loss_scale == “dynamic”

  • steps_per_increase – The number of steps after which to increase the loss scaling condition

  • min_loss_scale – The minimum loss scale value that can be chosen by dynamic loss scaling

  • max_loss_scale – The maximum loss scale value that can be chosen by dynamic loss scaling

  • overflow_tolerance – The maximum fraction of steps involving infinite or undefined values in the gradient we allow. We reduce the loss scale if the tolerance is exceeded

  • max_gradient_norm – The maximum gradient norm to use for global gradient clipping Only applies in the DLS + GCC case. If GCC is not enabled, then this parameter has no effect

__init__(loss_scale: Optional[Union[str, float]] = None, init_scale: Optional[float] = None, steps_per_increase: Optional[int] = None, min_loss_scale: Optional[float] = None, max_loss_scale: Optional[float] = None, overflow_tolerance: float = 0.05, max_gradient_norm: Optional[float] = None)[source]#
clip_gradients_and_return_isfinite(optimizers)[source]#

Clip the optimizer’s params’s gradients and return whether or not the norm is finite

get_scale()[source]#

Return the loss scale

load_state_dict(state_dict)[source]#

Loads the state dictionary into the current params

scale(loss: torch.Tensor)[source]#

Scales the loss in preparation of the backwards pass

state_dict(destination=None)[source]#

Returns a dictionary containing the state to be saved to a checkpoint

step(optimizer, *args, **kwargs)[source]#

Step carries out the following two operations: 1. Internally invokes unscale_(optimizer) (unless unscale_ was

explicitly called for optimizer earlier in the iteration). As part of the unscale_, gradients are checked for infs/NaNs.

  1. Invokes optimizer.step() using the unscaled gradients. Ensure that previous optimizer state or params carry over if we encounter NaNs in the gradients.

*args and **kwargs are forwarded to optimizer.step(). Returns the return value of optimizer.step(*args, **kwargs). :param optimizer: Optimizer that applies the gradients. :type optimizer: cerebras_pytorch.optim.Optimizer :param args: Any arguments. :param kwargs: Any keyword arguments.

step_if_finite(optimizer, *args, **kwargs)[source]#

Directly conditionalize the call to optimizer.step(*args, **kwargs) but only if this GradScaler detected finite grads.

Parameters
  • optimizer (cerebras_pytorch.experimental.optim.Optimizer) – Optimizer that applies the gradients.

  • args – Any arguments.

  • kwargs – Any keyword arguments.

Returns

The result of optimizer.step()

unscale_(optimizer)[source]#

Unscales the optimizer’s params gradients inplace

update(new_scale=None)[source]#

Update the gradient scalar after all optimizers have been stepped

update_scale(optimizers)[source]#

Update the scales of the optimizers

warned_unscaling_non_fp32_grad = False#

optimizer_step#

experimental.amp.optimizer_step(loss: torch.Tensor, optimizer: cerebras_pytorch.experimental.optim.Optimizer, grad_scaler: cerebras_pytorch.experimental.amp.GradScaler, max_gradient_norm: Optional[float] = None, max_gradient_value: Optional[float] = None)[source]#

Performs loss scaling, gradient scaling and optimizer step

Parameters
  • loss – The loss value to scale. loss.backward should be called before this function

  • optimizer – The optimizer to step

  • grad_scaler – The gradient scaler to use to scale the parameter gradients

  • max_gradient_norm – the max gradient norm to use for gradient clipping

  • max_gradient_value – the max gradient value to use for gradient clipping

Creation Ops#

Can be used to lazily initialize tensors with known shape, dtype and value to avoid have them unnecessarily take up memory.

full#

full_like#

ones#

ones_like#

zeros#

zeros_like#

Checkpoint Saving/Loading utilities#

Data Utilities#

utils.data.DataLoader#

class experimental.utils.data.DataLoader[source]#

Wrapper around torch.utils.data.DataLoader that facilitates moving data generated by the dataloader to a Cerebras system

Parameters
  • input_fn – A callable that returns a torch.utils.data.DataLoader instance

  • *args – Any other positional or keyword arguments are passed into the input_fn when each worker instantiates their respective dataloaders

  • **kwargs

    Any other positional or keyword arguments are passed into the input_fn when each worker instantiates their respective dataloaders

__init__(input_fn: Callable[[...], torch.utils.data.DataLoader], *args, **kwargs)[source]#

utils.data.SyntheticDataset#

class experimental.utils.data.SyntheticDataset[source]#

A synthetic dataset that generates samples from a SampleSpec.

Constructs a SyntheticDataset instance.

A synthetic dataset can be used to generate samples on the fly with an expected dtype/shape but without needing to create a full-blown dataset. This is especially useful for compile validation.

Parameters
  • sample_spec

    Specification of the samples to generate. This can be a nested structure of one of the following types:

    • torch.Tensor: A tensor to be cloned.

    • Callable: A callable that takes the sample index and

      returns a tensor.

    Supported data structures for holding the above leaf nodes are list, tuple, dict, OrderedDict, and NamedTuple.

  • num_samples – Total size of the dataset. If None, the dataset will generate samples indefinitely.

__init__(sample_spec: Union[torch.Tensor, Callable[[int], torch.Tensor], List[Union[torch.Tensor, Callable[[int], torch.Tensor], List[SampleSpecT], Tuple[SampleSpecT, ...], Dict[str, SampleSpecT], OrderedDict[str, SampleSpecT], NamedTuple]], Tuple[Union[torch.Tensor, Callable[[int], torch.Tensor], List[SampleSpecT], Tuple[SampleSpecT, ...], Dict[str, SampleSpecT], OrderedDict[str, SampleSpecT], NamedTuple], ...], Dict[str, Union[torch.Tensor, Callable[[int], torch.Tensor], List[SampleSpecT], Tuple[SampleSpecT, ...], Dict[str, SampleSpecT], OrderedDict[str, SampleSpecT], NamedTuple]], OrderedDict[str, Union[torch.Tensor, Callable[[int], torch.Tensor], List[SampleSpecT], Tuple[SampleSpecT, ...], Dict[str, SampleSpecT], OrderedDict[str, SampleSpecT], NamedTuple]], NamedTuple], num_samples: Optional[int] = None)[source]#

Constructs a SyntheticDataset instance.

A synthetic dataset can be used to generate samples on the fly with an expected dtype/shape but without needing to create a full-blown dataset. This is especially useful for compile validation.

Parameters
  • sample_spec

    Specification of the samples to generate. This can be a nested structure of one of the following types:

    • torch.Tensor: A tensor to be cloned.

    • Callable: A callable that takes the sample index and

      returns a tensor.

    Supported data structures for holding the above leaf nodes are list, tuple, dict, OrderedDict, and NamedTuple.

  • num_samples – Total size of the dataset. If None, the dataset will generate samples indefinitely.

utils.data.DataExecutor#

class experimental.utils.data.DataExecutor#

Defines a single execution run on a Cerebras wafer scale cluster

Parameters
  • dataloader – the dataloader to use for the run

  • num_steps – the number of steps to run. Defaults to 1 if the backend was configured for compile or validate only

  • checkpoint_steps – the interval at which to schedule fetching checkpoints from the cluster

  • cs_config – optionally, a csconfig object can be passed in to configure the cerebras wafer-scale cluster. if none provided the default configuration values will be used.

  • writer – The summary writer to be used to write any summarized scalars or tensors to tensorboard

  • profiler_activities – The list of activities to profile By default the client side rate and global rate are tracked

__init__(*args: Any, **kwargs: Any) None#

utils.CSConfig#

class experimental.utils.CSConfig#

Contains config details for the Cerebras Wafer Scale Cluster

Parameters
  • mgmt_address (Optional[str]) – Address to connect to appliance. If not provided, query the cluster management node for it. Default: None.

  • credentials_path (Optional[str]) – Credentials for connecting to appliance. If not provided, query the cluster management node for it. Default: None.

  • num_csx (int) – Number of Cerebras Systems to run on. Default: 1.

  • max_wgt_servers (int) – Number of weight servers to support run. Default: 24.

  • max_act_per_csx (int) – Number of activation servers per system. Default: 1.

  • num_workers_per_csx (int) – Number of streaming workers per system. Default: 1.

  • transfer_processes (int) – Number of processes to transfer data to/from appliance. Default: 5.

  • job_time_sec (int) – Time limit for the appliance jobs, not including the queue time. Default: None.

  • mount_dirs (List[str]) – Local storage to mount to appliance (ex. training data). Default: None.

  • python_paths (List[str]) – A list of path that worker pods respect as PYTHONPATH in addition to the PYTHONPATH set in the container image. Default: None.

  • job_labels (List[str]) – A list of equal-sign-separated key-value pairs that get applied as part of job metadata. Default: None.

  • debug_args (DebugArgs) – Optional debugging arguments object. Default: None.

  • precision_opt_level (int) – The precision optimization level. Default: 1.

Metrics#

A collection of evaluation metrics that can be used to evaluate the performance of a trained model on the Cerebras Wafer Scale Cluster.

metrics.Metric#

class experimental.metrics.Metric[source]#

The abstract basemetric class

__init__(name)[source]#
abstract compute() float[source]#

Compute and return the final metric value

forward(*args, **kwargs)[source]#

Updates the metric value

register_output(name: str)[source]#

Create and register a new property with provided name that handles fetching the tensor value when assigning to the property

Note, this means that only tensors are allowed to be set for these properties

Parameters

name – the name of the property

register_state(name: str, value: torch.Tensor)[source]#

Registers a state variable to the module

registry = {}#
abstract reset()[source]#

Reset the metric state

abstract update(*args, **kwargs)[source]#

Update the metric value

metrics.AccuracyMetric#

class experimental.metrics.AccuracyMetric[source]#

Computes the accuracy of the model’s predictions

Parameters

name – Name of the metric

reset()[source]#
update(labels, predictions, weights=None, dtype=None)[source]#

metrics.PerplexityMetric#

class experimental.metrics.PerplexityMetric[source]#

Computes the perplexity of the model’s predictions

Parameters

name – Name of the metric

reset()[source]#
update(labels, loss, weights=None, dtype=None)[source]#

metrics.compute_all_metrics#

experimental.metrics.compute_all_metrics()[source]#

Compute the floating point value of all registered metrics

Random Number Generation utilities#

numpy utilities#

from_numpy#

to_numpy#

Tensorboard utilities#

experimental.utils.tensorboard.SummaryWriter(*args, base_step: int = 1, **kwargs)#

Thin wrapper around torch.utils.tensorboard.SummaryWriter

Additional features include the ability to add a tensor summary

Parameters
  • base_step – The base step to use in summarize_{scalar,tensor} functions

  • *args – Any other positional and keyword arguments are forwarded directly to the base class

  • **kwargs

    Any other positional and keyword arguments are forwarded directly to the base class

experimental.utils.tensorboard.SummaryReader(log_dir: str, **kwargs)#

Class for reading summaries saved using the SummaryWriter