`cerebras.pytorch.optim`#

cerebras.pytorch.optim

Contains all Cerebras compliant Optimizer and LR scheduler classes.

class cerebras.pytorch.optim.Optimizer(params, defaults, enable_global_step=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.torch.optim.Optimizer, abc.ABC

The abstract Cerebras base optimizer class.

Enforces that the preinitialize method is implemented wherein the optimizer state should be initialized ahead of time

Parameters

params (Union[Iterable[torch.Tensor], Iterable[Dict[str, Any]]]) – Specifies what Tensors should be optimized.
defaults (Dict[str, Any]) – a dict containing default values of optimization options (used when a parameter group doesn’t specify them).
enable_global_step (bool) – If True, the optimizer will keep track of the global step for each parameter.

increment_global_step(p)[source]#: Increases the global steps by 1 and returns the current value of global step tensor in torch.float32 format.

state_dict(*args, **kwargs)[source]#

load_state_dict(state_dict)[source]#

register_zero_grad_pre_hook(hook)[source]#

Register an optimizer zero_grad pre hook which will be called before optimizer zero_grad. It should have the following signature:

hook(optimizer, args, kwargs) -> None or modified args and kwargs

The optimizer argument is the optimizer instance being used. If args and kwargs are modified by the pre-hook, then the transformed values are returned as a tuple containing the new_args and new_kwargs.

Parameters: hook (Callable) – The user defined hook to be registered.
Returns: a handle that can be used to remove the added hook by calling handle.remove()
Return type: torch.utils.hooks.RemovableHandle

register_zero_grad_post_hook(hook)[source]#

Register an optimizer zero_grad post hook which will be called after optimizer zero_grad. It should have the following signature:

hook(optimizer, args, kwargs)

The optimizer argument is the optimizer instance being used.

Parameters: hook (Callable) – The user defined hook to be registered.
Returns: a handle that can be used to remove the added hook by calling handle.remove()
Return type: torch.utils.hooks.RemovableHandle

zero_grad(*args, **kwargs)[source]#: Runs the optimizer zero_grad method and calls any pre and post hooks.

apply(f)[source]#: Calls the function on self.

visit_state(fn)[source]#: Applies a lambda to each stateful value.

abstract preinitialize()[source]#: The optimizer state must be initialized ahead of time in order to capture the full compute graph in the first iteration. This method must be overriden to perform the state preinitialization.

abstract step(closure=None)[source]#: Perform the optimizer step itself. Note, there should be no new state being created in this function. All state must be created ahead of time in preinitialize and only updated in this method.

class cerebras.pytorch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0, maximize=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Adadelta optimizer implemented to perform the required pre-initialization of the optimizer state.

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (Optional[Callable]) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.Adafactor(params, lr, eps=(1e-30, 0.001), clip_threshold=1.0, decay_rate=- 0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=False, warmup_init=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Adafactor optimizer implemented to conform to execution within the constraints of the Cerebras WSE.

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (Callable, optional) – A closure that reevaluates
loss. (the model and returns the) –

class cerebras.pytorch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-06, maximize=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Adagrad optimizer implemented to conform to execution within the constraints of the Cerebras WSE.

Parameters

params (Union[Iterable[torch.Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
lr_decay (float) – learning rate decay
weight_decay (float) – weight decay (L2 penalty)
eps (float) – term added to the denominator to improve numerical stability
maximize (bool) – maximize the params based on the objective instead of minimizing

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization: http://jmlr.org/papers/v12/duchi11a.html

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.Adamax(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0, maximize=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Adamax optimizer implemented to perform the required pre-initialization of the optimizer state.

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (Optional[Callable]) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0, amsgrad=False)[source]#

Bases: cerebras.pytorch.optim.AdamBase.AdamBase

Adam specific overrides to AdamBase.

handle_weight_decay(param_groups)[source]#

load_state_dict(state_dict)[source]#

Loads the optimizer state.

Parameters: state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict.

Adds checkpoint compatibility with the Adam from PyTorch

class cerebras.pytorch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0, correct_bias=True, amsgrad=False)[source]#

Bases: cerebras.pytorch.optim.AdamBase.AdamBase

AdamW specific overrides to AdamBase.

load_state_dict(state_dict)[source]#

Loads the optimizer state.

Parameters: state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict.

Adds checkpoint compatibility with the AdamW from HuggingFace

class cerebras.pytorch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0, maximize=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

ASGD optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state.

For more details, see https://dl.acm.org/citation.cfm?id=131098

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, adam=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Implements Lamb algorithm. It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
adam (bool, optional) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes.

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.Lion(params, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Implements Lion algorithm. As proposed in Symbolic Discovery of Optimization Algorithms.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-4)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.99))
weight_decay (float, optional) – weight decay coefficient (default: 0)

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.NAdam(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Implements NAdam algorithm to execute within the constraints of the Cerebras WSE, including pre-initializing optimizer state.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 2e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
momentum_decay (float, optional) – momentum momentum_decay (default: 4e-3)
foreach (bool, optional) – whether foreach implementation of optimizer is used (default: None)

For further details regarding the algorithm refer to Incorporating Nesterov Momentum into Adam: https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

RAdam optimizer implemented to conform to execution within the constraints of the Cerebras WSE.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

RMSprop optimizer implemented to perform the required pre-initialization of the optimizer state.

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.Rprop(params, lr=0.001, etas=(0.5, 1.2), step_sizes=(1e-06, 50.0))[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

Rprop optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
etas (Tuple[float, float], optional) – step size multipliers
step_size (Tuple[float, float], optional) – Tuple of min, max step size values. Step size is clamped to be between these values.

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class cerebras.pytorch.optim.SGD(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False, maximize=False)[source]#

Bases: cerebras.pytorch.optim.optimizer.Optimizer

SGD optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state.

Parameters

params (Union[Iterable[torch.Tensor], Iterable[Dict[str, Any]]]) – Model parameters
lr (float) – The learning rate to use
momentum (float) – momentum factor
dampening (float) – dampening for momentum
weight_decay (float) – weight decay (L2 penalty)
nesterov (bool) – enables Nesterov momentum.

preinitialize()[source]#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

Generic Scheduler class in `cerebras.pytorch`#

optim.scheduler.Scheduler#

class cerebras.pytorch.optim.scheduler.Scheduler(optimizer, total_iters, last_epoch=- 1, param_group_tags=None)[source]#

Generic scheduler class for various optimizer params.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
total_iters (int) – Number of steps to perform the decay
last_epoch (int) – the initial step to start at
param_group_tags (Optional[Union[str, List[str]]]) – param group tags to target update for

abstract _get_closed_form()[source]#

abstract property param_group_key#: Key of the param group value to modify. For example, ‘lr’ or ‘weight_decay’.

get()[source]#

state_dict()[source]#

load_state_dict(state_dict)[source]#

increment_last_epoch()[source]#: Increments the last epoch by 1.

step(*args, **kwargs)[source]#

Steps the scheduler and computes the latest value.

Only sets the last_epoch if running on CS

update_last_value()[source]#

update_groups(values)[source]#: Update the optimizer groups with the latest values.

get_last_value()[source]#: Return last computed value by current scheduler.

Learning Rate Schedulers in `cerebras.pytorch`#

Available learning rate schedulers in the cerebras.pytorch package

ConstantLR

PolynomialLR

LinearLR

ExponentialLR

InverseExponentialTimeDecayLR

InverseSquareRootDecayLR

CosineDecayLR

SequentialLR

PiecewiseConstantLR

MultiStepLR

StepLR

CosineAnnealingLR

LambdaLR

CosineAnnealingWarmRestarts

MultiplicativeLR

ChainedScheduler

optim.lr_scheduler.LRScheduler#

class cerebras.pytorch.optim.lr_scheduler.LRScheduler(*args, **kwargs)[source]#

property param_group_key#

get_last_lr()[source]#

Return last computed learning rate by current scheduler.

get_lr()[source]#

optim.lr_scheduler.ConstantLR#

class cerebras.pytorch.optim.lr_scheduler.ConstantLR(*args, **kwargs)[source]#

Maintains a constant learning rate for each parameter group (no decaying).

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
val – The learning_rate value to maintain
total_iters (Optional[int]) – The number of steps to decay for

property val#

optim.lr_scheduler.PolynomialLR#

class cerebras.pytorch.optim.lr_scheduler.PolynomialLR(*args, **kwargs)[source]#

Decays the learning rate of each parameter group using a polynomial function in the given total_iters.

This class is similar to the Pytorch PolynomialLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
end_learning_rate (float) – The final learning rate
total_iters (int) – Number of steps to perform the decay
power (float) – Exponent to apply to “x” (as in y=mx+b), which is ratio of step completion (1 for linear) Default: 1.0 (only Linear supported at the moment)
cycle (bool) – Whether to cycle

property initial_val#

property end_val#

optim.lr_scheduler.LinearLR#

class cerebras.pytorch.optim.lr_scheduler.LinearLR(*args, **kwargs)[source]#

Alias for Polynomial LR scheduler with a power of 1.

property initial_val#

property end_val#

optim.lr_scheduler.ExponentialLR#

class cerebras.pytorch.optim.lr_scheduler.ExponentialLR(*args, **kwargs)[source]#

Decays the learning rate of each parameter group by decay_rate every step.

This class is similar to the Pytorch ExponentialLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
total_iters (int) – Number of steps to perform the decay
decay_rate (float) – The decay rate
staircase (bool) – If True decay the learning rate at discrete intervals

property initial_val#

optim.lr_scheduler.InverseExponentialTimeDecayLR#

class cerebras.pytorch.optim.lr_scheduler.InverseExponentialTimeDecayLR(*args, **kwargs)[source]#

Decays the learning rate inverse-exponentially over time, as described in the Keras InverseTimeDecay class.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
step_exponent (int) – Exponential value.
total_iters (int) – Number of steps to perform the decay.
decay_rate (float) – The decay rate.
staircase (bool) – If True decay the learning rate at discrete intervals.

property initial_val#

optim.lr_scheduler.InverseSquareRootDecayLR#

class cerebras.pytorch.optim.lr_scheduler.InverseSquareRootDecayLR(*args, **kwargs)[source]#

Decays the learning rate inverse-squareroot over time, as described in the following equation:

\[\begin{aligned} lr_t & = \frac{\text{scale}}{\sqrt{\max\{t, \text{warmup_steps}\}}}. \end{aligned}\]

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
scale (float) – Multiplicative factor to scale the result.
warmup_steps (int) – use initial_learning_rate for the first warmup_steps.

property initial_val#

optim.lr_scheduler.CosineDecayLR#

class cerebras.pytorch.optim.lr_scheduler.CosineDecayLR(*args, **kwargs)[source]#

Applies the cosine decay schedule as described in the Keras CosineDecay class.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
end_learning_rate (float) – The final learning rate
total_iters (int) – Number of steps to perform the decay

property initial_val#

property end_val#

optim.lr_scheduler.SequentialLR#

class cerebras.pytorch.optim.lr_scheduler.SequentialLR(*args, **kwargs)[source]#

Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given step.

This class is a wrapper around the Pytorch SequentialLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – Wrapped optimizer
schedulers (list) – List of chained schedulers.
milestones (list) – List of integers that reflects milestone points.
last_epoch (int) – The index of last epoch. Default: -1.

optim.lr_scheduler.PiecewiseConstantLR#

class cerebras.pytorch.optim.lr_scheduler.PiecewiseConstantLR(*args, **kwargs)[source]#

Adjusts the learning rate to a predefined constant at each milestone and holds this value until the next milestone. Notice that such adjustment can happen simultaneously with other changes to the learning rate from outside this scheduler.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
learning_rates (List[float]) – List of learning rates to maintain before/during each milestone.
milestones (List[int]) – List of step indices. Must be increasing.

optim.lr_scheduler.MultiStepLR#

class cerebras.pytorch.optim.lr_scheduler.MultiStepLR(*args, **kwargs)[source]#

Decays the learning rate of each parameter group by gamma once the number of steps reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.

This class is similar to the Pytorch MultiStepLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
gamma (float) – Multiplicative factor of learning rate decay.
milestones (List[int]) – List of step indices. Must be increasing.

property initial_val#

optim.lr_scheduler.StepLR#

class cerebras.pytorch.optim.lr_scheduler.StepLR(*args, **kwargs)[source]#

Decays the learning rate of each parameter group by gamma every step_size. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.

This class is similar to the Pytorch StepLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
step_size (int) – Period of decay.
gamma (float) – Multiplicative factor of decay.

property initial_val#

optim.lr_scheduler.CosineAnnealingLR#

class cerebras.pytorch.optim.lr_scheduler.CosineAnnealingLR(*args, **kwargs)[source]#

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of steps since the last restart in SGDR:

\[\begin{split}\begin{aligned} \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), & T_{cur} \neq (2k+1)T_{max}; \\ \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), & T_{cur} = (2k+1)T_{max}. \end{aligned}\end{split}\]

Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

This class is similar to the Pytorch CosineAnnealingLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
T_max (int) – Maximum number of iterations.
eta_min (float) – Minimum learning rate.

property initial_val#

optim.lr_scheduler.LambdaLR#

class cerebras.pytorch.optim.lr_scheduler.LambdaLR(*args, **kwargs)[source]#

Sets the learning rate of each parameter group to the initial lr times a given function (which is specified by overriding set_value_lambda).

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.

property initial_val#

optim.lr_scheduler.CosineAnnealingWarmRestarts#

class cerebras.pytorch.optim.lr_scheduler.CosineAnnealingWarmRestarts(*args, **kwargs)[source]#

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr, \(T_{cur}\) is the number of steps since the last restart and \(T_{i}\) is the number of steps between two warm restarts in SGDR:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{i}}\pi\right)\right)\]

When \(T_{cur}=T_{i}\), set \(\eta_t = \eta_{min}\). When \(T_{cur}=0\) after restart, set \(\eta_t=\eta_{max}\).

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts.

This class is similar to the Pytorch CosineAnnealingWarmRestarts LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
T_0 (int) – Number of iterations for the first restart.
T_mult (int) – A factor increases Ti after a restart. Currently T_mult must be set to 1.0
eta_min (float) – Minimum learning rate.

property initial_val#

optim.lr_scheduler.MultiplicativeLR#

class cerebras.pytorch.optim.lr_scheduler.MultiplicativeLR(*args, **kwargs)[source]#

Multiply the learning rate of each parameter group by the supplied coefficient.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – The initial learning rate.
coefficient (float) – Multiplicative factor of learning rate.

property initial_val#

optim.lr_scheduler.ChainedScheduler#

class cerebras.pytorch.optim.lr_scheduler.ChainedScheduler(*args, **kwargs)[source]#

optim.lr_scheduler.CyclicLR#

class cerebras.pytorch.optim.lr_scheduler.CyclicLR(*args, **kwargs)[source]#

Sets the learning rate of each parameter group according to cyclical learning rate policy (CLR). The policy cycles the learning rate between two boundaries with a constant frequency, as detailed in the paper Cyclical Learning Rates for Training Neural Networks. The distance between the two boundaries can be scaled on a per-iteration or per-cycle basis.

Cyclical learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This class has three built-in policies, as put forth in the paper:

“triangular”: A basic triangular cycle without amplitude scaling.
“triangular2”: A basic triangular cycle that scales initial amplitude by
half each cycle.
“exp_range”: A cycle that scales initial amplitude by
\(\text{gamma}^{\text{cycle iterations}}\) at each cycle iteration.

This class is similar to the Pytorch CyclicLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule.
base_lr (float) – Initial learning rate which is the lower boundary in the cycle.
max_lr (float) – Upper learning rate boundaries in the cycle.
step_size_up (int) – Number of training iterations in the increasing half of a cycle.
step_size_down (Optional[int]) – Number of training iterations in the decreasing half of a cycle.
mode (str) – One of {‘triangular’, ‘triangular2’, ‘exp_range’}.
gamma (float) – Constant in ‘exp_range’ scaling function: gamma**(cycle iterations).
scale_mode (str) – {‘cycle’, ‘iterations’} Defines whether scale_fn is evaluated on cycle number or cycle iterations.

property base_val#

property max_val#

optim.lr_scheduler.OneCycleLR#

class cerebras.pytorch.optim.lr_scheduler.OneCycleLR(*args, **kwargs)[source]#

Sets the learning rate of each parameter group according to the 1cycle learning rate policy. The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.

This scheduler is not chainable.

This class is similar to the Pytorch OneCycleLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_learning_rate (float) – Initial learning rate. Compared with PyTorch, this is equivalent to max_lr / div_factor.
max_lr (float) – Upper learning rate boundaries in the cycle.
total_steps (int) – The total number of steps in the cycle.
pct_start (float) – The percentage of the cycle (in number of steps) spent increasing the learning rate.
final_div_factor (float) – Determines the minimum learning rate via min_lr = initial_lr/final_div_factor.
three_phase (bool) – If True, use a third phase of the schedule to annihilate the learning rate
anneal_strategy (str) – Specifies the annealing strategy: “cos” for cosine annealing, “linear” for linear annealing.

property initial_val#

property max_val#

Weight Decay Schedulers in `cerebras.pytorch`#

Available weight decay schedulers in the cerebras.pytorch package

ConstantWD

PolynomialWD

LinearWD

ExponentialWD

InverseExponentialTimeDecayWD

InverseSquareRootDecayWD

CosineDecayWD

SequentialWD

PiecewiseConstantWD

MultiStepWD

StepWD

CosineAnnealingWD

LambdaWD

CosineAnnealingWarmRestartsWD

MultiplicativeWD

ChainedWD

optim.weight_decay_scheduler.WeightDecayScheduler#

class cerebras.pytorch.optim.weight_decay_scheduler.WeightDecayScheduler(optimizer, total_iters, last_epoch=- 1, param_group_tags=None)[source]#

property param_group_key#

optim.weight_decay_scheduler.ConstantWD#

class cerebras.pytorch.optim.weight_decay_scheduler.ConstantWD(optimizer, val, total_iters=None, param_group_tags=None)[source]#

Maintains a constant weight decay for each parameter group (no decaying).

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
val (float) – The weight decay value to maintain
total_iters (Optional[int]) – The number of steps to decay for

optim.weight_decay_scheduler.PolynomialWD#

class cerebras.pytorch.optim.weight_decay_scheduler.PolynomialWD(optimizer, initial_val, end_val, total_iters, power=1.0, cycle=False, param_group_tags=None)[source]#

Decays the weight decay of each parameter group using a polynomial function in the given total_iters.

This class is similar to the Pytorch PolynomialLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay
end_val (float) – The final weight decay
total_iters (int) – Number of steps to perform the decay
power (float) – Exponent to apply to “x” (as in y=mx+b), which is ratio of step completion (1 for linear) Default: 1.0 (only Linear supported at the moment)
cycle (bool) – Whether to cycle

optim.weight_decay_scheduler.LinearWD#

class cerebras.pytorch.optim.weight_decay_scheduler.LinearWD(optimizer, initial_val, end_val, total_iters, cycle=False, param_group_tags=None)[source]#

Alias for Polynomial Scheduler scheduler with a power of 1.

optim.weight_decay_scheduler.ExponentialWD#

class cerebras.pytorch.optim.weight_decay_scheduler.ExponentialWD(optimizer, initial_val, total_iters, decay_rate, staircase=False, param_group_tags=None)[source]#

Decays the weight decay of each parameter group by decay_rate every step.

This class is similar to the Pytorch ExponentialLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.
total_iters (int) – Number of steps to perform the decay
decay_rate (float) – The decay rate
staircase (bool) – If True decay the weight decay at discrete intervals

optim.weight_decay_scheduler.InverseExponentialTimeDecayWD#

class cerebras.pytorch.optim.weight_decay_scheduler.InverseExponentialTimeDecayWD(optimizer, initial_val, step_exponent, total_iters, decay_rate, staircase=False, param_group_tags=None)[source]#

Decays the weight decay inverse-exponentially over time, as described in the Keras InverseTimeDecay class.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.
step_exponent (int) – Exponential weight decay.
total_iters (int) – Number of steps to perform the decay.
decay_rate (float) – The decay rate.
staircase (bool) – If True decay the weight decay at discrete intervals.

optim.weight_decay_scheduler.InverseSquareRootDecayWD#

class cerebras.pytorch.optim.weight_decay_scheduler.InverseSquareRootDecayWD(optimizer, initial_val=1.0, scale=1.0, warmup_steps=1.0, param_group_tags=None)[source]#

Decays the weight decay inverse-squareroot over time, as described in the following equation:

\[\begin{aligned} wd_t & = \frac{\text{scale}}{\sqrt{\max\{t, \text{warmup_steps}\}}}. \end{aligned}\]

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.
scale (float) – Multiplicative factor to scale the result.
warmup_steps (int) – use initial_val for the first warmup_steps.

optim.weight_decay_scheduler.CosineDecayWD#

class cerebras.pytorch.optim.weight_decay_scheduler.CosineDecayWD(optimizer, initial_val, end_val, total_iters, param_group_tags=None)[source]#

Applies the cosine decay schedule as described in the Keras CosineDecay class.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay
end_val (float) – The final weight decay
total_iters (int) – Number of steps to perform the decay

optim.weight_decay_scheduler.SequentialWD#

class cerebras.pytorch.optim.weight_decay_scheduler.SequentialWD(optimizer, schedulers, milestones, last_epoch=- 1, param_group_tags=None)[source]#

This class is similar to Pytorch SequentialLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – Wrapped optimizer
schedulers (list) – List of chained schedulers.
milestones (list) – List of integers that reflects milestone points.
last_epoch (int) – The index of last epoch. Default: -1.

optim.weight_decay_scheduler.PiecewiseConstantWD#

class cerebras.pytorch.optim.weight_decay_scheduler.PiecewiseConstantWD(optimizer, vals, milestones, param_group_tags=None)[source]#

Adjusts the weight decay to a predefined constant at each milestone and holds this value until the next milestone. Notice that such adjustment can happen simultaneously with other changes to the weight decays from outside this scheduler.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
vals (List[float]) – List of weight decays to maintain before/during each milestone.
milestones (List[int]) – List of step indices. Must be increasing.

optim.weight_decay_scheduler.MultiStepWD#

class cerebras.pytorch.optim.weight_decay_scheduler.MultiStepWD(optimizer, initial_val, gamma, milestones, param_group_tags=None)[source]#

Decays the weight decay of each parameter group by gamma once the number of steps reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the weight decay from outside this scheduler.

This class is similar to the Pytorch MultiStepLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.
gamma (float) – Multiplicative factor of weight decay decay.
milestones (List[int]) – List of step indices. Must be increasing.

optim.weight_decay_scheduler.StepWD#

class cerebras.pytorch.optim.weight_decay_scheduler.StepWD(optimizer, initial_val, step_size, gamma, param_group_tags=None)[source]#

Decays the weight decay of each parameter group by gamma every step_size. Notice that such decay can happen simultaneously with other changes to the weight decay from outside this scheduler.

This class is similar to the Pytorch StepLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial val.
step_size (int) – Period of decay.
gamma (float) – Multiplicative factor of decay.

optim.weight_decay_scheduler.CosineAnnealingWD#

class cerebras.pytorch.optim.weight_decay_scheduler.CosineAnnealingWD(optimizer, initial_val, T_max, eta_min=0.0, param_group_tags=None)[source]#

Set the weight decay of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial wd and \(T_{cur}\) is the number of steps since the last restart in SGDR:

Notice that because the schedule is defined recursively, the weight decay can be simultaneously modified outside this scheduler by other operators. If the weight decay is set solely by this scheduler, the weight decay at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

This class is similar to the Pytorch CosineAnnealingLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.
T_max (int) – Maximum number of iterations.
eta_min (float) – Minimum weight decay.

optim.weight_decay_scheduler.LambdaWD#

class cerebras.pytorch.optim.weight_decay_scheduler.LambdaWD(optimizer, initial_val, param_group_tags=None)[source]#

Sets the weight decay of each parameter group to the initial wd times a given function (which is specified by overriding set_value_lambda).

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.

optim.weight_decay_scheduler.CosineAnnealingWarmRestartsWD#

class cerebras.pytorch.optim.weight_decay_scheduler.CosineAnnealingWarmRestartsWD(optimizer, initial_val, T_0, T_mult=1, eta_min=0.0, param_group_tags=None)[source]#

Set the weight decay of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial wd, \(T_{cur}\) is the number of steps since the last restart and \(T_{i}\) is the number of steps between two warm restarts in SGDR:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{i}}\pi\right)\right)\]

When \(T_{cur}=T_{i}\), set \(\eta_t = \eta_{min}\). When \(T_{cur}=0\) after restart, set \(\eta_t=\eta_{max}\).

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts.

This class is similar to the Pytorch CosineAnnealingWarmRestarts LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.
T_0 (int) – Number of iterations for the first restart.
T_mult (int) – A factor increases Ti after a restart. Currently T_mult must be set to 1.0
eta_min (float) – Minimum weight decay.

optim.weight_decay_scheduler.MultiplicativeWD#

class cerebras.pytorch.optim.weight_decay_scheduler.MultiplicativeWD(optimizer, initial_val, coefficient, param_group_tags=None)[source]#

Multiply the weight decay of each parameter group by the supplied coefficient.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – The initial weight decay.
coefficient (float) – Multiplicative factor of weight decay.

optim.weight_decay_scheduler.ChainedWD#

class cerebras.pytorch.optim.weight_decay_scheduler.ChainedWD(schedulers, param_group_tags=None)[source]#

Chains list of weight decay schedulers. It takes a list of chainable weight decay schedulers and performs consecutive step() functions belonging to them by just one call.

optim.weight_decay_scheduler.CyclicWD#

class cerebras.pytorch.optim.weight_decay_scheduler.CyclicWD(optimizer, base_val, max_val, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_mode='cycle', param_group_tags=None)[source]#

Sets the weight decay of each parameter group according to cyclical weight decay policy (CLR). The policy cycles the learning rate between two boundaries with a constant frequency, as detailed in the paper Cyclical Learning Rates for Training Neural Networks. The distance between the two boundaries can be scaled on a per-iteration or per-cycle basis.

Cyclical weight decay policy changes the weight decay after every batch. step should be called after a batch has been used for training.

This class has three built-in policies, as put forth in the paper:

“triangular”: A basic triangular cycle without amplitude scaling.
“triangular2”: A basic triangular cycle that scales initial amplitude by
half each cycle.
“exp_range”: A cycle that scales initial amplitude by
\(\text{gamma}^{\text{cycle iterations}}\) at each cycle iteration.

This class is similar to the Pytorch CyclicLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule.
base_val (float) – Initial weight decay which is the lower boundary in the cycle.
max_val (float) – Upper weight decay boundaries in the cycle.
step_size_up (int) – Number of training iterations in the increasing half of a cycle.
step_size_down (int) – Number of training iterations in the decreasing half of a cycle.
mode (str) – One of {‘triangular’, ‘triangular2’, ‘exp_range’}.
gamma (float) – Constant in ‘exp_range’ scaling function: gamma**(cycle iterations).
scale_mode (str) – {‘cycle’, ‘iterations’} Defines whether scale_fn is evaluated on cycle number or cycle iterations.

optim.weight_decay_scheduler.OneCycleWD#

class cerebras.pytorch.optim.weight_decay_scheduler.OneCycleWD(optimizer, initial_val, max_val, total_steps=1000, pct_start=0.3, final_div_factor=10000.0, three_phase=False, anneal_strategy='cos', param_group_tags=None)[source]#

Sets the weight decay of each parameter group according to the 1cycle weight decay policy. The 1cycle policy anneals the learning rate from an initial weight decay to some maximum weight decay and then from that maximum weight decay to some minimum weight decay much lower than the initial weight decay. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.

This scheduler is not chainable.

This class is similar to the Pytorch OneCycleLR LRS.

Parameters

optimizer (torch.optim.Optimizer) – The optimizer to schedule
initial_val (float) – Initial weight decay. Compared with PyTorch, this is equivalent to max_val / div_factor.
max_val (float) – Upper weight decay boundaries in the cycle.
total_steps (int) – The total number of steps in the cycle.
pct_start (float) – The percentage of the cycle (in number of steps) spent increasing the weight decay.
final_div_factor (float) – Determines the minimum weight decay via min_val = initial_val/final_div_factor.
three_phase (bool) – If True, use a third phase of the schedule to annihilate the weight decay
anneal_strategy (str) – Specifies the annealing strategy: “cos” for cosine annealing, “linear” for linear annealing.

cerebras.pytorch.amp

cerebras.pytorch.sparse

`ConstantLR`	`PolynomialLR`
`LinearLR`	`ExponentialLR`
`InverseExponentialTimeDecayLR`	`InverseSquareRootDecayLR`
`CosineDecayLR`	`SequentialLR`
`PiecewiseConstantLR`	`MultiStepLR`
`StepLR`	`CosineAnnealingLR`
`LambdaLR`	`CosineAnnealingWarmRestarts`
`MultiplicativeLR`	`ChainedScheduler`

`ConstantWD`	`PolynomialWD`
`LinearWD`	`ExponentialWD`
`InverseExponentialTimeDecayWD`	`InverseSquareRootDecayWD`
`CosineDecayWD`	`SequentialWD`
`PiecewiseConstantWD`	`MultiStepWD`
`StepWD`	`CosineAnnealingWD`
`LambdaWD`	`CosineAnnealingWarmRestartsWD`
`MultiplicativeWD`	`ChainedWD`

cerebras.pytorch.optim#

Generic Scheduler class in cerebras.pytorch#

optim.scheduler.Scheduler#

Learning Rate Schedulers in cerebras.pytorch#

optim.lr_scheduler.LRScheduler#

optim.lr_scheduler.ConstantLR#

optim.lr_scheduler.PolynomialLR#

optim.lr_scheduler.LinearLR#

optim.lr_scheduler.ExponentialLR#

optim.lr_scheduler.InverseExponentialTimeDecayLR#

optim.lr_scheduler.InverseSquareRootDecayLR#

optim.lr_scheduler.CosineDecayLR#

optim.lr_scheduler.SequentialLR#

optim.lr_scheduler.PiecewiseConstantLR#

optim.lr_scheduler.MultiStepLR#

optim.lr_scheduler.StepLR#

optim.lr_scheduler.CosineAnnealingLR#

optim.lr_scheduler.LambdaLR#

optim.lr_scheduler.CosineAnnealingWarmRestarts#

optim.lr_scheduler.MultiplicativeLR#

optim.lr_scheduler.ChainedScheduler#

optim.lr_scheduler.CyclicLR#

optim.lr_scheduler.OneCycleLR#

Weight Decay Schedulers in cerebras.pytorch#

optim.weight_decay_scheduler.WeightDecayScheduler#

optim.weight_decay_scheduler.ConstantWD#

optim.weight_decay_scheduler.PolynomialWD#

optim.weight_decay_scheduler.LinearWD#

optim.weight_decay_scheduler.ExponentialWD#

optim.weight_decay_scheduler.InverseExponentialTimeDecayWD#

optim.weight_decay_scheduler.InverseSquareRootDecayWD#

optim.weight_decay_scheduler.CosineDecayWD#

optim.weight_decay_scheduler.SequentialWD#

optim.weight_decay_scheduler.PiecewiseConstantWD#

optim.weight_decay_scheduler.MultiStepWD#

optim.weight_decay_scheduler.StepWD#

optim.weight_decay_scheduler.CosineAnnealingWD#

optim.weight_decay_scheduler.LambdaWD#

optim.weight_decay_scheduler.CosineAnnealingWarmRestartsWD#

optim.weight_decay_scheduler.MultiplicativeWD#

optim.weight_decay_scheduler.ChainedWD#

optim.weight_decay_scheduler.CyclicWD#

optim.weight_decay_scheduler.OneCycleWD#

`cerebras.pytorch.optim`#

Generic Scheduler class in `cerebras.pytorch`#

Learning Rate Schedulers in `cerebras.pytorch`#

Weight Decay Schedulers in `cerebras.pytorch`#