Cerebras Model Zoo YAML parameters#

Model parameters#

Common#

Parameter Name	Data type	Description
mixed_precision	(`bool`, optional) Default: `None`	Whether to use mixed precision training or not
use_bfloat16	(`bool`, optional) Default: `False`	Whether to use bfloat16 data type instead of float32. See more

Transformer based models#

Parameter Name	Data type	Description	Supported Models
attention_dropout_rate	(`float`, optional) Default: same as `dropout`	Dropout rate for attention layer	All
attention_kernel	(`str`/`None`, optional) Default: `None`	Attention kernel to use. Accepted values: `None` - compiler selects the kernel. `"default"` - Default implementation. `"optimized_beta"` - Optimized implementation. Beta feature, support is limited	All
attention_softmax_fp32	(`bool`, optional) Default: `True`	Whether to use fp32 precision for attention softmax	All
attention_type	(`str`, optional) Default: `"scaled_dot_product"`	Type of attention. Accepted values: `"dot_product"` `"scaled_dot_product"`	All
d_ff	(`int`, optional) Default: `2048`	Size of the intermediate feed-forward layer in each `T5Block`	T5, Transformer
d_kv	(`int`, optional) Default: `64`	Size of the query/key/value projections per attention head. `d_kv` does not have to be equal to `d_model//num_heads`	T5, Transformer
d_model	(`int`, optional) Default `512`	The number of expected features in the encoder/decoder inputs	All
decoder_nonlinearity	(`str`, optional) Default: `"relu"`	Type of nonlinearity to be used in decoder	T5, Transformer
decoder_num_hidden_layers	(`int`, optional)	T5, Transformer	Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set
disable_nsp	(`bool`, optional) Default: False	Whether to disable the next sentence prediction task	BERT (pre-training, fine-tuning)
dropout_rate	(`float`, optional), Default: `0.1`	The dropout probability for all fully connected layers	All
embedding_dropout_rate	(`float`, optional) Default: `0.1`	Dropout rate for embeddings	All
embedding_initializer	(`str`, optional) Default: “normal”	Initializer to use for embeddings. See supported initializers	GPT2, GPT3, GPTJ
encoder_nonlinearity	(`str`, optional) Default: varies per model	Type of nonlinearity to be used in encoder	BERT (pre-training, fine-tuning), T5, Transformer
encoder_num_hidden_layers	(`int`, optional) Default: `6`	Number of hidden layers in the encoder	T5, Transformer
extra_ids	(`int`, optional) Default: `0`	The number of extra ids used for additional vocabulary items	T5, Transformer
filter_size	(`int`, optional) Default: `3072`	Dimensionality of the feed-forward layer in the Transformer block	BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ
hidden_size	(`int`, optional) Default: `768`	The size of the transformer hidden layers	BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ
initializer	(`str`, optional) Default: varies based on model	The initializer to be used for all the initializers used in the model. See supported initializers	BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ
initializer_range	(`float`, optional) Default: `0.02`	The standard deviation of the truncated_normal_initializer as the default initializer	BERT (pre-training), GPT2, GPT3, GPTJ
layer_norm_epsilon	(`float`, optional) Default: `1e-5`	The epsilon value used in layer normalization layers	All
lm_loss_weight	(`float`, optional) Default: `1.0`	Value that scales loss by the mean number of predictions per sequence in the dataset. This number varies per dataset and can be calculated by getting the reciprocal of average number of tokens per sequence in the training dataset. This is only needed when setting loss scaling to `"batch_size"`	T5, Transformer
loss_scaling	(`str`, optional) Default: `num_tokens`	The scaling type used to calculate the loss. Accepts: `batch_size`, `num_tokens`. See more	GPT2, GPT3, GPTJ
loss_weight	(`float`, optional) Default: `1.0`	The weight for the loss scaling when `loss_scaling: "batch_size"`, generally set to `1/max_sequence_length`	GPT2, GPT3, GPTJ
max_position_embeddings	(`int`, optional) Default: `1024`	The maximum sequence length that the model can handle	All
mlm_loss_scaling	(`str`, optional) Default: `"batch_size"`	A string specifying the scaling factor type used for the language modeling loss. Accepts one of: `"num_masked"` - uses the off-the shelf loss scaling by number of valid (non-padding) tokens the cross entropy loss function, `"precomputed_num_masked"` - uses loss scaling from the computed num valid masks in the data loader, when enabling `dynamic_loss_weight` in the data loader params, `"batch_size"` - uses loss scaling by `"batch_size"` and `lm_loss_weight` should be provided when using `"batch_size"`	T5, Transformer
mlm_loss_weight	(`float`, optional) Default: `1.0`	The weight for the masked language modeling loss used when scaling the loss with `"batch_size"`. This number varies per dataset and can be calculated by getting the reciprocal of average number of masked tokens per sequence in the training dataset	BERT (pre-training)
nonlinearity	(`str`, optional) Default: varies per model	The non-linear activation function used in the feed-forward network in each transformer block. See list of non-linearity functions here. Some may have to use `autogen_policy: "medium"`	BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ
num_heads	(`int`, optional) Default: varies per model	The number of attention heads in the multi-head attention layer	All
num_hidden_layers	(`int`, optional) Default: `12`	Number of hidden layers in the Transformer encoder/decoder	All
output_layer_initializer	(str, optional) Default: varies based on model	The name of the initializer for the weights of the output layer. See supported initializers	GPT2, GPT3, GPTJ
position_embedding_type	(`str`, optional) Default: varies per model	The type of position embedding to use in the model. Can be one of: `"fixed"` - Sinusoidal from original Transformer, `"relative"` - Relative position embedding, to exploit pairwise, relative positional information., `"rotary"` - a.k.a RoPE , `"learned"` - Learned embedding matrix, `None`	All
relu_dropout_rate	(`float`, optional) Default: varies per model	The dropout rate for ReLU activation function	T5, Transformer
residual_dropout_rate	(`float`, optional) Default: `0.1`	The dropout rate for residual connections	GPTJ
rotary_dim	(`int`, optional) Default: `None`	The number of dimensions used for the rotary position encoding. Must be an even number	GPTJ
share_embedding_weights	(`bool`, optional) Default: `True`	Whether to share the embedding weights between the input and out put embedding	All
share_encoder_decoder_embedding	(`bool`, optional) Default: `True`	Whether to share the embedding weights between the encoder and decoder	T5, Transformer
src_vocab_size	(`int`, optional) Default: `32128`	The size of the source vocabulary. Max supported value: `512000`	T5, Transformer
tgt_vocab_size	(`int`, optional) Default: `32128`	The size of the target vocabulary. Max supported value: `512000`	T5, Transformer
use_bias_in_output	(`bool`, optional) Default: `False`	Whether to use bias in the final output layer	GPT2, GPT3, GPTJ
use_dropout_outside_residual_path	Default: `True` for T5, `False` for Transformer	(`bool`, optional)	Whether to set dropout calculations outside of the residual path
use_ffn_bias	(`bool`, optional) Default: varies per model	Whether to use bias in the feed-forward network (FFN)	All
use_ffn_bias_in_attention	(`bool`, optional) Default: varies per model	Whether to include bias in the attention layer for feed-forward network (FFN)	All
use_position_embedding	(`bool`, optional) Default: `True`	Whether to use position embedding in the model	GPT2, GPT3
use_pre_encoder_decoder_dropout	(`bool`, optional) Default: `False`	Whether to use dropout layer after positional embedding layer and encoder/decoder	T5, Transformer
use_pre_encoder_decoder_layer_norm	(`bool`, optional) Default: `True`	Whether to use layer norm before passing input tensors into encoder/decoder	T5, Transformer
use_projection_bias_in_attention	(`bool`, optional) Default: varies per model	Whether to include bias in the attention layer for projection	All
use_t5_layer_norm	(`bool`, optional) Default: `False`	Whether to use T5 layer norm (with no mean subtraction and bias correction) or use the regular `nn.LayerNorm` module	T5, Transformer
use_transformer_initialization	(`bool`, optional) Default: `False`	The Transformer model tends to converge best with a scaled variant on Xavier uniform initialization used for linear layers. This contrasts the initialization used for the original T5 paper, which uses the normal initialization for linear layers. Setting this flag to `True` switches the initialization to the Transformer specific scaled Xavier initialization	T5, Transformer
use_untied_layer_norm	(`bool`, optional) Default: `False`	Whether to use untied layer normalization	GPTJ
vocab_size	(`int`, optional) Default: varies per model	The size of the vocabulary used in the model. Max supported value: `512000`	All

Computer Vision models#

Parameter	Data type	Description	Supported Models
bias_initializer	(`str`, optional) Default: `"zeros"`	Initializer for the bias	UNet
convs_per_block	(`List[str]`, required)	List of conv specifications for each conv in the block	UNet
decoder_filters	(`List[str]`, required)	List of filter sizes for each block in the decoder	UNet
downscale_bottleneck	(`bool`, optional) Default: `False`	Whether to downsample the spatial dimensions in the UNet bottleneck block	UNet
downscale_encoder_blocks	(`bool`/`List[bool]`, optional) Default: `True`	Determine whether each block in the Encoder includes downsampling. Length of the list must correspond to the number of UNetBlocks in the Encoder. If a single bool is provided, all blocks will use this value	UNet
downscale_first_conv	(`bool`, optional) Default: `False`	If True, the first convolution operation in each UNetBlock will be downscaled. If False, the last convolution in each UNetBlock will be downscaled	UNet
downscale_method	(`str`, optional) Default: `"max_pool"`	Downscaling method at the end of each block. One of `"max_pool"` or `"strided_conv"`	UNet
enable_bias	By default, bias will only be included when no normalization is used after the convolution layers	Whether to include a bias operation following convolution layers	UNet
encoder_filters	(`List[str]`, required)	List of filter sizes for each block in the encoder	UNet
eval_ignore_classes	(`List[int]`, optional)	List of classes to ignore during evaluation of model	UNet
eval_metrics	(`List[str]`, optional)	List of evaluation metrics to use during training and validation. Available options are accuracy (`Acc`), mean IOU (`mIOU`) or Dice (`DSC`)	UNet
initializer	(`str`, required)	Initializer for the convolution weights. See supported initializers	UNet
input_channels	(`int`, required)	Number of channels in the input images to the model	UNet
loss	(`str`, required)	Loss type, supported: values: `"bce"`, `"multilabel_bce"`, `"ssce"`	UNet
nonlinearity	(`str`, required)	Activation function used in the model following convolutions in the encoder and decoder	UNet
norm_kwargs	(`dict`, optional) Default: `None`	args to be passed to norm layers during initialization. For `norm_type` = `group`, `norm_kwargs` must include `num_groups` key-value pair. `norm_type` = `layer`, `norm_kwargs` must include `normalized_shape` key-value pair	UNet
norm_layer	(`str`, optional) Default: `"batchnorm2d"`	Type of normalization to be used. See supported norm layers	UNet
residual_blocks	(`bool`, optional) Default: `False`	Flag for using residual connections at the end of each block	UNet
skip_connect	(`bool`, optional) Default: `True`	Flag for if the model concatenates encoder outputs to decoder inputs	UNet
use_conv3d	(`bool`, optional) Default: `False`	Whether to use 3D convolutions in the model	UNet

Data loader parameters#

Common#

Parameter Name	Data type	Description
batch_size	(`int`, required)	Global batch size means the effective batch size - batch size used to calculate loss and update weights for a single step
data_dir	(`str`/`List[str]`), required	Path/s to the data files to use
data_processor	(`str`, required)	Name of the data processor to be used
mixed_precision	(`bool`, optional) Default: `None`	Flag to cast input to fp16
num_workers	(`int`, optional) Default: `0`	Number of workers to use in the dataloader. See more
persistent_workers	(`bool`, optional) Default: `True`	For multi-worker dataloader controls if the workers are recreated at the end of each epoch (see PyTorch docs)
prefetch_factor	(`int`, optional) Default: `10`	Number of samples loaded in advance by each worker
shuffle	(`bool`, optional) Default: `True`	Flag to enable data shuffling
shuffle_buffer	(`int`, optional) Default: `10 * batch_size`	Size of shuffle buffer in samples
shuffle_seed	(`int`, optional) Default: `None`	Shuffle seed

Transformers#

Parameter Name	Data type	Description	Supported Models
do_lower	(`bool`, optional) Default: `False`	Flag to lower case the texts	BERT (pre-training, fine-tuning), T5, Transformer
dynamic_loss_weight	(`bool`, optional) Default: `False`	Flag to dynamically scale the loss. If set, it will divide the loss for a token by the length of the sequence that the token comes from. Use with `"precomputed_num_tokens"` loss scaling	T5, Transformer
dynamic_mlm_scale	(`bool`, optional) Default: `False`	Flag to dynamically scale the loss. If set, MLM Loss is scaled by the number of masked tokens in the current batch using the `masked_lm_weights` from the input data features	BERT (pre-training)
extra_ids	(`int`, optional) Default: `0`	Number of sentinel tokens for T5 objective	T5, Transformer
masked_lm_prob	(`float`, optional) Default: `0.15`	Ratio of the masked tokens over the sequence length	BERT (pre-training)
max_predictions_per_seq	(`int`, required)	Maximum number of masked tokens per sequence	BERT (pre-training)
max_sequence_length	(`int`, optional) Default: varies per model	Maximum sequence length of the input data	All
src_data_dir	(`str`, required)	Path to directory containing all the files of tokenized data for source sequence	T5, Transformer
src_max_sequence_length	(`int`, required)	Largest possible sequence length for the input source sequence. If longer it will be truncated. All other sequences padded to this length	T5, Transformer
src_vocab_file	(`str`, required)	Path to vocab file for source input	T5, Transformer
tgt_data_dir	(`str`, required)	Path to directory containing all the files of tokenized data for target sequence	T5, Transformer
tgt_max_sequence_length	(`int`, required)	Largest possible sequence length for the input target sequence. If longer it will be truncated. All other sequences padded to this length	T5, Transformer
tgt_vocab_file	(`str`, required)	Path to vocab file for target input	T5, Transformer
vocab_file	(`str`, required)	Path to vocab file	BERT (pre-training, fine-tuning)
vocab_size	(`int`, required)	The size of the vocabulary used in the model	BERT (pre-training, fine-tuning)

Computer Vision#

Parameter Name	Data type	Description	Supported Models
aggregate_cartilage	(`bool`, optional) Default: `True`	For SKM-TEA dataset only. Combines medial and lateral classes into single class	UNet
augment_data	(`bool`, optional) Default: `True`	Apply data augmentation to the data	UNet
class_id	(`int`, optional)	For the Severstal Dataset this sets which class id to be considered as the positive class. All other classes will be considered negative examples	UNet
echo_type	(`str`, required) Default: `echo1`	For SKM-TEA dataset only. Specifies training data configuration. Allowed options are: `echo1`, `echo2`, or `root_sum_of_squares`	UNet
image_shape	(`List[int]`, required)	Expected shape of output images in format (H, W, C)	UNet
normalize_data_method	(`str`, required)	Specify the strategy to normalize the input data. One of: `"zero_centered"`,`"zero_one"`,`"standard_score"`	UNet
num_classes	Number of classes in the training dataset. (`int`, required)	UNet
train_test_split		Percentage of data to be used in the training dataset.	UNet
use_fast_dataloader	(`bool`, optional) Default: `False`	If set to True, mapstyle datasets that use the UNetDataProcessor perform faster data processing	UNet
use_worker_cache	(`bool`, optional) Default: `True`	If set to True data will be read from local SSD memory on the individual worker nodes during training. If the data does not exist on the worker nodes it will be automatically copied from the host node. This will cause a slowdown the first time this copy takes place	UNet

Optimizer parameters#

Parameter Name	Data type	Description
initial_loss_scale	(`int`, optional) Default: `2 ** 15`	Initial loss scale to be used in the grad scale
learning_rate	(`dict`, required)	Learning rate scheduler to be used. See supported LR schedulers
log_summaries	(`bool`, optional) Default: `False`	Flag to log per layer gradient norm in Tensorboard
loss_scaling_factor	(`float`/`str`, optional) Default: `1.0`	Loss scaling factor for gradient calculation in learning step
max_gradient_norm	(`float`, optional) Default: `None`	Max norm of the gradients for learnable parameters. Used for gradient clipping
min_loss_scale	(`float`, optional) Default: `None`	The minimum loss scale value that can be chosen by dynamic loss scaling
max_loss_scale	(`float`, optional) Default: `None`	The maximum loss scale value that can be chosen by dynamic loss scaling
optimizer_type	(`str`, required)	Optimizer to be used. See supported optimizers

Runconfig parameters#

Key	Data type	Description	Supported mode
autogen_policy	(`str`, optional) Default: `None`	The autogen policy to be used for the given run. Can be one of: `"default"`, `"disabled"`, `"mild"`, `"medium"`, `"aggressive"`. See more	CSX
autoload_last_checkpoint	(`bool`, optional) Default: `True`	Flag to automatically load the last checkpoint in the `model_dir`	All
check_loss_values	(`bool`, optional) Default: `True`	Flag to check the loss values to see if it is `Nan/inf`	All
checkpoint_path	(`str`, optional) Default: `None`	The path to load checkpoints from during training	All
checkpoint_steps	(`int`, optional) Default: `0`	The number of steps between saving model checkpoints during training. `0` means no checkpoints saved	All
compile_dir	(`str`, optional) Default: `None`	Compile directory where compile artifacts will be written	All
compile_only	(`bool`, optional) Default: `False`	Enables compile only workflow	All
credentials_path	(`str`, optional) Default: `None`	Credentials for cluster access. If `None`, the value from a pre-configured location will be used if available	CSX
debug_args_path	(`str`, optional) Default: `None`	path to debugs args file	CSX
dist_addr	(`str`, optional) Default: `localhost:8888`	To init master_addr and master_port of distributed	GPU
dist_backend	(`str`, optional) Default: `"nccl"`	Distributed backend engine	GPU
enable_distributed	(`bool`, optional) Default: `False`	Flag to enable distributed training on GPU	GPU
enable_summaries	(`bool`, optional) Default: `False`	Enable summaries when running on CS-X hardware	CSX
eval_frequency	(`int`, optional) Default: `None`	Specifies the evaluation frequency during training. Only used for `train_and_eval` mode	All
eval_steps	(`int`, optional) Default: `None`	Specifies the number of steps to run the model evaluation	All
experimental_api	(`bool`, optional) Default: `False`	Flag to enable experimental PyTorch API	CSX
init_method	(`str`, optional) Default: `"env://"`	URL specifying how to initialize the process group	GPU
is_pretrained_checkpoint	(`bool`, optional) Default: `False`	Flag used in conjunction with `checkpoint_path`, to enforce resetting of optimizer states and training steps after loading a given checkpoint. By setting this flag, matching weights are initialized from checkpoint provided by `checkpoint_path`, training starts from step 0, and optimizer states present in the checkpoint are ignored. Useful for fine-tuning runs on different tasks (e.g., classification, Q&A, etc.) where weights from a pre-trained model trained on language modeling (LM) tasks are loaded or fine-tuning on a different dataset on the same LM task	All
job_labels	(`str`, optional) Default: `None`	A list of equal-sign-separated key-value pairs served as job labels	CSX
log_steps	(`int`, optional) Default: `None`	Specifies the number of steps between logging during training. Same number controls the summary steps in Tensorboard	All
logging	(`str`, optional) Default: `"INFO"`	Specifies the logging level during training	All
max_steps	(`int`, required)	Specifies the maximum number of steps for training. `max_steps` is optional unless neither `num_epochs` nor `num_steps` are provided, in which case `max_steps` must be provided	All
mgmt_address	(`str`, optional)	The address of the management service used for coordinating the training job as `<host>:<port>`	CSX
mode	(`str`, required)	The mode of the training job, either ‘`"train"`’, ‘`"eval"`’, `"eval_all"` or `"train_and_eval"`	All
model_dir	(`str`, optional) Default: `./model_dir`	The directory where the model checkpoints and other metadata will be saved during training	All
mount_dirs	(`List[str]`, optional) Default: `None`	A list of paths to be mounted to the appliance containers. It should generally contain path to the directory containing the Cerebras model zoo and data dir	CSX
num_act_servers	(`int`, optional) Default: `1`	Number of activation servers per CS-X dedicated to stream samples to the WSE. Input workers stream data to these activation servers, and the activation servers to hold and further stream the data to the WSE. For LLMs, we generally choose one because they are compute-bound. For CV models we choose a higher number, a crude rule of thumb is to have one activation server for every 4 workers (i.e. `num_workers_per_csx // 4 if num_workers_per_csx > 4, else 1`). It is suggested to keep the default values for this param when possible	CSX
num_csx	(`int`, optional) Default: `1`	The number of CSX systems to use in Cerebras WSE cluster	CSX
num_epochs	(`int`, optional) Default: `None`	The number of epochs to train for	All
num_steps	(`int`, optional) Default: `None`	The number of steps to train for	All
num_wgt_servers	(`int`, optional) Default: `None`	Upper bound on the number of MemoryX servers used for storing the model weights. Compilation may choose a smaller number depending on the model topology. A sensible upper bound (currently 24) is selected if a value is not provided	CSX
num_workers_per_csx	(`int`, optional) Default: `0`	Number of input workers, per CSX, to use for streaming samples. This setting depends on whether the model is compute-bound or input-bound and how efficient the dataloader implementation is. For compute-bound models (e.g., LLM), even one input worker per csx is enough to saturate the input buffers on CSX systems. But for smaller models a larger number may be used. We currently default to 1 worker per CSX	CSX
precision_opt_level	(`int`, optional) Default: `1`	Setting to control the level of numerical precision used for training runs for large NLP models. See more	CSX
python_paths	(`List[str]`, optional) Default: `None`	A list of paths to be exported into `PYTHONPATH` for worker containers. It should generally contain path to the directory containing the Cerebras model zoo	CSX
save_initial_checkpoint	(`bool`, optional) Default: `False`	Whether to save an initial checkpoint before training starts	All
save_losses	(`bool`, optional) Default: `True`	Whether to save the loss values during training	All
seed	(`int`, optional) Default: `None`	The seed to use for random number generation for reproducibility	All
steps_per_epoch	(`int`, optional) Default: `None`	The number of steps per epoch	All
sync_batchnorm	(`bool`, optional) Default: `False`	Whether to use synchronized batch normalization on multi-GPU setup	GPU
target_device	(`str`, optional) Default: command line value	The target device to run the training on. One of: `CPU`, `GPU`, `CSX`. Required in command line	All
use_cs_grad_accum	(`bool`, optional) Default: `False`	Whether to use gradient accumulation to support larger batch sizes	CSX
validate_only	(`bool`, optional) Default: `False`	Enables validate only workflow, stops the compilation at kernel matching stage	CSX

Model configuration using Cerebras Model Zoo

Convert checkpoints and configurations