Cerebras Model Zoo YAML parameters#

Model parameters#

Common#

Parameter Name

Data type

Description

mixed_precision

(bool, optional) Default: None

Whether to use mixed precision training or not

use_bfloat16

(bool, optional) Default: False

Whether to use bfloat16 data type instead of float32. See more

Transformer based models#

Parameter Name

Data type

Description

Supported Models

attention_dropout_rate

(float, optional) Default: same as dropout

Dropout rate for attention layer

All

attention_kernel

(str/None, optional) Default: None

Attention kernel to use. Accepted values:
None - compiler selects the kernel.
"default" - Default implementation.
"optimized_beta" - Optimized implementation. Beta feature, support is limited

All

attention_softmax_fp32

(bool, optional) Default: True

Whether to use fp32 precision for attention softmax

All

attention_type

(str, optional) Default: "scaled_dot_product"

Type of attention. Accepted values:
"dot_product"
"scaled_dot_product"

All

d_ff

(int, optional) Default: 2048

Size of the intermediate feed-forward layer in each T5Block

T5, Transformer

d_kv

(int, optional) Default: 64

Size of the query/key/value projections per attention head. d_kv does not have to be equal to d_model//num_heads

T5, Transformer

d_model

(int, optional) Default 512

The number of expected features in the encoder/decoder inputs

All

decoder_nonlinearity

(str, optional) Default: "relu"

Type of nonlinearity to be used in decoder

T5, Transformer

decoder_num_hidden_layers

(int, optional)

T5, Transformer

Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set

disable_nsp

(bool, optional) Default: False

Whether to disable the next sentence prediction task

BERT (pre-training, fine-tuning)

dropout_rate

(float, optional), Default: 0.1

The dropout probability for all fully connected layers

All

embedding_dropout_rate

(float, optional) Default: 0.1

Dropout rate for embeddings

All

embedding_initializer

(str, optional) Default: “normal”

Initializer to use for embeddings. See supported initializers

GPT2, GPT3, GPTJ

encoder_nonlinearity

(str, optional) Default: varies per model

Type of nonlinearity to be used in encoder

BERT (pre-training, fine-tuning), T5, Transformer

encoder_num_hidden_layers

(int, optional) Default: 6

Number of hidden layers in the encoder

T5, Transformer

extra_ids

(int, optional) Default: 0

The number of extra ids used for additional vocabulary items

T5, Transformer

filter_size

(int, optional) Default: 3072

Dimensionality of the feed-forward layer in the Transformer block

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

hidden_size

(int, optional) Default: 768

The size of the transformer hidden layers

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

initializer

(str, optional) Default: varies based on model

The initializer to be used for all the initializers used in the model. See supported initializers

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

initializer_range

(float, optional) Default: 0.02

The standard deviation of the truncated_normal_initializer as the default initializer

BERT (pre-training), GPT2, GPT3, GPTJ

layer_norm_epsilon

(float, optional) Default: 1e-5

The epsilon value used in layer normalization layers

All

lm_loss_weight

(float, optional) Default: 1.0

Value that scales loss by the mean number of predictions per sequence in the dataset. This number varies per dataset and can be calculated by getting the reciprocal of average number of tokens per sequence in the training dataset. This is only needed when setting loss scaling to "batch_size"

T5, Transformer

loss_scaling

(str, optional) Default: num_tokens

The scaling type used to calculate the loss. Accepts:
batch_size, num_tokens. See more

GPT2, GPT3, GPTJ

loss_weight

(float, optional) Default: 1.0

The weight for the loss scaling when loss_scaling: "batch_size", generally set to 1/max_sequence_length

GPT2, GPT3, GPTJ

max_position_embeddings

(int, optional) Default: 1024

The maximum sequence length that the model can handle

All

mlm_loss_scaling

(str, optional) Default: "batch_size"

A string specifying the scaling factor type used for the language modeling loss. Accepts one of: "num_masked" - uses the off-the shelf loss scaling by number of valid (non-padding) tokens the cross entropy loss function, "precomputed_num_masked" - uses loss scaling from the computed num valid masks in the data loader, when enabling dynamic_loss_weight in the data loader params, "batch_size" - uses loss scaling by "batch_size" and lm_loss_weight should be provided when using "batch_size"

T5, Transformer

mlm_loss_weight

(float, optional) Default: 1.0

The weight for the masked language modeling loss used when scaling the loss with "batch_size". This number varies per dataset and can be calculated by getting the reciprocal of average number of masked tokens per sequence in the training dataset

BERT (pre-training)

nonlinearity

(str, optional) Default: varies per model

The non-linear activation function used in the feed-forward network in each transformer block. See list of non-linearity functions here. Some may have to use autogen_policy: "medium"

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

num_heads

(int, optional) Default: varies per model

The number of attention heads in the multi-head attention layer

All

num_hidden_layers

(int, optional) Default: 12

Number of hidden layers in the Transformer encoder/decoder

All

output_layer_initializer

(str, optional) Default: varies based on model

The name of the initializer for the weights of the output layer. See supported initializers

GPT2, GPT3, GPTJ

position_embedding_type

(str, optional) Default: varies per model

The type of position embedding to use in the model. Can be one of: "fixed" - Sinusoidal from original Transformer, "relative" - Relative position embedding, to exploit pairwise, relative positional information., "rotary" - a.k.a RoPE , "learned" - Learned embedding matrix, None

All

relu_dropout_rate

(float, optional) Default: varies per model

The dropout rate for ReLU activation function

T5, Transformer

residual_dropout_rate

(float, optional) Default: 0.1

The dropout rate for residual connections

GPTJ

rotary_dim

(int, optional) Default: None

The number of dimensions used for the rotary position encoding. Must be an even number

GPTJ

share_embedding_weights

(bool, optional) Default: True

Whether to share the embedding weights between the input and out put embedding

All

share_encoder_decoder_embedding

(bool, optional) Default: True

Whether to share the embedding weights between the encoder and decoder

T5, Transformer

src_vocab_size

(int, optional) Default: 32128

The size of the source vocabulary. Max supported value: 512000

T5, Transformer

tgt_vocab_size

(int, optional) Default: 32128

The size of the target vocabulary. Max supported value: 512000

T5, Transformer

use_bias_in_output

(bool, optional) Default: False

Whether to use bias in the final output layer

GPT2, GPT3, GPTJ

use_dropout_outside_residual_path

Default: True for T5, False for Transformer

(bool, optional)

Whether to set dropout calculations outside of the residual path

use_ffn_bias

(bool, optional) Default: varies per model

Whether to use bias in the feed-forward network (FFN)

All

use_ffn_bias_in_attention

(bool, optional) Default: varies per model

Whether to include bias in the attention layer for feed-forward network (FFN)

All

use_position_embedding

(bool, optional) Default: True

Whether to use position embedding in the model

GPT2, GPT3

use_pre_encoder_decoder_dropout

(bool, optional) Default: False

Whether to use dropout layer after positional embedding layer and encoder/decoder

T5, Transformer

use_pre_encoder_decoder_layer_norm

(bool, optional) Default: True

Whether to use layer norm before passing input tensors into encoder/decoder

T5, Transformer

use_projection_bias_in_attention

(bool, optional) Default: varies per model

Whether to include bias in the attention layer for projection

All

use_t5_layer_norm

(bool, optional) Default: False

Whether to use T5 layer norm (with no mean subtraction and bias correction) or use the regular nn.LayerNorm module

T5, Transformer

use_transformer_initialization

(bool, optional) Default: False

The Transformer model tends to converge best with a scaled variant on Xavier uniform initialization used for linear layers. This contrasts the initialization used for the original T5 paper, which uses the normal initialization for linear layers. Setting this flag to True switches the initialization to the Transformer specific scaled Xavier initialization

T5, Transformer

use_untied_layer_norm

(bool, optional) Default: False

Whether to use untied layer normalization

GPTJ

vocab_size

(int, optional) Default: varies per model

The size of the vocabulary used in the model. Max supported value: 512000

All

Computer Vision models#

Parameter

Data type

Description

Supported Models

bias_initializer

(str, optional) Default: "zeros"

Initializer for the bias

UNet

convs_per_block

(List[str], required)

List of conv specifications for each conv in the block

UNet

decoder_filters

(List[str], required)

List of filter sizes for each block in the decoder

UNet

downscale_bottleneck

(bool, optional) Default: False

Whether to downsample the spatial dimensions in the UNet bottleneck block

UNet

downscale_encoder_blocks

(bool/List[bool], optional) Default: True

Determine whether each block in the Encoder includes downsampling. Length of the list must correspond to the number of UNetBlocks in the Encoder. If a single bool is provided, all blocks will use this value

UNet

downscale_first_conv

(bool, optional) Default: False

If True, the first convolution operation in each UNetBlock will be downscaled. If False, the last convolution in each UNetBlock will be downscaled

UNet

downscale_method

(str, optional) Default: "max_pool"

Downscaling method at the end of each block. One of "max_pool" or "strided_conv"

UNet

enable_bias

By default, bias will only be included when no normalization is used after the convolution layers

Whether to include a bias operation following convolution layers

UNet

encoder_filters

(List[str], required)

List of filter sizes for each block in the encoder

UNet

eval_ignore_classes

(List[int], optional)

List of classes to ignore during evaluation of model

UNet

eval_metrics

(List[str], optional)

List of evaluation metrics to use during training and validation. Available options are accuracy (Acc), mean IOU (mIOU) or Dice (DSC)

UNet

initializer

(str, required)

Initializer for the convolution weights. See supported initializers

UNet

input_channels

(int, required)

Number of channels in the input images to the model

UNet

loss

(str, required)

Loss type, supported: values: "bce", "multilabel_bce", "ssce"

UNet

nonlinearity

(str, required)

Activation function used in the model following convolutions in the encoder and decoder

UNet

norm_kwargs


(dict, optional) Default: None

args to be passed to norm layers during initialization. For
norm_type = group, norm_kwargs must include num_groups key-value pair.
norm_type = layer, norm_kwargs must include normalized_shape key-value pair

UNet

norm_layer

(str, optional) Default: "batchnorm2d"

Type of normalization to be used. See supported norm layers

UNet

residual_blocks

(bool, optional) Default: False

Flag for using residual connections at the end of each block

UNet

skip_connect

(bool, optional) Default: True

Flag for if the model concatenates encoder outputs to decoder inputs

UNet

use_conv3d

(bool, optional) Default: False

Whether to use 3D convolutions in the model

UNet

Data loader parameters#

Common#

Parameter Name

Data type

Description

batch_size

(int, required)

Global batch size means the effective batch size - batch size used to calculate loss and update weights for a single step

data_dir

(str/List[str]), required

Path/s to the data files to use

data_processor

(str, required)

Name of the data processor to be used

mixed_precision

(bool, optional) Default: None

Flag to cast input to fp16

num_workers

(int, optional) Default: 0

Number of workers to use in the dataloader. See more

persistent_workers

(bool, optional) Default: True

For multi-worker dataloader controls if the workers are recreated at the end of each epoch (see PyTorch docs)

prefetch_factor

(int, optional) Default: 10

Number of samples loaded in advance by each worker

shuffle

(bool, optional) Default: True

Flag to enable data shuffling

shuffle_buffer

(int, optional) Default: 10 * batch_size

Size of shuffle buffer in samples

shuffle_seed

(int, optional) Default: None

Shuffle seed

Transformers#

Parameter Name

Data type

Description

Supported Models

do_lower

(bool, optional) Default: False

Flag to lower case the texts

BERT (pre-training, fine-tuning), T5, Transformer

dynamic_loss_weight

(bool, optional) Default: False

Flag to dynamically scale the loss. If set, it will divide the loss for a token by the length of the sequence that the token comes from. Use with "precomputed_num_tokens" loss scaling

T5, Transformer

dynamic_mlm_scale

(bool, optional) Default: False

Flag to dynamically scale the loss. If set, MLM Loss is scaled by the number of masked tokens in the current batch using the masked_lm_weights from the input data features

BERT (pre-training)

extra_ids

(int, optional) Default: 0

Number of sentinel tokens for T5 objective

T5, Transformer

masked_lm_prob

(float, optional) Default: 0.15

Ratio of the masked tokens over the sequence length

BERT (pre-training)

max_predictions_per_seq

(int, required)

Maximum number of masked tokens per sequence

BERT (pre-training)

max_sequence_length

(int, optional) Default: varies per model

Maximum sequence length of the input data

All

src_data_dir

(str, required)

Path to directory containing all the files of tokenized data for source sequence

T5, Transformer

src_max_sequence_length

(int, required)

Largest possible sequence length for the input source sequence. If longer it will be truncated. All other sequences padded to this length

T5, Transformer

src_vocab_file

(str, required)

Path to vocab file for source input

T5, Transformer

tgt_data_dir

(str, required)

Path to directory containing all the files of tokenized data for target sequence

T5, Transformer

tgt_max_sequence_length

(int, required)

Largest possible sequence length for the input target sequence. If longer it will be truncated. All other sequences padded to this length

T5, Transformer

tgt_vocab_file

(str, required)

Path to vocab file for target input

T5, Transformer

vocab_file

(str, required)

Path to vocab file

BERT (pre-training, fine-tuning)

vocab_size

(int, required)

The size of the vocabulary used in the model

BERT (pre-training, fine-tuning)

Computer Vision#

Parameter Name

Data type

Description

Supported Models

aggregate_cartilage

(bool, optional) Default: True

For SKM-TEA dataset only. Combines medial and lateral classes into single class

UNet

augment_data

(bool, optional) Default: True

Apply data augmentation to the data

UNet

class_id

(int, optional)

For the Severstal Dataset this sets which class id to be considered as the positive class. All other classes will be considered negative examples

UNet

echo_type

(str, required) Default: echo1

For SKM-TEA dataset only. Specifies training data configuration. Allowed options are: echo1, echo2, or root_sum_of_squares

UNet

image_shape

(List[int], required)

Expected shape of output images in format (H, W, C)

UNet

normalize_data_method

(str, required)

Specify the strategy to normalize the input data. One of: "zero_centered","zero_one","standard_score"

UNet

num_classes

Number of classes in the training dataset. (int, required)

UNet

train_test_split

Percentage of data to be used in the training dataset.

UNet

use_fast_dataloader

(bool, optional) Default: False

If set to True, mapstyle datasets that use the UNetDataProcessor perform faster data processing

UNet

use_worker_cache

(bool, optional) Default: True

If set to True data will be read from local SSD memory on the individual worker nodes during training. If the data does not exist on the worker nodes it will be automatically copied from the host node. This will cause a slowdown the first time this copy takes place

UNet

Optimizer parameters#

Parameter Name

Data type

Description

initial_loss_scale

(int, optional) Default: 2 ** 15

Initial loss scale to be used in the grad scale

learning_rate

(dict, required)

Learning rate scheduler to be used. See supported LR schedulers

log_summaries

(bool, optional) Default: False

Flag to log per layer gradient norm in Tensorboard

loss_scaling_factor

(float/str, optional) Default: 1.0

Loss scaling factor for gradient calculation in learning step

max_gradient_norm

(float, optional) Default: None

Max norm of the gradients for learnable parameters. Used for gradient clipping

min_loss_scale

(float, optional) Default: None

The minimum loss scale value that can be chosen by dynamic loss scaling

max_loss_scale

(float, optional) Default: None

The maximum loss scale value that can be chosen by dynamic loss scaling

optimizer_type

(str, required)

Optimizer to be used. See supported optimizers

Runconfig parameters#

Key

Data type

Description

Supported mode

autogen_policy

(str, optional) Default: None

The autogen policy to be used for the given run.
Can be one of: "default", "disabled", "mild", "medium", "aggressive". See more

CSX

autoload_last_checkpoint

(bool, optional) Default: True

Flag to automatically load the last checkpoint in the model_dir

All

check_loss_values

(bool, optional) Default: True

Flag to check the loss values to see if it is Nan/inf

All

checkpoint_path

(str, optional) Default: None

The path to load checkpoints from during training

All

checkpoint_steps

(int, optional) Default: 0

The number of steps between saving model checkpoints during training. 0 means no checkpoints saved

All

compile_dir

(str, optional) Default: None

Compile directory where compile artifacts will be written

All

compile_only

(bool, optional) Default: False

Enables compile only workflow

All

credentials_path

(str, optional) Default: None

Credentials for cluster access. If None, the value from a pre-configured location will be used if available

CSX

debug_args_path

(str, optional) Default: None

path to debugs args file

CSX

dist_addr

(str, optional) Default: localhost:8888

To init master_addr and master_port of distributed

GPU

dist_backend

(str, optional) Default: "nccl"

Distributed backend engine

GPU

enable_distributed

(bool, optional) Default: False

Flag to enable distributed training on GPU

GPU

enable_summaries

(bool, optional) Default: False

Enable summaries when running on CS-X hardware

CSX

eval_frequency

(int, optional) Default: None

Specifies the evaluation frequency during training. Only used for train_and_eval mode

All

eval_steps

(int, optional) Default: None

Specifies the number of steps to run the model evaluation

All

experimental_api

(bool, optional) Default: False

Flag to enable experimental PyTorch API

CSX

init_method

(str, optional) Default: "env://"

URL specifying how to initialize the process group

GPU

is_pretrained_checkpoint

(bool, optional) Default: False

Flag used in conjunction with checkpoint_path, to enforce resetting of optimizer states and training steps after loading a given checkpoint. By setting this flag, matching weights are initialized from checkpoint provided by checkpoint_path, training starts from step 0, and optimizer states present in the checkpoint are ignored. Useful for fine-tuning runs on different tasks (e.g., classification, Q&A, etc.) where weights from a pre-trained model trained on language modeling (LM) tasks are loaded or fine-tuning on a different dataset on the same LM task

All

job_labels

(str, optional) Default: None

A list of equal-sign-separated key-value pairs served as job labels

CSX

log_steps

(int, optional) Default: None

Specifies the number of steps between logging during training. Same number controls the summary steps in Tensorboard

All

logging

(str, optional) Default: "INFO"

Specifies the logging level during training

All

max_steps

(int, required)

Specifies the maximum number of steps for training. max_steps is optional unless neither num_epochs nor num_steps are provided, in which case max_steps must be provided

All

mgmt_address

(str, optional)

The address of the management service used for coordinating the training job as <host>:<port>

CSX

mode

(str, required)

The mode of the training job, either ‘"train"’, ‘"eval"’, "eval_all" or "train_and_eval"

All

model_dir

(str, optional) Default: ./model_dir

The directory where the model checkpoints and other metadata will be saved during training

All

mount_dirs

(List[str], optional) Default: None

A list of paths to be mounted to the appliance containers. It should generally contain path to the directory containing the Cerebras model zoo and data dir

CSX

num_act_servers

(int, optional) Default: 1

Number of activation servers per CS-X dedicated to stream samples to the WSE. Input workers stream data to these activation servers, and the activation servers to hold and further stream the data to the WSE. For LLMs, we generally choose one because they are compute-bound. For CV models we choose a higher number, a crude rule of thumb is to have one activation server for every 4 workers (i.e. num_workers_per_csx // 4 if num_workers_per_csx > 4, else 1). It is suggested to keep the default values for this param when possible

CSX

num_csx

(int, optional) Default: 1

The number of CSX systems to use in Cerebras WSE cluster

CSX

num_epochs

(int, optional) Default: None

The number of epochs to train for

All

num_steps

(int, optional) Default: None

The number of steps to train for

All

num_wgt_servers

(int, optional) Default: None

Upper bound on the number of MemoryX servers used for storing the model weights. Compilation may choose a smaller number depending on the model topology. A sensible upper bound (currently 24) is selected if a value is not provided

CSX

num_workers_per_csx

(int, optional) Default: 0

Number of input workers, per CSX, to use for streaming samples. This setting depends on whether the model is compute-bound or input-bound and how efficient the dataloader implementation is. For compute-bound models (e.g., LLM), even one input worker per csx is enough to saturate the input buffers on CSX systems. But for smaller models a larger number may be used. We currently default to 1 worker per CSX

CSX

precision_opt_level

(int, optional) Default: 1

Setting to control the level of numerical precision used for training runs for large NLP models. See more

CSX

python_paths

(List[str], optional) Default: None

A list of paths to be exported into PYTHONPATH for worker containers. It should generally contain path to the directory containing the Cerebras model zoo

CSX

save_initial_checkpoint

(bool, optional) Default: False

Whether to save an initial checkpoint before training starts

All

save_losses

(bool, optional) Default: True

Whether to save the loss values during training

All

seed

(int, optional) Default: None

The seed to use for random number generation for reproducibility

All

steps_per_epoch

(int, optional) Default: None

The number of steps per epoch

All

sync_batchnorm

(bool, optional) Default: False

Whether to use synchronized batch normalization on multi-GPU setup

GPU

target_device

(str, optional) Default: command line value

The target device to run the training on. One of: CPU, GPU, CSX. Required in command line

All

use_cs_grad_accum

(bool, optional) Default: False

Whether to use gradient accumulation to support larger batch sizes

CSX

validate_only

(bool, optional) Default: False

Enables validate only workflow, stops the compilation at kernel matching stage

CSX