Cerebras Model Zoo YAML parameters#
Model parameters#
Common#
Parameter Name |
Data type |
Description |
---|---|---|
mixed_precision |
( |
Whether to use mixed precision training or not |
use_bfloat16 |
( |
Whether to use bfloat16 data type instead of float32. See more |
Transformer based models#
Parameter Name |
Data type |
Description |
Supported Models |
---|---|---|---|
attention_dropout_rate |
( |
Dropout rate for attention layer |
All |
attention_kernel |
( |
Attention kernel to use. Accepted values: |
All |
attention_softmax_fp32 |
( |
Whether to use fp32 precision for attention softmax |
All |
attention_type |
( |
Type of attention. Accepted values: |
All |
d_ff |
( |
Size of the intermediate feed-forward layer in each |
T5, Transformer |
d_kv |
( |
Size of the query/key/value projections per attention head. |
T5, Transformer |
d_model |
( |
The number of expected features in the encoder/decoder inputs |
All |
decoder_nonlinearity |
( |
Type of nonlinearity to be used in decoder |
T5, Transformer |
decoder_num_hidden_layers |
( |
T5, Transformer |
Number of hidden layers in the Transformer decoder. Will use the same value as |
disable_nsp |
( |
Whether to disable the next sentence prediction task |
BERT (pre-training, fine-tuning) |
dropout_rate |
( |
The dropout probability for all fully connected layers |
All |
embedding_dropout_rate |
( |
Dropout rate for embeddings |
All |
embedding_initializer |
( |
Initializer to use for embeddings. See supported initializers |
GPT2, GPT3, GPTJ |
encoder_nonlinearity |
( |
Type of nonlinearity to be used in encoder |
BERT (pre-training, fine-tuning), T5, Transformer |
encoder_num_hidden_layers |
( |
Number of hidden layers in the encoder |
T5, Transformer |
extra_ids |
( |
The number of extra ids used for additional vocabulary items |
T5, Transformer |
filter_size |
( |
Dimensionality of the feed-forward layer in the Transformer block |
BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ |
hidden_size |
( |
The size of the transformer hidden layers |
BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ |
initializer |
( |
The initializer to be used for all the initializers used in the model. See supported initializers |
BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ |
initializer_range |
( |
The standard deviation of the truncated_normal_initializer as the default initializer |
BERT (pre-training), GPT2, GPT3, GPTJ |
layer_norm_epsilon |
( |
The epsilon value used in layer normalization layers |
All |
lm_loss_weight |
( |
Value that scales loss by the mean number of predictions per sequence in the dataset. This number varies per dataset and can be calculated by getting the reciprocal of average number of tokens per sequence in the training dataset. This is only needed when setting loss scaling to |
T5, Transformer |
loss_scaling |
( |
The scaling type used to calculate the loss. Accepts: |
GPT2, GPT3, GPTJ |
loss_weight |
( |
The weight for the loss scaling when |
GPT2, GPT3, GPTJ |
max_position_embeddings |
( |
The maximum sequence length that the model can handle |
All |
mlm_loss_scaling |
( |
A string specifying the scaling factor type used for the language modeling loss. Accepts one of: |
T5, Transformer |
mlm_loss_weight |
( |
The weight for the masked language modeling loss used when scaling the loss with |
BERT (pre-training) |
nonlinearity |
( |
The non-linear activation function used in the feed-forward network in each transformer block. See list of non-linearity functions here. Some may have to use |
BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ |
num_heads |
( |
The number of attention heads in the multi-head attention layer |
All |
num_hidden_layers |
( |
Number of hidden layers in the Transformer encoder/decoder |
All |
output_layer_initializer |
(str, optional) Default: varies based on model |
The name of the initializer for the weights of the output layer. See supported initializers |
GPT2, GPT3, GPTJ |
position_embedding_type |
( |
The type of position embedding to use in the model. Can be one of: |
All |
relu_dropout_rate |
( |
The dropout rate for ReLU activation function |
T5, Transformer |
residual_dropout_rate |
( |
The dropout rate for residual connections |
GPTJ |
rotary_dim |
( |
The number of dimensions used for the rotary position encoding. Must be an even number |
GPTJ |
share_embedding_weights |
( |
Whether to share the embedding weights between the input and out put embedding |
All |
share_encoder_decoder_embedding |
( |
Whether to share the embedding weights between the encoder and decoder |
T5, Transformer |
src_vocab_size |
( |
The size of the source vocabulary. Max supported value: |
T5, Transformer |
tgt_vocab_size |
( |
The size of the target vocabulary. Max supported value: |
T5, Transformer |
use_bias_in_output |
( |
Whether to use bias in the final output layer |
GPT2, GPT3, GPTJ |
use_dropout_outside_residual_path |
Default: |
( |
Whether to set dropout calculations outside of the residual path |
use_ffn_bias |
( |
Whether to use bias in the feed-forward network (FFN) |
All |
use_ffn_bias_in_attention |
( |
Whether to include bias in the attention layer for feed-forward network (FFN) |
All |
use_position_embedding |
( |
Whether to use position embedding in the model |
GPT2, GPT3 |
use_pre_encoder_decoder_dropout |
( |
Whether to use dropout layer after positional embedding layer and encoder/decoder |
T5, Transformer |
use_pre_encoder_decoder_layer_norm |
( |
Whether to use layer norm before passing input tensors into encoder/decoder |
T5, Transformer |
use_projection_bias_in_attention |
( |
Whether to include bias in the attention layer for projection |
All |
use_t5_layer_norm |
( |
Whether to use T5 layer norm (with no mean subtraction and bias correction) or use the regular |
T5, Transformer |
use_transformer_initialization |
( |
The Transformer model tends to converge best with a scaled variant on Xavier uniform initialization used for linear layers. This contrasts the initialization used for the original T5 paper, which uses the normal initialization for linear layers. Setting this flag to |
T5, Transformer |
use_untied_layer_norm |
( |
Whether to use untied layer normalization |
GPTJ |
vocab_size |
( |
The size of the vocabulary used in the model. Max supported value: |
All |
Computer Vision models#
Parameter |
Data type |
Description |
Supported Models |
---|---|---|---|
bias_initializer |
( |
Initializer for the bias |
UNet |
convs_per_block |
( |
List of conv specifications for each conv in the block |
UNet |
decoder_filters |
( |
List of filter sizes for each block in the decoder |
UNet |
downscale_bottleneck |
( |
Whether to downsample the spatial dimensions in the UNet bottleneck block |
UNet |
downscale_encoder_blocks |
( |
Determine whether each block in the Encoder includes downsampling. Length of the list must correspond to the number of UNetBlocks in the Encoder. If a single bool is provided, all blocks will use this value |
UNet |
downscale_first_conv |
( |
If True, the first convolution operation in each UNetBlock will be downscaled. If False, the last convolution in each UNetBlock will be downscaled |
UNet |
downscale_method |
( |
Downscaling method at the end of each block. One of |
UNet |
enable_bias |
By default, bias will only be included when no normalization is used after the convolution layers |
Whether to include a bias operation following convolution layers |
UNet |
encoder_filters |
( |
List of filter sizes for each block in the encoder |
UNet |
eval_ignore_classes |
( |
List of classes to ignore during evaluation of model |
UNet |
eval_metrics |
( |
List of evaluation metrics to use during training and validation. Available options are accuracy ( |
UNet |
initializer |
( |
Initializer for the convolution weights. See supported initializers |
UNet |
input_channels |
( |
Number of channels in the input images to the model |
UNet |
loss |
( |
Loss type, supported: values: |
UNet |
nonlinearity |
( |
Activation function used in the model following convolutions in the encoder and decoder |
UNet |
norm_kwargs |
|
args to be passed to norm layers during initialization. For |
UNet |
norm_layer |
( |
Type of normalization to be used. See supported norm layers |
UNet |
residual_blocks |
( |
Flag for using residual connections at the end of each block |
UNet |
skip_connect |
( |
Flag for if the model concatenates encoder outputs to decoder inputs |
UNet |
use_conv3d |
( |
Whether to use 3D convolutions in the model |
UNet |
Data loader parameters#
Common#
Parameter Name |
Data type |
Description |
---|---|---|
batch_size |
( |
Global batch size means the effective batch size - batch size used to calculate loss and update weights for a single step |
data_dir |
( |
Path/s to the data files to use |
data_processor |
( |
Name of the data processor to be used |
mixed_precision |
( |
Flag to cast input to fp16 |
num_workers |
( |
Number of workers to use in the dataloader. See more |
persistent_workers |
( |
For multi-worker dataloader controls if the workers are recreated at the end of each epoch (see PyTorch docs) |
prefetch_factor |
( |
Number of samples loaded in advance by each worker |
shuffle |
( |
Flag to enable data shuffling |
shuffle_buffer |
( |
Size of shuffle buffer in samples |
shuffle_seed |
( |
Shuffle seed |
Transformers#
Parameter Name |
Data type |
Description |
Supported Models |
---|---|---|---|
do_lower |
( |
Flag to lower case the texts |
BERT (pre-training, fine-tuning), T5, Transformer |
dynamic_loss_weight |
( |
Flag to dynamically scale the loss. If set, it will divide the loss for a token by the length of the sequence that the token comes from. Use with |
T5, Transformer |
dynamic_mlm_scale |
( |
Flag to dynamically scale the loss. If set, MLM Loss is scaled by the number of masked tokens in the current batch using the |
BERT (pre-training) |
extra_ids |
( |
Number of sentinel tokens for T5 objective |
T5, Transformer |
masked_lm_prob |
( |
Ratio of the masked tokens over the sequence length |
BERT (pre-training) |
max_predictions_per_seq |
( |
Maximum number of masked tokens per sequence |
BERT (pre-training) |
max_sequence_length |
( |
Maximum sequence length of the input data |
All |
src_data_dir |
( |
Path to directory containing all the files of tokenized data for source sequence |
T5, Transformer |
src_max_sequence_length |
( |
Largest possible sequence length for the input source sequence. If longer it will be truncated. All other sequences padded to this length |
T5, Transformer |
src_vocab_file |
( |
Path to vocab file for source input |
T5, Transformer |
tgt_data_dir |
( |
Path to directory containing all the files of tokenized data for target sequence |
T5, Transformer |
tgt_max_sequence_length |
( |
Largest possible sequence length for the input target sequence. If longer it will be truncated. All other sequences padded to this length |
T5, Transformer |
tgt_vocab_file |
( |
Path to vocab file for target input |
T5, Transformer |
vocab_file |
( |
Path to vocab file |
BERT (pre-training, fine-tuning) |
vocab_size |
( |
The size of the vocabulary used in the model |
BERT (pre-training, fine-tuning) |
Computer Vision#
Parameter Name |
Data type |
Description |
Supported Models |
---|---|---|---|
aggregate_cartilage |
( |
For SKM-TEA dataset only. Combines medial and lateral classes into single class |
UNet |
augment_data |
( |
Apply data augmentation to the data |
UNet |
class_id |
( |
For the Severstal Dataset this sets which class id to be considered as the positive class. All other classes will be considered negative examples |
UNet |
echo_type |
( |
For SKM-TEA dataset only. Specifies training data configuration. Allowed options are: |
UNet |
image_shape |
( |
Expected shape of output images in format (H, W, C) |
UNet |
normalize_data_method |
( |
Specify the strategy to normalize the input data. One of: |
UNet |
num_classes |
Number of classes in the training dataset. ( |
UNet |
|
train_test_split |
Percentage of data to be used in the training dataset. |
UNet |
|
use_fast_dataloader |
( |
If set to True, mapstyle datasets that use the UNetDataProcessor perform faster data processing |
UNet |
use_worker_cache |
( |
If set to True data will be read from local SSD memory on the individual worker nodes during training. If the data does not exist on the worker nodes it will be automatically copied from the host node. This will cause a slowdown the first time this copy takes place |
UNet |
Optimizer parameters#
Parameter Name |
Data type |
Description |
---|---|---|
initial_loss_scale |
( |
Initial loss scale to be used in the grad scale |
learning_rate |
( |
Learning rate scheduler to be used. See supported LR schedulers |
log_summaries |
( |
Flag to log per layer gradient norm in Tensorboard |
loss_scaling_factor |
( |
Loss scaling factor for gradient calculation in learning step |
max_gradient_norm |
( |
Max norm of the gradients for learnable parameters. Used for gradient clipping |
min_loss_scale |
( |
The minimum loss scale value that can be chosen by dynamic loss scaling |
max_loss_scale |
( |
The maximum loss scale value that can be chosen by dynamic loss scaling |
optimizer_type |
( |
Optimizer to be used. See supported optimizers |
Runconfig parameters#
Key |
Data type |
Description |
Supported mode |
---|---|---|---|
autogen_policy |
( |
The autogen policy to be used for the given run. |
CSX |
autoload_last_checkpoint |
( |
Flag to automatically load the last checkpoint in the |
All |
check_loss_values |
( |
Flag to check the loss values to see if it is |
All |
checkpoint_path |
( |
The path to load checkpoints from during training |
All |
checkpoint_steps |
( |
The number of steps between saving model checkpoints during training. |
All |
compile_dir |
( |
Compile directory where compile artifacts will be written |
All |
compile_only |
( |
Enables compile only workflow |
All |
credentials_path |
( |
Credentials for cluster access. If |
CSX |
debug_args_path |
( |
path to debugs args file |
CSX |
dist_addr |
( |
To init master_addr and master_port of distributed |
GPU |
dist_backend |
( |
Distributed backend engine |
GPU |
enable_distributed |
( |
Flag to enable distributed training on GPU |
GPU |
enable_summaries |
( |
Enable summaries when running on CS-X hardware |
CSX |
eval_frequency |
( |
Specifies the evaluation frequency during training. Only used for |
All |
eval_steps |
( |
Specifies the number of steps to run the model evaluation |
All |
experimental_api |
( |
Flag to enable experimental PyTorch API |
CSX |
init_method |
( |
URL specifying how to initialize the process group |
GPU |
is_pretrained_checkpoint |
( |
Flag used in conjunction with |
All |
job_labels |
( |
A list of equal-sign-separated key-value pairs served as job labels |
CSX |
log_steps |
( |
Specifies the number of steps between logging during training. Same number controls the summary steps in Tensorboard |
All |
logging |
( |
Specifies the logging level during training |
All |
max_steps |
( |
Specifies the maximum number of steps for training. |
All |
mgmt_address |
( |
The address of the management service used for coordinating the training job as |
CSX |
mode |
( |
The mode of the training job, either ‘ |
All |
model_dir |
( |
The directory where the model checkpoints and other metadata will be saved during training |
All |
mount_dirs |
( |
A list of paths to be mounted to the appliance containers. It should generally contain path to the directory containing the Cerebras model zoo and data dir |
CSX |
num_act_servers |
( |
Number of activation servers per CS-X dedicated to stream samples to the WSE. Input workers stream data to these activation servers, and the activation servers to hold and further stream the data to the WSE. For LLMs, we generally choose one because they are compute-bound. For CV models we choose a higher number, a crude rule of thumb is to have one activation server for every 4 workers (i.e. |
CSX |
num_csx |
( |
The number of CSX systems to use in Cerebras WSE cluster |
CSX |
num_epochs |
( |
The number of epochs to train for |
All |
num_steps |
( |
The number of steps to train for |
All |
num_wgt_servers |
( |
Upper bound on the number of MemoryX servers used for storing the model weights. Compilation may choose a smaller number depending on the model topology. A sensible upper bound (currently 24) is selected if a value is not provided |
CSX |
num_workers_per_csx |
( |
Number of input workers, per CSX, to use for streaming samples. This setting depends on whether the model is compute-bound or input-bound and how efficient the dataloader implementation is. For compute-bound models (e.g., LLM), even one input worker per csx is enough to saturate the input buffers on CSX systems. But for smaller models a larger number may be used. We currently default to 1 worker per CSX |
CSX |
precision_opt_level |
( |
Setting to control the level of numerical precision used for training runs for large NLP models. See more |
CSX |
python_paths |
( |
A list of paths to be exported into |
CSX |
save_initial_checkpoint |
( |
Whether to save an initial checkpoint before training starts |
All |
save_losses |
( |
Whether to save the loss values during training |
All |
seed |
( |
The seed to use for random number generation for reproducibility |
All |
steps_per_epoch |
( |
The number of steps per epoch |
All |
sync_batchnorm |
( |
Whether to use synchronized batch normalization on multi-GPU setup |
GPU |
target_device |
( |
The target device to run the training on. One of: |
All |
use_cs_grad_accum |
( |
Whether to use gradient accumulation to support larger batch sizes |
CSX |
validate_only |
( |
Enables validate only workflow, stops the compilation at kernel matching stage |
CSX |