Control numerical precision level#
Overview#
The Cerebras cluster currently has limited support for data types used in model training. Specifically, the cluster supports training neural network models with either float16, cbfloat16, or bfloat16 input data types, while arithmetic operations utilize float32 precision. Utilizing mixed precision data types (i.e. float16/cbfloat16/bfloat16 inputs with float32 operations) provides increased computational efficiency to the training process, especially when combined with dynamic loss scaling techniques. With bfloat16 as the selected input data type, the use of dynamic loss scaling during training is unnecessary. In summary, to maximize performance of models trained on the Cerebras cluster, users should enable either float16, cbfloat16, or bfloat16 inputs alongside float32 arithmetic operations.
Users have configurable control over the numerical precision settings utilized during the training of large natural language processing (NLP) models on this system. Specifically, parameters exist to specify the precision of the floating point representation used for model weights, activations, and gradients during the optimization process. These settings which dictate the aforementioned numerical precisions can be collectively referred to as the Precision and Optimization Level (POL).
Setting the numerical precision#
Configure the precision_opt_level
flag within the YAML configuration for all models integrated into the Cerebras Model Zoo. You can locate the precision_opt_level
flag within the runconfig section:
runconfig:
precision_opt_level:
...
The precision_opt_level
flag allows users to configure the numerical precision of operations during model training to achieve different precision and performance tradeoffs.
Setting
precision_opt_level
to 0 utilizes single-precision (float32) accumulations to ensure higher precisionA
precision_opt_level
of 1 represents an optimal tradeoff between precision and performance. It employs a combination of Float32 and bfloat16/cbfloat16/float16 reductions in matrix multiplication and attention layers to maximize performance while maintaining adequate precision to ensure model convergenceA
precision_opt_level
of 2 is a high-performance setting, combining float32 and bfloat16/float16 reductions in attention and matrix multiplication kernels, along with bfloat16/float16 softmax in attention, delivering the best performance
The default configuration for large language models in the Cerebras Model Zoo is to have precision_opt_level
: 1 with the use of cbfloat16, which ensures convergence with high performance. For models whose input activations are in 16-bit, the default is to use bfloat16.
Note
A large number of Cerebras models, have been trained using “precision_opt_level: 1”. This setting is recommended to obtain the best training performance while retaining good convergence behavior. In the event of significant numeric issues during training, one can consider the use of the more precise “precision_opt_level: 0” setting.
Additionally, it’s worth noting that “precision_opt_level: 2” is primarily used as an internal tool by Cerebras to evaluate the future performance of their hardware.
Supported precisions on Cerebras systems#
The CS system offers support for the following data formats:
32-bit Floating-point format: This format is IEEE single-precision - commonly known as FP32
16-bit Floating-point format: This format is IEEE half-precision - commonly known as FP16
Cerebras 16-bit floating point: Specific to the Cerebras Wafer-Scale engine - commonly known as
cbfloat16
Custom 16-bit floating point: bfloat16 has eight exponent bits and is specifically designed for deep-learning applications
Note that 16-bit arithmetic in the CS system uses 16-bit words and is always aligned to a 16-bit boundary. On the other hand, single-precision arithmetic uses even-aligned register pairs for register operands and requires 32-bit aligned addresses for memory operands.
Note
In the CS system, memory is 16-bit word addressable. It is not byte addressable, so you cannot directly access individual bytes within a word.
FP32 Single-Precision#
FP32, as used in the CS system, is indeed equivalent to IEEE binary32, which is also known as single-precision floating-point format. In this format, there are 8 bits allocated for the exponent and 23 bits for the explicit mantissa.
Sign: 1 |
Exponent: 8 |
Mantissa: 23 |
FP16#
The FP16 implementation in the CS system adheres to the IEEE standard for binary16, commonly known as half-precision floating-point format. In this format, there are 5 bits reserved for the exponent and 10 bits for the explicit mantissa.
Sign: 1 |
Exponent: 5 |
Mantissa: 10 |
cbfloat16#
cbfloat16
is a custom 16-bit floating point format available on the Cerebras Wafer-Scale engine that further optimizes the tradeoff between range and precision for neural network training. It uses 9 mantissa or fraction bits, and 6 exponent bits. It has double the range of the FP16 format, and significantly more precision (additional two bits) than the BF16 format.
Sign: 1 |
Exponent: 6 |
Mantissa: 9 |
Note
There is currently no support for cbfloat16
in Python nor PyTorch. For this reason, we make use of float16
as a “proxy-type” in these higher-level representations, which is then converted to cbfloat16
as part of the compilation process. Using float16
as the proxy-type avoids any loss of precision since it is more precise than cbfloat16
, however, we lose some range representation. Extensive testing has shown this to not be an issue.
bfloat16#
The bfloat16 type is a custom 16-bit floating point format for deep learning that’s comprised of a sign bit, 8 exponent bits, and 7 mantissa bits. This is different from the industry-standard IEEE 16-bit floating point, which was not designed with deep learning applications in mind. It has a range similar to the FP32 format, but with a lot less accuracy due to the constraints of the 16-bit format.
Sign: 1 |
Exponent: 8 |
Mantissa: 7 |
Automatic Mixed Precision#
Automatic mixed precision is a mode that enables the training of deep learning models using a combination of single-precision floating-point (float32) and half-precision floating-point formats, such as FP16, cbfloat16
, or bfloat16
.
The primary advantages of the mixed precision mode are centered on performance. It’s an optimization technique that allows for faster network training without sacrificing quality. This efficiency stems from the fact that some layers within neural networks can be computed with lower precision, such as convolutional or linear layers. These layers have been shown to be significantly faster when executed with FP16, cbfloat16
, or bfloat16
. However, certain operations, like reductions, often require a higher precision level to maintain the same level of quality in results.
This trade-off between casting certain operations to half-precision and maintaining others in single precision is part of the “automatic mixed precision algorithm.” In essence, this algorithm assesses the network’s performance in its default precision, then strategically introduces castings to execute the same network with mixed precision settings to optimize performance without compromising accuracy.
It’s important to note that mixed precision doesn’t mandate the use of a specific half-precision floating point format. However, there are tradeoffs to be considered in this choice and we discuss this below.
Automatic Mixed Precision and cbfloat16#
The use of the cbfloat16
half-precision format further improves performance over bfloat16 as the additional precision allows it to be used for certain types of reductions.
Performance: cbfloat16
is approximately 19% faster than bfloat16.
Weight Growth: cbfloat16
shows similar weight growth as bfloat16.
Evaluation Scores: cbfloat16
demonstrates similar evaluation scores as bfloat16.
The use of the cbfloat16
half-precision format provides similar benefits as bfloat16 at a higher performance. However, due to the restricted range as compared to bfloat16, it is recommended that loss scaling be enabled to ensure that the format captures the numeric range of interest on the backward pass.
Automatic Mixed Precision and bfloat16#
We experimented with a variety of deep learning networks comparing bfloat16 and FP16 modes and they yielded valuable insights. It’s clear that bfloat16 offers several advantages:
Performance: bfloat16
is approximately 18% faster than FP16.
Weight Growth: bfloat16
is significantly less prone to weight growth.
Evaluation Scores: bfloat16
demonstrates improved evaluation scores.
These findings underscore the benefits of choosing bfloat16
over pure float32
or a mixed version with float16
. bfloat16
enhances training efficiency, conserves memory space, and preserves the same level of accuracy. This is primarily because deep learning models are generally more sensitive to changes in the exponent rather than the mantissa of floating-point numbers.
Furthermore, training with the bfloat16
setting proves to be more robust and less susceptible to issues like underflows, overflows, or other numerical instabilities during training, much like training with pure float32
dtype. This enhanced stability is attributed to the fact that the exponent size of bfloat16 floating point matches that of float32
, providing a balance between precision and performance.
How to Enable FP16, cbfloat16, and bfloat16#
To enable FP16, cbfloat16
, or bfloat16
in the mixed precision mode, you can make the following changes in the configuration file:
model:
fp16_type: one of {float16, bfloat16, cbfloat16}
mixed_precision: True
When using fp16_type: bfloat16
, loss scaling is not necessary as the format has the same range as single precision and it will have identical behavior for underflows, overflows, or any other numeric instability during training as single precision. However, when using fp16_type: float16
or fp16_type: cbfloat16
the use of loss scaling is recommended to ensure that the values in the backward pass are captured correctly using the range specific to those formats. The Cerebras stack will throw an error if fp16_type: cbfloat16
is used and loss scaling is not enabled.
To experiment with networks using this setting, you can refer to the specific references provided for the gpt2, gpt3, and gptj models in the Model Zoo.
Conclusion#
Controlling the numerical precision level during training on the Cerebras Wafer-Scale cluster is a critical aspect of optimizing performance and efficiency. The support for float16, cbfloat16, and bfloat16 data types, in conjunction with float32 arithmetic operations, enables users to leverage mixed precision training effectively. By configuring the precision_opt_level
and selecting the appropriate floating-point representation, users can achieve a balance between computational speed and model accuracy. The introduction of cbfloat16 and bfloat16 offers a nuanced choice between precision and performance, with cbfloat16 providing a unique advantage in terms of speed and precision balance. Through careful configuration and understanding of each precision type’s characteristics, users can tailor their training process to suit their specific needs, leading to faster training times without compromising the quality of the model outputs.