Train an LLM using Maximal Update Parameterization#
Commonly, GPT-style models are trained using Standard Parametrization (SP). In SP parametrization, model weights are initialized from normal distributions with constant standard deviation or standard deviation based on the shape of each layer. However, as models scale to many parameters, the parametrization does not account for potential inter-layer interactions, resulting in unstable training. This instability can cause costly restarts. Further, hyperparameters do not transfer well as models scale in size.
μP (Maximal Update Parameterization) μP addresses these challenges by enabling:
Stable training dynamics at a large scale by controlling the initialization, activations magnitude, and layer-wise adaptive learning rates independent of model width
Zero-shot hyperparameter transfer from a smaller model to larger model. Essentially, μP facilitates width invariance to the model’s hyperparameters.
For example, if the Cerebras-GPT family of models was trained using both SP and μP parametrizations. For SP configuration, the hyperparameters were chosen from the best results found in the literature. For μP configurations, μTransfer enabled scaling hyperparameters from a 40M parameter model to models up to 2.7B parameters, resulting in a lower loss. These results highlight how μP can improve large model scaling, improving accuracy, and hyperparameter predictability at scale.
Note
Appendix G in Cerebras-GPT paper contains the learnings from the Cerebras team while using μP parametrization, including implementation differences with versus SP parametrization, μP hyperparameter search details, and advice for practitioners regarding the critical batch size.
In particular, we suggest looking at Table 14: Cheat Sheet: All implementation details required to compare SP and μP, to identify differences between μP and SP parametrizations.
How to enable μP when training models with Cerebras#
Both SP and μP parametrizations are available in Cerebras Model Zoo. Currently, only GPT-2 and GPT-3 style models support μP. All configuration files in the GPT-style models are SP parametrized unless noted as <name_config>_mup.yaml
. Both types of config files can be executed using the run.py
script of the preferred GPT-style model. Specify the path to the config file using the --config
flag. For more information on launching a Cerebras job, visit our quickstart.
There are two main differences between SP and μP parametrization configuration yaml files:
Structure: μP parametrization has additional hyperparameters that scale internal layers, including adding element-wise activation tensor scaling and adding layer-wise learning rates scaling to certain layers. These hyperparameters are specified in the config file as follows:
model:
...
# muP
scale_qk_dot_by_d: True
output_logits_scale: ...
embeddings_scale: ...
optimizer:
...
adjust_learning_rate:
decoder_kernel: ...
Initialization values: μP parametrization requires adjusting initializers for affected layers. Also, these hyperparameters are intended to be derived using μTransfer approach instead of reproducing results found in the literature. In the next section, you will learn more about computing values for μTransfer approach.
Note
To learn more about the mathematical differences between μP and SP, we suggest Table 14: Cheat Sheet: All implementation details required to compare SP and μP, in the Appendix G of Cerebras-GPT paper.
You can find examples of and μP parametrization in the configurations of Cerebras-GPT for models of the same size, including Cerebras-GPT of 2.7B parameters using sP and muP parametrizations.
Transfering hyperparameters with μTransfer#
μTransfer is a hyperparameter transfer paradigm that makes zero-shot transfer of near-optimal hyperparameters possible from a small version of the model to a large model via μP (Maximal Update Parameterization). The “small model” is called the proxy-model for which the hyperparameters are tuned, and the “large model” is referred to as the target-model.
First, you will tune nearly optimal hyperparameters of the proxy-model. In particular, the optimal learning rate, the initialization standard deviation, and the embedding multiplier will be relevant to derive scaling formulas for the target-model. As an example, the Cerebras-GPT family used a 200 sample random hyperparameter search on a 40M parameter proxy model trained on 600M tokens with a batch size of 131k tokens. You can find results and details on the optimal hyperparameters in Appendix G of the paper.
Second, you will scale up the hyperparameters based on the size of the target-model. For this, it is helpful to identify the different types of hyperparameters in a GPT-style model:
μTransferred Across |
μTransferable |
Not μTransferable |
---|---|---|
They define the training scale |
They can transfer from the small to the large model |
They do not work with μTransfer, because they depend on model size and data size |
Theoretically demonstrated: width; Empirically demonstrated: depth, batch size,training time, seq length |
Optimization related (learning rate (LR), momentum, Adam beta, LR schedule, etc), inititalization (per-layer init. variance),parameter multipliers (multiplicative constants after weight/biases) |
Regularization (dropout, weight decay, etc) |
Scaling the hyperparameter values from proxy to target model is defined by a series of equations depending on the layer width multiplier \(m_{width}=d_{target}/d_{proxy}\), where \(d\) stands for the width of each layer. For example, let’s use the 40M parameter Cerebras-GPT model as a proxy model and 2.7B parameter Cerebras-GPT model as the target model. The 40M model has a hidden size of 256, and the 2.7B model has a hidden size of 2560. Thus \(d_{proxy}=256\), \(d_{target} = 2560\), and \(m_{width}=d_{target}/d_{proxy} = 10\). These equations are found in Table 14 Appendix G of Cerebras-GPT paper.
For convenience, Cerebras Model Zoo contains the script convert_config_to_mup.py to scale GPT-3 configurations based on a target-model configuration YAML file. This script has the following parameters:
Parameter |
Description |
---|---|
|
[Required] Configuration Yaml file of the target-model. This can be a standard configuration file, like the ones found in Cerebras Model Zoo |
|
[Optional] Proxy-model’s width, defaults to 256 |
|
[Optional] Proxy-model’s lr determined by hyperparameter sweep, defaults to 6e-3. Currently, we support config generation for sequential Linear learning rate schedules. First lr scheduler should perform linear warm-up and second scheduler should do linear decay. |
|
[Optional] Proxy-model’s initial standard deviation, defaults to 0.08 |
|
Proxy-model’s embeddings multiplier, defaults to 10.0 |
|
[Optional] Output yaml file to save the μP config. if not provided, config will be stored under the same path as the input but with a |
The default values of these parameters are based on the findings in the Cerebras-GPT paper, with a proxy model of 40M parameters.
For example, you can use this script with a configuration file of a GPT-3 model of 2.7B params, called params_gpt3_2p7b.yaml
and all of the default arguments as follows:
$ python convert_config_to_mup.py --input_yaml </path/to/config>/params_gpt3_2p7b.yaml
muP config saved to </path/to/config>/params_gpt3_2p7b_mup.yaml