Running autoregressive inference#
Overview#
The Cerebras Wafer-Scale Engine cluster enables autoregressive inference for these large language models from the Model Zoo: GPT2, GPT3, BTLM, BLOOM, LaMDA, LLaMA, Mistral, MPT, OPT, and Santacoder. This allows you to generate text continuations for multiple prompt inputs in a batch, facilitating model evaluation downstream. Autoregressive generation is performed greedily, picking the highest probability token at each step.
Preparing input data for autoregressive inference#
To enable batched autoregressive inference, input prompts must be:
Tokenized into IDs based on the model vocabulary
Saved in a single
.h5
file containing one data tensor calleddata
The data tensor should have shape (
num_samples
,max_seq_len
)Select a
start_token
ID not used in any prompt text (the token ID may be outside the model’s vocabulary size)Append
start_token
after each prompt sequenceTokens after
start_token
can be arbitrary padding (recommend using additionalstart_token
)
Setting Inference Parameters#
Data Location and Batch Size#
In addition to preparing the input data, inference configuration requires specifying in a section called inference_input
:
data_processor
: Must be set toGptHDF5MapDataProcessor
data_dir
: Path to the directory containing the.h5
input filebatch_size
: Number of samples to process simultaneously
To run autoregressive inference, add an additional section to your model’s params.yaml
file:
inference_input:
data_processor: "GptHDF5MapDataProcessor"
data_dir: "./path/to/your/data/directory"
batch_size: 60
The batch size does not need to evenly divide the total samples. Any leftover samples will be padded to complete the last batch. Data for the last batch will be padded with dummy samples to ensure that every input sample is processed. For example, if batch size is 100 and you are inferring for 597 samples, the last batch will contain 97 real samples and three dummy padding samples.
When configuring autoregressive inference, it’s recommended to start with the same batch size and, if applicable, micro-batch size that were used during the evaluation phase. This approach provides a baseline for performance and resource utilization. However, since the optimal batch size can vary depending on the specific model and hardware configuration, you may need to conduct some experiments to identify the best batch size and micro-batch size that maximize performance while fitting within your device’s memory constraints. For a more systematic approach to exploring different batch sizes and finding the one that suits your needs, refer to the automatic batch exploration guide. This guide offers detailed steps and considerations to help you efficiently determine the most effective batch configuration for your inference tasks.
Model Parameters for Inference#
Additional keys must be added to the model
section of params.yaml
for autoregressive inference:
start_token
- ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.stop_sequences
- List of sequences (each one being a list of token IDs). If any one of these sequences is emitted by the model, inference will stop for that sample. For example, suppose you would like to stop inferring after either a newline character (e.g. token id 1), or a combination of a period (e.g. token id 2) followed by a space (e.g. token id 3). In this case, set stop_sequences to [[1], [2, 3]]. To stop inferring after seeing a newline character only, set stop_sequences to [[1]]. To disable this feature, setstop_sequences
to an empty list []. Additionally, the following optional parameters may be set:max_tokens
- Maximum tokens to infer for each sampleloop_dim
- Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e.sequence X samples
instead ofsamples X sequence
)
Running Autoregressive Inference#
To launch an autoregressive inference run, the run_gpt_inference.py
script must be used. It is very similar to the run.py
script normally used to launch evaluation or training, and supports
similar parameters, for example:
python modelzoo/common/run_gpt_inference.py CSX \
--params params.yaml \
--model_dir model_dir \
--mount_dirs {paths to modelzoo and data} \
--python_paths {paths to modelzoo and other python code if used} \
--checkpoint_path {path to checkpoint of trained weights}
A few points important to note:
mode
parameter should not be providedinference_steps
optional parameter is supported. If it is provided, inference will only run for the number of batches specified byinference_steps
, and not for your entire dataset. This is useful to validate your flow before starting a large inference job.compile_only
parameter is supported as usual, and may be useful in the process of picking the best batch size for the job, to validate that compilation with a given batch size fits on the device.
Accessing Inference Predictions#
Model predictions (i.e. the output of inference) will be available in the artifacts directory under
your model directory. It will appear in a directory called predictions
, in NumPy files named
predictions_X.npz
where X
is the 1-based batch number (i.e. the global step counter).
These files will appear gradually during the run. That is, there is no need to wait until inference for the entire dataset has completed, in order to access predictions for batches already inferred.
Each .npz
file contains:
global_step: Scalar batch counter
predictions: Batch outputs of shape
(batch_size, max_seq_len)
Each sample in the batch will contain the prompt, immediately followed by the results of inference (without
start_token
). Note that for each sample, inference will stop when either of the following occurs:
Maximum sequence length is reached
Any one of the specified
stop_sequences
is emitted by the model (in which case it will be present in the output data)The
max_tokens
limit is reached for the given sample
The model will pad any tokens after the end of the inferred text with the start_token ID
(or with the first start_token ID
if you provide multiple IDs).
In case of dummy samples for partial batches (e.g. for the last three samples when inferring 597 samples with batch size 100), the dummy samples will consist solely of start_token
in the output.
Conclusion#
The Cerebras Wafer-Scale Engine cluster provides a powerful platform for running autoregressive inference on a selection of large language models, enabling efficient text generation and model evaluation. The process involves meticulous preparation of input data and careful setting of inference parameters to ensure the input workers properly process batched prompts. By leveraging the specialized appliance_environ
object and configuring the model’s params.yaml
file, users can tailor their inference runs to their specific requirements, optimizing for batch size and other parameters to maximize efficiency and effectiveness. The availability of predictions in real-time during inference runs offers immediate insights, allowing users to rapidly iterate and refine their models. This streamlined approach to autoregressive inference on the Cerebras system underscores its capability to handle complex, large-scale language models, providing a robust toolset for advanced text generation tasks.