.. _sw-release-notes: Software Release Notes ====================== The following are the release notes for the Cerebras software. .. _v1-7-0: Release 1.7.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Computer vision models ^^^^^^^^^^^^^^^^^^^^^^ * Added support for UNet 2D in Weight Streaming execution to enable large input sizes of up to 5k x 5k resolutions. Both Training and Eval workflows are supported. Single CS-2 system support only. Single-channel single-class segmentation only in this release. Large language models ^^^^^^^^^^^^^^^^^^^^^ * Added support for GPT-J-style models up to 20B parameters in PyTorch, e.g. GPT-J 6B, GPT-NeoX 20B. More example configs can be found in `Cerebras Model Zoo `_. * Added new GPT-3 variants up to 20B parameters both in TensorFlow and PyTorch. * Improved performance (throughput) for GPT-style models by up to 1.5x compared to the previous release. Other features ^^^^^^^^^^^^^^ * Added support for bfloat16 data type for large language models in PyTorch and enabled precision knobs for performance optimization (see “Precision optimization level” in :ref:`performance-optimization` for more details). * Added support for training language models with sparse weights in PyTorch; training with static sparsity masks is now enabled, and scripts to convert a dense PyTorch checkpoint into a sparse version are available. * Improved error messages for easy-to-understand, user-actionable errors. * Expanded support for PyTorch learning rate schedules and loss functions. Known issues ~~~~~~~~~~~~ Running eval with UNet in PyTorch, Weight Streaming execution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * UNet model eval with metrics mIOU, DSC on images of size 4096 x 4096 pixels causes failures and is a known issue. All default configurations and image sizes published in the ModelZoo have been tested and expected to work without issues. The issue may manifest itself with other non-tested image sizes. The expected error is as follows: .. code-block:: bash status = StatusCode.INTERNAL details = "KM to RT IR Translation Failed" debug_error_string = "UNKNOWN:Error received from peer ipv4:10.254.104.16:443 {grpc_message:"KM to RT IR Translation Failed", grpc_status:13, created_time:"2022-12-15T00:27:47.134671257-08:00"}" Please contact Cerebras support if you run into failures. .. _v1-6-1: Release 1.6.1 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Weight Streaming models in PyTorch ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Early access support for GPT model variants, up to 1.5B parameters in PyTorch on single CS-2 systems. Security-related Updates and Error fixes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Restricts the weight streaming jobs to be run with respective uid/gid security context only and disallows root user * Creation of volume mounts is now limited to worker nodes only to avoid security issues from broader access. * Adds new feature of “KeepAlive” to coordinator node in appliance which helps to keep track of the client activity and prevents jobs from hanging indefinitely * Pipeline workflow in Appliance now uses the scheduler to choose the CS system automatically and doesn’t require user to specify the ``cs_ip`` explicitly .. _v1-6-0: Release 1.6.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Weight streaming models ^^^^^^^^^^^^^^^^^^^^^^^ * Up to 2x performance improvement for GPT-style models with weight streaming training on CS-2 systems. * Early access support for additional GPT model variants, up to 20B parameters on single CS-2 systems. Cerebras Wafer-Scale Cluster ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * This is the first production release of software support for weight streaming on the Cerebras Wafer-Scale Cluster. This includes a simple workflow that allows users to easily submit and scale large training jobs to a Cerebras cluster. Refer to our TF and PyTorch getting started guides on how to run your pipeline and weight streaming models on the cluster. Multi-node CS-2 support ^^^^^^^^^^^^^^^^^^^^^^^ * Expands Cerebras Wafer-Scale Cluster to support 8x CS-2s on GPT-style models and achieves near linear scaling performance. PyTorch support ^^^^^^^^^^^^^^^ * Introduces Cerebras PyTorch Layer API that implements a subset of PyTorch APIs with Cerebras custom implementations that take advantage of our high-performance kernels and provides drop-in replacement to the native PyTorch version. * Includes a demo of GPT2 model that uses the Layer API and implements the model. * Added support for label-smoothed cross entropy for pipelined models in PyTorch. Cerebras Model Zoo ^^^^^^^^^^^^^^^^^^ * Launched a public version of Cerebras sample model implementations through `Cerebras Model Zoo github repository `_. .. _v1-5-0: Release 1.5.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Weight streaming ^^^^^^^^^^^^^^^^ * Significant performance boost in the evaluation of accuracy for large-scale GPT models. Evaluation performance is now on par with training performance. * Added support for vocab sizes up to 2\ :sup:`31` for GPT-style models when run with weight streaming execution. Multi-node CS-2 support ^^^^^^^^^^^^^^^^^^^^^^^ * Expands Cerebras Wafer-Scale Cluster to support 4x CS-2s on GPT-style models. * Cerebras Wafer-Scale Clusters are composed of CS-2 systems, MemoryX and SwarmX nodes, input pre-processing servers and associated internal network switches. End-user workflow is supported via appliance model, where the user submits a job to the cluster as if it were a single device. .. note:: To learn more and get a demo of our cluster capabilities and appliance workflow, contact Cerebras support by sending email to support@cerebras.net. .. note:: CSoft 1.5 deprecates some of our experimental models that were brittle and not suitable to run with many variants of model implementation. The deprecated models are: RNN, GNN, CTR, Audio and RevBERT which were previously supported in demo mode. .. _v1-4-0: Release 1.4.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Weight Streaming: ^^^^^^^^^^^^^^^^^ - Single CS-2 System now supports training of multi-billion parameter NLP models including GPT-3XL 1.3 billion, GPT-J 6 billion, GPT-3 13 billion and more via the weight streaming mode of execution. - Training on GPT-J (6B parameters) is now supported for longer sequences up to 30K tokens on single CS-2 system. - Switching between these extreme scale models can be achieved by just a few changes in config file. PyTorch models: ^^^^^^^^^^^^^^^ - We have made performance Improvements for small- to medium-sized PyTorch models through support for Multi-Replica (MR) + Variable Tensor Shape (VTS), including: - Multi-replica improves fabric utilization and throughput performance. - VTS improves performance for dataset with variable length sequence. - Models now support MR + VTS: BERT, Transformer (AIAYN) and T5. - We added an adafactor optimizer support to T5 model in PyTorch to achieve robust convergence. TensorFlow models: ^^^^^^^^^^^^^^^^^^ - We have added support for a multi-replica mode and variable sequence length to T5 model in pipeline execution to further boost performance on CS-2 System. Multi-node CS-2 support: ^^^^^^^^^^^^^^^^^^^^^^^^ - This release introduces 2X CS-2 cluster in demo mode that leverages a weight streaming support cluster (composed of CS2 systems, MemoryX and SwarmX systems, worker servers and associated internal network switches) to efficiently run extreme scale NLP models. - The end-to-end user workflow can be thought of and handled as a single network-attached service appliance. In this appliance model, the user submits a job to the weight streaming cluster as if it were a single device. .. note:: To learn more and get a demo of our cluster capabilities and appliance workflow, contact Cerebras support by sending email to support@cerebras.net. Known issues ~~~~~~~~~~~~ Running eval on Cerebras system ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When running in ``eval`` mode on CS-1 system, if ``--nodes`` and ``--tasks_per_node`` value pairs are **not** set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch. 1. ``--nodes==1 --tasks_per_node=2``, or 2. ``--nodes==2 --tasks_per_node=1`` - **Workaound**: Make sure that you use one of the above settings for ``--nodes`` and ``--tasks_per_node``. For example: .. code-block:: bash --nodes=1 --tasks_per_node=2 The ``eval`` performance is not affected by these Slurm resource settings. See the example command below: .. code-block:: bash csrun_wse python run.py --mode=eval \ --nodes=1 --tasks_per_node=2 \ --params configs/your-params-file.yaml \ --model_dir your-model-dir \ --cs_ip=10.255.253.0 .. _v1-3-0: Release 1.3.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PyTorch models ^^^^^^^^^^^^^^ - Supports Variable Tensor Shapes (VTS) for Transformer, T5 and BERT models, which boosts performance significantly - Added support for BERT Finetuning tasks: SQUAD (Q&A), Classifier (SST) and Summarization (SUM) - Supports fixed positional embeddings. - Upgrades to the latest version of PyTorch 1.11. Weight streaming mode ^^^^^^^^^^^^^^^^^^^^^ - GPT-J 6B-parameter model in Tensorflow is supported for pretraining on single CS-2 system - The abstractive summarization fine-tuning task supported for GPT-J (6B parameters). - Eval metrics supported for GPT-2, GPT-3 variants, GPT-J. Metrics include perplexity, accuracy and BPB, BPC, BPW. .. note:: If you are interested in these models, contact Cerebras support by sending email to ``support@cerebras.net``. Multi-replica mode ^^^^^^^^^^^^^^^^^^ - Multi-replica mode is now supported across Transformer and BERT Tensorflow models. - Multi-replica mode also adds Variable Tensor Shape support to further boost performance for these models. Known issues ~~~~~~~~~~~~ GPT-J (6B parameters) model ^^^^^^^^^^^^^^^^^^^^^^^^^^^ - There is a non-determinism on the GPU side we are currently debugging, so in order to match the GPU reference, CS-2 run should start from the same initial checkpoint. - There is an unexplained shuffle happening when the input function runs out of data and needs to repeat the dataset. So, in order to get the exact match, the reference should run for less number of steps than the dataset, or the dataset needs to be extended, so that repeat doesn't happen. - When running the GPT-J 6B model, each weight streaming server should be configured to have 512 GB of total memory. It is recommended to have at least 128 GB of physical memory and any remainder as swap space. Running eval on Cerebras system ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When running in ``eval`` mode on CS-1 system, if ``--nodes`` and ``--tasks_per_node`` value pairs are **not** set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch. 1. ``--nodes==1 --tasks_per_node=2``, or 2. ``--nodes==2 --tasks_per_node=1`` - **Workaound**: Make sure that you use one of the above settings for ``--nodes`` and ``--tasks_per_node``. For example: .. code-block:: bash --nodes=1 --tasks_per_node=2 The ``eval`` performance is not affected by these Slurm resource settings. See the example command below: .. code-block:: bash csrun_wse python run.py --mode=eval \ --nodes=1 --tasks_per_node=2 \ --params configs/your-params-file.yaml \ --model_dir your-model-dir \ --cs_ip=10.255.253.0 ---- .. _v1-2-0: Release 1.2.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PyTorch models ^^^^^^^^^^^^^^ - Train and Eval mode is now supported for PyTorch BERT Base with sequences upto 4k tokens and BERT Large with sequences upto 2k tokens. Includes support for common eval metrics (eval loss, MLM accuracy, NSP accuracy, perplexity). - Train and Eval mode is now supported for RoBERTa configuration in PyTorch BERT. - Adds support for BERT-NER finetuning - Train and Eval mode is now supported for the PyTorch Transformer-Attention is All You Need model. - Train and Eval mode is now supported for PyTorch T5 model with configurations up to ~500M parameters, e.g., T5-Small 60M and T5-Base 220M - Train and Eval mode is now supported for PyTorch GPT-2 model with configurations with up to ~770M parameters, e.g., GPT-2 Small 117M, GPT-2 Medium 345M, GPT-2 Large 774M Weight streaming execution mode ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - A new execution mode, called **weight streaming** mode, to run extremely large models, is introduced as an early release. See :ref:`cerebras-execution-modes` for a detailed explanation of the weight streaming concept. - In weight streaming mode, support is added for eval on GPU. - In weight streaming mode, support is added to store checkpoints and resume training from checkpoints. - Support is added in weight streaming mode to track training runs with TensorBoard. Weight streaming models ^^^^^^^^^^^^^^^^^^^^^^^ The following models support weight streaming mode. These models are in early beta. - GPT-3 XL (1.3 billion total parameters) running on a single CS-2 system. .. note:: If you are interested in these models, contact Cerebras support by sending email to ``support@cerebras.net``. .. _relnote-slurm-analyzer: Input analyzer for Slurm resources ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - The ``cs_input_analyzer`` is a new Bash script that recommends Slurm resource settings you need to run on Cerebras system. These recommendations are generated by this script for a given ``input_fn`` and model. To use this tool, run it manually. See :ref:`cs-input-analyzer`. Known issues ~~~~~~~~~~~~ Running eval on Cerebras system ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When running in ``eval`` mode on CS-1 system, if ``--nodes`` and ``--tasks_per_node`` value pairs are **not** set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch. 1. ``--nodes==1 --tasks_per_node=2``, or 2. ``--nodes==2 --tasks_per_node=1`` - **Workaound**: Make sure that you use one of the above settings for ``--nodes`` and ``--tasks_per_node``. For example: .. code-block:: bash --nodes=1 --tasks_per_node=2 The ``eval`` performance is not affected by these Slurm resource settings. See the example command below: .. code-block:: bash csrun_wse python run.py --mode=eval \ --nodes=1 --tasks_per_node=2 \ --params configs/your-params-file.yaml \ --model_dir your-model-dir \ --cs_ip=10.255.253.0 .. _v1-1-0: Release 1.1.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PyTorch ^^^^^^^^ - The PyTorch support is enhanced. Key changes include but not limited to: - Support for ``eval`` mode is added for BERT and FC-MNIST PyTorch models. These models now support both ``train`` and ``eval`` modes. - Simplified :ref:`cbtorch-session`. - Enhanced the flexibility in specifying the ``cerebras.framework.torch.initialize()``. - Use of ``cbfloat16`` data format (see :ref:`cb16`) is now supported. - Made mixed precision interface more intuitive, via ``GradScaler`` (see :ref:`pytorch-dynamic-loss-scaling`). - Fixed several bugs in the areas of numerics, convergence and performance. PyTorch models """""""""""""" The following PyTorch models are supported. - A PyTorch version of FC-MNIST. - The PyTorch versions of BERT Base and BERT Large. - RoBERTa (Next Sentence Prediction (NSP) only) configurations are supported. See ``roberta_base.yaml`` and ``roberta_large.yaml``. - Longer Maximum Sequence Length (MSL) configurations are supported, at least up to MSL 4096. - The PyTorch Transformer-Attention is All You Need model is added as a Beta feature. This model can be compiled using ``run.py`` with the ``--compile_only`` flag, as well as run on CPU or GPU using ``run_cpu_gpu.py``. To train this model on the Cerebras System at your own risk, comment out the following lines from ``run.py``: .. code-block:: python if not runconfig_params["compile_only"]: raise ValueError( "Running the Transformer model on the Cerebras System is in beta." "Convergence is not guaranteed. Remove this exception to proceed." ) .. note:: If you are interested in these models, contact Cerebras support by sending email to ``support@cerebras.net``. Supported PyTorch ops """"""""""""""""""""" - A preliminary list of supported PyTorch ops is released. See :ref:`supported-pytorch-ops`. Multi-replica data parallel training ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A new feature called **multi-replica data parallel training** is released. Currently this feature is available only for TensorFlow models. When you use this feature, the Cerebras compiler uses several copies (replicas) of the same model to run data parallel training. See :ref:`multiple-models` for detailed documentation. - For a list of TensorFlow models supporting the multi-replica data parallel training, see :ref:`supported-multi-replica-models`. This feature is not yet supported for PyTorch models. Known issues ~~~~~~~~~~~~ T5 and Transformer (Attention is All You Need) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - The TensorFlow versions of the T5 and Transformer models are not guaranteed to converge. These models can still be compiled to the Cerebras system. However, to train these models on the Cerebras System at your own risk, comment out the following lines from ``run.py`` of the model: .. code-block:: python if not runconfig_params["compile_only"]: raise ValueError( "Running the Transformer model on the Cerebras System is in beta." "Convergence is not guaranteed. Remove this exception to proceed." ) - When you train the TensorFlow Transformer model on Cerebras system, you will see a modest increase in loss volatility, compared to the runs on GPUs. This is due to numerical differences. The pre-training eval accuracy is expected to be within a few percent of the equivalent model trained on a GPU. .. note:: If you are interested in these models, contact Cerebras support by sending email to ``support@cerebras.net``. Running eval on Cerebras system ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When running in ``eval`` mode on CS-1 system, if ``--nodes`` and ``--tasks_per_node`` value pairs are **not** set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch. 1. ``--nodes==1 --tasks_per_node=2``, or 2. ``--nodes==2 --tasks_per_node=1`` - **Workaound**: Make sure that you use one of the above settings for ``--nodes`` and ``--tasks_per_node``. For example: .. code-block:: bash --nodes=1 --tasks_per_node=2 The ``eval`` performance is not affected by these Slurm resource settings. See the example command below: .. code-block:: bash csrun_wse python run.py --mode=eval \ --nodes=1 --tasks_per_node=2 \ --params configs/your-params-file.yaml \ --model_dir your-model-dir \ --cs_ip=10.255.253.0 Multi-replica data parallel training ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Dynamic loss scaling is not yet supported with :ref:`multi-replica-data-parallel-training`. - Eval on Cerebras system is not yet supported for multi-replica data parallel trained models. You can run eval on CPU or GPU for these models. PyTorch ^^^^^^^ - For PyTorch, when you are targeting GPU, the following warning will be displayed. This can be safely ignored. This issue does not exist when you target Cerebras system for your acceleration. .. code-block:: text UserWarning: Detected call of ``lr_scheduler.step()`` before ``optimizer.step()``. In PyTorch 1.1.0 and later, you should call them in the opposite order: ``optimizer.step()`` before ``lr_scheduler.step()``. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. - For PyTorch models only, to run the training on the Cerebras system, the ``cs_ip`` flag must include both the IP address and the port number of the CS system. Only the IP address, for example: ``--cs_ip 192.168.1.1``, will not be sufficient. You must also include the port number, for example: ``--cs_ip 192.168.1.1:9000``. See :ref:`pt-qs-train-on-cs` in the PyTorch quickstart document. ---- .. _v1-0-0: Release 1.0.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PyTorch (BETA) ^^^^^^^^^^^^^^ Support is added, in beta phase only, for the PyTorch framework. The models and quickstart provided are strictly intended as advanced information only. - A PyTorch version of FC-MNIST is added as a part of PyTorch (BETA) support. This version only supports compiling on a CPU node with the ``train`` mode. To train this model on the Cerebras System at your own risk, edit the ``run.py`` file and comment out the entire ``raise ValueError()`` function, as shown below: .. code-block:: python elif runconfig_params["mode"] == TRAIN: # raise ValueError( # "Training PyTorch models on the Cerebras System is in beta " # "and is only validated with the default config provided in the " # "Model Zoo. Remove this exception and use the provided config to" # "proceed." #) runner.train(train_loader) - The PyTorch versions of BERT Base and BERT Large are added as a part of PyTorch (BETA) support. These versions only support compiling on a CPU node with the ``train`` mode. To train these models on the Cerebras System at your own risk, edit the ``run.py`` file and comment out the entire ``raise ValueError()`` function, as shown below: .. code-block:: python elif runconfig_params["mode"] == TRAIN: #raise ValueError( #"Training PyTorch models on the Cerebras System is in beta " #"and is only validated with the default configs provided in the " #"Model Zoo. Remove this exception and use one of the provided " #"configs to proceed." #) runner.train(train_loader) RevBERT ^^^^^^^ A new TensorFlow model, the RevBERT is introduced. The RevBERT is a Cerebras-specific BERT model that improves the BERT performance on Cerebras accelerator. Using the RevBERT model you can run up to 20x larger batch sizes and 2.7x larger models on the Cerebras System. This version of RevBERT is only supported with TensorFlow and only supports the ``train`` mode. .. note:: If you are interested in these models, contact Cerebras support by sending email to ``support@cerebras.net``. Transformer (Attention Is All You Need) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Support is added in the ``train`` mode for Variable Sequence Length (VSL) on the CS system. T5 model ^^^^^^^^ - Support is enhanced from loss-only eval to full eval metrics. - Support is added in the ``train`` mode for Variable Sequence Length (VSL) on the CS system. GPT-2 ^^^^^ Support is added in the ``train`` mode for Variable Sequence Length (VSL) on the CS system. ---- .. _v0-9-0: Release 0.9.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Improved Slurm wrapper scripts ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Two new Slurm wrapper scripts are introduced to make it easy to run on CS system and on the CPU. These scripts will replace ``srun_train`` and ``salloc_node``. See below: - The ``csrun_wse`` script can be used to execute training, evaluation and prediction on the CS system. See :ref:`train-eval-predict`. - The ``csrun_cpu`` script can be used to launch a given user command on a CPU, within the Cerebras Singularity container. See :ref:`validate-and-compile-on-cpu` for more on this. Transformer (Attention Is All You Need) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Support is added for the `Transformer (Attention Is All You Need) `_, with the following capabilities: - Example dataset and preprocessing scripts for English-to-German translation included. - On CS system: Training, and Eval (loss only). - On GPU: Train, Eval (``eval`` and ``eval_all``). T5 model ^^^^^^^^ Support is added for the `following T5 `_ family of models: - Small model: - d\ :sub:`model` \ = 512 - d\ :sub:`ff` \ = 2,048. - 8-headed attention. - 6 layers each in the encoder and decoder. - About 60 million parameters. - Model: - Base, BERT Base-sized encoder and decoder. - About ~ 220 million parameters. - Model: Large, BERT Large-sized encoder and decoder. - d\ :sub:`model` \ = 1,024. - d\ :sub:`ff` \ = 4,096. - d\ :sub:`kv` \ = 64. - 16-headed attention. - 24 layers each in the encoder and decoder. - Around 770 million parameters. - Sample dataset: `Colossal Clean Crawled Corpus (C4) dataset `_. - On CS system: Pre-training, Eval (loss only). - On GPU: Train, Eval (``eval`` and ``eval_all``). Variable Sequence Length ^^^^^^^^^^^^^^^^^^^^^^^^ The variable sequence length (VSL) performance of BERT-style encoder-decoder models is enhanced. Previously, a sequence of less than pre-defined maximum sequence length is padded up to the maximum sequence length. The compute and memory are also spent on processing these tokens used for padding, resulting in a significant loss of performance. With this enhancement, by taking advantage of the sparsity the tokens used for padding are not processed, thereby enhancing the performance of the variable length sequences. VSL-enhanced models ^^^^^^^^^^^^^^^^^^^ The performance-optimized variable sequence length is now available for the following models on the CS system: - BERT Pre-training (training only). - RNN Language Model (LM) (training only). - RNN Sentiment (training only). Enhanced BERT- and GPT-style models ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Performance is enhanced for long sequences (MSL up to 8K for smaller models) for BERT- and GPT-style models. This is accomplished by making use of sparse attention to reduce memory requirements. Known issues ~~~~~~~~~~~~ - When you use AdamW Optimizer and if both the following conditions are true: - The parameter ``weight_decay`` is set to a non-zero value, and - The parameter ``loss_scaling_factor`` is not set to "dynamic". then the execution will stop with the following error message: .. admonition:: Error "When using the AdamW optimizer with weight decay, set the ``loss_scaling_factor`` to dynamic." - For the models T5 and Transformer (Attention Is All You Need), the performance in samples-per-sec is optimal when the source ``max_seq_len`` and the target ``max_seq_len`` are equal. - When running evaluation with a BERT model, if the ``max_predictions_per_seq`` parameter is set to an odd value and if the following conditions are true: - The tensor is multi-dimensional (>1D). - The inner dimension is an odd value. - The datatype is < 4 bytes, i.e., FP16 or INT16 or UINT16. then this leads to a compile failure in 0.9.0 and execution failure in 0.8.0. **Workaround**: Set the ``max_predictions_per_seq`` parameter to an even value. .. note:: If you are interested in these models, contact Cerebras support by sending email to ``support@cerebras.net``. ---- .. _v0-8-0: Release 0.8.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. Retracted from 0.8.0 BERT fine-tuning ^^^^^^^^^^^^^^^^ - Support for BERT fine-tuning is added. See the ModelZoo section `Fine-Tuning `_. Also see the new document :ref:`model-support-matrix`. Inference support ^^^^^^^^^^^^^^^^^ - Inference is now supported for the following models: - Graph Convolutional Network - Graph Attention Network .. note:: If you are interested in these models, contact Cerebras support by sending email to ``support@cerebras.net``. .. Retracted from 0.8.0 - `BERT Classifier (SST) (BERT Fine-tuning) `_ - `BERT Token Classifier (NER) (BERT Fine-tuning) `_ Multi-model inference ^^^^^^^^^^^^^^^^^^^^^ - A new feature, multi-model inference, is introduced. Using this you can run multiple neural network models on the CS system, send inference requests to these models and receive prediction responses. See :ref:`multi-model-inference`. Early stopping ^^^^^^^^^^^^^^ - Early stopping is now supported using a custom hook called ``CerebrasEarlyStoppingHook``. Using this hook, you can terminate early a neural network training based on some logic. See :ref:`early-stopping`. .. Retracted from 080 Debug in single-task mode ^^^^^^^^^^^^^^^^^^^^^^^^^ - For a better debugging experience, support is added in the CS system workflow for single-task mode. See :ref:`debug-single-task`. .. Retracted from 080 Input analyzer for Slurm resources ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - The ``cs_input_analyzer`` is a new shell script that provides performance estimates and recommended Slurm resource settings for a given ``input_fn`` and model. To use this tool, run it manually. See :ref:`input-fn-perf-analyzer`. .. Retracted from 080 New script for train, eval and predict ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - A new shell script, named ``cs_run``, will be provided to make it easy to execute training, evaluation and prediction on the CS system. The ``cs_run`` is intended to eventually replace and expand upon the function of the ``srun_train`` script. However, the existing scripts based on ``srun_train`` will continue to work. The ``cs_run`` will be made available by your system administrator in the next few weeks. Launch Python or Bash in Cerebras container ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - A new shell script, named :ref:`cs-cpu` is provided to launch Python or Bash in Cerebras container. The ``cs_cpu`` can be used to perform for validate-only or compile-only tasks on CPU in Cerebras Singularity environment. Use the ``cs_cpu`` script to manage the mount directories, the SIF image, and default flags. ---- .. _v0-7-1: Release 0.7.1 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Support for BERT evaluation and prediction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Evaluation and prediction are now supported on the CS system for BERT networks. While executing the ``run.py``, you can run evaluation or prediction with your network as follows: - **Evaluation**: Use ``--mode eval`` to use the evaluation feature. - **Prediction**: Use ``--mode predict`` to use the prediction feature. See the following for additional documentation: - :ref:`run-py-template` for a description of the ``mode`` parameter using an example BERT ``run.py`` script. - :ref:`train-eval-predict` for usage examples. ---- .. _v0-7-0: Release 0.7.0 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. Not for 0.7.0 Support for evaluation and prediction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - The evaluation and prediction are now supported with the ``CerebrasEstimator``. While executing the ``run.py``, you can run evaluation or prediction with your network as follows: - **Evaluation**: The following three modes are supported with the ``--mode`` option of the ``run.py``: - ``eval`` - ``eval_all`` (not yet supported on the CS system) See :ref:`run-py-template`, :ref:`train-eval-predict`, and :ref:`cs-1-training-eval-exec-flow` for documentation. - **Prediction**: Use ``--mode predict`` with ``run.py`` to use the prediction feature. See :ref:`run-py-template`, :ref:`train-eval-predict`, and :ref:`cs-1-inference-exec-flow` for documentation. .. note:: Prediction is a new feature in v0.7.0 and hence the support for it is limited. Enhanced BERT Large performance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Performance is improved for BERT Large models with MSL 512. This is accomplished by making a tradeoff that mitigates the need for large buffer memory. Support for combined loss ^^^^^^^^^^^^^^^^^^^^^^^^^ - Support is added for combined Dice loss and Softmax Cross-entropy (CE) loss. Summary ops ^^^^^^^^^^^ - Support for TensorFlow summaries is added. See :ref:`using-summary-ops`. New data type ^^^^^^^^^^^^^ - A new datatype called CB16 is introduced. The CB16 is Cerebras' 16-bit format, also referred to as ``cbfloat16``. The CB16 is a floating-point format with 6-bit exponent and 9-bit explicit mantissa. This allows for double the dynamic range of FP16. See :ref:`data-formats`. Compile report ^^^^^^^^^^^^^^ - A new feature that projects the performance of your network is added to the Cerebras Graph Compiler (CGC). Now when your compile is successful, the generated report includes projections on how your network might perform on the CS system. See :ref:`compile-report`. Incremental compile ^^^^^^^^^^^^^^^^^^^ - A new feature called incremental compile is added to the Cerebras Graph Compiler (CGC). After you compile your model the first time, the incremental compile feature of CGC will automatically speed up the subsequent compile runs of your model by reusing, wherever possible, the optimizations already performed. See :ref:`incremental-compile`. Enhanced input function analyzer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - The input function analyzer is enhanced. Now called ``analyze_input_fn_compile``, this tool provides a detailed log identifying any missing functions and provides recommendations on parameter values to enhance the training performance on the CS system. See :ref:`input-function-report`. CS_AUTOTUNE ^^^^^^^^^^^ - Introduced a new method called Cerebras AUTOTUNE (``CS_AUTOTUNE``), which is similar to the TensorFlow ``tf.data.AUTOTUNE``. When you are targeting the CS system, using ``CS_AUTOTUNE`` instead of ``tf.data.AUTOTUNE`` will result in a better specification of parameters such as: - ``num_parallel_calls`` - ``cycle_length`` - ``num_parallel_reads`` See :ref:`cerebras-autotune`. Keras model to CerebrasEstimator ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - A new function, ``KerasModelToCerebrasEstimator``, is provided to convert a Keras model so the model can be run using the ``CerebrasEstimator``. See :ref:`keras-model-to-cerebras-estimator`. Simplified Slurm cluster resolver ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - While setting the runtime configuration options, in v0.6.3 and earlier versions you were required to add the following code for the Slurm cluster resolver. .. code-block:: python from cerebras.tf.cs_slurm_cluster_resolver import CSSlurmClusterResolver slurm_cluster_resolver = CSSlurmClusterResolver() cluster_spec = slurm_cluster_resolver.cluster_spec() task_type, task_id = slurm_cluster_resolver.get_task_info() os.environ['TF_CONFIG'] = json.dumps({ 'cluster': cluster_spec.as_dict(), 'task': {'type': task_type, 'index': task_id} }) Now this is done automatically. This means that your Slurm-orchestrated TensorFlow code that contains the above statements should be edited as follows: .. code-block:: python # Do not remove the following import statement. from cerebras.tf.cs_slurm_cluster_resolver import CSSlurmClusterResolver # Remove the following lines starting CGC v0.7.0. slurm_cluster_resolver = CSSlurmClusterResolver() cluster_spec = slurm_cluster_resolver.cluster_spec() task_type, task_id = slurm_cluster_resolver.get_task_info() os.environ['TF_CONFIG'] = json.dumps({ 'cluster': cluster_spec.as_dict(), 'task': {'type': task_type, 'index': task_id} }) See the examples in :ref:`step4-example-walk-through-cs-estimator` and :ref:`run-py-template`. .. _breaking-changes-v0-7-0: Breaking changes ~~~~~~~~~~~~~~~~ - The ``use_cs`` parameter in the :ref:`interface-cerebras-estimator` is removed and will result in compiler error if used in this API. The target hardware will now be automatically determined from a combination of the runtime configuration parameter ``cs_ip`` and the ``use_cs`` parameter setting in the method definitions for ``train``. - The format of the YAML config files for all the models is changed as follows: - All the training-related parameters have been moved to the ``runconfig`` section. - The ``max_steps`` parameter is added as a default parameter to control the duration of training. Known issues ~~~~~~~~~~~~ Incremental compile ^^^^^^^^^^^^^^^^^^^ - For BERT, a change in the ``max_gradient_norm`` hyperparameter value will not result in reduced incremental compile times. .. seealso:: :ref:`incremental-compile`. Loss scaling factor ^^^^^^^^^^^^^^^^^^^ - In v0.6.3, in some cases, when you enable dynamic loss scaling and an arbitrary operation is performed on the computed loss, then the Cerebras compiler may give error and fail to compile. **Workaround**: In v0.7.0 you can workaround this error by disabling the dynamic loss scaling by setting ``loss_scaling_factor`` to a constant value either equal to or greater than 1.0. ---- .. _v0-6-3: Release 0.6.3 ------------- New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - The overall performance is improved for BERT for max sequence length 128 (MSL128) variants. This improvement varies based on the fabric and model configuration. Enable the following custom Cerebras configuration flag only for BERT MSL128 variants to see this performance improvement: .. code-block:: bash config.matching.kernel.no_dcache_spill_splits = True .. tip:: The Cerebras implementation sets this flag by default for BERT runs with MSL128. - The kernel matching phase of the Cerebras Graph Compiler (CGC) is enhanced to significantly reduce the kernel matching time and improve flexibility. With this enhancement, the kernel matching phase is completed within 60 seconds in a majority of cases. As a result, the overall compile time will be reduced in these cases. Resolved issues ~~~~~~~~~~~~~~~ - Resolved a kernel matching issue with 1DConv models with embeddings, when a Conv1D layer is stacked on top of an embedding layer.