Resource requirements for parallel training and compilation#
The Cerebras Wafer-Scale Cluster supports parallel compiles and trains, for maximum utilisation of the CS2’s in the cluster, depending on the configuration of the cluster. It supports explicit resource management with strict limits on Memory and CPU requests. You have the option of overriding the limits to run a model that needs more resources.
Note
Swap has been removed from all the Cerebras Wafer-Scale Cluster CPU nodes.
CPU Requirements#
The number of cores used is set using K8s request. Additional memory requirements are managed by the cluster management if required.
Memory Requirements#
Management Node#
R1.9 supports a pool of management nodes, which allows for greater number of parallel compiles and trains on a cluster. The default train and compile coordinator memory limits for management nodes can optionally be overridden at cluster deployment time based on the memory of the management node. The default limits are tailored for a management node with 128Gi whereas overrides can provide additional buffer room in the case of management nodes with more memory. Some typical examples are as follows:
125Gi single management node: 67Gi for Compile and 32Gi for Train. This allows for 1 compile and 1 train in parallel.
503Gi single management node: 90Gi for Compile and 90Gi for Train. This allows for 2 compiles and 2 trains in parallel.
Some outlier models may need more memory, in which case, the you will see an OOM message corresponding to the compile or train and are propagated to the client side logs. These errors are also visible in the software errors section of the wsjob dashboard.
You can override the default limits in the runconfig
section of the yaml
configuration file using the compile_crd_memory_gi
and execute_crd_memory_gi
options, as follows:
runconfig:
compile_crd_memory_gi: 100
execute_crd_memory_gi: 120
Maximum override value#
Increasing the memory limits for management nodes may reduce the number of parallel compiles and trains. It should be noted that not all physical memory available on management node is available for compile/train coordinator consumption. The cluster management overhead should be discounted from the total physical memory before determining a possible max override value. Below is an estimated guidance on the cluster management overheads.
Cluster with single management node setup: 28 Gi
Cluster with multiple management node setup: 45 Gi
It is recommended to discount a small buffer on top of this before determining the max override possible.
MemoryX Nodes#
The servers running on MemoryX nodes set their memory limits dynamically based on estimate memory usage derived from compile artifacts. The cluster management automatically takes care of managing job scheduling when more memory is requested. Hence no overrides are required for these.
Worker Nodes#
Worker servers have a peak memory limit of 60Gi. Worker code is user-defined so memory usage is outside of Cerebras’ control. As a reference, all worker servers have not exceeded this limit when running Cerebras internal tests.
In the scenarion when a worker requires exceptional amounts of memory, the memory limit can be changed manually in the runconfig
section of the yaml
configuration file.
runconfig:
wrk_memory_gi: 80
SwarmX Nodes#
The SwarmX servers have a peak memory limit to 60Gi. No known outliers exist.
User Nodes#
A 128GB user node can support up to 2 compiles and 2 trains in parallel, as long as the cluster’s capacity supports it. If jobs are launched beyond what the cluster can support, an error message is reported.
Supported number of parallel compilations and trainings in a cluster#
The number of parallel compilations and trainings that can be supported in a cluster depends on:
Number of CS-2 systems
Number of management nodes
Number of fully populated MemoryX racks
Memory and CPU capacity of the management nodes
Number of fully populated MemoryX node groups in the cluster
Largest size model required to be supported
Model characteristics and parameters that drive up the memory
The table below showcases the number of parallel compilations and trainings based on the type of management nodes.
Number of management nodes |
1x128GB |
2x128GB |
3x128GB |
1x512GB |
4x512GB |
---|---|---|---|---|---|
Number of CS-2 systems |
1 |
1 |
2 |
2 |
16 |
Compile, Train memory limit |
67Gi, 32Gi |
75Gi, 75Gi |
75Gi, 75Gi |
75Gi, 75Gi |
75Gi, 75Gi |
Parallel compiles(C) and trains (T) |
1C, 1T |
1C, 1T |
1C, 2T |
2C, 2T |
8C, 8T |
Note
The Cluster Management overhead per management node is ~45Gi for multiple management nodes and ~28 Gi for single management node, which is taken into account in the above table.
Monitoring Cluster Resource Usage#
The max CPU/memory used by a specific run can be seen using grafana wsjob dashboard.
Known Outliers#
Compile memory outliers#
T5 model variants with Batch Size larger or equal to 700 are not supported.
Train memory outliers#
In general, the train memory requirement on the appliance increases with maximum tensor size. A maximum tensor size larger than 3.5Gi can drive the peak memory requirement on a management node above the 32Gi limit. Overriding the limit to 75Gi, the maximum tensor size that can be supported is approximately 6.8Gi. For example, a model with hidden size of ~6k and vocabulary size of ~150k results in a tensor sizes of ~3.7Gi. If the limit was set to 32Gi, then the memory limit should be overriden to 75Gi for execution.
The following is a list of known outliers assuming 75Gi limit and the peak memory usage
GPT-2 1.3B parameters model with vocabulary size of 1 Million and hidden size of 2280 requires 117Gi
Impact of checkpoint frequency on memory#
The frequency of checkpoints has an impact on memory usage. The lowest recommended frequency for models with 20B parameters and above is 200 steps.
Note
To learn more about how to troubleshoot issues with system resources, visit the Out of memory errors and system resources section.