Limitations of the CerebrasEstimator
On This Page
Limitations of the CerebrasEstimator¶
Though CerebrasEstimator inherits from the TensorFlow Estimator, the CerebrasEstimator does not yet support the full breadth of features provided by TensorFlow Estimator. There are a few differences and limitations in CerebrasEstimator. These are described below.
Important
All the feature limitations listed below, such as lack of support for user hooks, apply only when training on the CS system. The CerebrasEstimator supports all TensorFlow Estimator features when training on GPU or CPU.
Model function limitations¶
- Hooks
The
CerebrasEstimatorcurrently does not support user-defined hooks, which allow a way to ‘hook into’ certain points of theCerebrasEstimatorexecution. If such user-defined hooks are present, thenCerebrasEstimatorwill error by complaining about accessing a tensor that is not initialized.- Eval_metrics
The
CerebrasEstimatordoes not currently support the TensorFlow Metrics API only when running on CS system. This means that during a training run on CS system, you cannot runeval_metricsoperations such as accuracy.If you would like to use
eval_metricsfor debugging, theCerebrasEstimatorsupports the usage ofeval_metricson CPU or GPU.If the parameter
eval_metric_opsis set in theEstimatorSpecreturned by the model function, then running Estimator will produce the following error:[Unsupported TensorFlow] Detected unsupported eval_metric_ops in CS system training run.
Input function differences¶
Dataset repeating¶
Instead of requiring you to specify the number of epochs you would like to train for, the Estimator requires that you:
Explicitly set the number of steps you want to train for in the Estimator
trainfunction, andUse the default parameter of the
repeatfunction (count=None) provided by the Dataset API to ensure that theinput_fnwill keep providing samples to the CS system until the number of training steps set in thetrainfunction is complete. See below code example:dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)
Multiple input workers¶
To utilize the full computational capabilities of the CS system, multiple input workers are used to send the training data, simultaneously from each input worker, to the CS system.
Note
This means that each worker node must shuffle its data differently.
In the simplest setup, the same input data is replicated across every input worker. Because the dataset is large, the CerebrasEstimator can approximate distributed training with dataset shards by ensuring that each worker shuffles its data differently.
In other words, make sure that you are not providing a deterministic random seed.
Return Dataset¶
The CerebrasEstimator requires that your input function returns a Dataset (tf.data.Dataset). Each element of the Dataset must be structured to consist of features and labels. See <link to features and labels discussion>
Note
Features must be a tensor. Labels can be a tensor or None.
If the input function does not return a dataset, then CerebrasEstimator will
error out with the following error:
[Unsupported TensorFlow] Input function must return a tf.data.Dataset.
Single dictionary input¶
The input function in CerebrasEstimator only takes a single dictionary parameter, params, as input. This will be passed in through the Estimator constructor.
Input function limitations¶
- Drop remainder to enforce fixed batch size
The
CerebrasEstimatorrequires that your input function outputs batches of a fixedbatch_sizeacross all steps. To enforce this, you must set thedrop_remainderparameter provided by the Dataset API toTruewhen batching the Dataset. See TensorFlow documentation for batch.dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)
If you do not provide a fixed batch_size, the CerebrasEstimator will error out with the following error:
[Unsupported TensorFlow] Inconsistent batch sizes detected. To ensure a fixed batch size across all steps, set `drop_remainder=True` when batching your Dataset in the input function.
Config differences¶
- Lower bound on save_checkpoint_steps
Because the CS system trains faster than alternative systems, saving checkpoints too frequently can have a significant overall performance impact.
- TF Env Config
This environment variable (see section 1 under Configuration) must be specified, while training on the CS system. A default for this is already provided in our example scripts. Ensure that its called during training.
- Parameters not supported
save_checkpoint_secstrain_distributedevice_fnprotocoleval_distributeexperimental_distributeExperimental_max_worker_delay_secs
Compilation differences¶
Like most high performance compute devices, the CS system requires application compilation before execution. In a typical training run, this is handled automatically by CerebrasEstimator.
However, because this process can take many minutes, thereby increasing with the
complexity of your model, Cerebras makes available a standalone
CerebrasEstimator.compile() function. This function allows you to quickly validate your model code and perform full batched precompiles without connecting
to the CS system. However, note that when you compile your model on one CS system, you cannot run this compiled model on another CS system.