cerebras.modelzoo.data.common.HDF5IterableDataset.HDF5IterableDatasetConfig#

class cerebras.modelzoo.data.common.HDF5IterableDataset.HDF5IterableDatasetConfig(*args, **kwargs)[source]#

Methods

Attributes

`batch_size`	Batch size.
`data_dir`	Path to dataset HDF5 files
`drop_last`	If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.
`features_list`	List of features to include in the batch
`model_config`
`num_workers`	How many subprocesses to use for data loading.
`shuffle`	Flag to enable data shuffling.
`shuffle_seed`	Shuffle seed.
`use_vsl`	Flag to enable variable sequence length training.

drop_last = True#: If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

use_vsl = False#: Flag to enable variable sequence length training. It requires the dataset to have two extra features: the attention_span of keys and the position_ids of tokens.

features_list = ['input_ids', 'attention_mask', 'labels']#: List of features to include in the batch

__call__(**kwargs)#

Construct the original class with the current config.

By original class, we mean the class that this config class is associated with.

cerebras.modelzoo.data.common.HDF5IterableDataset.HDF5IterableDataset

cerebras.modelzoo.data.common.SyntheticDataProcessor