cerebras.modelzoo.data.common.HDF5IterableDataset.HDF5IterableDatasetConfig#

class cerebras.modelzoo.data.common.HDF5IterableDataset.HDF5IterableDatasetConfig(*args, **kwargs)[source]#

Bases: cerebras.modelzoo.config.base_config.BaseConfig

Methods

check_for_deprecated_fields

copy

get_orig_class

get_orig_class_args

model_copy

model_post_init

post_init

Attributes

batch_size

Batch size.

data_dir

Path to dataset HDF5 files

drop_last

If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

features_list

List of features to include in the batch

model_config

num_workers

How many subprocesses to use for data loading.

shuffle

Flag to enable data shuffling.

shuffle_seed

Shuffle seed.

use_vsl

Flag to enable variable sequence length training.

data_dir = Ellipsis#

Path to dataset HDF5 files

batch_size = Ellipsis#

Batch size.

shuffle = False#

Flag to enable data shuffling.

shuffle_seed = None#

Shuffle seed.

num_workers = 0#

How many subprocesses to use for data loading.

drop_last = True#

If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

use_vsl = False#

Flag to enable variable sequence length training. It requires the dataset to have two extra features: the attention_span of keys and the position_ids of tokens.

features_list = ['input_ids', 'attention_mask', 'labels']#

List of features to include in the batch

__call__(**kwargs)#

Construct the original class with the current config.

By original class, we mean the class that this config class is associated with.