cerebras.modelzoo.data.common.HDF5IterableDataProcessor.HDF5IterableDataProcessorConfig#

class cerebras.modelzoo.data.common.HDF5IterableDataProcessor.HDF5IterableDataProcessorConfig(*args, **kwargs)[source]#

Bases: cerebras.modelzoo.data.common.config.GenericDataProcessorConfig, cerebras.modelzoo.data.common.HDF5IterableDataset.HDF5IterableDatasetConfig

Methods

check_for_deprecated_fields

check_literal_discriminator_field

copy

get_orig_class

get_orig_class_args

model_copy

model_post_init

post_init

Attributes

batch_size

Batch size.

data_dir

Path to dataset HDF5 files

discriminator

discriminator_value

drop_last

similar to the PyTorch drop_last setting except that samples that when set to True, samples that would have been dropped at the end of one epoch are yielded at the start of the next epoch so that there is no data loss.

features_list

List of features to include in the batch

model_config

num_workers

How many subprocesses to use for data loading.

persistent_workers

If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

prefetch_factor

Number of batches loaded in advance by each worker.

shuffle

Flag to enable data shuffling.

shuffle_buffer

Size of shuffle buffer in samples.

shuffle_seed

Shuffle seed.

use_vsl

Flag to enable variable sequence length training.

vocab_size

data_processor