cerebras.modelzoo.data.multimodal.llava.LlavaHDF5MapDataProcessor.LlavaHDF5MapDataProcessorConfig#
- class cerebras.modelzoo.data.multimodal.llava.LlavaHDF5MapDataProcessor.LlavaHDF5MapDataProcessorConfig(*args, **kwargs)[source]#
Bases:
cerebras.modelzoo.data.common.h5_map_dataset.dataset.MultiModalHDF5DatasetConfig,cerebras.modelzoo.config.data_config.DataConfigMethods
check_for_deprecated_fieldscheck_literal_discriminator_fieldcheck_mutual_exclusivitycopyget_orig_classget_orig_class_argsmodel_copymodel_post_initpost_initAttributes
batch_sizeThe batch size
bos_token_iddata_dirThe path to the HDF5 files.
data_subsetAn optional specification to only consider a subset of the full dataset, useful for sequence length scheduling and multi-epoch testing.
dataset_map_fndiscriminatordiscriminator_valuedrop_lastSimilar to the PyTorch drop_last setting except that samples that when set to True, samples that would have been dropped at the end of one epoch are yielded at the start of the next epoch so that there is no data loss.
image_data_sizeThe final C x H x W shape of the image.
img_data_dirThe path to the directory containing the images.
max_sequence_lengthThe sequence length of samples produced by the dataloader.
mixed_precisionmixtureAn optional specification of multiple datasets to mix over to create one single weighted combination.
model_confignum_samplesThe number of samples to shuffle over (if shuffling is enabled).
The number of PyTorch processes used in the dataloader.
pad_lastFlag to enable padding of the last batch so that the last batch has the same batch size as the rest of the batches.
Whether or not to keep workers persistent between epochs.
pos_token_idThe number of batches to prefetch in the dataloader.
shuffleWhether or not to shuffle the dataset.
shuffle_seedThe seed used for deterministic shuffling.
sort_filesWhether or not the reader should sort the input files.
transformsA specification of the torchvision transforms.
use_vslFlag to enable variable sequence length training.
use_worker_cacheWhether or not to copy data to storage that is directly attached to each individual worker node.
vocab_sizedata_processor- num_workers = 0#
The number of PyTorch processes used in the dataloader.
- prefetch_factor = 10#
The number of batches to prefetch in the dataloader.
- persistent_workers = True#
Whether or not to keep workers persistent between epochs.