cerebras.modelzoo.data.common.h5_map_dataset.dataset.HDF5Dataset#

class cerebras.modelzoo.data.common.h5_map_dataset.dataset.HDF5Dataset(*args, **kwargs)[source]#

Bases: torch.utils.data.Dataset

Dynamically read samples from disk for using mapping paradigms.

It supports two different data formats on disk. The first is data stored in an H5 file in the shape (num_tokens,), i.e. a series of documents tokenized and concatenated together. We call this format the ‘corpus’ format The second format is H5 data of shape (num_sequences, …), i.e. data has already been tokenized and split into sequences. We call this format the ‘sample’ format.

The corpus format supports flexible choice of MSL backed by a single copy of the data on disk. Both formats support deterministic restart, and a data order that is independent of the configuration of the cluster you are running on. I.e. you can pause a run, increase or decrease the number of systems you are running on, and restart the run with no change in data order.

When used in combination with shuffling, this implementation relies on random access reads to disk to dynamically split samples into sequences and shuffle. Users with unusually slow storage should look out for data loading bottlenecks and might consider using use_worker_cache=True if disk access is indeed a bottleneck.

Parameters

config (cerebras.modelzoo.data.common.h5_map_dataset.dataset.HDF5DatasetConfig) – The configuration used to configure the dataset

Methods

generate_sample

Generates an empty tensor with the same shape and dtype as a sample from its dataset.

load_state_dict

map

state_dict

Attributes

by_sample

generate_sample()[source]#

Generates an empty tensor with the same shape and dtype as a sample from its dataset.