cerebras.modelzoo.data.common.h5_map_dataset.dataset.HDF5Dataset#
- class cerebras.modelzoo.data.common.h5_map_dataset.dataset.HDF5Dataset(*args, **kwargs)[source]#
Bases:
torch.utils.data.Dataset
Dynamically read samples from disk for using mapping paradigms.
It supports two different data formats on disk. The first is data stored in an H5 file in the shape (num_tokens,), i.e. a series of documents tokenized and concatenated together. We call this format the ‘corpus’ format The second format is H5 data of shape (num_sequences, …), i.e. data has already been tokenized and split into sequences. We call this format the ‘sample’ format.
The corpus format supports flexible choice of MSL backed by a single copy of the data on disk. Both formats support deterministic restart, and a data order that is independent of the configuration of the cluster you are running on. I.e. you can pause a run, increase or decrease the number of systems you are running on, and restart the run with no change in data order.
When used in combination with shuffling, this implementation relies on random access reads to disk to dynamically split samples into sequences and shuffle. Users with unusually slow storage should look out for data loading bottlenecks and might consider using use_worker_cache=True if disk access is indeed a bottleneck.
- Parameters
config (cerebras.modelzoo.data.common.h5_map_dataset.dataset.HDF5DatasetConfig) – The configuration used to configure the dataset
Methods
Generates an empty tensor with the same shape and dtype as a sample from its dataset.
load_state_dict
map
state_dict
Attributes
by_sample