data_processing.h5_map_dataset package#

Submodules#

data_processing.h5_map_dataset.dataset module#

class data_processing.h5_map_dataset.dataset.HDF5Dataset[source]#

Bases: torch.utils.data.Dataset

Dynamically read samples from disk for using mapping paradigms.

It supports two different data formats on disk. The first is data stored in an H5 file in the shape (num_tokens,), i.e. a series of documents tokenized and concatenated together. We call this format the ‘corpus’ format The second format is H5 data of shape (num_sequences, …), i.e. data has already been tokenized and split into sequences. We call this format the ‘sample’ format.

The corpus format supports flexible choice of MSL backed by a single copy of the data on disk. Both formats support deterministic restart, and a data order that is independent of the configuration of the cluster you are running on. I.e. you can pause a run, increase or decrease the number of systems you are running on, and restart the run with no change in data order.

When used in combination with shuffling, this implementation relies on random access reads to disk to dynamically split samples into sequences and shuffle. Users with unusually slow storage should look out for data loading bottlenecks and might consider using use_worker_cache=True if disk access is indeed a bottleneck.

Parameters

params (dict) –

a dictionary containing the following fields: - “data_dir” (str or list[str]): the path to the HDF5 files.

Exactly one of “data_dir” or “mixture” must be specified.

”batch_size” (int): batch size
”shuffle” (bool): whether or not to shuffle the dataset. Defaults
to True
”shuffle_seed” (int): seed used for deterministic shuffling.
Defaults to 0.
”use_worker_cache” (bool): whether or not to copy data to storage
that is directly attached to each individual worker node. Useful when your network storage is unusually slow, but otherwise discouraged.
”max_sequence_length” (int): the sequence length of samples
produced by the dataloader. When using the ‘corpus’ data format, the same preprocessed data will work with any max sequence length, so this may be set at runtime. When using the ‘sample’ format this must be set to None.
”data_subset” (str): an optional specification to only consider a
subset of the full dataset, useful for sequence length scheduling and multi-epoch testing. Expected to be a comma separated list of ranges, e.g. ‘0.0-0.5’ or ‘0.1-0.3,0.7-1.0’. Specifying ‘0.0-0.5’ creates a dataset from the first half of the data on disk and disregards the second half.
”mixture” list[dict]: an optional specification of multiple
datasets to mix over to create one single weighted combination. Each element must be a dictionary containing keys data_dir and weight. data_dir serves the same purpose as mentioned above. weight defines the probability with which this dataset should be sampled from. Weights are normalized to sum to 1. Optionally, the dictionary may also contain a data_subset field which functions the same as the data_subset argument above.
”drop_last” (bool): similar to the PyTorch drop_last setting
except that samples that when set to True, samples that would have been dropped at the end of one epoch are yielded at the start of the next epoch so that there is no data loss. This is necessary for a data ordering that is independent of the distributed setup being used.

__init__(params)[source]#

property by_sample#

map(fn)[source]#

data_processing.h5_map_dataset.preprocess_pile module#

Preprocess a dataste saved in the Eleuther lm_dataformat format such as Pile for use in a data processor such as the GptG5MapDataProcessor which is backed by a H5Reader.

The basic logic in this script is to convert each input file to a single H5 output file by appying unicode normalization, tokenizing, and concatenating documents with an end of document token in between.

This script is meant to be run in parallel across several nodes using a tool such as sbatch. For example, to preprocess Pile from the raw artifacts downloaded from https://the-eye.eu/public/AI/pile/, run the following slurm script using sbatch –array 0-29: ` #!/bin/bash python preprocess_pile.py --input_path /path/to/raw/pile/train/*.jsonl.zst --output_dir /path/to/output/dir --tokenizer /path/to/gpt2/tokenizer.json --eos_id 50256 --normalizer NFC --rank $SLURM_ARRAY_TASK_ID --world_size $SLURM_ARRAY_TASK_COUNT ` The files provided are automatically sharded beween workers based on the provided rank and world size which results in each worker processing a single file. The script is also functional although less parallel if you reduce the worker pool (potentially to only a single worker) and let each worker process multiple files. The only change needed would be in the –array sbatch argument.

This script assumes that the documents in the source dataset are already shuffled, which is the case for the typical Pile download.

data_processing.h5_map_dataset.preprocess_pile.main()[source]#

data_processing.h5_map_dataset.preprocess_pile.parse_args()[source]#

data_processing.h5_map_dataset.preprocess_pile.save_run_info(args)[source]#

data_processing.h5_map_dataset.readers module#

class data_processing.h5_map_dataset.readers.H5Reader[source]#

Bases: object

An abstraction for reading individual sequences from h5 files on disk.

Supports 2 formats of data on disk. The first is a rank-1 tensor of concatenated tokenized documents. The second is a rank > 1 tensor of preprocessed samples where the 0th index of the data on disk indexes the data by sample.

Creates a reader for an h5 corpus

Parameters

data_dirs (list[str]) – Directories containing h5 files to read from
sequence_length (int) – The number of tokens per sample if reading from a corpus. Must be None if the data has already been preprocessed into samples.
read_extra_token (bool) – Whether to read and return one extra token after the end of the sequence. This can be useful for language modeling tasks where you want to construct the labels as an shifted version of the inputs. Setting this to True differs from increasing sequence_length by one in that the extra token returned due to this flag will be included in some other sequence as the first token. Will be ignored if sequence_length is None.
data_subset (str) – A string specifying the subset of the corpus to consider. E.g. if data_subset=”0.0-0.75” is specified, only samples in the first 3/4 of the dataset will be considered and the last 1/4 of the dataset will be completely untouched. The self reported length will be the length of the valid portion of the dataset (e.g. the first 3/4), and any attempt to access an element beyond this length will result in an exception.

__init__(data_dirs, sequence_length=None, read_extra_token=False, data_subset=None)[source]#

Creates a reader for an h5 corpus

Parameters

data_dirs (list[str]) – Directories containing h5 files to read from
sequence_length (int) – The number of tokens per sample if reading from a corpus. Must be None if the data has already been preprocessed into samples.
read_extra_token (bool) – Whether to read and return one extra token after the end of the sequence. This can be useful for language modeling tasks where you want to construct the labels as an shifted version of the inputs. Setting this to True differs from increasing sequence_length by one in that the extra token returned due to this flag will be included in some other sequence as the first token. Will be ignored if sequence_length is None.
data_subset (str) – A string specifying the subset of the corpus to consider. E.g. if data_subset=”0.0-0.75” is specified, only samples in the first 3/4 of the dataset will be considered and the last 1/4 of the dataset will be completely untouched. The self reported length will be the length of the valid portion of the dataset (e.g. the first 3/4), and any attempt to access an element beyond this length will result in an exception.

property by_sample#

class data_processing.h5_map_dataset.readers.Mixture[source]#

Bases: object

Mix several map-style datasets according to provided weights.

Parameters

datasets – a list of objects implementing __len__ and __getitem__
weights – a list of weights associated with each dataset. weights must have the same length as datasets and contain only nonnegative values. All weights will be normalized to sum to 1.
interleave – whether or not samples of different datasets should be interleaved together. If all the datasets are preprocessed into sequences and shuffled before being written to disk, then setting this flag will allow you to avoid doing any shuffling at run time while still having samples from the different datasets intermingled, which may be desirable for enabling sequential disk reads. This is implemented in a way that samples within a dataset are not shuffled in relation to each other, i.e. sample 0 of dataset 0 will always have a smaller index than sample 1 of dataset 0.
seed – the random seed used for interleaving. Ignored if interleave is False.

__init__(datasets, weights, interleave=False, seed=0)[source]#

property by_sample#

data_processing.h5_map_dataset.readers.trivial_context_manager(f)[source]#

data_processing.h5_map_dataset.samplers module#

class data_processing.h5_map_dataset.samplers.BaseSampler[source]#

Bases: torch.utils.data.Sampler

Handle shuffling and skipping

__init__(data_source, num_samples=None, shuffle=True, seed=None, start_index=0)[source]#

property num_samples#

class data_processing.h5_map_dataset.samplers.BatchAccumulator[source]#

Bases: torch.utils.data.Sampler

Accumulate neighboring batches into one single larger batch. This is the inverse operation to the splitting of batches into microbatches that happens when using multiple CSX systems.

Assumes data_source is an iterator of batches where each batch has the same length (i.e. drop_last=True).

__init__(data_source, n_accum)[source]#: Assumes data_source is an iterator of batches where each batch has the same length (i.e. drop_last=True).

class data_processing.h5_map_dataset.samplers.BatchSampler[source]#

Bases: torch.utils.data.Sampler

A slight modification of the PyTorch batch sampler such that any samples not yielded at the end of an epoch when drop_last=True will be yielded at the start of the next epoch. This is necessary for shard-invariance.

Adapted from the PyTorch batch sampler

__init__(sampler, batch_size, drop_last)[source]#

class data_processing.h5_map_dataset.samplers.CBSampler[source]#

Bases: torch.utils.data.Sampler

A sampler to handle sharding, batching, and skipping of map style datasets intended for use on CSX. Sharding is performed in such a way that data order is independent of the number of systems being used and the number of workers per system.

Create a sampler to handle shuffling in a deterministic and restartable way as well as sharding.

Parameters

data_source (torch.utils.data.Dataset) – dataset to sample from
shuffle (bool) – whether or not to shuffle the dataset
seed (int) – The seed used to make shuffling deterministic
start_index (int) – The index of the first sample to yield
shard (bool) – Whether or not to shard the dataset across Cerebras data streamer nodes
batch_size (int) – The batch size to use to compute sharded indices and group samples into batches. If None, no batching will be performed. This is the global batch size visible to the dataset rather than the microbatch size.

__init__(data_source, shuffle=True, seed=None, start_index=0, shard=True, batch_size=None, drop_last=True)[source]#

Create a sampler to handle shuffling in a deterministic and restartable way as well as sharding.

Parameters

data_source (torch.utils.data.Dataset) – dataset to sample from
shuffle (bool) – whether or not to shuffle the dataset
seed (int) – The seed used to make shuffling deterministic
start_index (int) – The index of the first sample to yield
shard (bool) – Whether or not to shard the dataset across Cerebras data streamer nodes
batch_size (int) – The batch size to use to compute sharded indices and group samples into batches. If None, no batching will be performed. This is the global batch size visible to the dataset rather than the microbatch size.

class data_processing.h5_map_dataset.samplers.Sharder[source]#

Bases: torch.utils.data.Sampler

__init__(data_source)[source]#

Module contents#

data_processing.bert package

data_processing.huggingface package