data_processing.scripts.hdf5_preprocessing package#
Submodules#
data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5 module#
- data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5.convert_dataset_to_HDF5(dataset: Union[torch.utils.data.IterableDataset, torch.utils.data.Dataset], output_dir='./hdf5_dataset/', name='dataset-partition', samples_per_file=2000, num_workers=8, batch_size=64, data_collator=None, dtype='i4', compression='gzip')[source]#
Iterates PyTorch dataset and writes the data to HDF5 files.
- Parameters
dataset (IterableDataset, Dataset) – PyTorch dataset to fetch the data from.
output_dir (string) – directory where HDF5 will be stored. Defaults to ‘./hdf5_dataset/’
name (string) – name of the dataset; i.e. prefix to use for HDF5 file names. Defaults to ‘dataset-partition’
samples_per_file (int) – number of samples written to each HDF5 file
2000 ((last file can have less samples if the dataset isn't divisible). Defaults to) –
num_workers (int) – number of Python processes to use for generating data. Defaults to 8
batch_size (int) – The batch size to use fetching the data. Defaults to 64
data_collator (Callable) – merges a list of samples to form a mini-batch of Tensor(s).
dataset. (Used when using batched loading from a map-style) –
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.
- data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5.write_hdf5_file(file_path, data, n_examples, chunks, dtype='i4', compression='gzip')[source]#
Write data to HDF5 file.
- Parameters
file_path (string) – HDF5 file path.
data (numpy array) – Input features and labels that will be written to HDF5.
n_examples (int) – Number of examples that will be written in the file.
chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.
data_processing.scripts.hdf5_preprocessing.create_hdf5_dataset module#
Script that generates a dataset in HDF5 format for GPT Models.
data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor module#
- class data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor[source]#
Bases:
abc.ABC
This module defines how to process a dataset, tokenize it and write into HDF5 format.
- Parameters
params (Dict) – Dictionary contains the parameters that configures the processing of the dataset.
- add_token(token)[source]#
Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
- create_dataset(params)[source]#
Creates HDF5 dataset from given parameters.
- Parameters
files (list) – List of files to process.
process_no (int) – process id
- Returns
- Dictionary containing results of execution, specifically as number of
processed, discarded, and successful files as well as number of examples.
- abstract file_read_generator(file)[source]#
Read file and generates content :param file: path to data file :type file: str
- Returns
a tuple of intermediate results read from files
- Return type
docs_read (tuple)
- get_vocab_size()[source]#
Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
- abstract preprocessing_generator(*doc_read_results)[source]#
Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple
- Returns
one or multiple training samples
- Return type
sample (np.array)
- seed_runs(rank=0)[source]#
Set seed for run based on user provided seed and rank.
- Parameters
rank (int) – Rank to set, based on process number for execution. Defaults to 0.
- Returns
Object of type random.Random, with seed set.
- write_hdf5_file(file_path, files, rng, n_examples, chunks, dtype='i4', compression='gzip')[source]#
Write data to HDF5 file.
- Parameters
file_path (string) – HDF5 file path.
files (sequence) – List of lists containing tokenized data to write.
rng (random.Random obj) – Instance of random object, with states set.
n_examples (int) – Number of examples that will be written in the file.
chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.
- write_hdf5_files(files, start_number, write_remainder=False, process_number=None, rng=<random.Random object>)[source]#
Writes a list of files to HDF5.
- Parameters
files (sequence) – List of lists containing tokenized data to write.
start_number (int) – Continual count of HDF5 files written out.
write_remainder (bool) – Write out remaining data from files, if files per record is not met. Defaults to False.
process_number (int) – Process number for execution. Defaults to None.
rng (random.Random obj) – Instance of random object, with states set. Defaults to new instance created for write.
- Returns
Continual count of HDF5 files written out. remainder (list): Remaining sequences not written out, if length of
files to write is greater than the file per record.
- Return type
start_number (int)
data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors module#
- class data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors.LMDataPreprocessor[source]#
Bases:
modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor
data_processing.scripts.hdf5_preprocessing.utils module#
- data_processing.scripts.hdf5_preprocessing.utils.add_common_args(parser)[source]#
For the argparse to parse arguments for subcommands, we add common command line arguments to each subcommand parser here.
- data_processing.scripts.hdf5_preprocessing.utils.dump_args(args, json_params_file)[source]#
Write the input params to file.
- data_processing.scripts.hdf5_preprocessing.utils.dump_result(results, json_params_file, eos_id=None, pad_id=None, vocab_size=None)[source]#
Write outputs of execution
- data_processing.scripts.hdf5_preprocessing.utils.get_params(desc)[source]#
Retrieve configuration parameters :returns:
- Dictionary contains the parameters used to configure
the data processing.
- Return type
params (Dict)
- data_processing.scripts.hdf5_preprocessing.utils.get_parser(desc)[source]#
Argparser definition for command line arguments from user.
- Returns
Argparse namespace object with command line arguments.
- data_processing.scripts.hdf5_preprocessing.utils.process_dataset(files, dataset_processor, processes)[source]#
Process a dataset and write it into HDF5 format.
- Parameters
files (list) – List of files to process.
dataset_processor – Class containing methods that specify how the dataset will be processed and written into HDF5 files.
processes (int) – Number of processes to use.
- Returns
- Dictionary containing results of execution, specifically as number of
processed, discarded, and successful files as well as number of examples from all processes.
- data_processing.scripts.hdf5_preprocessing.utils.update_params(params, args)[source]#
Update config parameters with CLI arguments
- data_processing.scripts.hdf5_preprocessing.utils.verify_saved_hdf5_files(params)[source]#
This function is used to do sanity checks at the end of the creation of hdf5 files. This function loads every .h5 files generated and checks:
The data type
Shape of the dataset
Fact that labels and inputs are as expected