data_processing.scripts.hdf5_preprocessing package#

Submodules#

data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5 module#

data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5.convert_dataset_to_HDF5(dataset: Union[torch.utils.data.IterableDataset, torch.utils.data.Dataset], output_dir='./hdf5_dataset/', name='dataset-partition', samples_per_file=2000, num_workers=8, batch_size=64, data_collator=None, dtype='i4', compression='gzip')[source]#

Iterates PyTorch dataset and writes the data to HDF5 files.

Parameters

dataset (IterableDataset, Dataset) – PyTorch dataset to fetch the data from.
output_dir (string) – directory where HDF5 will be stored. Defaults to ‘./hdf5_dataset/’
name (string) – name of the dataset; i.e. prefix to use for HDF5 file names. Defaults to ‘dataset-partition’
samples_per_file (int) – number of samples written to each HDF5 file
2000 ((last file can have less samples if the dataset isn't divisible). Defaults to) –
num_workers (int) – number of Python processes to use for generating data. Defaults to 8
batch_size (int) – The batch size to use fetching the data. Defaults to 64
data_collator (Callable) – merges a list of samples to form a mini-batch of Tensor(s).
dataset. (Used when using batched loading from a map-style) –
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.

data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5.write_hdf5_file(file_path, data, n_examples, chunks, dtype='i4', compression='gzip')[source]#

Write data to HDF5 file.

Parameters

file_path (string) – HDF5 file path.
data (numpy array) – Input features and labels that will be written to HDF5.
n_examples (int) – Number of examples that will be written in the file.
chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.

data_processing.scripts.hdf5_preprocessing.create_hdf5_dataset module#

Script that generates a dataset in HDF5 format for GPT Models.

data_processing.scripts.hdf5_preprocessing.create_hdf5_dataset.main()[source]#: Main function for execution.

data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor module#

class data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor[source]#

Bases: abc.ABC

This module defines how to process a dataset, tokenize it and write into HDF5 format.

Parameters: params (Dict) – Dictionary contains the parameters that configures the processing of the dataset.

__init__(params)[source]#

add_token(token)[source]#: Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

create_dataset(params)[source]#

Creates HDF5 dataset from given parameters.

Parameters

files (list) – List of files to process.
process_no (int) – process id

Returns

Dictionary containing results of execution, specifically as number of: processed, discarded, and successful files as well as number of examples.

abstract file_read_generator(file)[source]#

Read file and generates content :param file: path to data file :type file: str

Returns: a tuple of intermediate results read from files
Return type: docs_read (tuple)

generate_sample(file)[source]#

get_vocab_size()[source]#: Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)

abstract preprocessing_generator(*doc_read_results)[source]#

Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple

Returns: one or multiple training samples
Return type: sample (np.array)

seed_runs(rank=0)[source]#

Set seed for run based on user provided seed and rank.

Parameters: rank (int) – Rank to set, based on process number for execution. Defaults to 0.
Returns: Object of type random.Random, with seed set.

write_hdf5_file(file_path, files, rng, n_examples, chunks, dtype='i4', compression='gzip')[source]#

Write data to HDF5 file.

Parameters

file_path (string) – HDF5 file path.
files (sequence) – List of lists containing tokenized data to write.
rng (random.Random obj) – Instance of random object, with states set.
n_examples (int) – Number of examples that will be written in the file.
chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.

write_hdf5_files(files, start_number, write_remainder=False, process_number=None, rng=<random.Random object>)[source]#

Writes a list of files to HDF5.

Parameters

files (sequence) – List of lists containing tokenized data to write.
start_number (int) – Continual count of HDF5 files written out.
write_remainder (bool) – Write out remaining data from files, if files per record is not met. Defaults to False.
process_number (int) – Process number for execution. Defaults to None.
rng (random.Random obj) – Instance of random object, with states set. Defaults to new instance created for write.

Returns

Continual count of HDF5 files written out. remainder (list): Remaining sequences not written out, if length of

files to write is greater than the file per record.

Return type

start_number (int)

data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors module#

class data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors.LMDataPreprocessor[source]#

Bases: modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor

__init__(params)[source]#

file_read_generator(file)[source]#

preprocessing_generator(doc)[source]#

tokenize_text_auto_lm(text)[source]#

class data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors.SummarizationPreprocessor[source]#

Bases: modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor

__init__(params)[source]#

file_read_generator(file)[source]#

preprocessing_generator(doc)[source]#

data_processing.scripts.hdf5_preprocessing.utils module#

data_processing.scripts.hdf5_preprocessing.utils.add_common_args(parser)[source]#: For the argparse to parse arguments for subcommands, we add common command line arguments to each subcommand parser here.

data_processing.scripts.hdf5_preprocessing.utils.dump_args(args, json_params_file)[source]#: Write the input params to file.

data_processing.scripts.hdf5_preprocessing.utils.dump_result(results, json_params_file, eos_id=None, pad_id=None, vocab_size=None)[source]#: Write outputs of execution

data_processing.scripts.hdf5_preprocessing.utils.get_params(desc)[source]#

Retrieve configuration parameters :returns:

Dictionary contains the parameters used to configure
the data processing.

Return type: params (Dict)

data_processing.scripts.hdf5_preprocessing.utils.get_parser(desc)[source]#

Argparser definition for command line arguments from user.

Returns: Argparse namespace object with command line arguments.

data_processing.scripts.hdf5_preprocessing.utils.get_verification_args(params)[source]#

data_processing.scripts.hdf5_preprocessing.utils.process_dataset(files, dataset_processor, processes)[source]#

Process a dataset and write it into HDF5 format.

Parameters

files (list) – List of files to process.
dataset_processor – Class containing methods that specify how the dataset will be processed and written into HDF5 files.
processes (int) – Number of processes to use.

Returns

Dictionary containing results of execution, specifically as number of: processed, discarded, and successful files as well as number of examples from all processes.

data_processing.scripts.hdf5_preprocessing.utils.update_params(params, args)[source]#: Update config parameters with CLI arguments

data_processing.scripts.hdf5_preprocessing.utils.verify_saved_hdf5_files(params)[source]#

This function is used to do sanity checks at the end of the creation of hdf5 files. This function loads every .h5 files generated and checks:

The data type

Shape of the dataset

Fact that labels and inputs are as expected

data_processing.scripts.hdf5_preprocessing.utils.verify_saved_hdf5_files_mp(files, args)[source]#: Verify the generated HDF5 dataset. :param files: List of files to process. :type files: list :param args: Arguments for verifying HDF5 dataset. :type args: argparse namespace

Module contents#

data_processing.scripts package

data_processing.scripts.pubmed package