data_processing.scripts.hdf5_preprocessing package#

Submodules#

data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5 module#

data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5.convert_dataset_to_HDF5(dataset: Union[torch.utils.data.IterableDataset, torch.utils.data.Dataset], output_dir='./hdf5_dataset/', name='dataset-partition', samples_per_file=2000, num_workers=8, batch_size=64, data_collator=None, dtype='i4', compression='gzip')[source]#

Iterates PyTorch dataset and writes the data to HDF5 files.

Parameters
  • dataset (IterableDataset, Dataset) – PyTorch dataset to fetch the data from.

  • output_dir (string) – directory where HDF5 will be stored. Defaults to ‘./hdf5_dataset/’

  • name (string) – name of the dataset; i.e. prefix to use for HDF5 file names. Defaults to ‘dataset-partition’

  • samples_per_file (int) – number of samples written to each HDF5 file

  • 2000 ((last file can have less samples if the dataset isn't divisible). Defaults to) –

  • num_workers (int) – number of Python processes to use for generating data. Defaults to 8

  • batch_size (int) – The batch size to use fetching the data. Defaults to 64

  • data_collator (Callable) – merges a list of samples to form a mini-batch of Tensor(s).

  • dataset. (Used when using batched loading from a map-style) –

  • dtype (string) – Data type for the HDF5 dataset.

  • compression (string) – Compression strategy.

data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5.write_hdf5_file(file_path, data, n_examples, chunks, dtype='i4', compression='gzip')[source]#

Write data to HDF5 file.

Parameters
  • file_path (string) – HDF5 file path.

  • data (numpy array) – Input features and labels that will be written to HDF5.

  • n_examples (int) – Number of examples that will be written in the file.

  • chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.

  • dtype (string) – Data type for the HDF5 dataset.

  • compression (string) – Compression strategy.

data_processing.scripts.hdf5_preprocessing.create_hdf5_dataset module#

Script that generates a dataset in HDF5 format for GPT Models.

data_processing.scripts.hdf5_preprocessing.create_hdf5_dataset.main()[source]#

Main function for execution.

data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor module#

class data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor[source]#

Bases: abc.ABC

This module defines how to process a dataset, tokenize it and write into HDF5 format.

Parameters

params (Dict) – Dictionary contains the parameters that configures the processing of the dataset.

__init__(params)[source]#
add_token(token)[source]#

Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

create_dataset(params)[source]#

Creates HDF5 dataset from given parameters.

Parameters
  • files (list) – List of files to process.

  • process_no (int) – process id

Returns

Dictionary containing results of execution, specifically as number of

processed, discarded, and successful files as well as number of examples.

abstract file_read_generator(file)[source]#

Read file and generates content :param file: path to data file :type file: str

Returns

a tuple of intermediate results read from files

Return type

docs_read (tuple)

generate_sample(file)[source]#
get_vocab_size()[source]#

Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)

abstract preprocessing_generator(*doc_read_results)[source]#

Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple

Returns

one or multiple training samples

Return type

sample (np.array)

seed_runs(rank=0)[source]#

Set seed for run based on user provided seed and rank.

Parameters

rank (int) – Rank to set, based on process number for execution. Defaults to 0.

Returns

Object of type random.Random, with seed set.

write_hdf5_file(file_path, files, rng, n_examples, chunks, dtype='i4', compression='gzip')[source]#

Write data to HDF5 file.

Parameters
  • file_path (string) – HDF5 file path.

  • files (sequence) – List of lists containing tokenized data to write.

  • rng (random.Random obj) – Instance of random object, with states set.

  • n_examples (int) – Number of examples that will be written in the file.

  • chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.

  • dtype (string) – Data type for the HDF5 dataset.

  • compression (string) – Compression strategy.

write_hdf5_files(files, start_number, write_remainder=False, process_number=None, rng=<random.Random object>)[source]#

Writes a list of files to HDF5.

Parameters
  • files (sequence) – List of lists containing tokenized data to write.

  • start_number (int) – Continual count of HDF5 files written out.

  • write_remainder (bool) – Write out remaining data from files, if files per record is not met. Defaults to False.

  • process_number (int) – Process number for execution. Defaults to None.

  • rng (random.Random obj) – Instance of random object, with states set. Defaults to new instance created for write.

Returns

Continual count of HDF5 files written out. remainder (list): Remaining sequences not written out, if length of

files to write is greater than the file per record.

Return type

start_number (int)

data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors module#

class data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors.LMDataPreprocessor[source]#

Bases: modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor

__init__(params)[source]#
file_read_generator(file)[source]#
preprocessing_generator(doc)[source]#
tokenize_text_auto_lm(text)[source]#
class data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors.SummarizationPreprocessor[source]#

Bases: modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor.HDF5BasePreprocessor

__init__(params)[source]#
file_read_generator(file)[source]#
preprocessing_generator(doc)[source]#

data_processing.scripts.hdf5_preprocessing.utils module#

data_processing.scripts.hdf5_preprocessing.utils.add_common_args(parser)[source]#

For the argparse to parse arguments for subcommands, we add common command line arguments to each subcommand parser here.

data_processing.scripts.hdf5_preprocessing.utils.dump_args(args, json_params_file)[source]#

Write the input params to file.

data_processing.scripts.hdf5_preprocessing.utils.dump_result(results, json_params_file, eos_id=None, pad_id=None, vocab_size=None)[source]#

Write outputs of execution

data_processing.scripts.hdf5_preprocessing.utils.get_params(desc)[source]#

Retrieve configuration parameters :returns:

Dictionary contains the parameters used to configure

the data processing.

Return type

params (Dict)

data_processing.scripts.hdf5_preprocessing.utils.get_parser(desc)[source]#

Argparser definition for command line arguments from user.

Returns

Argparse namespace object with command line arguments.

data_processing.scripts.hdf5_preprocessing.utils.get_verification_args(params)[source]#
data_processing.scripts.hdf5_preprocessing.utils.process_dataset(files, dataset_processor, processes)[source]#

Process a dataset and write it into HDF5 format.

Parameters
  • files (list) – List of files to process.

  • dataset_processor – Class containing methods that specify how the dataset will be processed and written into HDF5 files.

  • processes (int) – Number of processes to use.

Returns

Dictionary containing results of execution, specifically as number of

processed, discarded, and successful files as well as number of examples from all processes.

data_processing.scripts.hdf5_preprocessing.utils.update_params(params, args)[source]#

Update config parameters with CLI arguments

data_processing.scripts.hdf5_preprocessing.utils.verify_saved_hdf5_files(params)[source]#

This function is used to do sanity checks at the end of the creation of hdf5 files. This function loads every .h5 files generated and checks:

  1. The data type

  2. Shape of the dataset

  3. Fact that labels and inputs are as expected

data_processing.scripts.hdf5_preprocessing.utils.verify_saved_hdf5_files_mp(files, args)[source]#

Verify the generated HDF5 dataset. :param files: List of files to process. :type files: list :param args: Arguments for verifying HDF5 dataset. :type args: argparse namespace

Module contents#