data_processing.scripts package#

Subpackages#

Submodules#

data_processing.scripts.utils module#

Common utils.py file sharing the utility functions that could be shared by the special scripts in any of the sub folders.

data_processing.scripts.utils.archive_to_tokens(f, tokenizer, args, prefix=[])[source]#

Generator that yields the contents of the files in an archive if data_to_prepend is not None, prepend data_to_preprend + an EOS separator to the encoded data.

Parameters

f (file) – Archive file to read.
tokenizer (BPETokenizer obj) – Tokenizer used to encode raw data.
args (argparse namespace) – Arguments for writing out tfrecords/HDF5.
prefix (list) – Data to prefix before splitting to given context length. Used to add remainder data from previous iteration of data reads. Defaults to [], i.e, empty list.

Yields

A list of lists with tokenized data + EOS separator appended at the end.

data_processing.scripts.utils.create_features_auto_lm(token_ids, max_sequence_length, short_seq_prob=0, inverted_mask=False, pad_id=0, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32', rng=None)[source]#

Given a list of token_ids, generate input sequence and labels.

Parameters

token_ids (sequence) – List containing token ids for creating features, labels and input mask from.
max_sequence_length (int) – Maximum sequence length for data writes.
short_seq_prob (float) – Probability of generating short sequences from data. Defaults to 0.
inverted_mask (bool) – Invert mask if specified for runtime execution. Defaults to False.
pad_id (int) – Id for pad token. Defaults to 0.
input_ids_dtype (str) – Dtype as string for input ids. Defaults to int32.
input_mask_dtype (str) – Dtype as string for input mask. Defaults to int32.
labels_dtype (str) – Dtype as string for labels. Defaults to int32.
rng (random.Random obj) – Instance of random object, with states set. Defaults to None.

Returns

Tuple containing features and labels

data_processing.scripts.utils.create_features_labels(token_ids, max_sequence_length, short_seq_prob=0, inverted_mask=False, pad_id=0, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32', rng=None)[source]#

Given a list of token_ids, generate input sequence and labels.

Parameters

token_ids (sequence) – List containing token ids for creating features, labels and input mask from.
max_sequence_length (int) – Maximum sequence length for data writes.
short_seq_prob (float) – Probability of generating short sequences from data. Defaults to 0.
inverted_mask (bool) – Invert mask if specified for runtime execution. Defaults to False.
pad_id (int) – Id for pad token. Defaults to 0.
input_ids_dtype (str) – Dtype as string for input ids. Defaults to int32.
input_mask_dtype (str) – Dtype as string for input mask. Defaults to int32.
labels_dtype (str) – Dtype as string for labels. Defaults to int32.
rng (random.Random obj) – Instance of random object, with states set. Defaults to None.

Returns

Tuple containing features and labels

data_processing.scripts.utils.create_features_summarization(prompt_ids, completion_ids, max_sequence_length, eos_id, sep_id, pad_id=0, inverted_mask=False, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32')[source]#

data_processing.scripts.utils.get_files(input_dir=None, filetypes=None, metadata_files=None)[source]#

Get all files of given filetypes from input directory.

Parameters

input_dir (str) – Input directory to read files from.
filetypes (sequence) – File types to fetch from the given input directory. Defaults to None.
metadata_files (str) – Comma separated string of metadata files.

Returns

List of lists containing all file paths as strings

data_processing.scripts.utils.get_single_example(tokens, args, rng)[source]#

Create features, labels from tokens for HDF5. :param tokens: List containing tokenized data to write. :type tokens: list :param args: Arguments for writing out HDF5 dataset. :type args: argparse namespace :param rng: Instance of random object, with states set. :type rng: random.Random obj

Returns: [3, max_sequence_length])
Return type: Numpy array contains features for a single example (shape

data_processing.scripts.utils.read_checkpoint(checkpoint_path, resume_from_checkpoint=True)[source]#

Checkpoint reader for execution.

Parameters

checkpoint_path (str) – Path to read checkpoint data from
resume_from_checkpoint (bool) – Resume from checkpoint for execution. Defaults to True.

Returns

Tuple containing number of files processed and the count of tfrecords/HDF5 files: written to output directory.

data_processing.scripts.utils.wikitext_detokenizer(string)[source]#

Detokenizer for wikitext. Used for special handling of data for substrings.

Parameters: string (str) – String to detoknize before tokenization.
Returns: Detokenized string

Module contents#

data_processing.huggingface package

data_processing.scripts.hdf5_preprocessing package