data_processing.scripts package#
Subpackages#
- data_processing.scripts.hdf5_preprocessing package
- Submodules
- data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5 module
- data_processing.scripts.hdf5_preprocessing.create_hdf5_dataset module
- data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor module
- data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors module
- data_processing.scripts.hdf5_preprocessing.utils module
- Module contents
- data_processing.scripts.pubmed package
Submodules#
data_processing.scripts.utils module#
Common utils.py file sharing the utility functions that could be shared by the special scripts in any of the sub folders.
- data_processing.scripts.utils.archive_to_tokens(f, tokenizer, args, prefix=[])[source]#
Generator that yields the contents of the files in an archive if data_to_prepend is not None, prepend data_to_preprend + an EOS separator to the encoded data.
- Parameters
f (file) – Archive file to read.
tokenizer (BPETokenizer obj) – Tokenizer used to encode raw data.
args (argparse namespace) – Arguments for writing out tfrecords/HDF5.
prefix (list) – Data to prefix before splitting to given context length. Used to add remainder data from previous iteration of data reads. Defaults to [], i.e, empty list.
- Yields
A list of lists with tokenized data + EOS separator appended at the end.
- data_processing.scripts.utils.create_features_auto_lm(token_ids, max_sequence_length, short_seq_prob=0, inverted_mask=False, pad_id=0, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32', rng=None)[source]#
Given a list of token_ids, generate input sequence and labels.
- Parameters
token_ids (sequence) – List containing token ids for creating features, labels and input mask from.
max_sequence_length (int) – Maximum sequence length for data writes.
short_seq_prob (float) – Probability of generating short sequences from data. Defaults to 0.
inverted_mask (bool) – Invert mask if specified for runtime execution. Defaults to False.
pad_id (int) – Id for pad token. Defaults to 0.
input_ids_dtype (str) – Dtype as string for input ids. Defaults to int32.
input_mask_dtype (str) – Dtype as string for input mask. Defaults to int32.
labels_dtype (str) – Dtype as string for labels. Defaults to int32.
rng (random.Random obj) – Instance of random object, with states set. Defaults to None.
- Returns
Tuple containing features and labels
- data_processing.scripts.utils.create_features_labels(token_ids, max_sequence_length, short_seq_prob=0, inverted_mask=False, pad_id=0, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32', rng=None)[source]#
Given a list of token_ids, generate input sequence and labels.
- Parameters
token_ids (sequence) – List containing token ids for creating features, labels and input mask from.
max_sequence_length (int) – Maximum sequence length for data writes.
short_seq_prob (float) – Probability of generating short sequences from data. Defaults to 0.
inverted_mask (bool) – Invert mask if specified for runtime execution. Defaults to False.
pad_id (int) – Id for pad token. Defaults to 0.
input_ids_dtype (str) – Dtype as string for input ids. Defaults to int32.
input_mask_dtype (str) – Dtype as string for input mask. Defaults to int32.
labels_dtype (str) – Dtype as string for labels. Defaults to int32.
rng (random.Random obj) – Instance of random object, with states set. Defaults to None.
- Returns
Tuple containing features and labels
- data_processing.scripts.utils.create_features_summarization(prompt_ids, completion_ids, max_sequence_length, eos_id, sep_id, pad_id=0, inverted_mask=False, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32')[source]#
- data_processing.scripts.utils.get_files(input_dir=None, filetypes=None, metadata_files=None)[source]#
Get all files of given filetypes from input directory.
- Parameters
input_dir (str) – Input directory to read files from.
filetypes (sequence) – File types to fetch from the given input directory. Defaults to None.
metadata_files (str) – Comma separated string of metadata files.
- Returns
List of lists containing all file paths as strings
- data_processing.scripts.utils.get_single_example(tokens, args, rng)[source]#
Create features, labels from tokens for HDF5. :param tokens: List containing tokenized data to write. :type tokens: list :param args: Arguments for writing out HDF5 dataset. :type args: argparse namespace :param rng: Instance of random object, with states set. :type rng: random.Random obj
- Returns
[3, max_sequence_length])
- Return type
Numpy array contains features for a single example (shape
- data_processing.scripts.utils.read_checkpoint(checkpoint_path, resume_from_checkpoint=True)[source]#
Checkpoint reader for execution.
- Parameters
checkpoint_path (str) – Path to read checkpoint data from
resume_from_checkpoint (bool) – Resume from checkpoint for execution. Defaults to True.
- Returns
- Tuple containing number of files processed and the count of tfrecords/HDF5 files
written to output directory.