cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils#
Functions
| For the argparse to parse arguments for subcommands, we add common command line arguments to each subcommand parser here. | |
| The language-modeling format is common enough (FIM is very similar) that we can re-use the arguments for it | |
| Since we do character-level FIM we need to detokenize, determine boundaries to split, and re-tokenize after splitting. | |
| Collect statistics of the dataset. | |
| Given a list of token_ids, generate input sequence and labels. | |
| Given a list of VSL sequences, generate input features and labels. | |
| Given a list of VSL sequences, generate input features and labels. | |
| Given a list of VSL sequences, generate input features and labels. | |
| Given a list of prompt_ids and completion_ids, generate input sequence and labels. | |
| Given a list of VSL sequences, generate input features and labels. | |
| Write the input params to file. | |
| Write outputs of execution | |
| Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability | |
| Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats. | |
| Get all files of given filetypes from input directory. | |
| Retrieve configuration parameters :returns: | |
| Argparser definition for command line arguments from user. | |
| Get arguments for verifying HDF5 dataset. :param params: Dictionary containing parameters for verifying HDF5 dataset. :type params: dict :param data_processor: Class containing methods that specify how the dataset will be processed and written into HDF5 files. | |
| When performing FIM, we tokenize each chunk again after splitting. | |
| Helper for padding. | |
| Process a dataset and write it into HDF5 format. | |
| Checkpoint reader for execution. | |
| Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences. | |
| The goal of our truncation scheme is to avoid removing tokens from the middle section. | |
| Since we perform FIM at character-level, we potentially split characters in the middle of a word. | |
| Update config parameters with CLI arguments | |
| This function is used to do sanity checks at the end of the creation of hdf5 files. This function loads every .h5 files generated and checks: 1. The data type 2. Shape of the dataset 3. Fact that labels and inputs are as expected. | |
| Verify the generated HDF5 dataset. | |
| Detokenizer for wikitext. | 
Classes