cerebras.modelzoo.data_preparation.data_preprocessing.utils#
Functions
Add arguments to the data preprocessing parser. |
|
Process data preprocessing CLI arguments to parameters :returns: |
|
Since we do character-level FIM we need to detokenize, determine boundaries to split, and re-tokenize after splitting. |
|
Clean the provided text using ftfy normalization and wikitext detokenization. |
|
Write the input params to file. |
|
Write outputs of execution |
|
Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability |
|
Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats. |
|
Get data statistics from the sample. |
|
Get all files of given filetypes from input directory. |
|
Retrieve configuration parameters :returns: |
|
Argparser definition for command line arguments from user. |
|
Recursively finds size of objects |
|
When performing FIM, we tokenize each chunk again after splitting. |
|
Helper for padding. |
|
Set up logging to log warnings to a file in the specified output directory. |
|
Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences. |
|
The goal of our truncation scheme is to avoid removing tokens from the middle section. |
|
Since we perform FIM at character-level, we potentially split characters in the middle of a word. |
|
Truncates token sequences to fit within a specified MSL, parameterized by max_turn_length. |
|
Update eos_id and pad_id in data_params |
|
Update config parameters with CLI arguments |
|
Detokenizer for wikitext. |