cerebras.modelzoo.data_preparation.data_preprocessing.utils#

Functions

add_preprocess_args

Add arguments to the data preprocessing parser.

append_eos_to_multiple_semantic_regions

args_to_params

Process data preprocessing CLI arguments to parameters :returns:

check_fim_special_tokens

chunk

Since we do character-level FIM we need to detokenize, determine boundaries to split, and re-tokenize after splitting.

clean_text

Clean the provided text using ftfy normalization and wikitext detokenization.

dump_args

Write the input params to file.

dump_result

Write outputs of execution

fim

Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability

find_region_in_formatted_string

find_token_range

format_fim

Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats.

get_data_stats

Get data statistics from the sample.

get_files

Get all files of given filetypes from input directory.

get_params

Retrieve configuration parameters :returns:

get_parser

Argparser definition for command line arguments from user.

get_size

Recursively finds size of objects

get_tokenizer_vocab

handle_bos_token_default

When performing FIM, we tokenize each chunk again after splitting.

has_valid_extension

listdir_or_file

pad_helper

Helper for padding.

setup_warning_logging

Set up logging to log warnings to a file in the specified output directory.

split_text_and_tokenize

Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences.

truncate_helper

The goal of our truncation scheme is to avoid removing tokens from the middle section.

truncate_or_pad_helper

Since we perform FIM at character-level, we potentially split characters in the middle of a word.

truncate_sequence

Truncates token sequences to fit within a specified MSL, parameterized by max_turn_length.

update_args

Update eos_id and pad_id in data_params

update_params

Update config parameters with CLI arguments

wikitext_detokenizer

Detokenizer for wikitext.