cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils#

Functions

`add_common_args`	For the argparse to parse arguments for subcommands, we add common command line arguments to each subcommand parser here.
`add_dpo_args`
`add_llava_common_args`
`add_llava_phase_1_args`
`add_llava_phase_2_args`
`add_lm_args`	The language-modeling format is common enough (FIM is very similar) that we can re-use the arguments for it
`add_mlm_args`
`add_summarization_args`
`add_summarization_vsl_args`
`check_fim_special_tokens`
`chunk`	Since we do character-level FIM we need to detokenize, determine boundaries to split, and re-tokenize after splitting.
`collect_stats`	Collect statistics of the dataset.
`create_features_auto_lm`	Given a list of token_ids, generate input sequence and labels.
`create_features_auto_lm_vsl`	Given a list of VSL sequences, generate input features and labels.
`create_features_llava_phase1`	Given a list of VSL sequences, generate input features and labels.
`create_features_llava_phase2`	Given a list of VSL sequences, generate input features and labels.
`create_features_summarization`	Given a list of prompt_ids and completion_ids, generate input sequence and labels.
`create_features_summarization_vsl`	Given a list of VSL sequences, generate input features and labels.
`dump_args`	Write the input params to file.
`dump_result`	Write outputs of execution
`fim`	Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability
`format_fim`	Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats.
`get_files`	Get all files of given filetypes from input directory.
`get_params`	Retrieve configuration parameters :returns:
`get_parser`	Argparser definition for command line arguments from user.
`get_tokenizer_vocab`
`get_verification_args`	Get arguments for verifying HDF5 dataset. :param params: Dictionary containing parameters for verifying HDF5 dataset. :type params: dict :param data_processor: Class containing methods that specify how the dataset will be processed and written into HDF5 files.
`handle_bos_token_default`	When performing FIM, we tokenize each chunk again after splitting.
`handle_jsonl`
`has_valid_extension`
`listdir_or_file`
`multimodal_add_image_patch_start_idx`
`pad_helper`	Helper for padding.
`process_dataset`	Process a dataset and write it into HDF5 format.
`read_checkpoint`	Checkpoint reader for execution.
`set_defaults`
`split_text_and_tokenize`	Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences.
`truncate_helper`	The goal of our truncation scheme is to avoid removing tokens from the middle section.
`truncate_or_pad_helper`	Since we perform FIM at character-level, we potentially split characters in the middle of a word.
`update_params`	Update config parameters with CLI arguments
`validate_tokens`
`verify_saved_hdf5_files`	This function is used to do sanity checks at the end of the creation of hdf5 files. This function loads every .h5 files generated and checks: 1. The data type 2. Shape of the dataset 3. Fact that labels and inputs are as expected.
`verify_saved_hdf5_files_mp`	Verify the generated HDF5 dataset.
`wikitext_detokenizer`	Detokenizer for wikitext.

Classes

`DatasetStats`	DatasetStats(num_sequences: int, num_tokens: int, detokenized_bytes: int, detokenized_chars: int, non_pad_tokens: int, loss_valid_tokens: int)
`DocObject`
`Reader`
`VerificationArgs`	VerificationArgs(processes: int, files_per_record: int, max_seq_length: int, tokenizer_obj: object, eos_id: int, pad_id: int, num_features: int)

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_nlg_preprocessor.NLGPreprocessor

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.add_common_args