cerebras.modelzoo.data_preparation.utils#
Functions
Converts a string (e.g. from parsing CSV) of the form |
|
Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging |
|
Counts total number of documents in metadata_files. :param str or list[str] metadata_files: Path or list of paths to metadata files. :returns: Number of documents whose paths are contained in the metadata files. |
|
Creates the predictions for the masked LM objective :param list tokens: List of tokens to process :param list vocab_words: List of all words present in the vocabulary :param bool mask_whole_word: If true, mask all the subtokens of a word :param int max_predictions_per_seq: Maximum number of masked LM predictions per sequence :param float masked_lm_prob: Masked LM probability :param rng: random.Random object with shuffle function :param Optional[list] exclude_from_masking: List of tokens to exclude from masking. |
|
Function to read the files in metadata file provided as input to data generation scripts. |
|
Load the label-id mapping: Mapping between output labels and id :param str label_vocab_file: Path to the label vocab file |
|
Function to generate vocab from provided vocab_file_path. |
|
Splits list/string into n sized chunks. |
|
Convert the input data into tokens :param str data: Contains data read from a text file :param tokenizer: Tokenizer object which contains functions to convert words to tokens :param bool multiple_docs_in_single_file: Indicates whether there are multiple documents in the given data string :param str multiple_docs_separator: String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like "-----" etc. There can only be one separator string for all the documents. :param bool single_sentence_per_line: Indicates whether the data contains one sentence in each line :param spacy_nlp: spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences :return List[List[List]] documents: Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j. |
|
Splits a piece of text based on whitespace characters |
Classes
maskedLmInstance(index, label) |