cerebras.modelzoo.data_preparation.utils#
Functions
| Converts a string (e.g. from parsing CSV) of the form | |
| Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging | |
| Counts total number of documents in metadata_files. :param str or list[str] metadata_files: Path or list of paths to metadata files. :returns: Number of documents whose paths are contained in the metadata files. | |
| Creates the predictions for the masked LM objective :param list tokens: List of tokens to process :param list vocab_words: List of all words present in the vocabulary :param bool mask_whole_word: If true, mask all the subtokens of a word :param int max_predictions_per_seq: Maximum number of masked LM predictions per sequence :param float masked_lm_prob: Masked LM probability :param rng: random.Random object with shuffle function :param Optional[list] exclude_from_masking: List of tokens to exclude from masking. | |
| Function to read the files in metadata file provided as input to data generation scripts. | |
| Load the label-id mapping: Mapping between output labels and id :param str label_vocab_file: Path to the label vocab file | |
| Function to generate vocab from provided vocab_file_path. | |
| Splits list/string into n sized chunks. | |
| Convert the input data into tokens :param str data: Contains data read from a text file :param tokenizer: Tokenizer object which contains functions to convert words to tokens :param bool multiple_docs_in_single_file: Indicates whether there are multiple documents in the given data string :param str multiple_docs_separator: String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like "-----" etc. There can only be one separator string for all the documents. :param bool single_sentence_per_line: Indicates whether the data contains one sentence in each line :param spacy_nlp: spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences :return List[List[List]] documents: Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j. | |
| Splits a piece of text based on whitespace characters | 
Classes
| maskedLmInstance(index, label) |