Dataloaders and Processing scripts in Cerebras Model Zoo#
data_processing.GenericDataProcessor module#
Pytorch Generic Dataloader
- class data_processing.GenericDataProcessor.GenericDataProcessor[source]#
A Generic PyTorch Data Processor.
- Parameters
params (dict) –
dict containing training input parameters for creating dataset.
Expects the following fields:
”batch_size” (int): Batch size.
”shuffle” (bool): Flag to enable data shuffling.
”shuffle_seed” (int): Shuffle seed.
”shuffle_buffer” (int): Size of shuffle buffer in samples.
”num_workers” (int): How many subprocesses to use for data loading.
”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.
”prefetch_factor” (int): Number of batches loaded in advance by each worker.
”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.
data_processing.HDF5IterableDataProcessor module#
Pytorch HDF5 Dataloader
- class data_processing.HDF5IterableDataProcessor.HDF5IterableDataProcessor[source]#
A HDF5 dataset processor. Loads data from HDF5 files.
- Parameters
params (dict) –
dict containing training input parameters for creating dataset.
Expects the following fields:
”batch_size” (int): Batch size.
”shuffle” (bool): Flag to enable data shuffling.
”shuffle_seed” (int): Shuffle seed.
”shuffle_buffer” (int): Size of shuffle buffer in samples.
”num_workers” (int): How many subprocesses to use for data loading.
”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.
”prefetch_factor” (int): Number of batches loaded in advance by each worker.
”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.
data_processing.HDF5IterableDataset module#
PyTorch HDF5 Dataset
- class data_processing.HDF5IterableDataset.HDF5IterableDataset[source]#
A HDF5 dataset processor. Loads data from HDF5 files.
- Parameters
params (dict) –
dict containing training input parameters for creating dataset.
Expects the following fields:
”data_dir” (str or list of str): Path to dataset HDF5 files
”batch_size” (int): Batch size.
”shuffle” (bool): Flag to enable data shuffling.
”shuffle_seed” (int): Shuffle seed.
”num_workers” (int): How many subprocesses to use for data loading.
”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.
data_processing.utils module#
- data_processing.utils.convert_str_to_int_list(s)[source]#
- Converts a string (e.g. from parsing CSV) of the form
“[1, 5, 7, 2]”
to a list of integers.
- data_processing.utils.convert_to_unicode(text)[source]#
Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging
- data_processing.utils.count_total_documents(metadata_files)[source]#
Counts total number of documents in metadata_files.
- Parameters
metadata_files (str or list[str]) – Path or list of paths to metadata files.
- Returns
Number of documents whose paths are contained in the metadata files.
- data_processing.utils.create_masked_lm_predictions(tokens, vocab_words, mask_whole_word, max_predictions_per_seq, masked_lm_prob, rng, exclude_from_masking=None)[source]#
Creates the predictions for the masked LM objective
- Parameters
tokens (list) – List of tokens to process
vocab_words (list) – List of all words present in the vocabulary
mask_whole_word (bool) – If true, mask all the subtokens of a word
max_predictions_per_seq (int) – Maximum number of masked LM predictions per sequence
masked_lm_prob (float) – Masked LM probability
rng – random.Random object with shuffle function
exclude_from_masking (Optional[list]) – List of tokens to exclude from masking. Defaults to [“[CLS]”, “[SEP]”]
- Returns
tuple of tokens which include masked tokens,
the corresponding positions for the masked tokens and also the corresponding labels for training
- data_processing.utils.get_files_in_metadata(metadata_filepaths)[source]#
Function to read the files in metadata file provided as input to data generation scripts.
- Parameters
metadata_filepaths – path/s to metadata files
- Returns List input_files
Contents of metadata files.
- data_processing.utils.get_label_id_map(label_vocab_file)[source]#
Load the label-id mapping: Mapping between output labels and id
- Parameters
label_vocab_file (str) – Path to the label vocab file
- data_processing.utils.get_output_type_shapes(max_seq_length, max_predictions_per_seq, mlm_only=False)[source]#
- data_processing.utils.get_vocab(vocab_file_path, do_lower)[source]#
Function to generate vocab from provided vocab_file_path.
- Parameters
vocab_file_path (str) – Path to vocab file
do_lower (bool) – If True, convert vocab words to lower case.
- Returns List[str]
list containing vocab words.
- class data_processing.utils.maskedLmInstance#
maskedLmInstance(index, label)
Create new instance of maskedLmInstance(index, label)
- static __new__(_cls, index, label)#
Create new instance of maskedLmInstance(index, label)
- index#
Alias for field number 0
- label#
Alias for field number 1
- data_processing.utils.pad_input_sequence(input_sequence, padding=0, max_sequence_length=512)[source]#
- data_processing.utils.pad_instance_to_max_seq_length(instance, mlm_only, tokenizer, max_seq_length, max_predictions_per_seq, output_type_shapes, inverted_mask)[source]#
- data_processing.utils.split_list(l, n)[source]#
Splits list/string into n sized chunks.
- Parameters
l (List[str]) – List or string to split.
n (int) – Number of chunks to split to.
- Returns List[List]
List of lists containing split list/string.
- data_processing.utils.text_to_tokenized_documents(data, tokenizer, multiple_docs_in_single_file, multiple_docs_separator, single_sentence_per_line, spacy_nlp)[source]#
Convert the input data into tokens
- Parameters
data (str) – Contains data read from a text file
tokenizer – Tokenizer object which contains functions to convert words to tokens
multiple_docs_in_single_file (bool) – Indicates whether there are multiple documents in the given data string
multiple_docs_separator (str) – String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like “—–” etc. There can only be one separator string for all the documents.
single_sentence_per_line (bool) – Indicates whether the data contains one sentence in each line
spacy_nlp – spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences
- Return List[List[List]] documents
Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j
- data_processing.bert package
- data_processing.h5_map_dataset package
- data_processing.huggingface package
- Submodules
- data_processing.huggingface.CSDataCollatorForLanguageModeling module
- data_processing.huggingface.HF_converter_example_BookCorpus module
- data_processing.huggingface.HF_converter_example_Eli5 module
- data_processing.huggingface.HuggingFaceDataProcessor module
- data_processing.huggingface.HuggingFace_BookCorpus module
- data_processing.huggingface.HuggingFace_Eli5 module
- Module contents
- data_processing.scripts package
- Subpackages
- data_processing.scripts.hdf5_preprocessing package
- Submodules
- data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5 module
- data_processing.scripts.hdf5_preprocessing.create_hdf5_dataset module
- data_processing.scripts.hdf5_preprocessing.hdf5_base_preprocessor module
- data_processing.scripts.hdf5_preprocessing.hdf5_dataset_preprocessors module
- data_processing.scripts.hdf5_preprocessing.utils module
- Module contents
- data_processing.scripts.pubmed package
- data_processing.scripts.hdf5_preprocessing package
- Submodules
- data_processing.scripts.utils module
- Module contents
- Subpackages
- data_processing.tokenizers package