Dataloaders and Processing scripts in Cerebras Model Zoo#

data_processing.GenericDataProcessor module#

Pytorch Generic Dataloader

class data_processing.GenericDataProcessor.GenericDataProcessor[source]#

Bases: object

A Generic PyTorch Data Processor.


params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

  • ”batch_size” (int): Batch size.

  • ”shuffle” (bool): Flag to enable data shuffling.

  • ”shuffle_seed” (int): Shuffle seed.

  • ”shuffle_buffer” (int): Size of shuffle buffer in samples.

  • ”num_workers” (int): How many subprocesses to use for data loading.

  • ”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

  • ”prefetch_factor” (int): Number of batches loaded in advance by each worker.

  • ”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.


Classmethod to create the dataloader object.

data_processing.HDF5IterableDataProcessor module#

Pytorch HDF5 Dataloader

class data_processing.HDF5IterableDataProcessor.HDF5IterableDataProcessor[source]#

Bases: object

A HDF5 dataset processor. Loads data from HDF5 files.


params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

  • ”batch_size” (int): Batch size.

  • ”shuffle” (bool): Flag to enable data shuffling.

  • ”shuffle_seed” (int): Shuffle seed.

  • ”shuffle_buffer” (int): Size of shuffle buffer in samples.

  • ”num_workers” (int): How many subprocesses to use for data loading.

  • ”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

  • ”prefetch_factor” (int): Number of batches loaded in advance by each worker.

  • ”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.


Classmethod to create the dataloader object.

data_processing.HDF5IterableDataset module#

PyTorch HDF5 Dataset

class data_processing.HDF5IterableDataset.HDF5IterableDataset[source]#


A HDF5 dataset processor. Loads data from HDF5 files.


params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

  • ”data_dir” (str or list of str): Path to dataset HDF5 files

  • ”batch_size” (int): Batch size.

  • ”shuffle” (bool): Flag to enable data shuffling.

  • ”shuffle_seed” (int): Shuffle seed.

  • ”num_workers” (int): How many subprocesses to use for data loading.

  • ”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.


data_processing.utils module#

Converts a string (e.g. from parsing CSV) of the form

“[1, 5, 7, 2]”

to a list of integers.


Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging


Counts total number of documents in metadata_files.


metadata_files (str or list[str]) – Path or list of paths to metadata files.


Number of documents whose paths are contained in the metadata files.

data_processing.utils.create_masked_lm_predictions(tokens, vocab_words, mask_whole_word, max_predictions_per_seq, masked_lm_prob, rng, exclude_from_masking=None)[source]#

Creates the predictions for the masked LM objective

  • tokens (list) – List of tokens to process

  • vocab_words (list) – List of all words present in the vocabulary

  • mask_whole_word (bool) – If true, mask all the subtokens of a word

  • max_predictions_per_seq (int) – Maximum number of masked LM predictions per sequence

  • masked_lm_prob (float) – Masked LM probability

  • rng – random.Random object with shuffle function

  • exclude_from_masking (Optional[list]) – List of tokens to exclude from masking. Defaults to [“[CLS]”, “[SEP]”]


tuple of tokens which include masked tokens,

the corresponding positions for the masked tokens and also the corresponding labels for training


Function to read the files in metadata file provided as input to data generation scripts.


metadata_filepaths – path/s to metadata files

Returns List input_files

Contents of metadata files.


Load the label-id mapping: Mapping between output labels and id


label_vocab_file (str) – Path to the label vocab file

data_processing.utils.get_output_type_shapes(max_seq_length, max_predictions_per_seq, mlm_only=False)[source]#
data_processing.utils.get_vocab(vocab_file_path, do_lower)[source]#

Function to generate vocab from provided vocab_file_path.

  • vocab_file_path (str) – Path to vocab file

  • do_lower (bool) – If True, convert vocab words to lower case.

Returns List[str]

list containing vocab words.

class data_processing.utils.maskedLmInstance#

Bases: tuple

maskedLmInstance(index, label)

Create new instance of maskedLmInstance(index, label)

static __new__(_cls, index, label)#

Create new instance of maskedLmInstance(index, label)


Alias for field number 0


Alias for field number 1

data_processing.utils.pad_input_sequence(input_sequence, padding=0, max_sequence_length=512)[source]#
data_processing.utils.pad_instance_to_max_seq_length(instance, mlm_only, tokenizer, max_seq_length, max_predictions_per_seq, output_type_shapes, inverted_mask)[source]#
data_processing.utils.split_list(l, n)[source]#

Splits list/string into n sized chunks.

  • l (List[str]) – List or string to split.

  • n (int) – Number of chunks to split to.

Returns List[List]

List of lists containing split list/string.

data_processing.utils.text_to_tokenized_documents(data, tokenizer, multiple_docs_in_single_file, multiple_docs_separator, single_sentence_per_line, spacy_nlp)[source]#

Convert the input data into tokens

  • data (str) – Contains data read from a text file

  • tokenizer – Tokenizer object which contains functions to convert words to tokens

  • multiple_docs_in_single_file (bool) – Indicates whether there are multiple documents in the given data string

  • multiple_docs_separator (str) – String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like “—–” etc. There can only be one separator string for all the documents.

  • single_sentence_per_line (bool) – Indicates whether the data contains one sentence in each line

  • spacy_nlp – spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences

Return List[List[List]] documents

Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j

data_processing.utils.whitespace_tokenize(text, lower=False)[source]#

Splits a piece of text based on whitespace characters
