Dataloaders and Processing scripts in Cerebras Model Zoo#

data_processing.GenericDataProcessor module#

Pytorch Generic Dataloader

class data_processing.GenericDataProcessor.GenericDataProcessor[source]#

Bases: object

A Generic PyTorch Data Processor.

Parameters

params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

  • ”batch_size” (int): Batch size.

  • ”shuffle” (bool): Flag to enable data shuffling.

  • ”shuffle_seed” (int): Shuffle seed.

  • ”shuffle_buffer” (int): Size of shuffle buffer in samples.

  • ”num_workers” (int): How many subprocesses to use for data loading.

  • ”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

  • ”prefetch_factor” (int): Number of batches loaded in advance by each worker.

  • ”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

__init__(params)[source]#
create_dataloader()[source]#

Classmethod to create the dataloader object.

data_processing.HDF5IterableDataProcessor module#

Pytorch HDF5 Dataloader

class data_processing.HDF5IterableDataProcessor.HDF5IterableDataProcessor[source]#

Bases: object

A HDF5 dataset processor. Loads data from HDF5 files.

Parameters

params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

  • ”batch_size” (int): Batch size.

  • ”shuffle” (bool): Flag to enable data shuffling.

  • ”shuffle_seed” (int): Shuffle seed.

  • ”shuffle_buffer” (int): Size of shuffle buffer in samples.

  • ”num_workers” (int): How many subprocesses to use for data loading.

  • ”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

  • ”prefetch_factor” (int): Number of batches loaded in advance by each worker.

  • ”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

__init__(params)[source]#
create_dataloader()[source]#

Classmethod to create the dataloader object.

data_processing.HDF5IterableDataset module#

PyTorch HDF5 Dataset

class data_processing.HDF5IterableDataset.HDF5IterableDataset[source]#

Bases: torch.utils.data.IterableDataset

A HDF5 dataset processor. Loads data from HDF5 files.

Parameters

params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

  • ”data_dir” (str or list of str): Path to dataset HDF5 files

  • ”batch_size” (int): Batch size.

  • ”shuffle” (bool): Flag to enable data shuffling.

  • ”shuffle_seed” (int): Shuffle seed.

  • ”num_workers” (int): How many subprocesses to use for data loading.

  • ”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

__init__(params)[source]#

data_processing.utils module#

data_processing.utils.convert_str_to_int_list(s)[source]#
Converts a string (e.g. from parsing CSV) of the form

“[1, 5, 7, 2]”

to a list of integers.

data_processing.utils.convert_to_unicode(text)[source]#

Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging

data_processing.utils.count_total_documents(metadata_files)[source]#

Counts total number of documents in metadata_files.

Parameters

metadata_files (str or list[str]) – Path or list of paths to metadata files.

Returns

Number of documents whose paths are contained in the metadata files.

data_processing.utils.create_masked_lm_predictions(tokens, vocab_words, mask_whole_word, max_predictions_per_seq, masked_lm_prob, rng, exclude_from_masking=None)[source]#

Creates the predictions for the masked LM objective

Parameters
  • tokens (list) – List of tokens to process

  • vocab_words (list) – List of all words present in the vocabulary

  • mask_whole_word (bool) – If true, mask all the subtokens of a word

  • max_predictions_per_seq (int) – Maximum number of masked LM predictions per sequence

  • masked_lm_prob (float) – Masked LM probability

  • rng – random.Random object with shuffle function

  • exclude_from_masking (Optional[list]) – List of tokens to exclude from masking. Defaults to [“[CLS]”, “[SEP]”]

Returns

tuple of tokens which include masked tokens,

the corresponding positions for the masked tokens and also the corresponding labels for training

data_processing.utils.get_files_in_metadata(metadata_filepaths)[source]#

Function to read the files in metadata file provided as input to data generation scripts.

Parameters

metadata_filepaths – path/s to metadata files

Returns List input_files

Contents of metadata files.

data_processing.utils.get_label_id_map(label_vocab_file)[source]#

Load the label-id mapping: Mapping between output labels and id

Parameters

label_vocab_file (str) – Path to the label vocab file

data_processing.utils.get_output_type_shapes(max_seq_length, max_predictions_per_seq, mlm_only=False)[source]#
data_processing.utils.get_vocab(vocab_file_path, do_lower)[source]#

Function to generate vocab from provided vocab_file_path.

Parameters
  • vocab_file_path (str) – Path to vocab file

  • do_lower (bool) – If True, convert vocab words to lower case.

Returns List[str]

list containing vocab words.

class data_processing.utils.maskedLmInstance#

Bases: tuple

maskedLmInstance(index, label)

Create new instance of maskedLmInstance(index, label)

static __new__(_cls, index, label)#

Create new instance of maskedLmInstance(index, label)

index#

Alias for field number 0

label#

Alias for field number 1

data_processing.utils.pad_input_sequence(input_sequence, padding=0, max_sequence_length=512)[source]#
data_processing.utils.pad_instance_to_max_seq_length(instance, mlm_only, tokenizer, max_seq_length, max_predictions_per_seq, output_type_shapes, inverted_mask)[source]#
data_processing.utils.split_list(l, n)[source]#

Splits list/string into n sized chunks.

Parameters
  • l (List[str]) – List or string to split.

  • n (int) – Number of chunks to split to.

Returns List[List]

List of lists containing split list/string.

data_processing.utils.text_to_tokenized_documents(data, tokenizer, multiple_docs_in_single_file, multiple_docs_separator, single_sentence_per_line, spacy_nlp)[source]#

Convert the input data into tokens

Parameters
  • data (str) – Contains data read from a text file

  • tokenizer – Tokenizer object which contains functions to convert words to tokens

  • multiple_docs_in_single_file (bool) – Indicates whether there are multiple documents in the given data string

  • multiple_docs_separator (str) – String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like “—–” etc. There can only be one separator string for all the documents.

  • single_sentence_per_line (bool) – Indicates whether the data contains one sentence in each line

  • spacy_nlp – spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences

Return List[List[List]] documents

Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j

data_processing.utils.whitespace_tokenize(text, lower=False)[source]#

Splits a piece of text based on whitespace characters

Subpackages#