Dataloaders and Processing scripts in Cerebras Model Zoo#

data_processing.GenericDataProcessor module#

Pytorch Generic Dataloader

class data_processing.GenericDataProcessor.GenericDataProcessor[source]#

Bases: object

A Generic PyTorch Data Processor.

Parameters

params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

”batch_size” (int): Batch size.
”shuffle” (bool): Flag to enable data shuffling.
”shuffle_seed” (int): Shuffle seed.
”shuffle_buffer” (int): Size of shuffle buffer in samples.
”num_workers” (int): How many subprocesses to use for data loading.
”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.
”prefetch_factor” (int): Number of batches loaded in advance by each worker.
”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

__init__(params)[source]#

create_dataloader()[source]#: Classmethod to create the dataloader object.

data_processing.HDF5IterableDataProcessor module#

Pytorch HDF5 Dataloader

class data_processing.HDF5IterableDataProcessor.HDF5IterableDataProcessor[source]#

Bases: object

A HDF5 dataset processor. Loads data from HDF5 files.

Parameters

params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

”batch_size” (int): Batch size.
”shuffle” (bool): Flag to enable data shuffling.
”shuffle_seed” (int): Shuffle seed.
”shuffle_buffer” (int): Size of shuffle buffer in samples.
”num_workers” (int): How many subprocesses to use for data loading.
”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.
”prefetch_factor” (int): Number of batches loaded in advance by each worker.
”persistent_workers” (bool): If True, the data loader will not shutdown the worker processes after a dataset has been consumed once.

__init__(params)[source]#

create_dataloader()[source]#: Classmethod to create the dataloader object.

data_processing.HDF5IterableDataset module#

PyTorch HDF5 Dataset

class data_processing.HDF5IterableDataset.HDF5IterableDataset[source]#

Bases: torch.utils.data.IterableDataset

A HDF5 dataset processor. Loads data from HDF5 files.

Parameters

params (dict) –

dict containing training input parameters for creating dataset.

Expects the following fields:

”data_dir” (str or list of str): Path to dataset HDF5 files
”batch_size” (int): Batch size.
”shuffle” (bool): Flag to enable data shuffling.
”shuffle_seed” (int): Shuffle seed.
”num_workers” (int): How many subprocesses to use for data loading.
”drop_last” (bool): If True and the dataset size is not divisible by the batch size, the last incomplete batch will be dropped.

__init__(params)[source]#

data_processing.utils module#

data_processing.utils.convert_str_to_int_list(s)[source]#

Converts a string (e.g. from parsing CSV) of the form: “[1, 5, 7, 2]”

to a list of integers.

data_processing.utils.convert_to_unicode(text)[source]#: Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging

data_processing.utils.count_total_documents(metadata_files)[source]#

Counts total number of documents in metadata_files.

Parameters: metadata_files (str or list[str]) – Path or list of paths to metadata files.
Returns: Number of documents whose paths are contained in the metadata files.

data_processing.utils.create_masked_lm_predictions(tokens, vocab_words, mask_whole_word, max_predictions_per_seq, masked_lm_prob, rng, exclude_from_masking=None)[source]#

Creates the predictions for the masked LM objective

Parameters

tokens (list) – List of tokens to process
vocab_words (list) – List of all words present in the vocabulary
mask_whole_word (bool) – If true, mask all the subtokens of a word
max_predictions_per_seq (int) – Maximum number of masked LM predictions per sequence
masked_lm_prob (float) – Masked LM probability
rng – random.Random object with shuffle function
exclude_from_masking (Optional[list]) – List of tokens to exclude from masking. Defaults to [“[CLS]”, “[SEP]”]

Returns

tuple of tokens which include masked tokens,

the corresponding positions for the masked tokens and also the corresponding labels for training

data_processing.utils.get_files_in_metadata(metadata_filepaths)[source]#

Function to read the files in metadata file provided as input to data generation scripts.

Parameters: metadata_filepaths – path/s to metadata files
Returns List input_files: Contents of metadata files.

data_processing.utils.get_label_id_map(label_vocab_file)[source]#

Load the label-id mapping: Mapping between output labels and id

Parameters: label_vocab_file (str) – Path to the label vocab file

data_processing.utils.get_output_type_shapes(max_seq_length, max_predictions_per_seq, mlm_only=False)[source]#

data_processing.utils.get_vocab(vocab_file_path, do_lower)[source]#

Function to generate vocab from provided vocab_file_path.

Parameters

vocab_file_path (str) – Path to vocab file
do_lower (bool) – If True, convert vocab words to lower case.

Returns List[str]

list containing vocab words.

class data_processing.utils.maskedLmInstance#

Bases: tuple

maskedLmInstance(index, label)

Create new instance of maskedLmInstance(index, label)

static __new__(_cls, index, label)#: Create new instance of maskedLmInstance(index, label)

index#: Alias for field number 0

label#: Alias for field number 1

data_processing.utils.pad_input_sequence(input_sequence, padding=0, max_sequence_length=512)[source]#

data_processing.utils.pad_instance_to_max_seq_length(instance, mlm_only, tokenizer, max_seq_length, max_predictions_per_seq, output_type_shapes, inverted_mask)[source]#

data_processing.utils.split_list(l, n)[source]#

Splits list/string into n sized chunks.

Parameters

l (List[str]) – List or string to split.
n (int) – Number of chunks to split to.

Returns List[List]

List of lists containing split list/string.

data_processing.utils.text_to_tokenized_documents(data, tokenizer, multiple_docs_in_single_file, multiple_docs_separator, single_sentence_per_line, spacy_nlp)[source]#

Convert the input data into tokens

Parameters

data (str) – Contains data read from a text file
tokenizer – Tokenizer object which contains functions to convert words to tokens
multiple_docs_in_single_file (bool) – Indicates whether there are multiple documents in the given data string
multiple_docs_separator (str) – String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like “—–” etc. There can only be one separator string for all the documents.
single_sentence_per_line (bool) – Indicates whether the data contains one sentence in each line
spacy_nlp – spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences

Return List[List[List]] documents

Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j

data_processing.utils.whitespace_tokenize(text, lower=False)[source]#: Splits a piece of text based on whitespace characters

Subpackages#

Step by step guide to pre-process SlimPajama

data_processing.bert package