cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessor#

class cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessor(*args, **kwargs)[source]#

Bases: torch.utils.data.IterableDataset

Reads csv files containing the input text tokens, and MLM features.

Methods

create_dataloader

Classmethod to create the dataloader object.

get_single_item

Iterating over the data to construct input features.

load_buffer

Generator to read the data in chunks of size of data_buffer.

load_buffer()[source]#

Generator to read the data in chunks of size of data_buffer.

Returns

Yields the data stored in the data_buffer.

get_single_item()[source]#

Iterating over the data to construct input features.

Returns

A tuple with training features: * np.array[int.32] input_ids: Numpy array with input token indices.

Shape: (max_sequence_length).

  • np.array[int.32] labels: Numpy array with labels.

    Shape: (max_sequence_length).

  • np.array[int.32] attention_mask

    Shape: (max_sequence_length).

  • np.array[int.32] token_type_ids: Numpy array with segment indices.

    Shape: (max_sequence_length).

  • np.array[int.32] next_sentence_label: Numpy array with labels for NSP task.

    Shape: (1).

  • np.array[int.32] masked_lm_mask: Numpy array with a mask of

    predicted tokens. Shape: (max_predictions) 0 indicates the non masked token, and 1 indicates the masked token.

create_dataloader()[source]#

Classmethod to create the dataloader object.