cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessor#
- class cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessor(*args, **kwargs)[source]#
Bases:
torch.utils.data.IterableDataset
Reads csv files containing the input text tokens, adds MLM features on the fly.
Methods
Classmethod to create the dataloader object.
Iterating over the data to construct input features.
Generator to read the data in chunks of size of data_buffer.
- load_buffer()[source]#
Generator to read the data in chunks of size of data_buffer.
- Returns
Yields the data stored in the data_buffer.
- get_single_item()[source]#
Iterating over the data to construct input features.
- Returns
A tuple with training features: * np.array[int.32] input_ids: Numpy array with input token indices.
Shape: (max_sequence_length).
- np.array[int.32] labels: Numpy array with labels.
Shape: (max_sequence_length).
- np.array[int.32] attention_mask
Shape: (max_sequence_length).
- np.array[int.32] token_type_ids: Numpy array with segment indices.
Shape: (max_sequence_length).
- np.array[int.32] next_sentence_label: Numpy array with labels for NSP task.
Shape: (1).
- np.array[int.32] masked_lm_mask: Numpy array with a mask of
predicted tokens. Shape: (max_predictions) 0 indicates the non masked token, and 1 indicates the masked token.