cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessor#
- class cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessor[source]#
- Bases: - torch.utils.data.IterableDataset- Reads csv files containing the input text tokens, and MLM features. :param <dict> params: dict containing input parameters for creating dataset. Expects the following fields: - “data_dir” (string): path to the data files to use. 
- “batch_size” (int): Batch size. 
- “shuffle” (bool): Flag to enable data shuffling. 
- “shuffle_seed” (int): Shuffle seed. 
- “shuffle_buffer” (int): Shuffle buffer size. 
- “dynamic_mlm_scale” (bool): Flag to dynamically scale the loss. 
- “num_workers” (int): How many subprocesses to use for data loading. 
- “drop_last” (bool): If True and the dataset size is not divisible
- by the batch size, the last incomplete batch will be dropped. 
 
- “prefetch_factor” (int): Number of samples loaded in advance by each worker. 
- “persistent_workers” (bool): If True, the data loader will not shutdown
- the worker processes after a dataset has been consumed once. 
 
- “mixed_precision” (bool): Casts input mask to fp16 if set to True. Otherwise, the generated mask is float32. 
 - Methods - Classmethod to create the dataloader object. - Iterating over the data to construct input features. - Generator to read the data in chunks of size of data_buffer. - load_buffer()[source]#
- Generator to read the data in chunks of size of data_buffer. - Returns
- Yields the data stored in the data_buffer. 
 
 - get_single_item()[source]#
- Iterating over the data to construct input features. - Returns
- A tuple with training features: * np.array[int.32] input_ids: Numpy array with input token indices. - Shape: (max_sequence_length). - np.array[int.32] labels: Numpy array with labels.
- Shape: (max_sequence_length). 
 
- np.array[int.32] attention_mask
- Shape: (max_sequence_length). 
 
- np.array[int.32] token_type_ids: Numpy array with segment indices.
- Shape: (max_sequence_length). 
 
- np.array[int.32] next_sentence_label: Numpy array with labels for NSP task.
- Shape: (1). 
 
- np.array[int.32] masked_lm_mask: Numpy array with a mask of
- predicted tokens. Shape: (max_predictions) 0 indicates the non masked token, and 1 indicates the masked token. 
 
 
 
 - __call__(*args: Any, **kwargs: Any) Any#
- Call self as a function. 
 - static __new__(cls, *args: Any, **kwargs: Any) Any#