cerebras.modelzoo.data.nlp.bert.BertTokenClassifierDataProcessor.BertTokenClassifierDataProcessor#
- class cerebras.modelzoo.data.nlp.bert.BertTokenClassifierDataProcessor.BertTokenClassifierDataProcessor[source]#
Bases:
torch.utils.data.IterableDataset
Reads csv file containing the input token ids, and label_ids. Creates attention_masks and sedment_ids on the fly :param <dict> params: dict containing input parameters for creating dataset.
- Expects the following fields:
“vocab_file” (str): Path to the vocab file.
“label_vocab_file” (str): Path to json file with class name to class index.
“data_dir” (str): Path to directory containing the CSV files.
“batch_size” (int): Batch size.
“max_sequence_length” (int): Maximum length of the sequence.
“do_lower” (bool): Flag to lower case the texts.
“shuffle” (bool): Flag to enable data shuffling.
“shuffle_seed” (int): Shuffle seed.
“shuffle_buffer” (int): Shuffle buffer size.
“num_workers” (int): How many subprocesses to use for data loading.
- “drop_last” (bool): If True and the dataset size is not divisible
by the batch size, the last incomplete batch will be dropped.
“prefetch_factor” (int): Number of samples loaded in advance by each worker.
- “persistent_workers” (bool): If True, the data loader will not shutdown
the worker processes after a dataset has been consumed once.
- Pram model_params (dict)
Model parameters for creating the dataset. Expects the following to be defined: - “include_padding_in_loss” (bool): If set to true then a loss mask will be
generated such that padding tokens will be included in the loss calculation.
Methods
Classmethod to create the dataloader object.
Generator to read the data in chunks of size of data_buffer.
- load_buffer()[source]#
Generator to read the data in chunks of size of data_buffer.
- Returns
Yields the data stored in the data_buffer.
- __call__(*args: Any, **kwargs: Any) Any #
Call self as a function.
- static __new__(cls, *args: Any, **kwargs: Any) Any #