cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessorConfig#
- class cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessorConfig(*args, **kwargs)[source]#
Bases:
cerebras.modelzoo.config.data_config.DataConfig
Methods
check_for_deprecated_fields
check_literal_discriminator_field
copy
get_disable_nsp
get_orig_class
get_orig_class_args
get_vocab_file
model_copy
model_post_init
post_init
Attributes
attn_mask_pad_id
The batch size.
A list of bucket boundaries.
Path to the data files to use.
Whether Next Sentence Prediction (NSP) objective is disabled.
discriminator
discriminator_value
Flag to lower case the texts.
Seperator token.
Whether to drop last batch of epoch if it's an incomplete batch.
Whether to dynamically scale the loss.
Tokens that should be excluded from being masked.
gather_mlm_labels
input_pad_id
labels_pad_id
Mask token.
Flag to whether mask the entire word.
masked_lm_prob
max_predictions_per_seq
max_sequence_length
mixed_precision
model_config
num_examples
The number of PyTorch processes used in the dataloader.
Out of vocabulary token.
Whether or not to keep workers persistent between epochs.
The number of batches to prefetch in the dataloader.
segment_pad_id
Whether or not to shuffle the dataset.
Buffer size to shuffle samples across.
The seed used for deterministic shuffling.
steps
Path to the vocabulary file.
vocab_size
whole_word_masking
data_processor
- data_dir = Ellipsis#
Path to the data files to use.
- batch_size = Ellipsis#
The batch size.
- disable_nsp = False#
Whether Next Sentence Prediction (NSP) objective is disabled.
- dynamic_mlm_scale = False#
Whether to dynamically scale the loss.
- buckets = None#
A list of bucket boundaries. If set to None, then no bucketing will happen, and data will be batched normally. If set to a list, then data will be grouped into len(buckets) + 1 buckets. A sample s will go into bucket i if buckets[i-1] <= element_length_fn(s) < buckets[i] where 0 and inf are the implied lowest and highest boundaries respectively. buckets must be sorted and all elements must be non-zero.
- mask_whole_word = False#
Flag to whether mask the entire word.
- do_lower = False#
Flag to lower case the texts.
- vocab_file = Ellipsis#
Path to the vocabulary file.
- oov_token = '[UNK]'#
Out of vocabulary token.
- mask_token = '[MASK]'#
Mask token.
- document_separator_token = '[SEP]'#
Seperator token.
- exclude_from_masking = ['[CLS]', '[SEP]', '[PAD]', '[MASK]']#
Tokens that should be excluded from being masked.
- shuffle = True#
Whether or not to shuffle the dataset.
- shuffle_seed = None#
The seed used for deterministic shuffling.
- shuffle_buffer = None#
Buffer size to shuffle samples across. If None and shuffle is enabled, 10*batch_size is used.
- num_workers = 0#
The number of PyTorch processes used in the dataloader.
- prefetch_factor = 10#
The number of batches to prefetch in the dataloader.
- persistent_workers = True#
Whether or not to keep workers persistent between epochs.
- drop_last = True#
Whether to drop last batch of epoch if it’s an incomplete batch.