cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessorConfig#

class cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessorConfig(*args, **kwargs)[source]#

Bases: cerebras.modelzoo.config.data_config.DataConfig

Methods

check_for_deprecated_fields

check_literal_discriminator_field

copy

get_disable_nsp

get_orig_class

get_orig_class_args

get_vocab_file

model_copy

model_post_init

post_init

Attributes

attn_mask_pad_id

batch_size

The batch size.

buckets

A list of bucket boundaries.

data_dir

Path to the data files to use.

disable_nsp

Whether Next Sentence Prediction (NSP) objective is disabled.

discriminator

discriminator_value

do_lower

Flag to lower case the texts.

document_separator_token

Seperator token.

drop_last

Whether to drop last batch of epoch if it's an incomplete batch.

dynamic_mlm_scale

Whether to dynamically scale the loss.

exclude_from_masking

Tokens that should be excluded from being masked.

gather_mlm_labels

input_pad_id

labels_pad_id

mask_token

Mask token.

mask_whole_word

Flag to whether mask the entire word.

masked_lm_prob

max_predictions_per_seq

max_sequence_length

mixed_precision

model_config

num_examples

num_workers

The number of PyTorch processes used in the dataloader.

oov_token

Out of vocabulary token.

persistent_workers

Whether or not to keep workers persistent between epochs.

prefetch_factor

The number of batches to prefetch in the dataloader.

segment_pad_id

shuffle

Whether or not to shuffle the dataset.

shuffle_buffer

Buffer size to shuffle samples across.

shuffle_seed

The seed used for deterministic shuffling.

steps

vocab_file

Path to the vocabulary file.

vocab_size

whole_word_masking

data_processor

data_dir = Ellipsis#

Path to the data files to use.

batch_size = Ellipsis#

The batch size.

disable_nsp = False#

Whether Next Sentence Prediction (NSP) objective is disabled.

dynamic_mlm_scale = False#

Whether to dynamically scale the loss.

buckets = None#

A list of bucket boundaries. If set to None, then no bucketing will happen, and data will be batched normally. If set to a list, then data will be grouped into len(buckets) + 1 buckets. A sample s will go into bucket i if buckets[i-1] <= element_length_fn(s) < buckets[i] where 0 and inf are the implied lowest and highest boundaries respectively. buckets must be sorted and all elements must be non-zero.

mask_whole_word = False#

Flag to whether mask the entire word.

do_lower = False#

Flag to lower case the texts.

vocab_file = Ellipsis#

Path to the vocabulary file.

oov_token = '[UNK]'#

Out of vocabulary token.

mask_token = '[MASK]'#

Mask token.

document_separator_token = '[SEP]'#

Seperator token.

exclude_from_masking = ['[CLS]', '[SEP]', '[PAD]', '[MASK]']#

Tokens that should be excluded from being masked.

shuffle = True#

Whether or not to shuffle the dataset.

shuffle_seed = None#

The seed used for deterministic shuffling.

shuffle_buffer = None#

Buffer size to shuffle samples across. If None and shuffle is enabled, 10*batch_size is used.

num_workers = 0#

The number of PyTorch processes used in the dataloader.

prefetch_factor = 10#

The number of batches to prefetch in the dataloader.

persistent_workers = True#

Whether or not to keep workers persistent between epochs.

drop_last = True#

Whether to drop last batch of epoch if it’s an incomplete batch.