cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessorConfig#
- class cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessorConfig(*args, **kwargs)[source]#
Bases:
cerebras.modelzoo.config.data_config.DataConfigMethods
check_for_deprecated_fieldscheck_literal_discriminator_fieldcopyget_disable_nspget_orig_classget_orig_class_argsmodel_copymodel_post_initpost_initAttributes
The batch size.
A list of bucket boundaries.
Path to the data files to use.
Whether Next Sentence Prediction (NSP) objective is disabled.
discriminatordiscriminator_valuedo_lowerWhether to drop last batch of epoch if it's an incomplete batch.
Whether to dynamically scale the loss.
masked_lm_probmax_position_embeddingsmax_predictions_per_seqmax_sequence_lengthmixed_precisionmodel_configThe number of PyTorch processes used in the dataloader.
Whether or not to keep workers persistent between epochs.
The number of batches to prefetch in the dataloader.
Whether or not to shuffle the dataset.
Buffer size to shuffle samples across.
The seed used for deterministic shuffling.
vocab_filevocab_sizewhole_word_maskingdata_processor- data_dir = Ellipsis#
Path to the data files to use.
- batch_size = Ellipsis#
The batch size.
- disable_nsp = False#
Whether Next Sentence Prediction (NSP) objective is disabled.
- dynamic_mlm_scale = False#
Whether to dynamically scale the loss.
- buckets = None#
A list of bucket boundaries. If set to None, then no bucketing will happen, and data will be batched normally. If set to a list, then data will be grouped into len(buckets) + 1 buckets. A sample s will go into bucket i if buckets[i-1] <= element_length_fn(s) < buckets[i] where 0 and inf are the implied lowest and highest boundaries respectively. buckets must be sorted and all elements must be non-zero.
- shuffle = False#
Whether or not to shuffle the dataset.
- shuffle_seed = None#
The seed used for deterministic shuffling.
- shuffle_buffer = None#
Buffer size to shuffle samples across. If None and shuffle is enabled, 10*batch_size is used.
- num_workers = 0#
The number of PyTorch processes used in the dataloader.
- prefetch_factor = 2#
The number of batches to prefetch in the dataloader.
- persistent_workers = False#
Whether or not to keep workers persistent between epochs.
- drop_last = True#
Whether to drop last batch of epoch if it’s an incomplete batch.