cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig#
- class cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig(*args, **kwargs)[source]#
Bases:
cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessorConfigMethods
check_for_deprecated_fieldscheck_literal_discriminator_fieldcopyget_orig_classget_orig_class_argsmodel_copymodel_post_initpost_initvalidate_max_sequence_lengthAttributes
batch_sizeNumber of sequences per batch.
bucketsA list of boundaries for sequence lengths to bucket together in order to speed up VTS/VSL.
discriminatordiscriminator_valuedo_lowerIf True, will lowercase all tokens in vocabulary.
drop_lastIf the last batch is not the full size, i.e. the dataset could not divide evenly into the batch-size, do not use the last batch.
dynamic_loss_weightIf set, will divide the loss for a token by the length of the sequence that the token comes from.
eos_tokenToken for end-of-sequence
extra_idsNumber of sentinel tokens for T5 objective
fp16_typeinput_pad_idCan set specific padding for inputs
labels_pad_idCan set specific padding for labels
mixed_precisionmodel_confignum_documents_to_concatenateSpecifies how many documents to pack together
num_workersNumber of processes that move data to the accelerator system, so that the system doesn't process data faster than it receives it.
oov_tokenToken for out-of-vocabulary words/sub-words
pack_sequencesIf set, will concatenate sequences so that computation is performed on real data rather than padding
pad_tokenToken for padding
persistent_workersIf set, workers will not be shutdown after going through the dataset once.
prefetch_factorNumber of batch loaded in advance by each worker.
shuffleIf true the data will be shuffled before passing into the model.
shuffle_bufferSize of buffer used to store data before shuffling
shuffle_seedSets random seed for the order of data shuffling.
sos_tokenToken for start-of-sequence
src_data_dirPath to directory containing the output of preprocess.sh, with all the files of tokenized data.
src_max_sequence_lengthLargest possible sequence length for the input.
src_vocab_filePath to file containing tokens of vocabulary, one token per line.
tgt_data_dirtgt_max_sequence_lengthLargest possible sequence length for the labels.
tgt_vocab_filevocab_sizedata_processor