Thank you for your feedback!
cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig#
- class cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig(*args, **kwargs)[source]#
Bases:
cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessorConfig
Methods
check_for_deprecated_fields
check_literal_discriminator_field
copy
get_orig_class
get_orig_class_args
model_copy
model_post_init
post_init
validate_max_sequence_length
Attributes
batch_size
Number of sequences per batch.
buckets
A list of boundaries for sequence lengths to bucket together in order to speed up VTS/VSL.
discriminator
discriminator_value
do_lower
If True, will lowercase all tokens in vocabulary.
drop_last
If the last batch is not the full size, i.e. the dataset could not divide evenly into the batch-size, do not use the last batch.
dynamic_loss_weight
If set, will divide the loss for a token by the length of the sequence that the token comes from.
eos_token
Token for end-of-sequence
extra_ids
Number of sentinel tokens for T5 objective
fp16_type
input_pad_id
Can set specific padding for inputs
labels_pad_id
Can set specific padding for labels
mixed_precision
model_config
num_documents_to_concatenate
Specifies how many documents to pack together
num_workers
Number of processes that move data to the accelerator system, so that the system doesn't process data faster than it receives it.
oov_token
Token for out-of-vocabulary words/sub-words
pack_sequences
If set, will concatenate sequences so that computation is performed on real data rather than padding
pad_token
Token for padding
persistent_workers
If set, workers will not be shutdown after going through the dataset once.
prefetch_factor
Number of batch loaded in advance by each worker.
shuffle
If true the data will be shuffled before passing into the model.
shuffle_buffer
Size of buffer used to store data before shuffling
shuffle_seed
Sets random seed for the order of data shuffling.
sos_token
Token for start-of-sequence
src_data_dir
Path to directory containing the output of preprocess.sh, with all the files of tokenized data.
src_max_sequence_length
Largest possible sequence length for the input.
src_vocab_file
Path to file containing tokens of vocabulary, one token per line.
tgt_data_dir
tgt_max_sequence_length
Largest possible sequence length for the labels.
tgt_vocab_file
vocab_size
data_processor
Was this information helpful?
Thank you for your feedback!
- NO
- YES