cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig#

class cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig(*args, **kwargs)[source]#

Bases: cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessorConfig

Methods

`check_for_deprecated_fields`
`check_literal_discriminator_field`
`copy`
`get_orig_class`
`get_orig_class_args`
`model_copy`
`model_post_init`
`post_init`
`validate_max_sequence_length`

Attributes

`batch_size`	Number of sequences per batch.
`buckets`	A list of boundaries for sequence lengths to bucket together in order to speed up VTS/VSL.
`discriminator`
`discriminator_value`
`do_lower`	If True, will lowercase all tokens in vocabulary.
`drop_last`	If the last batch is not the full size, i.e. the dataset could not divide evenly into the batch-size, do not use the last batch.
`dynamic_loss_weight`	If set, will divide the loss for a token by the length of the sequence that the token comes from.
`eos_token`	Token for end-of-sequence
`extra_ids`	Number of sentinel tokens for T5 objective
`fp16_type`
`input_pad_id`	Can set specific padding for inputs
`labels_pad_id`	Can set specific padding for labels
`mixed_precision`
`model_config`
`num_documents_to_concatenate`	Specifies how many documents to pack together
`num_workers`	Number of processes that move data to the accelerator system, so that the system doesn't process data faster than it receives it.
`oov_token`	Token for out-of-vocabulary words/sub-words
`pack_sequences`	If set, will concatenate sequences so that computation is performed on real data rather than padding
`pad_token`	Token for padding
`persistent_workers`	If set, workers will not be shutdown after going through the dataset once.
`prefetch_factor`	Number of batch loaded in advance by each worker.
`shuffle`	If true the data will be shuffled before passing into the model.
`shuffle_buffer`	Size of buffer used to store data before shuffling
`shuffle_seed`	Sets random seed for the order of data shuffling.
`sos_token`	Token for start-of-sequence
`src_data_dir`	Path to directory containing the output of preprocess.sh, with all the files of tokenized data.
`src_max_sequence_length`	Largest possible sequence length for the input.
`src_vocab_file`	Path to file containing tokens of vocabulary, one token per line.
`tgt_data_dir`
`tgt_max_sequence_length`	Largest possible sequence length for the labels.
`tgt_vocab_file`
`vocab_size`
`data_processor`

cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessor

cerebras.modelzoo.data.vision