cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor#

class cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor(*args, **kwargs)[source]#

Reads text files containing the input text tokens, adds extra ids for language modeling task on the fly.

Parameters: config (cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessorConfig) – Configuration for the data processor

Methods

`create_dataloader`	Classmethod to create the dataloader object.
`element_length_fn`	Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.
`get_meta_data`	Read data from meta files.
`get_single_item`	Iterating over the data to construct input features.
`load_buffer`	Generator to read samples of data.

get_meta_data(data_dir)[source]#: Read data from meta files. :param str data_dir: Path to the input directory. :return: Processed meta data.

load_buffer()[source]#

Generator to read samples of data.

get_single_item()[source]#

Iterating over the data to construct input features.

Returns

A dict with training features: * np.array[int.32] input_ids: Numpy array with encoder input token indices.

Shape: (src_max_sequence_length).

np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.
Shape: (tgt_max_sequence_length).
np.array[int.32] attention_mask: Numpy array with attention mask for encoder.
Shape: (src_max_sequence_length).
np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.
Shape: (tgt_max_sequence_length).
np.array[int.32] labels: Numpy array with labels for teacher forcing mode.
Shape: (tgt_max_sequence_length).

element_length_fn(features)[source]#: Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor

cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessorConfig