cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessor#

class cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessor(*args, **kwargs)[source]#

Bases: cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor

Reads text files containing the input text tokens.

Parameters

config (cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig) – The configuration object for the processor.

Methods

create_dataloader

Classmethod to create the dataloader object.

element_length_fn

Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

get_meta_data

Read data from meta files.

get_single_item

Iterating over the data to construct input features.

load_buffer

Generator to read the data in chunks of size of data_buffer.

get_meta_data(data_dir)[source]#

Read data from meta files. :param str data_dir: Path to the input directory. :return: Processed meta data.

load_buffer()[source]#

Generator to read the data in chunks of size of data_buffer. We read data from both source and target input datasets to prepare features for side by side translation task. :returns: Yields the data stored in the data_buffer.

get_single_item()[source]#

Iterating over the data to construct input features.

Returns

A dict with training features: * np.array[int.32] input_ids: Numpy array with encoder input token indices.

Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.

    Shape: (tgt_max_sequence_length).

  • np.array[int.32] attention_mask: Numpy array with attention mask for encoder.

    Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.

    Shape: (tgt_max_sequence_length).

  • np.array[int.32] labels: Numpy array with labels for teacher forcing mode.

    Shape: (tgt_max_sequence_length).

element_length_fn(features)[source]#

Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

create_dataloader()#

Classmethod to create the dataloader object.