cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor#

class cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor(*args, **kwargs)[source]#

Bases: torch.utils.data.IterableDataset

Reads text files containing the input text tokens, adds extra ids for language modeling task on the fly.

Parameters

config (cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessorConfig) – Configuration for the data processor

Methods

create_dataloader

Classmethod to create the dataloader object.

element_length_fn

Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

get_meta_data

Read data from meta files.

get_single_item

Iterating over the data to construct input features.

load_buffer

Generator to read samples of data.

get_meta_data(data_dir)[source]#

Read data from meta files. :param str data_dir: Path to the input directory. :return: Processed meta data.

load_buffer()[source]#

Generator to read samples of data.

Returns

Yields data samples, one at a time.

get_single_item()[source]#

Iterating over the data to construct input features.

Returns

A dict with training features: * np.array[int.32] input_ids: Numpy array with encoder input token indices.

Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.

    Shape: (tgt_max_sequence_length).

  • np.array[int.32] attention_mask: Numpy array with attention mask for encoder.

    Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.

    Shape: (tgt_max_sequence_length).

  • np.array[int.32] labels: Numpy array with labels for teacher forcing mode.

    Shape: (tgt_max_sequence_length).

element_length_fn(features)[source]#

Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

create_dataloader()[source]#

Classmethod to create the dataloader object.