cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessor#

class cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessor(*args, **kwargs)[source]#

Reads text files containing the input text tokens.

Parameters: config (cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig) – The configuration object for the processor.

Methods

`create_dataloader`	Classmethod to create the dataloader object.
`element_length_fn`	Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.
`get_meta_data`	Read data from meta files.
`get_single_item`	Iterating over the data to construct input features.
`load_buffer`	Generator to read the data in chunks of size of data_buffer.

get_meta_data(data_dir)[source]#: Read data from meta files. :param str data_dir: Path to the input directory. :return: Processed meta data.

load_buffer()[source]#: Generator to read the data in chunks of size of data_buffer. We read data from both source and target input datasets to prepare features for side by side translation task. :returns: Yields the data stored in the data_buffer.

get_single_item()[source]#

Iterating over the data to construct input features.

Returns

A dict with training features: * np.array[int.32] input_ids: Numpy array with encoder input token indices.

Shape: (src_max_sequence_length).

np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.
Shape: (tgt_max_sequence_length).
np.array[int.32] attention_mask: Numpy array with attention mask for encoder.
Shape: (src_max_sequence_length).
np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.
Shape: (tgt_max_sequence_length).
np.array[int.32] labels: Numpy array with labels for teacher forcing mode.
Shape: (tgt_max_sequence_length).

element_length_fn(features)[source]#: Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor

cerebras.modelzoo.data.nlp.transformer.TransformerDynamicDataProcessor.TransformerDynamicDataProcessorConfig