cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor#
- class cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor(*args, **kwargs)[source]#
Bases:
torch.utils.data.IterableDataset
Reads text files containing the input text tokens, adds extra ids for language modeling task on the fly.
- Parameters
config (cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessorConfig) – Configuration for the data processor
Methods
Classmethod to create the dataloader object.
Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.
Read data from meta files.
Iterating over the data to construct input features.
Generator to read samples of data.
- get_meta_data(data_dir)[source]#
Read data from meta files. :param str data_dir: Path to the input directory. :return: Processed meta data.
- load_buffer()[source]#
Generator to read samples of data.
- Returns
Yields data samples, one at a time.
- get_single_item()[source]#
Iterating over the data to construct input features.
- Returns
A dict with training features: * np.array[int.32] input_ids: Numpy array with encoder input token indices.
Shape: (src_max_sequence_length).
- np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.
Shape: (tgt_max_sequence_length).
- np.array[int.32] attention_mask: Numpy array with attention mask for encoder.
Shape: (src_max_sequence_length).
- np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.
Shape: (tgt_max_sequence_length).
- np.array[int.32] labels: Numpy array with labels for teacher forcing mode.
Shape: (tgt_max_sequence_length).