cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessorBCEH#

class cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessorBCEH(params, samples_file_list, dataset_size, max_input_len, inf_start_token=None, stop_sequence_shape=None)[source]#

Bases: cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessor

Subclass for processing BigCode data, i.e. bigcode_eh requests.

Methods

create_dataloader

Classmethod to create the dataloader object.

from_request_type

gen_data_samples

Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

create_dataloader()#

Classmethod to create the dataloader object.

static gen_data_samples(requests, batch_size, max_sequence_length, tokenizer, samples_saver, request_type, max_input_len=0, inf_start_token=None, max_gen_tokens=None, stop_words_cache=None, stop_sequence_shape=None)#

Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

Parameters
  • requests (List) – List of EEH’s Instance dataclass objects holding raw text data

  • batch_size (int) – The batch size

  • max_sequence_length (int) – The maximum length of each sample

  • tokenizer (transformers.PreTrainedTokenizerBase) – The tokenizer used to tokenize raw text data

  • samples_saver (cerebras.modelzoo.common.utils.input.utils.SamplesSaver) – SamplesSaver object to manage the saving of data samples to file.

  • request_type (cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.RequestType) – The type of request for which the data sample is to be created

  • max_input_len (int) – The maximum length of the tokenized input

  • inf_start_token (Optional[int]) – (generative tasks-only) int representing the start token for generative inference

  • max_gen_tokens (Optional[int]) – (generative tasks-only) The max number of tokens to generate

  • stop_words_cache (Optional[Dict[str, List[List[int]]]]) – (generative tasks-only) dict to cache the tokenized stop sequences

  • stop_sequence_shape (Optional[Tuple[int, int]]) – (generative tasks-only) tuple to cache the (num_stop_sequences, max_stop_seq_len)

Returns

(List[str], int, tuple) tuple of - list of file paths where the samples are dumped; - int representing the size of the dataset (total no. of samples; - dict of request and dataset metadata needed for EEH postprocessing.

Return type

Tuple[List[str], int, List[Dict[str, Any]]]