cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessorLL#
- class cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessorLL[source]#
- Bases: - cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessor- Subclass for processing EEH loglikelihood requests. - Methods - Classmethod to create the dataloader object. - from_request_type- Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file. - __call__(*args: Any, **kwargs: Any) Any#
- Call self as a function. 
 - static __new__(cls, *args: Any, **kwargs: Any) Any#
 - create_dataloader()#
- Classmethod to create the dataloader object. 
 - static gen_data_samples(requests: List, batch_size: int, max_sequence_length: int, tokenizer: Union[tokenizers.Tokenizer, transformers.PreTrainedTokenizerBase], eos_token_id: int, samples_saver: cerebras.modelzoo.common.utils.input.utils.SamplesSaver, request_type: cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.RequestType, inf_start_token: Optional[int] = None, max_gen_tokens: Optional[int] = None) Tuple[List[str], int, Tuple[int, int]]#
- Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file. - Parameters
- requests – List of EEH’s Instance dataclass objects holding raw text data 
- batch_size – The batch size 
- max_sequence_length – The maximum length of each sample 
- tokenizer – The tokenizer used to tokenize raw text data 
- eos_token_id – int representing the end-of-sentence token 
- samples_saver – SamplesSaver object to manage the saving of data samples to file. 
- request_type – The type of request for which the data sample is to be created 
- inf_start_token – (generative tasks-only) int representing the start token for generative inference 
- max_gen_tokens – (generative tasks-only) The max number of tokens to generate 
 
- Returns
- (List[str], int, tuple) tuple of - list of file paths where the samples are dumped; - int representing the size of the dataset (total no. of samples; - tuple of request metadata needed for EEH postprocessing.