cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.FinetuningTokenGenerator#

class cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.FinetuningTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#

Bases: object

Methods

create_features_finetuning

create_features_multimodal

encode

Tokenize and encode the doc for text summarization.

get_tokenized_semantic_regions

pad_to_msl

parse_semantic_data_array

tokenize_data

encode(semantic_data_array)[source]#

Tokenize and encode the doc for text summarization.

Parameters

data (Dict) – Contains a semantic data dict returned from a format hook

Returns

Tuple of encoded features for text summarization and dataset stats

Return type

-> Tuple[List[np.ndarray], Dict]