cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.FinetuningTokenGenerator#
- class cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.FinetuningTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#
Bases:
objectMethods
Clean the provided text.
Tokenize and encode the doc for text summarization.
Get data statistics from the sample.
get_tokenized_semantic_regionsparse_semantic_data_arraytokenize_data- clean_text(data)[source]#
Clean the provided text.
- Parameters
data (str) – Text to clean.
- Returns
Cleaned text.
- Return type
str