cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.FinetuningTokenGenerator#

class cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.FinetuningTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#

Bases: object

Methods

`clean_text`	Clean the provided text.
`encode`	Tokenize and encode the doc for text summarization.
`get_data_ranges`	Get data ranges for the conversation data.
`get_data_stats`	Get data statistics from the sample.
`get_tokenized_semantic_regions`
`parse_semantic_data_array`
`tokenize_data`

clean_text(data)[source]#

Clean the provided text.

Parameters: data (str) – Text to clean.
Returns: Cleaned text.
Return type: str

get_data_ranges(semantic_regions, formatted_data)[source]#

Get data ranges for the conversation data.

Parameters

conversation_data (List[Dict[str, str]]) – List of conversation data.
formatted_data (str) – Formatted conversation data.

Returns

Ranges for system, user, and assistant data.

Return type

Tuple[List[Tuple[int, int]], List[Tuple[int, int]], List[Tuple[int, int]]]

get_data_stats(sample)[source]#

Get data statistics from the sample.

Parameters: sample (np.ndarray) – Tokenized sample.
Returns: Data statistics.
Return type: Dict[str, int]

encode(semantic_data_array)[source]#

Tokenize and encode the doc for text summarization.

Parameters: data (Dict) – Contains a semantic data dict returned from a format hook
Returns: Tuple of encoded features for text summarization and dataset stats
Return type: -> Tuple[List[np.ndarray], Dict]

cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator.pad_to_msl

cerebras.modelzoo.data_preparation.data_preprocessing.hooks