cerebras.modelzoo.data_preparation.data_preprocessing.utils.truncate_sequence#
- cerebras.modelzoo.data_preparation.data_preprocessing.utils.truncate_sequence(token_ids, tokenized_semantic_region_list, max_sequence_length, max_turn_length, prompt_truncation_mode)[source]#
Truncates token sequences to fit within a specified MSL, parameterized by max_turn_length.
- Parameters
token_ids (list) – List of token IDs representing the entire sequence.
tokenized_semantic_region_list (list) – List of tokenized semantic regions.
max_sequence_length (int) – Maximum allowed length of the sequence after truncation.
max_turn_length (int) – Maximum length of any single segment that can be present, after truncation.
prompt_truncation_mode (str) – Mode of truncation for prompt/user part of chat. Can be ‘keep_start’ or ‘keep_end’.
- Returns
Returned with indices updated for region after truncation. list: The truncated sequence of token IDs that fits within the max_sequence_length constraint.
- Return type
tokenized_semantic_region_list (list)