cerebras.modelzoo.data_preparation.data_preprocessing.utils.truncate_sequence#

cerebras.modelzoo.data_preparation.data_preprocessing.utils.truncate_sequence(token_ids, tokenized_semantic_region_list, max_sequence_length, max_turn_length, prompt_truncation_mode)[source]#

Truncates token sequences to fit within a specified MSL, parameterized by max_turn_length.

Parameters
  • token_ids (list) – List of token IDs representing the entire sequence.

  • tokenized_semantic_region_list (list) – List of tokenized semantic regions.

  • max_sequence_length (int) – Maximum allowed length of the sequence after truncation.

  • max_turn_length (int) – Maximum length of any single segment that can be present, after truncation.

  • prompt_truncation_mode (str) – Mode of truncation for prompt/user part of chat. Can be ‘keep_start’ or ‘keep_end’.

Returns

Returned with indices updated for region after truncation. list: The truncated sequence of token IDs that fits within the max_sequence_length constraint.

Return type

tokenized_semantic_region_list (list)