cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.tokenize_stop_words#

cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.tokenize_stop_words(stop_words, tokenizer, stop_words_cache=None, max_stop_seq_len=None)[source]#

Helper to construct a list of stop token sequences from the given list of stop words using the specified tokenizer.

For stop words that tokenize to a single token, we iterate the tokenizer’s vocab and add all the token_ids that detokenize to the stop word. This is done to handle the case where different token ids map to the same stop word, since RT uses stop tokens, not words to stop inferring.

For stop words that tokenize to multiple token sequence, we add the sequence directly.

Parameters
  • stop_words (str) – The input string.

  • tokenizer (PreTrainedTokenizerBase) – Tokenizer class from huggingface transformers library.

  • stop_words_cache (Dict) – (Optional) Dict to record / retrieve list of stop sequences per stop word. If not provided, each stop word is tokenized anew. Defaults to None.

  • max_stop_seq_len (int) – (Optional) Int recording the maximum length of a stop sequence.

Returns

Sorted (by first token id) list of stop token sequences; (Optional) Updated maximum stop sequence length

Return type

Tuple of