Thank you for your feedback!
cerebras.modelzoo.data_preparation.data_preprocessing.utils.handle_bos_token_default#
- cerebras.modelzoo.data_preparation.data_preprocessing.utils.handle_bos_token_default(tokenizer)[source]#
When performing FIM, we tokenize each chunk again after splitting. Therefore, if the tokenizer adds bos-token by default, we will get extra bos-tokens in the middle of the sequence. In this function, we set the tokenizer bos default to False, and return a flag that indicates whether we will need to add bos-token in the final fim formatting function.
Was this information helpful?
Thank you for your feedback!
- NO
- YES
Cancel
Submit