cerebras.modelzoo.data_preparation.data_preprocessing.utils.handle_bos_token_default#

cerebras.modelzoo.data_preparation.data_preprocessing.utils.handle_bos_token_default(tokenizer)[source]#: When performing FIM, we tokenize each chunk again after splitting. Therefore, if the tokenizer adds bos-token by default, we will get extra bos-tokens in the middle of the sequence. In this function, we set the tokenizer bos default to False, and return a flag that indicates whether we will need to add bos-token in the final fim formatting function.

cerebras.modelzoo.data_preparation.data_preprocessing.utils.get_tokenizer_vocab

cerebras.modelzoo.data_preparation.data_preprocessing.utils.has_valid_extension