cerebras.modelzoo.data_preparation.data_preprocessing.utils.clean_text#
- cerebras.modelzoo.data_preparation.data_preprocessing.utils.clean_text(data, use_ftfy, wikitext_detokenize, ftfy_normalizer)[source]#
Clean the provided text using ftfy normalization and wikitext detokenization.
- Parameters
data (str) – The text to be cleaned.
use_ftfy (bool) – Whether to use the ftfy library to fix text encoding issues.
wikitext_detokenize (bool) – Whether to apply wikitext detokenization to the text.
ftfy_normalizer (str) – The normalization method to use with ftfy if enabled.
- Returns
The cleaned text after applying the specified operations.
- Return type
str