cerebras.modelzoo.data_preparation.data_preprocessing.utils.clean_text#

cerebras.modelzoo.data_preparation.data_preprocessing.utils.clean_text(data, use_ftfy, wikitext_detokenize, ftfy_normalizer)[source]#

Clean the provided text using ftfy normalization and wikitext detokenization.

Parameters
  • data (str) – The text to be cleaned.

  • use_ftfy (bool) – Whether to use the ftfy library to fix text encoding issues.

  • wikitext_detokenize (bool) – Whether to apply wikitext detokenization to the text.

  • ftfy_normalizer (str) – The normalization method to use with ftfy if enabled.

Returns

The cleaned text after applying the specified operations.

Return type

str