data_processing.tokenizers package#
Submodules#
data_processing.tokenizers.BPETokenizer module#
Byte pair encoding/decoding utilities
Modified from the GPT-2 codebase: https://github.com/openai/gpt-2
- data_processing.tokenizers.BPETokenizer.bytes_to_unicode()[source]#
Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.
data_processing.tokenizers.HFTokenizer module#
- class data_processing.tokenizers.HFTokenizer.HFTokenizer[source]#
Bases:
object
Designed to integrate the HF’s Tokenizer library :param vocab_file: A vocabulary file to create the tokenizer from. :type vocab_file: str :param special_tokens: A list or a string representing the special
tokens that are to be added to the tokenizer.
- property eos#
- property pad#
data_processing.tokenizers.Tokenization module#
Tokenization classes and functions
- class data_processing.tokenizers.Tokenization.BaseTokenizer[source]#
Bases:
object
Class for base tokenization of a piece of text Handles grammar operations like removing strip accents, checking for chinese characters in text, handling splitting on punctuation and control characters. Also handles creating the tokenizer for converting tokens->id and id->tokens and storing vocabulary for the dataset :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing
- class data_processing.tokenizers.Tokenization.FullTokenizer[source]#
Bases:
object
Class for full tokenization of a piece of text Calls BaseTokenizer and WordPiece tokenizer to perform basic grammar operations and wordpiece splits :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing
- convert_ids_to_tokens(text)[source]#
Converts a list of ids to a list of tokens We shift all inputs by 1 because of the ids->token dictionary formed by keras Tokenizer starts with index 1 instead of 0.
- class data_processing.tokenizers.Tokenization.WordPieceTokenizer[source]#
Bases:
data_processing.tokenizers.Tokenization.BaseTokenizer
Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing