data_processing.tokenizers package#

Submodules#

data_processing.tokenizers.BPETokenizer module#

Byte pair encoding/decoding utilities

Modified from the GPT-2 codebase: https://github.com/openai/gpt-2

class data_processing.tokenizers.BPETokenizer.BPETokenizer[source]#

Bases: object

__init__(vocab_file, encoder_file, errors='replace', special_tokens=None)[source]#

add_token(token)[source]#

bpe(token)[source]#

decode(tokens)[source]#

encode(text)[source]#

get_token_id(token)[source]#

data_processing.tokenizers.BPETokenizer.bytes_to_unicode()[source]#: Returns list of utf-8 byte and a corresponding list of unicode strings. The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. When you’re at something like a 10B token dataset you end up needing around 5K for decent coverage. This is a signficant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup tables between utf-8 bytes and unicode strings. And avoids mapping to whitespace/control characters the bpe code barfs on.

data_processing.tokenizers.BPETokenizer.get_pairs(word)[source]#

Return set of symbol pairs in a word.

Word is represented as tuple of symbols (symbols being variable-length strings).

data_processing.tokenizers.HFTokenizer module#

class data_processing.tokenizers.HFTokenizer.HFTokenizer[source]#

Bases: object

Designed to integrate the HF’s Tokenizer library :param vocab_file: A vocabulary file to create the tokenizer from. :type vocab_file: str :param special_tokens: A list or a string representing the special

tokens that are to be added to the tokenizer.

__init__(vocab_file, special_tokens=None)[source]#

add_special_tokens(special_tokens)[source]#

add_token(token)[source]#

decode(token_ids)[source]#

encode(text)[source]#

property eos#

get_token(id)[source]#

get_token_id(token)[source]#

property pad#

set_eos_pad_tokens()[source]#

data_processing.tokenizers.Tokenization module#

Tokenization classes and functions

class data_processing.tokenizers.Tokenization.BaseTokenizer[source]#

Bases: object

Class for base tokenization of a piece of text Handles grammar operations like removing strip accents, checking for chinese characters in text, handling splitting on punctuation and control characters. Also handles creating the tokenizer for converting tokens->id and id->tokens and storing vocabulary for the dataset :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing

__init__(vocab_file, do_lower_case=True)[source]#

tokenize(text)[source]#: Tokenizes a piece of text. Does not convert to ids

class data_processing.tokenizers.Tokenization.FullTokenizer[source]#

Bases: object

Class for full tokenization of a piece of text Calls BaseTokenizer and WordPiece tokenizer to perform basic grammar operations and wordpiece splits :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing

__init__(vocab_file, do_lower_case=True)[source]#

convert_ids_to_tokens(text)[source]#: Converts a list of ids to a list of tokens We shift all inputs by 1 because of the ids->token dictionary formed by keras Tokenizer starts with index 1 instead of 0.

convert_tokens_to_ids(text)[source]#: Converts a list of tokens to a list of ids We shift all outputs by 1 because of the dictionary formed by keras Tokenizer starts with index 1 instead of 0.

get_vocab_words()[source]#: Returns a list of the words in the vocab

tokenize(text)[source]#: Perform basic tokenization followed by wordpiece tokenization on a piece of text. Does not convert to ids.

class data_processing.tokenizers.Tokenization.WordPieceTokenizer[source]#

Bases: data_processing.tokenizers.Tokenization.BaseTokenizer

Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing

__init__(vocab_file, unknown_token='[UNK]', max_input_chars_per_word=200, do_lower_case=True)[source]#

tokenize(text)[source]#

Tokenize a piece of text into its word pieces This uses a greedy longest-match-first algorithm to perfom tokenization using the given vocabulary. For example:

input = “unaffable” output = [“un”, “##aff”, “##able”]

Does not convert to ids.

Module contents#

data_processing.scripts.pubmed.preprocess package

Create your own dataloader