data_processing.bert package#

Submodules#

data_processing.bert.bertsum_data_processor module#

Common pre-processing functions for BERTSUM data processing

class data_processing.bert.bertsum_data_processor.BertData[source]#

Bases: object

Converts input into bert format.

Parameters

params – dict params: BertData configuration parameters.

__init__(params)[source]#

Converts input into bert format.

Parameters

params – dict params: BertData configuration parameters.

process(source, target, oracle_ids)[source]#
class data_processing.bert.bertsum_data_processor.JsonConverter[source]#

Bases: object

JsonConverter simplifies the input and convert it into json files format with source and target (summarized) texts. Splits input into train, test and valid parts based on the map_path.

Parameters

params – dict params: JsonConverter configuration parameters.

__init__(params)[source]#

JsonConverter simplifies the input and convert it into json files format with source and target (summarized) texts. Splits input into train, test and valid parts based on the map_path.

Parameters

params – dict params: JsonConverter configuration parameters.

process()[source]#
class data_processing.bert.bertsum_data_processor.RougeBasedLabelsFormatter[source]#

Bases: object

Based on the reference n-grams, RougeBasedLabelsFormatter selects sentences from the input with the highest rouge-score calculated between them and the reference. This is needed since we solve extractive summarization task, where target summarization is the subset of the input sentences in contrast to abstractive summarization, where summarized text is generated by the system without relying on the input text.

__init__()[source]#

Based on the reference n-grams, RougeBasedLabelsFormatter selects sentences from the input with the highest rouge-score calculated between them and the reference. This is needed since we solve extractive summarization task, where target summarization is the subset of the input sentences in contrast to abstractive summarization, where summarized text is generated by the system without relying on the input text.

process(document_sentences, abstract_sentences, summary_size)[source]#
class data_processing.bert.bertsum_data_processor.Tokenizer[source]#

Bases: object

Tokenizes files from the input path into output path. Stanford CoreNLP is used for tokenization.

Parameters

params – dict params: Tokenizer configuration parameters.

__init__(params)[source]#

Tokenizes files from the input path into output path. Stanford CoreNLP is used for tokenization.

Parameters

params – dict params: Tokenizer configuration parameters.

process()[source]#
data_processing.bert.bertsum_data_processor.check_output(input_path, output_path)[source]#
data_processing.bert.bertsum_data_processor.convert_to_json_files(params)[source]#

Format input tokenized files into simpler json files. Takes params.input_path, convert it to json format and store it under params.output_path.

data_processing.bert.bertsum_data_processor.create_parser()[source]#
data_processing.bert.bertsum_data_processor.tokenize(params)[source]#

Split sentences and perform tokenization. Takes params.input_path, tokenize it and store it under params.output_path.

data_processing.bert.dynamic_processor module#

class data_processing.bert.dynamic_processor.PreprocessInstance[source]#

Bases: object

A single training (sentence-pair) instance.

Parameters
  • tokens (list) – List of tokens for sentence pair

  • segment_ids (list) – List of segment ids for sentence pair

  • is_random_next (bool) – Specifies wh ether the second element in the pair is random

__init__(tokens, segment_ids, is_random_next)[source]#
to_dict()[source]#
data_processing.bert.dynamic_processor.data_generator(metadata_files, vocab_file, do_lower, split_num, max_seq_length, short_seq_prob, mask_whole_word, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, min_short_seq_length=None, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, inverted_mask=False, seed=None, spacy_model='en_core_web_sm', input_files_prefix='', sop_labels=False)[source]#

Generator function used to create input dataset for MLM + NSP dataset.

1. Generate raw examples by concatenating two parts ‘tokens-a’ and ‘tokens-b’ as follows: [CLS] <tokens-a> [SEP] <tokens-b> [SEP] where :

tokens-a: list of tokens taken from the current document and of random length (less than msl).

tokens-b: list of tokens chosen based on the randomly set “next_sentence_labels” and of length msl-len(<tokens-a>)- 3 (to account for [CLS] and [SEP] tokens)

If “next_sentence_labels” is 1, (set to 1 with 0.5 probability),

tokens-b are list of tokens from sentences chosen randomly from different document

else,

tokens-b are list of tokens taken from the same document and is a continuation of tokens-a in the document

The number of raw tokens depends on “short_sequence_prob” as well

Parameters
  • metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.

  • vocab_file (str) – Vocabulary file, to build tokenization from

  • do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not

  • split_num (int) – Number of input files to read at a given time for processing.

  • max_seq_length (int) – Maximum length of the sequence to generate

  • short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.

  • mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.

  • max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence

  • masked_lm_prob (float) – Proportion of tokens to be masked

  • dupe_factor (int) – Number of times to duplicate the dataset with different static masks

  • min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]

  • output_type_shapes (dict) – Dictionary indicating the shapes of different outputs

  • multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>

  • multiple_docs_separator (str) – String which separates multiple documents in a single text file.

  • single_sentence_per_line – True,when the document is already split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document

  • inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.

  • seed (int) – Random seed.

  • spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.

  • input_file_prefix (str) – Prefix to be added to paths of the input files.

  • sop_labels (bool) – If true, negative examples of the dataset will be two consecutive sentences in reversed order. Otherwise, uses regular (NSP) labels (where negative examples are from different documents).

Returns

yields training examples (feature, label)

where label refers to the next_sentence_prediction label

data_processing.bert.mlm_only_processor module#

class data_processing.bert.mlm_only_processor.MLMOnlyInstance[source]#

Bases: object

A single training MLMOnly instance.

Parameters
  • tokens (list) – List of tokens for MLM example

  • masked_lm_positions (list) – List of masked lm positions for sentence

pair :param list masked_lm_labels: List of masked lm labels for example

__init__(tokens, masked_lm_positions, masked_lm_labels)[source]#
data_processing.bert.mlm_only_processor.create_masked_lm_features(example, vocab_words, max_seq_length, mask_whole_word, max_predictions_per_seq, masked_lm_prob, document_separator_token, rng, tokenizer, output_type_shapes, inverted_mask)[source]#
data_processing.bert.mlm_only_processor.data_generator(metadata_files, vocab_file, do_lower, disable_masking, mask_whole_word, max_seq_length, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, buffer_size=1000000.0, min_short_seq_length=None, overlap_size=None, short_seq_prob=0, spacy_model='en_core_web_sm', inverted_mask=False, allow_cross_document_examples=True, document_separator_token='[SEP]', seed=None, input_files_prefix='')[source]#

Generator function used to create input dataset for MLM only dataset.

1. Generate raw examples with tokens based on “overlap_size”, “max_sequence_length”, “allow_cross_document_examples” and “document_separator_token” and using a sliding window approach. The exact steps are detailed in “_create_examples_from_document” function 2. Mask the raw examples based on “max_predictions_per_seq” 3. Pad the masked example to “max_sequence_length” if less that msl

Parameters
  • metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.

  • vocab_file (str) – Vocabulary file, to build tokenization from

  • do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not

  • disable_masking (bool) – whether masking should be disabled

  • mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.

  • max_seq_length (int) – Maximum length of the sequence to generate

  • max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence

  • masked_lm_prob (float) – Proportion of tokens to be masked

  • dupe_factor (int) – Number of times to duplicate the dataset with different static masks

  • output_type_shapes (dict) – Dictionary indicating the shapes of different outputs

  • multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>

  • multiple_docs_separator (str) – String which separates multiple documents in a single text file.

  • single_sentence_per_line – True,when the document is already split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document

  • buffer_size (int) – Number of tokens to be processed at a time

  • min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]

  • overlap_size (int) – Number of tokens that overlap with previous example when processing buffer with a sliding window approach. If None, defaults to overlap to max_seq_len/4.

  • short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.

  • spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.

  • inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.

  • allow_cross_document_examples (bool) – If True, the sequences can contain tokens from the next document.

  • document_separator_token (str) – String to separate tokens from one document and the next when sequences span documents

  • seed (int) – Random seed.

  • input_file_prefix (str) – Prefix to be added to paths of the input files.

Returns

yields training examples (feature, [])

data_processing.bert.ner_data_processor module#

Common pre-processing functions taken from: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_ner.py with minor modifications

class data_processing.bert.ner_data_processor.InputExample[source]#

Bases: object

A single training/test example for simple sequence classification.

Constructs a InputExample.

Parameters
  • guid – Unique id for the example.

  • text_a – string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.

  • label – (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.

__init__(guid, text, label=None)[source]#

Constructs a InputExample.

Parameters
  • guid – Unique id for the example.

  • text_a – string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified.

  • label – (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples.

class data_processing.bert.ner_data_processor.NERProcessor[source]#

Bases: object

get_dev_examples(data_dir, file_name='dev.tsv')[source]#
get_labels(data_split_type=None)[source]#
get_test_examples(data_dir, file_name='test.tsv')[source]#
get_train_examples(data_dir, file_name='train.tsv')[source]#
data_processing.bert.ner_data_processor.create_parser()[source]#

Parse command-line arguments.

data_processing.bert.ner_data_processor.get_tokens_and_labels(example, tokenizer, max_seq_length)[source]#
data_processing.bert.ner_data_processor.write_label_map_files(label_list, out_dir)[source]#

data_processing.bert.sentence_pair_processor module#

class data_processing.bert.sentence_pair_processor.SentencePairInstance[source]#

Bases: object

A single training (sentence-pair) instance. :param list tokens: List of tokens for sentence pair :param list segment_ids: List of segment ids for sentence pair :param list masked_lm_positions: List of masked lm positions for sentence pair :param list masked_lm_labels: List of masked lm labels for sentence pair :param bool is_random_next: Specifies whether the second element in the pair is random

__init__(tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next)[source]#
data_processing.bert.sentence_pair_processor.data_generator(metadata_files, vocab_file, do_lower, split_num, max_seq_length, short_seq_prob, mask_whole_word, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, min_short_seq_length=None, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, inverted_mask=False, seed=None, spacy_model='en_core_web_sm', input_files_prefix='', sop_labels=False)[source]#

Generator function used to create input dataset for MLM + NSP dataset.

1. Generate raw examples by concatenating two parts ‘tokens-a’ and ‘tokens-b’ as follows: [CLS] <tokens-a> [SEP] <tokens-b> [SEP] where :

tokens-a: list of tokens taken from the current document and of random length (less than msl).

tokens-b: list of tokens chosen based on the randomly set “next_sentence_labels” and of length msl-len(<tokens-a>)- 3 (to account for [CLS] and [SEP] tokens)

If “next_sentence_labels” is 1, (set to 1 with 0.5 probability),

tokens-b are list of tokens from sentences chosen randomly from different document

else,

tokens-b are list of tokens taken from the same document and is a continuation of tokens-a in the document

The number of raw tokens depends on “short_sequence_prob” as well 2. Mask the raw examples based on “max_predictions_per_seq” 3. Pad the masked example to “max_sequence_length” if less that msl

Parameters
  • metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.

  • vocab_file (str) – Vocabulary file, to build tokenization from

  • do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not

  • split_num (int) – Number of input files to read at a given time for processing.

  • max_seq_length (int) – Maximum length of the sequence to generate

  • short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.

  • mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.

  • max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence

  • masked_lm_prob (float) – Proportion of tokens to be masked

  • dupe_factor (int) – Number of times to duplicate the dataset with different static masks

  • min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]

  • output_type_shapes (dict) – Dictionary indicating the shapes of different outputs

  • multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>

  • multiple_docs_separator (str) – String which separates

multiple documents in a single text file. :param single_sentence_per_line: True,when the document is already

split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document

Parameters
  • inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.

  • seed (int) – Random seed.

  • spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.

  • input_file_prefix (str) – Prefix to be added to paths of the input files.

  • sop_labels (bool) – If true, negative examples of the dataset will be two consecutive sentences in reversed order. Otherwise, uses regular (NSP) labels (where negative examples are from different documents).

Returns

yields training examples (feature, label)

where label refers to the next_sentence_prediction label

Module contents#