data_processing.scripts.pubmed.preprocess package#

Submodules#

data_processing.scripts.pubmed.preprocess.Downloader module#

Wrapper script to download PubMed datasets Reference: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT

class data_processing.scripts.pubmed.preprocess.Downloader.Downloader[source]#

Bases: object

Parameters
  • save_path – Location to download and extract the dataset

  • dataset – One of “pubmed_baseline”, “pubmed_daily_update”, “pubmed_fulltext”, “pubmed_open_access”

Extracts to save_path/extracted

__init__(dataset, save_path)[source]#
Parameters
  • save_path – Location to download and extract the dataset

  • dataset – One of “pubmed_baseline”, “pubmed_daily_update”, “pubmed_fulltext”, “pubmed_open_access”

Extracts to save_path/extracted

download()[source]#
download_files(url, dataset)[source]#
extract_files(dataset)[source]#
data_processing.scripts.pubmed.preprocess.Downloader.parse_args()[source]#

data_processing.scripts.pubmed.preprocess.TextFormatting module#

Script to format PubMed Fulltext commercial, PubMed Baseline and Update file Abstracts

Reference: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT

class data_processing.scripts.pubmed.preprocess.TextFormatting.TextFormatting[source]#

Bases: object

Parameters

pubmed_path (str) – Path to folder containing PubMed files

:param str output_folder : Path to where the txt file to be written :param Optional[int] filesize_limit: Max size of each text file :param Optional[bool] recursive: Flag if true, searches for nxml/xml files recursively within subfolders

__init__(pubmed_path, output_filename, filesize_limit=5000000000, recursive=False)[source]#
Parameters

pubmed_path (str) – Path to folder containing PubMed files

:param str output_folder : Path to where the txt file to be written :param Optional[int] filesize_limit: Max size of each text file :param Optional[bool] recursive: Flag if true, searches for nxml/xml files recursively within subfolders

merge(dataset_name)[source]#
merge_abstracts()[source]#
merge_fulltext()[source]#

data_processing.scripts.pubmed.preprocess.TextSharding module#

Script to shard into separate train and test dataset files

Reference: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT

class data_processing.scripts.pubmed.preprocess.TextSharding.NLTKSegmenter[source]#

Bases: object

__init__()[source]#
segment_string(article)[source]#
class data_processing.scripts.pubmed.preprocess.TextSharding.Sharding[source]#

Bases: object

__init__(input_files, output_name_prefix, n_training_shards, n_test_shards, fraction_test_set)[source]#
distribute_articles_over_shards()[source]#
get_sentences_per_shard(shard)[source]#
init_output_files()[source]#
load_articles()[source]#
segment_articles_into_sentences(segmenter)[source]#
write_shards_to_disk()[source]#
write_single_shard(shard_name, shard, split)[source]#

Module contents#