data_processing.scripts.pubmed.preprocess package#
Submodules#
data_processing.scripts.pubmed.preprocess.Downloader module#
Wrapper script to download PubMed datasets Reference: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT
- class data_processing.scripts.pubmed.preprocess.Downloader.Downloader[source]#
Bases:
object
- Parameters
save_path – Location to download and extract the dataset
dataset – One of “pubmed_baseline”, “pubmed_daily_update”, “pubmed_fulltext”, “pubmed_open_access”
Extracts to save_path/extracted
data_processing.scripts.pubmed.preprocess.TextFormatting module#
Script to format PubMed Fulltext commercial, PubMed Baseline and Update file Abstracts
Reference: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT
- class data_processing.scripts.pubmed.preprocess.TextFormatting.TextFormatting[source]#
Bases:
object
- Parameters
pubmed_path (str) – Path to folder containing PubMed files
:param str output_folder : Path to where the txt file to be written :param Optional[int] filesize_limit: Max size of each text file :param Optional[bool] recursive: Flag if true, searches for nxml/xml files recursively within subfolders
- __init__(pubmed_path, output_filename, filesize_limit=5000000000, recursive=False)[source]#
- Parameters
pubmed_path (str) – Path to folder containing PubMed files
:param str output_folder : Path to where the txt file to be written :param Optional[int] filesize_limit: Max size of each text file :param Optional[bool] recursive: Flag if true, searches for nxml/xml files recursively within subfolders
data_processing.scripts.pubmed.preprocess.TextSharding module#
Script to shard into separate train and test dataset files
Reference: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT