cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_dataset_preprocessors.SummarizationPreprocessor#
- class cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_dataset_preprocessors.SummarizationPreprocessor[source]#
- 
Methods Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str check_valid_docclean_textCreates HDF5 dataset from given parameters. Read file and generates content :param file: path to data file :type file: str generate_sampleget_tokenizable_columnsGet tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int) parse_docTakes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple Set seed for run based on user provided seed and rank. Write data to HDF5 file. Writes a list of files to HDF5. Attributes num_features- file_read_generator(file)[source]#
- Read file and generates content :param file: path to data file :type file: str - Returns
- a tuple of intermediate results read from files 
- Return type
- docs_read (tuple) 
 
 - preprocessing_generator(doc_obj)[source]#
- Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple - Returns
- one or multiple training samples 
- Return type
- sample (np.array) 
 
 - add_token(token)#
- Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str 
 - create_dataset(params)#
- Creates HDF5 dataset from given parameters. - Parameters
- files (list) – List of files to process. 
- process_no (int) – process id 
 
- Returns
- Dictionary containing results of execution, specifically as number of
- processed, discarded, and successful files as well as number of examples. 
 
 
 - get_vocab_size()#
- Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int) 
 - seed_runs(rank=0)#
- Set seed for run based on user provided seed and rank. - Parameters
- rank (int) – Rank to set, based on process number for execution. Defaults to 0. 
- Returns
- Object of type random.Random, with seed set. 
 
 - write_hdf5_file(file_path, files, rng, n_examples, chunks, dtype='i4', compression='gzip')#
- Write data to HDF5 file. - Parameters
- file_path (string) – HDF5 file path. 
- files (sequence) – List of lists containing tokenized data to write. 
- rng (random.Random obj) – Instance of random object, with states set. 
- n_examples (int) – Number of examples that will be written in the file. 
- chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking. 
- dtype (string) – Data type for the HDF5 dataset. 
- compression (string) – Compression strategy. 
 
 
 - write_hdf5_files(files, start_number, write_remainder=False, process_number=None, rng=<random.Random object>)#
- Writes a list of files to HDF5. - Parameters
- files (sequence) – List of lists containing tokenized data to write. 
- start_number (int) – Continual count of HDF5 files written out. 
- write_remainder (bool) – Write out remaining data from files, if files per record is not met. Defaults to False. 
- process_number (int) – Process number for execution. Defaults to None. 
- rng (random.Random obj) – Instance of random object, with states set. Defaults to new instance created for write. 
 
- Returns
- Continual count of HDF5 files written out. remainder (list): Remaining sequences not written out, if length of - files to write is greater than the file per record. 
- Return type
- start_number (int)