cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_dataset_preprocessors.VSLSummarizationPreprocessor#

class cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_dataset_preprocessors.VSLSummarizationPreprocessor[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_dataset_preprocessors.SummarizationPreprocessor

self.chunk_lengths stores List(List(Tuple)) The outer list is chunks, inner list is sequences, tuples are prompt + completion pairs

Methods

add_token

Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

check_valid_doc

clean_text

create_dataset

Creates HDF5 dataset from given parameters.

file_read_generator

Read file and generates content :param file: path to data file :type file: str

generate_sample

get_tokenizable_columns

get_vocab_size

Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)

init_prefix_toks

parse_doc

preprocessing_generator

Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple

process_default_bos_token

process_doc

seed_runs

Set seed for run based on user provided seed and rank.

vsl_pack

Handles the packing of sequences together based on their length.

vsl_sample_generator

write_hdf5_file

Write data to HDF5 file.

write_hdf5_files

Writes a list of files to HDF5.

Attributes

num_features

use_vsl

__init__(params)[source]#
vsl_pack(doc_obj)[source]#

Handles the packing of sequences together based on their length. Relies on self.process_doc to calculate the lengths

preprocessing_generator(doc)[source]#

Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple

Returns

one or multiple training samples

Return type

sample (np.array)

add_token(token)#

Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str

create_dataset(params)#

Creates HDF5 dataset from given parameters.

Parameters
  • files (list) – List of files to process.

  • process_no (int) – process id

Returns

Dictionary containing results of execution, specifically as number of

processed, discarded, and successful files as well as number of examples.

file_read_generator(file)#

Read file and generates content :param file: path to data file :type file: str

Returns

a tuple of intermediate results read from files

Return type

docs_read (tuple)

get_vocab_size()#

Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)

seed_runs(rank=0)#

Set seed for run based on user provided seed and rank.

Parameters

rank (int) – Rank to set, based on process number for execution. Defaults to 0.

Returns

Object of type random.Random, with seed set.

write_hdf5_file(file_path, files, rng, n_examples, chunks, dtype='i4', compression='gzip')#

Write data to HDF5 file.

Parameters
  • file_path (string) – HDF5 file path.

  • files (sequence) – List of lists containing tokenized data to write.

  • rng (random.Random obj) – Instance of random object, with states set.

  • n_examples (int) – Number of examples that will be written in the file.

  • chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.

  • dtype (string) – Data type for the HDF5 dataset.

  • compression (string) – Compression strategy.

write_hdf5_files(files, start_number, write_remainder=False, process_number=None, rng=<random.Random object>)#

Writes a list of files to HDF5.

Parameters
  • files (sequence) – List of lists containing tokenized data to write.

  • start_number (int) – Continual count of HDF5 files written out.

  • write_remainder (bool) – Write out remaining data from files, if files per record is not met. Defaults to False.

  • process_number (int) – Process number for execution. Defaults to None.

  • rng (random.Random obj) – Instance of random object, with states set. Defaults to new instance created for write.

Returns

Continual count of HDF5 files written out. remainder (list): Remaining sequences not written out, if length of

files to write is greater than the file per record.

Return type

start_number (int)