cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_dataset_preprocessors.LlavaPhaseOnePreprocessor#
- class cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_dataset_preprocessors.LlavaPhaseOnePreprocessor[source]#
-
Methods
Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
check_valid_doc
clean_text
collect_image_patch_start_idx
Creates HDF5 dataset from given parameters.
Read file and generates content :param file: path to data file :type file: str
generate_sample
get_tokenizable_columns
Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
parse_doc
Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple
process_default_bos_token
process_doc
Set seed for run based on user provided seed and rank.
Write data to HDF5 file.
Writes a list of files to HDF5.
Attributes
num_features
- preprocessing_generator(doc_obj)[source]#
Takes in content read from files and generates samples :param dos_read: return results of function file_read_generator :type dos_read: tuple
- Returns
one or multiple training samples
- Return type
sample (np.array)
- add_token(token)#
Add token to the tokenizer :param token: token to be added to the tokenizer :type token: str
- create_dataset(params)#
Creates HDF5 dataset from given parameters.
- Parameters
files (list) – List of files to process.
process_no (int) – process id
- Returns
- Dictionary containing results of execution, specifically as number of
processed, discarded, and successful files as well as number of examples.
- file_read_generator(file)#
Read file and generates content :param file: path to data file :type file: str
- Returns
a tuple of intermediate results read from files
- Return type
docs_read (tuple)
- get_vocab_size()#
Get tokenizer vocabulary size :returns: text to tokenize :rtype: vocab_size (int)
- seed_runs(rank=0)#
Set seed for run based on user provided seed and rank.
- Parameters
rank (int) – Rank to set, based on process number for execution. Defaults to 0.
- Returns
Object of type random.Random, with seed set.
- write_hdf5_file(file_path, files, rng, n_examples, chunks, dtype='i4', compression='gzip')#
Write data to HDF5 file.
- Parameters
file_path (string) – HDF5 file path.
files (sequence) – List of lists containing tokenized data to write.
rng (random.Random obj) – Instance of random object, with states set.
n_examples (int) – Number of examples that will be written in the file.
chunks (tuple or bool) – Chunk shape, or True to enable auto-chunking.
dtype (string) – Data type for the HDF5 dataset.
compression (string) – Compression strategy.
- write_hdf5_files(files, start_number, write_remainder=False, process_number=None, rng=<random.Random object>)#
Writes a list of files to HDF5.
- Parameters
files (sequence) – List of lists containing tokenized data to write.
start_number (int) – Continual count of HDF5 files written out.
write_remainder (bool) – Write out remaining data from files, if files per record is not met. Defaults to False.
process_number (int) – Process number for execution. Defaults to None.
rng (random.Random obj) – Instance of random object, with states set. Defaults to new instance created for write.
- Returns
Continual count of HDF5 files written out. remainder (list): Remaining sequences not written out, if length of
files to write is greater than the file per record.
- Return type
start_number (int)