modelzoo.transformers.data_processing.slimpajama.preprocessing.datasets.RedPajamaGithubDataset#
- class modelzoo.transformers.data_processing.slimpajama.preprocessing.datasets.RedPajamaGithubDataset[source]#
 Bases:
modelzoo.transformers.data_processing.slimpajama.preprocessing.datasets.DatasetMethods
Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
Path to the directory
A generator producing all documents in the dataset.
Human-readable name of tfhe dataset
num_docsnum_duplicate_docsnum_short_docsPath to the file with short documents
Return an estimate of the dataset size.
size_duplicate_docssize_short_docsstem_dir_path- already_shuffled()#
 Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.
- documents(process_id, n_process, dup_sh, short_sh)#
 A generator producing all documents in the dataset.
- short_documents_path()#
 Path to the file with short documents