cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaC4Dataset#
- class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaC4Dataset(input_dir)[source]#
Bases:
cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.DatasetMethods
Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
dir_pathA generator producing all documents in the dataset.
namenum_docsnum_duplicate_docsnum_short_docsPath to the file with short documents
sizesize_duplicate_docssize_short_docsstem_dir_path- already_shuffled()#
Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.
- documents(process_id, n_process, dup_sh, short_sh)#
A generator producing all documents in the dataset.
- short_documents_path()#
Path to the file with short documents