cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication#
- class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication(datasets, duplicates, short_docs)[source]#
Bases:
cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.DatasetMethods
Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
Path to the directory
documentsnameReturn an estimate of the dataset number of documents.
sample_documentsPath to the file with short documents
size- num_docs()[source]#
Return an estimate of the dataset number of documents. Implementations may use a faster, less accurate estimate.
- already_shuffled()#
Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.
- dir_path()#
Path to the directory
- short_documents_path()#
Path to the file with short documents