cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication#
- class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication[source]#
- Bases: - cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset- Methods - Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again. - Path to the directory - A generator producing all documents in the dataset. - Human-readable name of tfhe dataset - Return an estimate of the dataset number of documents. - sample_documents- Path to the file with short documents - Return an estimate of the dataset size. - size()[source]#
- Return an estimate of the dataset size. Implementations may use a faster, less accurate estimate. 
 - num_docs()[source]#
- Return an estimate of the dataset number of documents. Implementations may use a faster, less accurate estimate. 
 - already_shuffled()#
- Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again. 
 - dir_path()#
- Path to the directory 
 - short_documents_path()#
- Path to the file with short documents