cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaCommonCrawlDataset#
- class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaCommonCrawlDataset[source]#
Bases:
cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset
Methods
Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
Path to the directory
A generator producing all documents in the dataset.
Human-readable name of tfhe dataset
num_docs
num_duplicate_docs
num_short_docs
Path to the file with short documents
Return an estimate of the dataset size.
size_duplicate_docs
size_short_docs
stem_dir_path
- size()[source]#
Return an estimate of the dataset size. Implementations may use a faster, less accurate estimate.
- already_shuffled()#
Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.
- documents(process_id, n_process, dup_sh, short_sh)#
A generator producing all documents in the dataset.
- short_documents_path()#
Path to the file with short documents