cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaWikipediaDataset#
- class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaWikipediaDataset[source]#
- Bases: - cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset- Methods - Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again. - Path to the directory - A generator producing all documents in the dataset. - Human-readable name of tfhe dataset - num_docs- num_duplicate_docs- num_short_docs- Path to the file with short documents - Return an estimate of the dataset size. - size_duplicate_docs- size_short_docs- stem_dir_path- size()[source]#
- Return an estimate of the dataset size. Implementations may use a faster, less accurate estimate. 
 - already_shuffled()#
- Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again. 
 - documents(process_id, n_process, dup_sh, short_sh)#
- A generator producing all documents in the dataset. 
 - short_documents_path()#
- Path to the file with short documents