Preprocessing Scripts#
- Using Hugging Face datasets for auto-regressive LM
 - Creating HDF5 dataset for GPT models
 - Generating HDF5 data GPT-style models using data chunk preprocessing
 - Data pre-processing pipeline
 - Online Shuffling in HDF5 File Storage
 - Output files structure
 - Implementation notes
 - Shuffling Samples for HDF5 dataset of GPT Models
 - Step by step guide to pre-process SlimPajama