Data preprocessing scripts#
- Using Hugging Face datasets for auto-regressive LM
- Creating HDF5 dataset for GPT models using chunk data preprocessing
- Data pre-processing pipeline
- Online Shuffling in HDF5 File Storage
- Output files structure
- Implementation notes
- Creating HDF5 dataset for GPT models
- Shuffling Samples for HDF5 dataset of GPT models
- Optimizing SlimPajama dataset pre-processing