Creating HDF5 dataset for GPT models#
Overview#
We provide two methods to generate Hierarchical Data Formats (HDF) files (.h5
) that you can use in the input pipeline for GPT style models to implement data loader for GPT style models efficiently.
If you have a PyTorch dataset and need to convert it to an HDF5 format, follow section Converting a PyTorch dataset to HDF5 format.
If you have raw data and want to convert it to an HDF5 dataset, follow section Generating HDF5 data from raw data of this document.
Converting a PyTorch dataset to an HDF5 format#
Suppose you have a PyTorch dataset for GPT models (from sources such HuggingFace, Map-Style, or Iterable). In that case, you can easily write samples for your dataset in HDF5 format to use with Cerebras optimized HDF5 DataProcessor. Call function convert_dataset_to_HDF5()
, that is defined in convert_dataset_to_HDF5.py.
The function convert_dataset_to_HDF5()
uses a PyTorch Dataloader to fetch samples from the specified dataset and writes those samples in h5
files. The following table explains the arguments to the convert_dataset_to_HDF5()
function:
Table 1: convert_dataset_to_HDF5 Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
N/A |
PyTorch dataset to fetch the data from (IterableDataset or Dataset ). |
|
./hdf5_dataset/ |
Directory where HDF5 will be stored. |
|
dataset-partition |
Name of the dataset; i.e. prefix to use for HDF5 file names. |
|
2000 |
Number of samples written to each HDF5 file |
|
8 |
Number of Python processes to use for generating data. |
|
64 |
The batch size to use fetching the data. |
|
N/A |
Merges a list of samples to form a mini-batch of Tensor(s). |
|
i4 |
Data type for the HDF5 dataset. |
|
gzip |
HDF5 Compression strategy. |
While the function convert_dataset_to_HDF5()
is generic and can be used with all transformer models, note that PyTorch dataset features dictionary should have the the following key/values GPT models:
input_ids
: Input token IDs, padded with0
tomax_sequence_length
.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
attention_mask
: Mask for padded positions. Has values0
on the padded positions and1
elsewhere.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
labels
: Labels for language modeling pre-training task, padded with0
tomax_sequence_length
.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
NOTE: For more information on using of HuggingFace datasets, refer to Using HuggingFace datasets for auto-regressive LM)
Converting raw data to an HDF5 data#
We currently offer three modes for generating HDF5 files:
LMData
: for processing language modeling datasets in.jsonl
or.txt
format.Summarization
: for processing fine-tuning datasets in.txt
format.Customize
: for any dataset format, but requires a module specifying how to read the raw dataset files.
Run the following script to process your data:create_hdf5_dataset.py with the appropriate subcommand (modes) to generate the .h5
files for GPT style models. Each sub-commands takes in a set of arguments described below in Generating HDF5 files section.
Set up environment#
The following setup is needed to enable a clean run of the script.
NOTE: This assumes commands are run from the following direcectory:
<modelzoo_path>/modelzoo/transformers/data_processing/scripts/hdf5_preprocessing
For setting up a Python virtual environment:`
$ python -m venv data_env
source ./data_env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Ignore error messages such as:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed
as they shouldn’t affect the rest of the steps.
Input files format#
Ensure the input text documents are in a specific file format before utilizing the provided script, except for the Customize
mode. The acceptable file formats are '.jsonl', '.json.gz', '.jsonl.zst', '.jsonl.zst.tar', '.txt'
. These files should have the data in a specific structure described in data format section.
To optimally process the files, we recommend that all files with any of the above formats besides .txt
contain enough text in a single file. The recommended size for each file is in the order of GB.
On the contrary, if you are processing smaller files with .txt
format, input a metadata
file containing a list of paths to these files to leverage multi-processing better.
Input data format#
As mentioned above, the preprocessing script accepts two primary input files: .json
based or .txt
based. The input data must follow a specific structure for each type to be accurately converted into hdf5
files.
Format for jsonl
files#
The raw text and meta data for generation should be represented in the .jsonl
based files as:
{"text": "Any text excerpt from the dataset of interest...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
For the jsonl
files, as shown above, the raw text is extracted from key=text
in the input files by default. If your input files do not contain a text
key, you should know the key
corresponding to the text you need to extract. Then, extract the text from the command line argument --jsonl_key=<your key name>
.
For example, if your jsonl
files have the content such as the following:
{"idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
then you’d need to pass --jsonl_key=idea
in the command line arguments.
Format for txt
based files in LMData
mode#
Always represent raw text for generation in a .txt
based files as:
Any text excerpt from the dataset of interest...
The new lines may not be represented as a newline character.
Note that there are no special tags or anchors in the above. If they exist, all these will be treated as a single document and may not represent the natural language.
For example, the following text gets tokenized be entirely as:
<DOC>
<DOCNO>TRC2-2008-01-01-0000</DOCNO>
<BLOGS08DAY>-13</BLOGS08DAY>
<CONTENT>
Example content in the format that may be outside of the
</DOCS>
Definition of vocab_file and encoder_file#
The script supports two kinds of tokenizers: GPT2Tokenizer
and NeoXTokenizer
.
Supply the correct vocab_file
and encoder_file
when using the desired tokenizer.
For
GPT2Tokenizer
,vocab_file=gpt2-vocab.bpe
andencoder_file=gpt2-encoder.json
For
NeoXTokenizer
,encoder_file=/neox-encoder.json
These files can be found here.
Note: For the GPT2Tokenizer
, we follow the nomenclature used by OpenAI in their implementation which is slightly different from Hugging Face’s nomenclature where they call the vocab_file
as merges_file
and encoder_file
as vocab_file
. However, the content of the files is the same. For NeoXTokenizer
, we use the same nomenclature to avoid confusion.
Generating HDF5 files#
Once you have a text dataset that meets the above requirement, you can generate HDF5 files using the create_hdf5_dataset.py
script:
python create_hdf5_dataset.py [mode] [--arguments]
As we mentioned before, the mode can be one of {LMData
, Summarization
,}. The four modes share the same setup and processing arguments but differ in their dataset arguments, as detailed below:
Table 2: Setup Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
N/A |
Path to YAML config file for setting dataset preprocessing parameters. Optional alternative for providing command line arguments. |
|
N/A |
Directory where raw data is stored. Supports only the formats: [ |
|
N/A |
Path to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by comma. |
|
|
Directory where HDF5 files will be stored. |
|
cpu count |
Number of processes to use. |
|
N/A |
Python file name contains the custom dataset processor for |
|
N/A |
Name of the custom dataset processor for |
Note: You have to provide either the
input_dir
ormetadata_files
argument. Only files referenced in themetadata_files
will be processed if you provided both.
Table 3: Processing Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
required arg |
Type of tokenizer to use for HDF5 dataset generation. Can be one of |
|
N/A |
Path to the vocabulary file. |
|
N/A |
Path to the encoder file. |
|
|
Maximum sequence length. |
|
|
Probability of creating sequences which are shorter than the maximum sequence length. |
|
|
Name of the dataset; i.e. prefix to use for HDF5 file names. |
|
|
Text files to write per HDF5 file. |
|
|
Whether to write the samples in batch for the HDF5 format, setting to false will save memory but a bit slower. |
|
|
Write the remainder files when data is left over from processing. |
|
|
Resume record writing from a given checkpoint. |
|
|
Display progress while runs. |
|
|
Random seed. |
Table 4: Dataset Arguments (LMData
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Fix text with ftfy. |
|
|
Choose what kind of unicode normalization is applied. Usually, we apply |
|
|
Use wikitext detokenizer to fix text. |
|
|
The key name in input jsonl files from which the raw text will be extracted in order to further process it. |
|
|
Concatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token. |
Table 5: Dataset Arguments (Summarization
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Fix text with ftfy. |
|
|
Choose what kind of unicode normalization is applied. Usually, we apply |
|
|
Use wikitext detokenizer to fix text. |
|
|
Token added between prompt and completion in preprocessed sequences. |
|
required arg |
Json key for the prompt. |
|
required arg |
Json key for the completion. |
Usage of create_hdf5_dataset.py file#
You can provide the above arguments either as command line arguments or as YAML config file:
Command line
python create_hdf5_dataset.py LMData --input_dir /path/to/data --tokenizer_type NeoXTokenizer --encoder_file /path/to/encoder --max_seq_length 4096 --ftfy True --pack_sequences False
YAML config file
python create_hdf5_dataset.py LMData --params ./configs/autoregressive_lm_preprocessing.yaml
Example of sample YAML files for LMData and Summarization are located on Cerebras Model Zoo.
Note: You can use both, but command-line arguments will override any common arguments with the YAML configuration file.
Customize mode steps#
Create a python file or put under
./hdf5_dataset_preprocessors.py
Import the module
HDF5Preprocessor
in the file you created as follows:
from modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.hdf5_preprocessor import HDF5Preprocessor
Create a class that inherits from
HDF5Preprocessor
. (e.gCustomDataset
)Implements init takes as input a dictionary contains the dataset parameters that is needed for
HDF5Preprocessor
.Implements the method
file_read_generator
andpreprocessing_generator
following Write Customized PreprocessorRun
create_hdf5_dataset.py
script.
Write customized preprocessor#
You can create customized preprocessors for various datasets or objectives. We provide two references at hdf5_dataset_preprocessors.py
where:
LMDataPreprocessor: the preprocessor for autoregressive language modeling tasks
SummarizationPreprocessor: the preprocessor for summarization tasks
They both inherit from the HDF5BasePreprocessor
at hdf5_base_preprocessor.py
with two functions that can be overridden to customize for various cases:
file_read_generator() takes a file path, reads from the file, and yields the corresponding text documents. You can customize how you want the file to be read based on its format (ex. csv, zip, etc.). Our default preprocessors use
lm_dataformat
reader with specific JSON keys.preprocessing_generator(), This function takes in the output of file_read_generator(), performs tokenization and other preprocessing techniques, and yields the data samples in np.array format.
For example, in the autoregressive language modeling task, file_read_generator yields an str object, and the preprocessing_generator produces an np array with shape [3, max_sequence_length]
with the following three features concatenated on the first dimension:
input_ids
: Input token ids, padded with 0’s to max_sequence_length.input_mask
: Loss mask for the sequence. It has 0’s padded positions like prompts or padding tokens, and 1’s elsewhere.labels
: input_ids shifted to the right by one position as the target labels.
Best practices#
It is recommended to use the
ftfy
module to fix the datasets. Enable with the--ftfy
argument.The NeoXTokenizer uses the HuggingFace library’s inbuilt tokenizer and handles NFC normalization independently. When using this tokenizer_type, set the
--ftfy_normalizer
argument toNone
. For theGPT2Tokenizer
, use the defaultNFC
value for the normalizer.To process HDF5 for training, we recommend using multi-processing. Moreover, we suggest using several input files such that the totalnum,ber of input files are greater than or equal to the number of processes provided by
--processes
. Note that this requires a high-spec CPU server, which can handle the concurrent running processes in RAM and the I/O for reads and writes. If the I/O of the server is slow, the processes may appear to be hung for a very long while.The recommendation is to split the data into smaller subsets and write out each subset for very large datasets (with several files, with each file in the order of GBs). You can then mix all HDF5 in a common folder for use by the data pipeline or just provide the locations of each subset in a list. The overall time to write out HDF5 can depend on the CPU server used.
It is better to split the input dataset into multiple files with similar sizes to leverage the full potential of parallel processing.
For CodeGen models processing, please use
GPT2Tokenizer
along with the updated vocab files such that the vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces.
Output files structure#
The output directory will contain many h5
files, as shown below (with two processes):
<path/to/output_dir>
├── checkpoint_0.txt
├── checkpoint_1.txt
├── data_params.json
├── examples_0_0.h5
├── examples_0_1.h5
├── examples_1_0.h5
├── examples_1_1.h5
├── examples_2_0.h5
├── examples_2_1.h5
├── examples_3_0.h5
├── examples_3_1.h5
├── examples_4_0.h5
├── examples_4_1.h5
├── examples_5_0.h5
├── examples_6_0.h5
├── examples_7_0.h5
└── examples_8_0.h5
Here data_params.json
is the file that stores the parameters used for generating this set of files. checkpoint_*.txt
can be used for resuming the processing in case the run script gets killed for some reason. There is one checkpoint_*.txt
file for each process. To use this file, resume the previous command that you ran along with the additional command line argument --resume_from_checkpoint
.
Example for HuggingFace Eli5 dataset#
The example shows conversion of HuggingFace Eli5 dataset to HDF5:
from modelzoo.transformers.data_processing.huggingface.HuggingFace_Eli5 import (
HuggingFace_Eli5,
)
from modelzoo.transformers.data_processing.scripts.hdf5_preprocessing.convert_dataset_to_HDF5 import (
convert_dataset_to_HDF5,
)
dataset, data_collator = HuggingFace_Eli5(split="train", num_workers=8)
convert_dataset_to_HDF5(
dataset=dataset,
data_collator=data_collator,
output_dir="./eli5_hdf5_dataset/",
num_workers=8,
)