Creating HDF5 dataset for GPT models#
IMPORTANT: This preprocessing script for creating HDF5 dataset files is soon going to be officially DEPRECATED. To ensure optimal performance and compatibility, we strongly recommend transitioning to the script provided in the Creating HDF5 dataset for GPT models using chunk data preprocessing document.
Overview#
We provide two methods to generate Hierarchical Data Formats (HDF) files (.h5
) that you can use in the input pipeline for GPT style models to implement data loader for GPT style models efficiently.
If you have a PyTorch dataset and need to convert it to an HDF5 format, follow section Converting a PyTorch dataset to HDF5 format.
If you have raw data and want to convert it to an HDF5 dataset, follow section Generating HDF5 files of this document.
Converting a PyTorch dataset to HDF5 format#
If you have a PyTorch dataset for GPT models (from any source such HuggingFace, Map-Style or Iterable), you can easily write the samples of that dataset in HDF5 format to use with Cerebras optimized HDF5 DataProcessor. This can be done by calling the function convert_dataset_to_HDF5()
which is defined in convert_dataset_to_HDF5.py. The following example shows conversion of HuggingFace Eli5 dataset to HDF5:
from cerebras.modelzoo.data_preparation.huggingface.HuggingFace_Eli5 import (
HuggingFace_Eli5,
)
from cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5 import (
convert_dataset_to_HDF5,
)
dataset, data_collator = HuggingFace_Eli5(
split="train", num_workers=8, sequence_length=128
)
convert_dataset_to_HDF5(
dataset=dataset,
data_collator=data_collator,
output_dir="./eli5_hdf5_dataset/",
num_workers=8,
)
The function convert_dataset_to_HDF5()
uses a PyTorch Dataloader to fetch samples from the specified dataset and writes those samples in h5
files. The following table explains the arguments to the convert_dataset_to_HDF5()
function:
Table 1: convert_dataset_to_HDF5 Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
N/A |
PyTorch dataset to fetch the data from (IterableDataset or Dataset ). |
|
./hdf5_dataset/ |
Directory where HDF5 will be stored. |
|
dataset-partition |
Name of the dataset; i.e. prefix to use for HDF5 file names. |
|
2000 |
Number of samples written to each HDF5 file |
|
8 |
Number of Python processes to use for generating data. |
|
64 |
The batch size to use fetching the data. |
|
N/A |
Merges a list of samples to form a mini-batch of Tensor(s). |
|
i4 |
Data type for the HDF5 dataset. |
|
gzip |
HDF5 Compression strategy. |
While the function convert_dataset_to_HDF5()
is generic and can be used with all transformer models, note that PyTorch dataset features dictionary should have the the following key/values GPT models:
input_ids
: Input token IDs, padded with0
tomax_sequence_length
.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
attention_mask
: Mask for padded positions. Has values0
on the padded positions and1
elsewhere.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
labels
: Labels for language modeling pre-training task, padded with0
tomax_sequence_length
.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
There are two extra key/values needed to train GPT models on variable sequence length (VSL) samples that are packed into fixed length sequence:
“attention_span”, “position_ids”
attention_span
: Specifies the attention span for each attention key to prevent attending to out-of-sample queries, padded with0
tomax_sequence_length
.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
position_ids
: Token position index relative to the data sample, padded with0
tomax_sequence_length
.Shape:
(batch_size, max_sequence_length)
Type:
torch.int32
NOTE:
More information on using of HuggingFace datasets can be found in this document: Using HuggingFace datasets for auto-regressive LM].
attention_mask
here actually represents loss mask and is used as loss mask by our gpt-style models. It can mask out components like padding tokens or prompt tokens that shouldn’t be in the loss calculation. For example: \
If we do autoregressive language modeling with the input in format [
input_ids
,padding_tokens
] (LMData),attention_mask
will look something like [1, 1, …, 1, 0, 0, 0, …, 0] where the 1’s corresponds toinput_ids
and 0’s topadding_tokens
.If we do prompted generation like instruction tuning with input in format [
prompt_ids
,input_ids
,padding_tokens
(optional)] (Summarization),attention_mask
will look something like [0, 0, …, 0, … 1, 1, …, 1, 0, 0, 0, …, 0] where the first chunk of 0’s correspond toprompt_ids
, the 1’s correspond toinput_ids
and the second chunk of 0’s correspond topadding_tokens
.
Converting raw data to HDF5 data#
We currently offer eight modes for generating HDF5 files:
LMData
: for processing language modelling datasets that are in .jsonl, .jsonl.zst, .parquet or .txt format.Summarization
: for processing fine-tuning datasets that are in .jsonl, .jsonl.zst, .parquet or .txt format.FIM
: for processing language modeling data to support Fill-in-the-Middle objective, also requires datasets that are in .jsonl, .jsonl.zst, .parquet or .txt format.LMData_VSL
: for processing language modelling datasets that are in .jsonl, .jsonl.zst, .parquet or .txt format for variable sequence length (VSL) training.Summarization_VSL
: for processing fine-tuning datasets that are in .jsonl, .jsonl.zst, .parquet or .txt format for variable sequence length (VSL) training.Customize
: for any dataset format, but requires supplying a module that specifies how to read the raw dataset files.LlavaPhaseOne
: for preprocessing LLaVA phase 1 datasets that have a image in .jpg or .png format and text in .jsonl, .jsonl.zst, .parquet or .txt format.LlavaPhaseTwo
: for preprocessing LLaVA phase 2 datasets that have a image in .jpg or .png format and text in .jsonl, .jsonl.zst, .parquet or .txt format.
Run the create_hdf5_dataset.py script to process your data with the appropriate subcommand (modes) to generate the .h5
files for GPT style models. Each sub-commands takes in a set of arguments described below in Generating HDF5 files section.
Before doing that, you need to setup a python virtual environment through conda as described below.
Set up environment#
NOTE: Model Zoo environment setup is required for a clean run of the data preprocessing script.
Set up the Model Zoo environment as described in PYTHON-SETUP.md.
Input files format#
Ensure the input text documents are in a specific file format before utilizing the provided script, except for the Customize
mode. The acceptable file formats are '.jsonl', '.json.gz', '.jsonl.zst', '.jsonl.zst.tar', '.txt'
. These files should have the data in a specific structure described in data format section.
To optimally process the files, we recommend that all files with any of the above formats besides .txt
contain enough text in a single file. The recommended size for each file is in the order of GB.
On the contrary, if you are processing smaller files with .txt
format, input a metadata
file containing a list of paths to these files to leverage multi-processing better.
Input data format#
As mentioned above, the preprocessing script accepts two primary input files: .json
based or .txt
based. The input data must follow a specific structure for each type to be accurately converted into hdf5
files.
Format for jsonl
files#
The raw text and meta data for generation should be represented in the .jsonl
based files as:
{"text": "Any text excerpt from the dataset of interest...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
For the jsonl
files, as shown above, the raw text is extracted from key=text
in the input files by default. If your input files do not contain a text
key, you should know the key
corresponding to the text you need to extract. Then, extract the text from the command line argument --jsonl_key=<your key name>
.
For example, if your jsonl
files have the content such as the following:
{"idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
"meta": {"info": "any other metadata dict"}}
then you’d need to pass --jsonl_key=idea
in the command line arguments.
Format for txt
based files in LMData
mode#
Always represent raw text for generation in a .txt
based files as:
Any text excerpt from the dataset of interest...
The new lines may not be represented as a newline character.
Note that there are no special tags or anchors in the above. If they exist, all these will be treated as a single document and may not represent the natural language.
For example, the following text gets tokenized be entirely as:
<DOC>
<DOCNO>TRC2-2008-01-01-0000</DOCNO>
<BLOGS08DAY>-13</BLOGS08DAY>
<CONTENT>
Example content in the format that may be outside of the
</DOCS>
Format for PARQUET
based files in LMData
mode#
Column Name = “text”: “Any text excerpt from the dataset of interest…\nThe new lines should be represented as a newline character.”, Column Name = “abc”: “….” etc
For the `parquet` files, as shown above, by default, the raw text is extracted from the column with name `text` in the input files. If your input files do not contain a column with name `text` as key then you should know the `key` corresponding to the text you need to extract. Then, you can use the command line argument `--jsonl_key=<your key name>` to extract the text.
For example, if your parquet files have the content as below and you want to extract the value from column name `idea`:
```parquet
Column Name = "idea": "Any text excerpt from the dataset of interest, with custom key: 'idea'...\nThe new lines should be represented as a newline character.",
Column Name = "abc": "...."
etc
then you’d need to pass --jsonl_key=idea
in the command line arguments.
Definition of vocab_file and encoder_file#
The script supports two kinds of tokenizers: GPT2Tokenizer
and NeoXTokenizer
.
Supply the correct vocab_file
and encoder_file
when using the desired tokenizer.
For
GPT2Tokenizer
,vocab_file=gpt2-vocab.bpe
andencoder_file=gpt2-encoder.json
For
NeoXTokenizer
,encoder_file=/neox-encoder.json
These files can be found here.
Note: For the GPT2Tokenizer
, we follow the nomenclature used by OpenAI in their implementation which is slightly different from Hugging Face’s nomenclature where they call the vocab_file
as merges_file
and encoder_file
as vocab_file
. However, the content of the files is the same. For NeoXTokenizer
, we use the same nomenclature to avoid confusion.
Generating HDF5 files#
Once you have a text dataset that meets the above requirement, you can generate HDF5 files using the create_hdf5_dataset.py
script:
python create_hdf5_dataset.py [mode] [--arguments]
As we mentioned before, the mode can be one of {LMData
, Summarization
,}. The four modes share the same setup and processing arguments but differ in their dataset arguments, as detailed below:
Table 2: Setup Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
N/A |
Path to YAML config file for setting dataset preprocessing parameters. Optional alternative for providing command line arguments. |
|
N/A |
Directory where raw data is stored. Supports only the formats: [ |
|
N/A |
Path to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by comma. |
|
|
Directory where HDF5 files will be stored. |
|
cpu count |
Number of processes to use. |
|
N/A |
Python file name contains the custom dataset processor for |
|
N/A |
Name of the custom dataset processor for |
Note: You have to provide either the
input_dir
ormetadata_files
argument. Only files referenced in themetadata_files
will be processed if you provided both.
Table 3: Processing Arguments#
Argument |
Default Value |
Description |
---|---|---|
|
required arg |
Type of tokenizer to use for HDF5 dataset generation. Can be one of |
|
N/A |
Path to the vocabulary file. |
|
N/A |
Path to the encoder file. |
|
|
Maximum sequence length. |
|
|
Probability of creating sequences which are shorter than the maximum sequence length. |
|
|
Name of the dataset; i.e. prefix to use for HDF5 file names. |
|
|
Text files to write per HDF5 file. |
|
|
Whether to write the samples in batch for the HDF5 format, setting to false will save memory but a bit slower. |
|
|
Write the remainder files when data is left over from processing. |
|
|
Resume record writing from a given checkpoint. |
|
|
Display progress while runs. |
|
|
Random seed. |
Table 4: Dataset Arguments (LMData
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Fix text with ftfy. |
|
|
Choose what kind of unicode normalization is applied. Usually, we apply |
|
|
Use wikitext detokenizer to fix text. |
|
|
The key name in input jsonl files from which the raw text will be extracted in order to further process it. |
|
|
Concatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token. |
|
|
Minimum token length to skip the sample. |
|
|
dtype of processed input_ids. |
|
|
dtype of processed input loss masks. |
|
|
If False, 0 represents masked positions. If True 1 represents masked positions. |
|
|
Whether to split the text into smaller chunks before tokenization. This is helpful for very long documents with tokenizers such as Llama tokenizer which performs quadratically in the text length. |
|
|
Length of the text chunks to split the text into before tokenization for slower tokenizers. Could be optionally used with the above flag |
|
|
Whether to remove the BOS token from the beginning of the chunks. Set this to |
Table 5: Dataset Arguments (Summarization
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Fix text with ftfy. |
|
|
Choose what kind of unicode normalization is applied. Usually, we apply |
|
|
Use wikitext detokenizer to fix text. |
|
|
Minimum token length to skip the sample. |
|
|
Token added between prompt and completion in preprocessed sequences. If supplied with a non-None value, the tokenizer will add the token to the vocab size and modify the vocab size. This may not be advisable for doing fine tuning on a pre-trained model on the types of models that do not provision for extra tokens. |
|
required arg |
Json key for the prompt. |
|
required arg |
Json key for the completion. |
|
|
dtype of processed input_ids. |
|
|
dtype of processed input loss masks. |
|
|
If False, 0 represents masked positions. If True 1 represents masked positions. |
Table 6: Dataset Arguments (FIM
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Float specifying percentage of data to apply FIM transformation, instead of leaving as auto-regressive. |
|
|
Float specifying percentage of FIM transformation to convert to prefix-suffix-middle (PSM) vs suffix-prefix-middle (SPM) formats. |
The FIM
mode is very similar to the LMData
mode, and uses all the same other arguments as listed in the LMData
table. These additional parameters determine whether what percentage of samples have the FIM transformation applied, and what percent of these end up in PSM (prefix, suffix, middle) or SPM format.
Note: For CodeLlama, to follow the note here specify the EOT token as the EOS token in the config.
Table 7: Dataset Arguments (LMData_VSL
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Fold documents larger than |
|
|
dtype of token position ids. |
The LMData_VSL
mode inherits all the other arguments from the LMData
mode as listed in Table 4.
Table 8: Dataset Arguments (Summarization_VSL
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
dtype of token position ids. |
|
|
If specified, this will be added before the prompt in every sequence. Example usage is to add |
|
|
Similar to |
|
|
Some current chat templates will include an EOS token after the end of the user input in a multi-turn dialogue. If this flag is specified, there will be EOS tokens after all prompts. |
|
|
If specified, this replaces the |
|
|
If the data stored at |
The Summarization_VSL
mode inherits all the other arguments from the Summarization
mode as listed in Table 5.
Table 9: Dataset Arguments (LlavaPhaseOne
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
Some current chat templates will include an EOS token after the end of the user input in a multi-turn dialogue. If this flag is specified, there will be EOS tokens after all prompts. |
|
|
If specified, this replaces the |
|
|
If the data stored at |
|
|
Image key of the LLaVA dataset. For example a jsonl file might have the image path contained at the key, “image”. |
|
|
Some examples in LLaVA training are text-only and have no images, so that the model does not forget how to answer text-only questions while it is learning multi-modality. These examples will not have the |
|
|
String that represents where in the text the image patches will be inserted. For example, the original LLaVA dataset contained the string “ |
|
|
Number of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images. |
|
|
Absolute path of image directory. Used along with the relative path under the |
The LlavaPhaseOne
mode inherits all the other arguments from the Summarization
mode as listed in Table 8. Note that the LLaVA phase-one training removes the prompt text, and trains on the image + completion pair.
Also note that both preprocessors for LLaVA currently only support tokenizers based on the Llama and Mistral models.
Table 10: Dataset Arguments (LlavaPhaseTwo
mode)#
Argument |
Default Value |
Description |
---|---|---|
|
|
If specified, this will be added before the prompt in every sequence. Example usage is to add |
|
|
Similar to |
|
|
Key to obtain the system prompt used for the LLM backbone within LLaVA. The currently supported keys are |
The LlavaPhaseTwo
mode inherits all the other arguments from LlavaPhaseOne
mode as listed in Table 9. LLaVA phase-two training does not remove the prompt text as phase-one does.
Usage of create_hdf5_dataset.py file#
You can provide the above arguments either as command line arguments or as YAML config file:
Command line
python create_hdf5_dataset.py LMData --input_dir /path/to/data --tokenizer_type NeoXTokenizer --encoder_file /path/to/encoder --max_seq_length 4096 --ftfy True --pack_sequences False
YAML config file
python create_hdf5_dataset.py LMData --params ./configs/autoregressive_lm_preprocessing.yaml
Example of sample YAML files for LMData and Summarization are located on Cerebras Model Zoo.
Note: You can use both, but command-line arguments will override any common arguments with the YAML configuration file.
Customize mode steps#
Create a python file or put under
./hdf5_dataset_preprocessors.py
Import the module
HDF5Preprocessor
in the file you created as follows:
from modelzoo.data_preparation.nlp.hdf5_preprocessing.hdf5_preprocessor import HDF5Preprocessor
Create a class that inherits from
HDF5Preprocessor
. (e.gCustomDataset
)Implements init takes as input a dictionary contains the dataset parameters that is needed for
HDF5Preprocessor
.Implements the method
file_read_generator
andpreprocessing_generator
following Write Customized PreprocessorRun
create_hdf5_dataset.py
script.
Write customized preprocessor#
You can create customized preprocessors for various datasets or objectives. We provide two references at hdf5_dataset_preprocessors.py
where:
LMDataPreprocessor: the preprocessor for autoregressive language modeling tasks
SummarizationPreprocessor: the preprocessor for summarization tasks
They both inherit from the HDF5BasePreprocessor
at hdf5_base_preprocessor.py
with two functions that can be overridden to customize for various cases:
file_read_generator() takes a file path, reads from the file, and yields the corresponding text documents. You can customize how you want the file to be read based on its format (ex. csv, zip, etc.). Our default preprocessors use
lm_dataformat
reader with specific JSON keys.preprocessing_generator(), This function takes in the output of file_read_generator(), performs tokenization and other preprocessing techniques, and yields the data samples in np.array format.
For example, in the autoregressive language modeling task, file_read_generator yields an str object, and the preprocessing_generator produces an np array with shape [3, max_sequence_length]
with the following three features concatenated on the first dimension:
input_ids
: Input token ids, padded with 0’s to max_sequence_length.input_mask
: Loss mask for the sequence. It has 0’s padded positions like prompts or padding tokens, and 1’s elsewhere.labels
: input_ids shifted to the right by one position as the target labels.
NOTE: To avoid tedious setup of arguments specific to your customized preprocessor, we recommend running with a YAML file config.
Best practices#
It is recommended to use the
ftfy
module to fix the datasets. Enable with the--ftfy
argument.The NeoXTokenizer uses the HuggingFace library’s inbuilt tokenizer and handles NFC normalization independently. When using this tokenizer_type, set the
--ftfy_normalizer
argument toNone
. For theGPT2Tokenizer
, use the defaultNFC
value for the normalizer.To process HDF5 for training, we recommend using multi-processing. Moreover, we suggest using several input files such that the totalnum,ber of input files are greater than or equal to the number of processes provided by
--processes
. Note that this requires a high-spec CPU server, which can handle the concurrent running processes in RAM and the I/O for reads and writes. If the I/O of the server is slow, the processes may appear to be hung for a very long while.The recommendation is to split the data into smaller subsets and write out each subset for very large datasets (with several files, with each file in the order of GBs). You can then mix all HDF5 in a common folder for use by the data pipeline or just provide the locations of each subset in a list. The overall time to write out HDF5 can depend on the CPU server used.
It is better to split the input dataset into multiple files with similar sizes to leverage the full potential of parallel processing.
For CodeGen models processing, please use
GPT2Tokenizer
along with the updated vocab files such that the vocabulary of GPT-2 is extended by special tokens representing repeating tokens of tabs and white spaces.
Output files structure#
The output directory will contain many h5
files, as shown below (with two processes):
<path/to/output_dir>
├── checkpoint_0.txt
├── checkpoint_1.txt
├── data_params.json
├── examples_0_0.h5
├── examples_0_1.h5
├── examples_1_0.h5
├── examples_1_1.h5
├── examples_2_0.h5
├── examples_2_1.h5
├── examples_3_0.h5
├── examples_3_1.h5
├── examples_4_0.h5
├── examples_4_1.h5
├── examples_5_0.h5
├── examples_6_0.h5
├── examples_7_0.h5
└── examples_8_0.h5
Here data_params.json
is the file that stores the parameters used for generating this set of files. checkpoint_*.txt
can be used for resuming the processing in case the run script gets killed for some reason. There is one checkpoint_*.txt
file for each process. To use this file, resume the previous command that you ran along with the additional command line argument --resume_from_checkpoint
.
Example for HuggingFace Eli5 dataset#
The example shows conversion of HuggingFace Eli5 dataset to HDF5:
from modelzoo.data_preparation.huggingface.HuggingFace_Eli5 import (
HuggingFace_Eli5,
)
from modelzoo.data_preparation.nlp.hdf5_preprocessing.convert_dataset_to_HDF5 import (
convert_dataset_to_HDF5,
)
dataset, data_collator = HuggingFace_Eli5(split="train", num_workers=8)
convert_dataset_to_HDF5(
dataset=dataset,
data_collator=data_collator,
output_dir="./eli5_hdf5_dataset/",
num_workers=8,
)