cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader#
This module contains helper functions and classes to read data from different formats, process them, and save in HDF5 format. It supports JSONL, GZipped JSON, Parquet, ZST compressed JSONL, and TAR archives of ZST compressed JSONL files.
- Classes:
- DataFrame:
An object to hold and process data with the ability to serialize itself into an HDF5 format.
- Reader:
Provides a mechanism to read data from multiple file formats, process it, and yield in manageable chunks.
Functions
Find the last end of a paragraph (denoted by ' |
|
Compute the size of the given data. |
|
This is used to set metadata for a given dataframe |
|
Split a large entry into chunks by sentence or paragraph end. |
Classes
Initialize the DataFrame object. |
|
Initialize the Reader instance. |