modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.Reader#
- class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.data_reader.Reader[source]#
 Bases:
objectInitialize the Reader instance.
- Parameters
 file_list (List[str]) – List of file paths to be read.
max_chunk_size (int) – Maximum chunk size for accumulated data.
keys (Optional[List[str]]) – List of keys to filter data. Defaults to [‘text’].
Methods
Accumulate data and yield in chunks.
Handle JSONL data and yield processed entries.
Read and process gzipped JSON file.
Read and process JSONL file.
Read and process TAR archive containing ZST compressed JSONL files.
Read and process ZST compressed JSONL file.
Read and process Parquet file.
Read and process text file.
Stream and process data from multiple file formats.
- __init__(file_list: List[str], max_chunk_size: int, logger: logging.Logger, keys: Optional[List[str]] = None) None[source]#
 Initialize the Reader instance.
- Parameters
 file_list (List[str]) – List of file paths to be read.
max_chunk_size (int) – Maximum chunk size for accumulated data.
keys (Optional[List[str]]) – List of keys to filter data. Defaults to [‘text’].
- accumulate_and_yield(data_gen: Iterator[Dict[str, Any]], file_idx) Iterator[Any][source]#
 Accumulate data and yield in chunks.
- Parameters
 data_gen (Iterator[Dict[str, Any]]) – Generator yielding data entries.
file_idx (int) – Current file index
- Returns
 Yields accumulated data chunks.
- Return type
 Iterator[Any]
- handle_jsonl(jsonl_reader: Any, start_doc_idx: int, get_meta: bool, autojoin_paragraphs: bool, para_joiner: str) Iterator[Dict[str, Any]][source]#
 Handle JSONL data and yield processed entries.
- Parameters
 jsonl_reader (Any) – The JSONL reader object.
start_doc_idx (int) – Contains the current document starting index
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
 Yields processed data entries.
- Return type
 Iterator[Dict[str, Any]]
- read_jsongz(file: str, checkpoint_args: tuple) Iterator[Any][source]#
 Read and process gzipped JSON file.
- Parameters
 file (str) – Path to the .json.gz file.
checkpoint_args (tuple) – Contains the current file starting index , current document starting index
- Returns
 Yields processed data entries.
- Return type
 Iterator[Any]
- read_jsonl(file: str, checkpoint_args: tuple, get_meta: bool = False, autojoin_paragraphs: bool = True, para_joiner: str = '\n\n') Iterator[Any][source]#
 Read and process JSONL file.
- Parameters
 file (str) – Path to the .jsonl file.
checkpoint_args (tuple) – Contains the current file starting index , current document starting index
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
 Yields processed data entries.
- Return type
 Iterator[Any]
- read_jsonl_tar(file: str, checkpoint_args: tuple, get_meta: bool = False, autojoin_paragraphs: bool = True, para_joiner: str = '\n\n') Iterator[Any][source]#
 Read and process TAR archive containing ZST compressed JSONL files.
- Parameters
 file (str) – Path to the .jsonl.zst.tar file.
checkpoint_args (tuple) – Contains the current file starting index , current document starting index
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
 Yields processed data entries.
- Return type
 Iterator[Any]
- read_jsonl_zst(file: str, checkpoint_args: tuple, get_meta: bool = False, autojoin_paragraphs: bool = True, para_joiner: str = '\n\n') Iterator[Any][source]#
 Read and process ZST compressed JSONL file.
- Parameters
 file (str) – Path to the .jsonl.zst file.
checkpoint_args (tuple) – Contains the current file starting index , current document starting index
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
 Yields processed data entries.
- Return type
 Iterator[Any]
- read_parquet(file: str, checkpoint_args: tuple) Iterator[Any][source]#
 Read and process Parquet file.
- Parameters
 file (str) – Path to the .parquet file.
checkpoint_args (tuple) – Contains the current file starting index , current document starting index
- Returns
 Yields processed data rows.
- Return type
 Iterator[Any]
- read_txt(file: str, checkpoint_args: tuple) Iterator[Any][source]#
 Read and process text file.
- Parameters
 file (str) – Path to the .txt file.
checkpoint_args (tuple) – Contains the current file starting index , current document starting index
- Returns
 Yields processed data lines.
- Return type
 Iterator[Any]
- stream_data(checkpoint_args, get_meta: bool = False) Iterator[Any][source]#
 Stream and process data from multiple file formats.
- Parameters
 get_meta (bool) – Flag to determine if meta data should be extracted.
checkpoint_args (tuple) – Contains the current file starting index , current document starting index
- Returns
 Yields processed data chunks.
- Return type
 Iterator[Any]