cerebras.modelzoo.data_preparation.data_preprocessing.data_reader.Reader#

class cerebras.modelzoo.data_preparation.data_preprocessing.data_reader.Reader(file_list, max_chunk_size, keys, read_hook_fn, **kwargs)[source]#

Bases: object

Initialize the Reader instance.

Parameters
  • file_list (List[str]) – List of file paths to be read.

  • max_chunk_size (int) – Maximum chunk size for accumulated data.

  • keys (Optional[Dict]) – Dictionary containing the type of key and it’s name.

Methods

accumulate_and_yield

Accumulate data and yield in chunks.

handle_jsonl

Handle JSONL data and yield processed entries.

read_fasta

Read and process Fasta file without using BioPython.

read_jsongz

Read and process gzipped JSON file.

read_jsonl

Read and process JSONL file.

read_jsonl_tar

Read and process TAR archive containing ZST compressed JSONL files.

read_jsonl_zst

Read and process ZST compressed JSONL file.

read_parquet

Read and process Parquet file.

read_txt

Read and process text file.

stream_data

Stream and process data from multiple file formats.

handle_jsonl(jsonl_reader, start_doc_idx, get_meta, autojoin_paragraphs, para_joiner)[source]#

Handle JSONL data and yield processed entries.

Parameters
  • jsonl_reader (Any) – The JSONL reader object.

  • start_doc_idx (int) – Contains the current document starting index

  • get_meta (bool) – Flag to determine if meta data should be extracted.

  • autojoin_paragraphs (bool) – Flag to auto join paragraphs.

  • para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Dict[str, Any]]

accumulate_and_yield(data_gen, file_idx)[source]#

Accumulate data and yield in chunks.

Parameters
  • data_gen (Iterator[Dict[str, Any]]) – Generator yielding data entries.

  • file_idx (int) – Current file index

Returns

Yields accumulated data chunks.

Return type

Iterator[Any]

read_txt(file, checkpoint_args)[source]#

Read and process text file.

Parameters
  • file (str) – Path to the .txt file.

  • checkpoint_args (tuple) – Contains the current file starting index , current document starting index

Returns

Yields processed data lines.

Return type

Iterator[Any]

read_jsongz(file, checkpoint_args)[source]#

Read and process gzipped JSON file.

Parameters
  • file (str) – Path to the .json.gz file.

  • checkpoint_args (tuple) – Contains the current file starting index , current document starting index

Returns

Yields processed data entries.

Return type

Iterator[Any]

read_jsonl(file, checkpoint_args, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#

Read and process JSONL file.

Parameters
  • file (str) – Path to the .jsonl file.

  • checkpoint_args (tuple) – Contains the current file starting index , current document starting index

  • get_meta (bool) – Flag to determine if meta data should be extracted.

  • autojoin_paragraphs (bool) – Flag to auto join paragraphs.

  • para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Any]

read_jsonl_zst(file, checkpoint_args, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#

Read and process ZST compressed JSONL file.

Parameters
  • file (str) – Path to the .jsonl.zst file.

  • checkpoint_args (tuple) – Contains the current file starting index , current document starting index

  • get_meta (bool) – Flag to determine if meta data should be extracted.

  • autojoin_paragraphs (bool) – Flag to auto join paragraphs.

  • para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Any]

read_jsonl_tar(file, checkpoint_args, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#

Read and process TAR archive containing ZST compressed JSONL files.

Parameters
  • file (str) – Path to the .jsonl.zst.tar file.

  • checkpoint_args (tuple) – Contains the current file starting index , current document starting index

  • get_meta (bool) – Flag to determine if meta data should be extracted.

  • autojoin_paragraphs (bool) – Flag to auto join paragraphs.

  • para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Any]

read_parquet(file, checkpoint_args)[source]#

Read and process Parquet file.

Parameters
  • file (str) – Path to the .parquet file.

  • checkpoint_args (tuple) – Contains the current file starting index , current document starting index

Returns

Yields processed data rows.

Return type

Iterator[Any]

read_fasta(file, checkpoint_args)[source]#

Read and process Fasta file without using BioPython. :param file: Path to the .fasta file. :type file: str :param checkpoint_args: Contains the current file starting index, current document starting index :type checkpoint_args: tuple

Returns

Yields processed data rows.

Return type

Iterator[Dict[str, Any]]

stream_data(checkpoint_args, get_meta=False)[source]#

Stream and process data from multiple file formats.

Parameters
  • get_meta (bool) – Flag to determine if meta data should be extracted.

  • checkpoint_args (tuple) – Contains the current file starting index, current document starting index.

Returns

Yields processed data chunks.

Return type

Iterator[Any]