cerebras.modelzoo.data_preparation.raw_dataset_processor.utils.Reader#

class cerebras.modelzoo.data_preparation.raw_dataset_processor.utils.Reader(file_list, keys, read_hook_fn)[source]#

Bases: object

Initialize the Reader instance.

Parameters

file_list (List[str]) – List of file paths to be read.
keys (Optional[Dict]) – Dictionary containing the type of key and it’s name.

Methods

`handle_jsonl`	Handle JSONL data and yield processed entries.
`read_fasta`	Read and process Fasta file without using BioPython.
`read_jsongz`	Read and process gzipped JSON file.
`read_jsonl`	Read and process JSONL file.
`read_jsonl_tar`	Read and process TAR archive containing ZST compressed JSONL files.
`read_jsonl_zst`	Read and process ZST compressed JSONL file.
`read_parquet`	Read and process Parquet file.
`read_txt`	Read and process text file.
`stream_data`	Stream and process data from multiple file formats.

handle_jsonl(jsonl_reader, get_meta, autojoin_paragraphs, para_joiner)[source]#

Handle JSONL data and yield processed entries.

Parameters

jsonl_reader (Any) – The JSONL reader object.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Dict[str, Any]]

read_txt(file)[source]#

Read and process text file.

Parameters: file (str) – Path to the .txt file.
Returns: Yields processed data lines.
Return type: Iterator[Any]

read_jsongz(file)[source]#

Read and process gzipped JSON file.

Parameters: file (str) – Path to the .json.gz file.
Returns: Yields processed data entries.
Return type: Iterator[Any]

read_jsonl(file, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#

Read and process JSONL file.

Parameters

file (str) – Path to the .jsonl file.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Any]

read_jsonl_zst(file, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#

Read and process ZST compressed JSONL file.

Parameters

file (str) – Path to the .jsonl.zst file.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Any]

read_jsonl_tar(file, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#

Read and process TAR archive containing ZST compressed JSONL files.

Parameters

file (str) – Path to the .jsonl.zst.tar file.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.

Returns

Yields processed data entries.

Return type

Iterator[Any]

read_parquet(file)[source]#

Read and process Parquet file.

Parameters: file (str) – Path to the .parquet file.
Returns: Yields processed data rows.
Return type: Iterator[Any]

read_fasta(file)[source]#

Read and process Fasta file without using BioPython. :param file: Path to the .fasta file. :type file: str

Returns: Yields processed data rows.
Return type: Iterator[Dict[str, Any]]

stream_data(get_meta=False)[source]#

Stream and process data from multiple file formats.

Parameters: get_meta (bool) – Flag to determine if meta data should be extracted.
Returns: Yields processed data chunks.
Return type: Iterator[Any]

cerebras.modelzoo.data_preparation.raw_dataset_processor.utils

cerebras.modelzoo.data_preparation.utils