cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame#
- class cerebras.modelzoo.data_preparation.nlp.chunk_data_processing.data_reader.DataFrame[source]#
Bases:
object
Initialize the DataFrame object.
- Parameters
keys (Dict) – Keys for the data entries.
Methods
Add an entry to the DataFrame.
Appends the different examples in a dataFrame object to different HDF5 files.
Checks if the document is corrupted in the case of summarization tasks
Clear the raw data after tokenizing.
Save the processed tokenized data to a CSV file.
Save the DataFrame object to an HDF5 file.
Tokenize the data values.
- __init__(keys: Optional[Dict] = None)[source]#
Initialize the DataFrame object.
- Parameters
keys (Dict) – Keys for the data entries.
- save_to_hdf5(h5file: Any, write_in_batch: bool, dtype: str = 'i4', compression: str = 'gzip') None [source]#
Save the DataFrame object to an HDF5 file.
- Parameters
h5file – An HDF5 file handle.
data_frame_num (int) – Unique identifier for the data frame.
- save_mlm_data_to_csv(csv_file_path)[source]#
Save the processed tokenized data to a CSV file.
- Parameters
csv_file_path (str) – Path to the CSV file to write.
- append_to_hdf5(output_dir, total_chunks, pid, chunk_locks, dtype='i4', compression='gzip')[source]#
Appends the different examples in a dataFrame object to different HDF5 files. This API is called when online shuffling is used
- Parameters
output_dir – Output dir where HDF5 data is supposed to be dumped.
total_chunks – Total number of estimated output chunks.
pid – Process id of the writer process.
chunk_locks – The list of file specific chunk locks used while appending to a output file.
- add(value: Dict[str, Any]) None [source]#
Add an entry to the DataFrame.
- Parameters
value (Union[Dict[str, Any], Any]) – Entry to be added.