cerebras.modelzoo.data_preparation.data_preprocessing.utils.get_data_stats#

cerebras.modelzoo.data_preparation.data_preprocessing.utils.get_data_stats(sample, pad_id, eos_id, max_seq_length, loss_valid_tokens=None)[source]#

Get data statistics from the sample.

Parameters

sample (np.ndarray) – Tokenized sample in the form of a NumPy array.
pad_id (int) – The ID used for padding tokens.
eos_id (int) – The ID used for end-of-sequence tokens.
max_seq_length (int) – The maximum sequence length.
loss_valid_tokens (Optional[int]) – The number of valid tokens for loss computation. If not provided, it will be calculated from the sample.

Returns

A dictionary containing the following data statistics:

”num_pad_tokens”: Number of padding tokens in the sample.
”non_pad_tokens”: Number of tokens that are neither padding nor end-of-sequence tokens.
”num_tokens”: Total number of tokens in the sample.
”loss_valid_tokens”: Number of valid tokens for loss computation.
”num_masked_tokens”: Number of masked tokens based on the maximum sequence length.

Return type

Dict[str, int]

cerebras.modelzoo.data_preparation.data_preprocessing.utils.format_fim

cerebras.modelzoo.data_preparation.data_preprocessing.utils.get_files