cerebras.modelzoo.data_preparation.data_preprocessing.utils.get_data_stats#

cerebras.modelzoo.data_preparation.data_preprocessing.utils.get_data_stats(sample, pad_id, eos_id, max_seq_length, loss_valid_tokens=None)[source]#

Get data statistics from the sample.

Parameters
  • sample (np.ndarray) – Tokenized sample in the form of a NumPy array.

  • pad_id (int) – The ID used for padding tokens.

  • eos_id (int) – The ID used for end-of-sequence tokens.

  • max_seq_length (int) – The maximum sequence length.

  • loss_valid_tokens (Optional[int]) – The number of valid tokens for loss computation. If not provided, it will be calculated from the sample.

Returns

A dictionary containing the following data statistics:
  • ”num_pad_tokens”: Number of padding tokens in the sample.

  • ”non_pad_tokens”: Number of tokens that are neither padding nor end-of-sequence tokens.

  • ”num_tokens”: Total number of tokens in the sample.

  • ”loss_valid_tokens”: Number of valid tokens for loss computation.

  • ”num_masked_tokens”: Number of masked tokens based on the maximum sequence length.

Return type

Dict[str, int]