Running Eleuther AI’s Evaluation Harness#

Overview#

We provide support for running EleutherAI’s Evaluation Harness (EEH) on the Cerebras Wafer-Scale cluster. EEH is a popular framework for evaluating large language models across various different datasets and tasks.

Running Evaluation Harness on CS-X#

To run EEH tasks on CS-X, use the modelzoo/common/run_cstorch_eval_harness.py script from the Cerebras Model Zoo. This script is similar to our other run scripts normally used to launch train/eval jobs, and it accepts the following command-line arguments:

python <path_to_modelzoo>/common/run_cstorch_eval_harness.py CSX [-h] [--tasks TASKS] [--num_fewshot NUM_FEWSHOT] [--output_path = [dir/file.jsonl] [DIR]] [--limit LIMIT] [--use_cache USE_CACHE]
        [--check_integrity] [--write_out] [--log_samples] [--show_config] [--include_path INCLUDE_PATH] [--hf_cache_dir HF_CACHE_DIR] [--keep_data_dir]
        -p PARAMS [-m {eval}] [-o MODEL_DIR]
        [--checkpoint_path CHECKPOINT_PATH] [--disable_strict_checkpoint_loading] [--load_checkpoint_states LOAD_CHECKPOINT_STATES] [--logging LOGGING]
        [--wsc_log_level WSC_LOG_LEVEL [WSC_LOG_LEVEL ...]] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--config CONFIG] [--compile_only | --validate_only]
        [--num_workers_per_csx NUM_WORKERS_PER_CSX] [-c COMPILE_DIR] [--job_labels JOB_LABELS [JOB_LABELS ...]] [--job_priority {p1,p2,p3}]
        [--debug_args_path DEBUG_ARGS_PATH] [--mount_dirs MOUNT_DIRS [MOUNT_DIRS ...]] [--python_paths PYTHON_PATHS [PYTHON_PATHS ...]]
        [--credentials_path CREDENTIALS_PATH] [--mgmt_address MGMT_ADDRESS] [--job_time_sec JOB_TIME_SEC] [--disable_version_check] [--num_csx NUM_CSX]
        [--num_wgt_servers NUM_WGT_SERVERS] [--num_act_servers NUM_ACT_SERVERS] [--debug_args [DEBUG_ARGS [DEBUG_ARGS ...]]] [--ini [INI [INI ...]]]
        [--transfer_processes TRANSFER_PROCESSES]

Eleuther Eval Harness Arguments	Description
–tasks TASKS	Comma separated string specifying Eleuther Eval Harness tasks. To get full list of tasks, use the command `lm-eval --tasks list` from within your python venv.
–num_fewshot NUM_FEWSHOT	Number of examples to be added to the fewshot context string. Defaults to 0
–output_path OUTPUT_PATH	The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used.
–limit LIMIT	Accepts an integer, or a float between 0.0 and 1.0. This limits the number of documents to evaluate per task to the first X documents (if an integer) or first X% of documents. This is useful for debugging.
–use_cache USE_CACHE	A path to a sqlite db file for caching model responses. None if not caching.
–check_integrity	Whether to run the relevant part of the test suite for the tasks.
–write_out	Prints the prompt for the first few documents. Defaults to False.
–log_samples	If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Defaults to False.
–show_config	If True, shows the the full config of all tasks at the end of the evaluation. Defaults to False.
–include_path INCLUDE_PATH	Additional path to include if there are external tasks to include.
–hf_cache_dir HF_CACHE_DIR	Path to directory for caching Hugging Face downloaded data
–keep_data_dir	Specifies whether dumped data samples should be kept for reuse. Defaults to False, i.e. data samples are deleted after the run

Note

In run_cstorch_eval_harness.py, we expose only a subset of Eleuther’s commandline interface (CLI) arguments listed above. For a detailed descrition of these supported arguments, see https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.0/docs/interface.md. Note that the CSX-specific CLI arguments are exactly the same as in our training and evaluation flows using run.py scripts. Refer to PyTorch params documentation for a detailed description of the runconfig parameters that are part of the CSX CLI.

The following runconfig arguments are important for running EEH on CS-X:

Runconfig Arguments	Description
`--params`	This argument specifies the path to `.yaml` file defining model architecture. This argument is required.
`--checkpoint_path`	This argument specifies the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the specified `model_dir`. Note that a checkpoint file is needed to run EEH, otherwise we will error out.
`--keep_data_dir`	This option, if set, preserves the preprocessed data samples generated by the DataLoader for the EEH task data, i.e. the directory specified under `params.eval_input.data_dir`.

Addtionally, note that the following settings are important for input data preprocessing and must be specified under the eval_input of your YAML file:

Input data preprocessing	Description
`data_dir`	This setting must be specified. Provide a path to the mounted directory visible to the worker containers where EEH task data samples are dumped after preprocessing. Use the `--mount_dirs` CLI argument runconfig param to specify a dir mount, similar to our existing flows.
`tokenizer_file_path`	Path to a custom tokenizer (JSON) file for models other than `gpt2` (e.g. Llama2). If this is specified, then `eval_input.eos_id` must also be specified. If no custom tokenizer file path is specified, we default to the `gpt2` tokenizer.
`eos_id`	This setting is required if a custom tokenizer is specified; otherwise we use `eos_id` for the default, `gpt2` tokenizer (i.e. `50256`).
`max_sequence_length`	This setting is required for preprocessing input data samples from the specified eval harness tasks. We recommend aligning the `max_sequence_length` field to the `max_position_embeddings` value in the model section of the `.yaml` file. If no `max_sequence_length` is specified, the flow defaults to an max sequence length of 2048.
`num_workers`	This setting, if specified, must be at most 1 since multiple process dataloading is not supported for this flow.

Example#

Here are sample runs for Llama2 7B model with 4K context length using a pretrained checkpoint from HuggingFace (HF).

Note that we first convert the HF checkpoint to a CS checkpoint using our checkpoint conversion script. See Convert checkpoints and model configs for more details on how to do this.

The following set up is for running the multiple choice eval task winogrande:

python <path_to_modelzoo>/common/run_cstorch_eval_harness.py CSX \
  --params <path_to_params.yaml> \
  --tasks "winogrande" \
  --num_fewshot 0 \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers>
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \
  --logging "info" \

eval_input:
  micro_batch_size: null
  batch_size: 21
  data_dir: <path_to_mounted_directory_for_data_dumps>
  eos_id: 2
  max_sequence_length: 4096
  num_workers: 1
  shuffle: false
  shuffle_seed: 1
  tokenizer_file_path: <path_to_llama_tokenizer>

The output logs should be as follows:

...
2024-02-29 11:57:43,888 INFO:   | Eval Device=CSX, Step=100, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-02-29 11:57:56,100 INFO:   | Eval Device=CSX, Step=120, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-02-29 11:57:56,712 INFO:   | Eval Device=CSX, Step=121, Rate=34.35 samples/sec, GlobalRate=34.35 samples/sec
<cerebras.modelzoo.common.eval_harness_impl.CSEvalHarnessAdapter object at 0x7f06cfdc42e0> (None), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: None
|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|-------|------|-----:|------|-----:|---|-----:|
|winogrande|Yaml   |none  |     0|acc   |0.6882|±  |0.0130|

To run for more tasks, you may update the run script to the following:

python <path_to_modelzoo>/common/run_cstorch_eval_harness.py CSX \
  --params <path_to_params.yaml> \
  --tasks "arc_challenge,arc_easy,hellaswag,openbookqa,piqa,winogrande" \
  --num_fewshot 0 \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers>
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \
  --logging "info" \

The output should logs should be:

...
2024-02-29 12:40:44,896 INFO:   | Eval Device=CSX, Step=2960, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
2024-02-29 12:40:57,106 INFO:   | Eval Device=CSX, Step=2980, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
<cerebras.modelzoo.common.eval_harness_impl.CSEvalHarnessAdapter object at 0x7fd7400ece50> (None), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: None
|    Tasks    |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------|-------|------|-----:|--------|-----:|---|-----:|
|arc_challenge|Yaml   |none  |     0|acc     |0.4334|±  |0.0145|
|             |       |none  |     0|acc_norm|0.4625|±  |0.0146|
|arc_easy     |Yaml   |none  |     0|acc     |0.7630|±  |0.0087|
|             |       |none  |     0|acc_norm|0.7462|±  |0.0089|
|hellaswag    |Yaml   |none  |     0|acc     |0.5716|±  |0.0049|
|             |       |none  |     0|acc_norm|0.7597|±  |0.0043|
|openbookqa   |Yaml   |none  |     0|acc     |0.3140|±  |0.0207|
|             |       |none  |     0|acc_norm|0.4420|±  |0.0223|
|piqa         |Yaml   |none  |     0|acc     |0.7797|±  |0.0097|
|             |       |none  |     0|acc_norm|0.7899|±  |0.0095|
|winogrande   |Yaml   |none  |     0|acc     |0.6882|±  |0.0130|

In addition to these logs, the output directory as specified under the command line argument --output_path contains dumped output metrics as well.

Supported Models#

Evaluation Harness on CS-X is supported for several Model Zoo models including GPT2, GPT3, BTLM, BLOOM, LaMDA, LLaMA, Mistral, MPT, OPT, StarCoder, and SantaCoder. Use the params YAML file to specify the desired model architecture.

Supported Eval Harness Tasks#

Non-Autoregressive Eval Harness Tasks

In Release 2.2.0, we support all tasks that specify output_type: loglikelihood or output_type: multiple_choice; you may run multiple such tasks in one go, as shown in the example run above.
Generative eval harness tasks with output_type: generate_until are not yet supported.
Tasks specifying output_type: loglikelihood_rolling are not yet supported. These are tasks pile and wikitext for the supported EEH version (v0.4.0).

Adding New Tasks#

Please refer to Eleuther’s new task implementation guide here to add new tasks.

Release Notes#

In Release 2.2.0, we provide support for EEH version v0.4.0.

Running autoregressive inference

Boosting Model performance