Running Eleuther AI’s Evaluation Harness#
Overview#
We provide support for running EleutherAI’s Evaluation Harness (EEH) on the Cerebras Wafer-Scale cluster. EEH is a popular framework for evaluating large language models across various different datasets and tasks.
Running Evaluation Harness on CS-X#
To run EEH tasks on CS-X, use the modelzoo/common/run_cstorch_eval_harness.py
script from the Cerebras Model Zoo. This script is similar to our other run scripts normally used to launch train/eval jobs, and it accepts the following command-line arguments:
python <path_to_modelzoo>/common/run_cstorch_eval_harness.py CSX [-h] [--tasks TASKS] [--num_fewshot NUM_FEWSHOT] [--output_path = [dir/file.jsonl] [DIR]] [--limit LIMIT] [--use_cache USE_CACHE]
[--check_integrity] [--write_out] [--log_samples] [--show_config] [--include_path INCLUDE_PATH] [--hf_cache_dir HF_CACHE_DIR] [--keep_data_dir]
-p PARAMS [-m {eval}] [-o MODEL_DIR]
[--checkpoint_path CHECKPOINT_PATH] [--disable_strict_checkpoint_loading] [--load_checkpoint_states LOAD_CHECKPOINT_STATES] [--logging LOGGING]
[--wsc_log_level WSC_LOG_LEVEL [WSC_LOG_LEVEL ...]] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--config CONFIG] [--compile_only | --validate_only]
[--num_workers_per_csx NUM_WORKERS_PER_CSX] [-c COMPILE_DIR] [--job_labels JOB_LABELS [JOB_LABELS ...]] [--job_priority {p1,p2,p3}]
[--debug_args_path DEBUG_ARGS_PATH] [--mount_dirs MOUNT_DIRS [MOUNT_DIRS ...]] [--python_paths PYTHON_PATHS [PYTHON_PATHS ...]]
[--credentials_path CREDENTIALS_PATH] [--mgmt_address MGMT_ADDRESS] [--job_time_sec JOB_TIME_SEC] [--disable_version_check] [--num_csx NUM_CSX]
[--num_wgt_servers NUM_WGT_SERVERS] [--num_act_servers NUM_ACT_SERVERS] [--debug_args [DEBUG_ARGS [DEBUG_ARGS ...]]] [--ini [INI [INI ...]]]
[--transfer_processes TRANSFER_PROCESSES]
Eleuther Eval Harness Arguments |
Description |
---|---|
–tasks TASKS |
Comma separated string specifying Eleuther Eval Harness tasks. To get full list of tasks, use the command |
–num_fewshot NUM_FEWSHOT |
Number of examples to be added to the fewshot context string. Defaults to 0 |
–output_path OUTPUT_PATH |
The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used. |
–limit LIMIT |
Accepts an integer, or a float between 0.0 and 1.0. This limits the number of documents to evaluate per task to the first X documents (if an integer) or first X% of documents. This is useful for debugging. |
–use_cache USE_CACHE |
A path to a sqlite db file for caching model responses. None if not caching. |
–check_integrity |
Whether to run the relevant part of the test suite for the tasks. |
–write_out |
Prints the prompt for the first few documents. Defaults to False. |
–log_samples |
If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Defaults to False. |
–show_config |
If True, shows the the full config of all tasks at the end of the evaluation. Defaults to False. |
–include_path INCLUDE_PATH |
Additional path to include if there are external tasks to include. |
–hf_cache_dir HF_CACHE_DIR |
Path to directory for caching Hugging Face downloaded data |
–keep_data_dir |
Specifies whether dumped data samples should be kept for reuse. Defaults to False, i.e. data samples are deleted after the run |
Note
In run_cstorch_eval_harness.py
, we expose only a subset of Eleuther’s commandline interface (CLI) arguments listed above.
For a detailed descrition of these supported arguments, see https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.0/docs/interface.md.
Note that the CSX-specific CLI arguments are exactly the same as in our training and evaluation flows using run.py
scripts. Refer to PyTorch params documentation
for a detailed description of the runconfig parameters that are part of the CSX CLI.
The following runconfig arguments are important for running EEH on CS-X:
Runconfig Arguments |
Description |
---|---|
|
This argument specifies the path to |
|
This argument specifies the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the specified |
|
This option, if set, preserves the preprocessed data samples generated by the DataLoader for the EEH task data, i.e. the directory specified under |
Addtionally, note that the following settings are important for input data preprocessing and must be specified under the eval_input
of your YAML file:
Input data preprocessing |
Description |
---|---|
|
This setting must be specified. Provide a path to the mounted directory visible to the worker containers where EEH task data samples are dumped after preprocessing. Use the |
|
Path to a custom tokenizer (JSON) file for models other than |
|
This setting is required if a custom tokenizer is specified; otherwise we use |
|
This setting is required for preprocessing input data samples from the specified eval harness tasks. We recommend aligning the |
|
This setting, if specified, must be at most 1 since multiple process dataloading is not supported for this flow. |
Example#
Here are sample runs for Llama2 7B model with 4K context length using a pretrained checkpoint from HuggingFace (HF).
Note that we first convert the HF checkpoint to a CS checkpoint using our checkpoint conversion script. See Convert checkpoints and model configs for more details on how to do this.
The following set up is for running the multiple choice eval task winogrande
:
python <path_to_modelzoo>/common/run_cstorch_eval_harness.py CSX \
--params <path_to_params.yaml> \
--tasks "winogrande" \
--num_fewshot 0 \
--checkpoint_path <path_to_checkpoint_file> \
--python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers>
--mount_dirs <path(s)_to_mount_to_appliance_containers> \
--logging "info" \
eval_input:
micro_batch_size: null
batch_size: 21
data_dir: <path_to_mounted_directory_for_data_dumps>
eos_id: 2
max_sequence_length: 4096
num_workers: 1
shuffle: false
shuffle_seed: 1
tokenizer_file_path: <path_to_llama_tokenizer>
The output logs should be as follows:
...
2024-02-29 11:57:43,888 INFO: | Eval Device=CSX, Step=100, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-02-29 11:57:56,100 INFO: | Eval Device=CSX, Step=120, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-02-29 11:57:56,712 INFO: | Eval Device=CSX, Step=121, Rate=34.35 samples/sec, GlobalRate=34.35 samples/sec
<cerebras.modelzoo.common.eval_harness_impl.CSEvalHarnessAdapter object at 0x7f06cfdc42e0> (None), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: None
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|----------|-------|------|-----:|------|-----:|---|-----:|
|winogrande|Yaml |none | 0|acc |0.6882|± |0.0130|
To run for more tasks, you may update the run script to the following:
python <path_to_modelzoo>/common/run_cstorch_eval_harness.py CSX \
--params <path_to_params.yaml> \
--tasks "arc_challenge,arc_easy,hellaswag,openbookqa,piqa,winogrande" \
--num_fewshot 0 \
--checkpoint_path <path_to_checkpoint_file> \
--python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers>
--mount_dirs <path(s)_to_mount_to_appliance_containers> \
--logging "info" \
The output should logs should be:
...
2024-02-29 12:40:44,896 INFO: | Eval Device=CSX, Step=2960, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
2024-02-29 12:40:57,106 INFO: | Eval Device=CSX, Step=2980, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
<cerebras.modelzoo.common.eval_harness_impl.CSEvalHarnessAdapter object at 0x7fd7400ece50> (None), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: None
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|-------------|-------|------|-----:|--------|-----:|---|-----:|
|arc_challenge|Yaml |none | 0|acc |0.4334|± |0.0145|
| | |none | 0|acc_norm|0.4625|± |0.0146|
|arc_easy |Yaml |none | 0|acc |0.7630|± |0.0087|
| | |none | 0|acc_norm|0.7462|± |0.0089|
|hellaswag |Yaml |none | 0|acc |0.5716|± |0.0049|
| | |none | 0|acc_norm|0.7597|± |0.0043|
|openbookqa |Yaml |none | 0|acc |0.3140|± |0.0207|
| | |none | 0|acc_norm|0.4420|± |0.0223|
|piqa |Yaml |none | 0|acc |0.7797|± |0.0097|
| | |none | 0|acc_norm|0.7899|± |0.0095|
|winogrande |Yaml |none | 0|acc |0.6882|± |0.0130|
In addition to these logs, the output directory as specified under the command line argument --output_path
contains dumped output metrics as well.
Supported Models#
Evaluation Harness on CS-X is supported for several Model Zoo models including GPT2, GPT3, BTLM, BLOOM, LaMDA, LLaMA, Mistral, MPT, OPT, StarCoder, and SantaCoder. Use the params YAML file to specify the desired model architecture.
Supported Eval Harness Tasks#
Non-Autoregressive Eval Harness Tasks
In Release 2.2.0, we support all tasks that specify
output_type: loglikelihood
oroutput_type: multiple_choice
; you may run multiple such tasks in one go, as shown in the example run above.Generative eval harness tasks with
output_type: generate_until
are not yet supported.Tasks specifying
output_type: loglikelihood_rolling
are not yet supported. These are taskspile
andwikitext
for the supported EEH version (v0.4.0).
Adding New Tasks#
Please refer to Eleuther’s new task implementation guide here to add new tasks.
Release Notes#
In Release 2.2.0, we provide support for EEH version v0.4.0.