Downstream Validation using Eleuther Eval Harness#
This page walks through the steps for performing downstream validation on the Cerebras Wafer-Scale cluster using EleutherAI’s Evaluation Harness (EEH). EEH is a popular framework for evaluating large language models across various different datasets and tasks.
While you can configure EEH as part of your training (see Pretraining with Downstream Validation) run, this guide is primarily for running EEH standalone, without also configuring any training, using the run script src/cerebras/modelzoo/common/run_eleuther_eval_harness.py
from the Cerebras Model Zoo.
The examples in this guide will perform downstream validation on LLaMA3 8B.
By the end of this guide, you will be able to leverage the EEH framework to perform standalone downstream validation on your models on CS-X.
To skip ahead to the examples covered in this guide without following the entire walk through, please follow the links:
Prerequisites#
Please ensure that you have installed the Cerebras Model Zoo package by going through the installation guide. Note that EEH version tested and packaged in the Cerebras Model Zoo is the official release v0.4.2.
Please also read through the Trainer Overview and Trainer Configuration Overview, as these guides will help understand how to configure running EEH standalone.
Note
This guide configures the downstream EEH run using V2 YAML. While release 2.3 includes support for legacy V1 YAML, please convert the configuration to V2 using convert_legacy_params_to_trainer_params
from the script src/cerebras/modelzoo/trainer/utils.py
.
Configure the Run#
This section covers the required steps for setting up an EEH run to perform standalone downstream validation on various tasks.
In particular, you will need to write a YAML configuration file to configure an instance of the Trainer
class
with the EleutherEvalHarness
callback.
The example in this section configures evaluation for LLaMA3 8B
via the multiple choice (non-generative) eval harness task winogrande
using a single CSX.
If you aren’t interested in seeing the break down of the configuration, feel free to skip ahead to the Putting it All Together section to see the full YAML configuration.
Configure the CSX Backend#
The first step is to specify the CSX backend and resources required for the run.
Please create a YAML configuration file with the following cluster config:
trainer:
init:
backend:
backend_type: CSX
cluster_config:
num_csx: 1
This example uses a single CSX, but you can readily update num_csx
to run EEH on multiple CSXs for improved performance.
Configure the Model#
Next, please add the following model configuration in the YAML for LLaMA3 8B with 8K context length:
trainer:
init:
backend: # CSX
...
model:
name: llama # This setting is required
# Embedding
vocab_size: 128256
hidden_size: 4096
position_embedding_type: "rotary"
pos_scaling_factor: 1.0
rope_theta: 500000.0
rotary_dim: 128
share_embedding_weights: false
max_position_embeddings: 8192
embedding_dropout_rate: 0.0
embedding_layer_norm: false
# Decoder
num_hidden_layers: 32
dropout_rate: 0.0
layer_norm_epsilon: 1.0e-5
norm_type: "rmsnorm"
# Decoder - Attention
num_heads: 32
attention_type: "scaled_dot_product"
attention_module: "multiquery_attention"
attention_dropout_rate: 0.0
use_projection_bias_in_attention: false
use_ffn_bias_in_attention: false
extra_attention_params:
num_kv_groups: 8
# Decoder - ffn
filter_size: 14336
nonlinearity: "swiglu"
use_ffn_bias: false
# Task-specific
use_bias_in_output: false
loss_scaling: "num_tokens"
loss_weight: 1.0
# Initializer
initializer_range: 0.02
Note
Starting from release 2.4.0, specifying model_name is now deprecated and replaced with name.
Note
To run downstream validation harness, you must specify the name
setting in the model configuration. Valid names corresponding to the supported models include:
btlm
bloom
gpt2
gptj
falcon
gpt3
gpt-neox
llama
mistral
mpt
jais
santacoder
starcoder
Configure the EEH Callback#
EEH is implemented as an extension to the Trainer
class that you will configure via the EleutherEvalHarness
callback.
Add the following section in the YAML to set up the EleutherEvalHarness
callback:
trainer:
init:
backend: # CSX
...
model: # Llama3-8B
name: llama # This setting is required
...
callbacks:
- EleutherEvalHarness:
# Eleuther Eval Harness settings (also exposed via CLI)
eeh_args:
tasks: winogrande
num_fewshot: 0
# CSX-specific eval harness settings (also exposed via CLI)
keep_data_dir: false
# Dataloader settings
batch_size: 4
shuffle: false
max_sequence_length: 8192
num_workers: 1
data_dir: <path_to_mounted_dir>
tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
eos_id: 128001
pretrained_model_name_or_path: null
# Eval Harness Flags
flags:
csx.performance.micro_batch_size: null
The eeh_args
section exposes the following settings to configure the EEH run:
Eleuther Eval Harness CLI Arguments |
Description |
---|---|
|
Comma separated string specifying Eleuther Eval Harness tasks. To get full list of tasks, use the command |
|
Number of examples to be added to the fewshot context string. Defaults to 0 |
|
The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used. |
|
Accepts an integer, or a float between 0.0 and 1.0. This limits the number of documents to evaluate per task to the first X documents (if an integer) or first X% of documents. This is useful for debugging. |
|
A path to a sqlite db file for caching model responses. None if not caching. |
|
Speed up evaluation by caching the building of dataset requests. None if not caching. |
|
Whether to run the relevant part of the test suite for the tasks. |
|
Prints the prompt for the first few documents. Defaults to False. |
|
If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Defaults to False. |
|
If True, shows the the full config of all tasks at the end of the evaluation. Defaults to False. |
|
Additional path to include if there are external tasks to include. |
|
Use with –log_samples. Only model outputs will be saved and metrics will not be evaluated. |
|
Set seed for python’s random, numpy and torch. |
|
Sampling temperature used for generation (autoregressive, generate_until tasks only). |
|
Top-p parameter used for nucleus sampling (autoregressive, generate_until tasks only). |
|
Top-k parameter used for generation (autoregressive, generate_until tasks only). |
You can either specify the settings here or pass them via CLI arguments to the standalone EEH run script.
The callback configuration also accepts dataloader settings that you must specify in the YAML to set up input data preprocessing for the run:
DataLoader Settings |
Description |
---|---|
|
This setting is required. Provide a path to the mounted directory visible to the worker containers where eval harness task data samples are dumped after preprocessing. Use the |
|
Path to a custom tokenizer (JSON) file. If you provide a custom tokenizer, then you must also specify |
|
Hugging Face (HF) pretrained model name or path. This setting is required if you do not specify |
|
End-of-sentence (eos) token ID to signal the termination of a sequence. This setting is required if you specify a custom tokenizer in |
|
Maximum length of the input sequence. This setting is required for preprocessing input data samples from the specified eval harness tasks. You should align the |
Additionally, you may optionally specify the following, CSX-specific eval harness setting:
keep_data_dir
: Use this to preserve the preprocessed eval harness task data samples, i.e. the directory specified underdata_dir
. Defaults to False, i.e. data samples are deleted after the run.
(Optional) Configure HuggingFace (HF) Cache Directory#
EEH utilizes HF’s APIs to download task data and other configurations. This data is by default cached under $HOME/.cache/huggingface
.
However, you may choose to specify a different directory for this cached data via the HFCacheDir
callback:
trainer:
init:
backend: # CSX
...
model: # Llama3-8B
...
callbacks:
- EleutherEvalHarness:
...
- HFCacheDir:
cache_dir: <path_to_directory_for_caching_HF_data>
Putting it All Together#
Here’s what the full YAML configuration looks like once you follow this guide for configuring the individual pieces:
trainer:
init:
backend:
backend_type: CSX
cluster_config:
num_csx: 1
mount_dirs: <path(s)_to_mount_to_appliance_containers>
model:
name: llama
# Embedding
vocab_size: 128256
hidden_size: 4096
position_embedding_type: "rotary"
pos_scaling_factor: 1.0
rope_theta: 500000.0
rotary_dim: 128
share_embedding_weights: false
max_position_embeddings: 8192
embedding_dropout_rate: 0.0
embedding_layer_norm: false
# Decoder
num_hidden_layers: 32
dropout_rate: 0.0
layer_norm_epsilon: 1.0e-5
norm_type: "rmsnorm"
# Decoder - Attention
num_heads: 32
attention_type: "scaled_dot_product"
attention_module: "multiquery_attention"
attention_dropout_rate: 0.0
use_projection_bias_in_attention: false
use_ffn_bias_in_attention: false
extra_attention_params:
num_kv_groups: 8
# Decoder - ffn
filter_size: 14336
nonlinearity: "swiglu"
use_ffn_bias: false
# Task-specific
use_bias_in_output: false
loss_scaling: "num_tokens"
loss_weight: 1.0
# Initializer
initializer_range: 0.02
callbacks:
- EleutherEvalHarness:
eeh_args:
tasks: winogrande
num_fewshot: 0
keep_data_dir: false
# Dataloader settings
batch_size: 4
shuffle: false
max_sequence_length: 8192
num_workers: 1
data_dir: <path_to_mounted_dir>
tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
eos_id: 128001
pretrained_model_name_or_path: null
# Eval Harness Flags
flags:
csx.performance.micro_batch_size: null
Running EEH on CS-X#
Now that the the YAML configuration is complete, you will use the run script src/cerebras/modelzoo/common/run_eleuther_eval_harness.py
from the Cerebras Model Zoo to run EEH on various tasks.
This script accepts the following command line interface (CLI) arguments:
python run_eleuther_eval_harness.py CSX [-h] [--tasks task1,task2] [--num_fewshot N] [--output_path DIR|DIR/file.json] [--limit N|0<N<1] [--use_cache DIR]
[--cache_requests {true,refresh,delete}] [--check_integrity] [--write_out] [--log_samples] [--show_config] [--include_path DIR]
[--predict_only] [--seed SEED] [--trust_remote_code] [--temperature TEMPERATURE] [--top_p TOP_P] [--top_k TOP_K]
[--keep_data_dir]
-p PARAMS [-m {eval}] [-o MODEL_DIR] [--checkpoint_path CHECKPOINT_PATH]
[--disable_strict_checkpoint_loading] [--load_checkpoint_states LOAD_CHECKPOINT_STATES] [--logging LOGGING]
[--wsc_log_level WSC_LOG_LEVEL [WSC_LOG_LEVEL ...]] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--config CONFIG]
[--compile_only | --validate_only] [--num_workers_per_csx NUM_WORKERS_PER_CSX] [-c COMPILE_DIR]
[--job_labels JOB_LABELS [JOB_LABELS ...]] [--job_priority {p1,p2,p3}] [--debug_args_path DEBUG_ARGS_PATH]
[--mount_dirs MOUNT_DIRS [MOUNT_DIRS ...]] [--python_paths PYTHON_PATHS [PYTHON_PATHS ...]]
[--credentials_path CREDENTIALS_PATH] [--mgmt_address MGMT_ADDRESS] [--job_time_sec JOB_TIME_SEC] [--disable_version_check]
[--num_csx NUM_CSX] [--num_wgt_servers NUM_WGT_SERVERS] [--num_act_servers NUM_ACT_SERVERS]
[--debug_args [DEBUG_ARGS [DEBUG_ARGS ...]]] [--ini [INI [INI ...]]] [--transfer_processes TRANSFER_PROCESSES]
Note
We support a subset of Eleuther’s command line interface (CLI) arguments above. For a more detailed descrition of these supported arguments, see https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.0/docs/interface.md.
You may also specify these arguments in the YAML under the
eeh_args
key of theEleutherEvalHarness
configuration, but please note that the CLI setting will override the settings in the YAML.The
--params
CLI argument is required. Use it to specify the path to the YAML configuration file. Note that while we do support the old V1 YAML specification, it will soon be deprecated so we recommend using the new V2 YAML.Use the
--checkpoint_path
CLI argument to specify the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the specifiedmodel_dir
.
Example (Single Non-generative Task)#
Let’s assume that the YAML configuration file above is written to ./llama3_8B_eeh.yaml
. Then, to run evaluation for task winogrande
, please set up a bash script as follows:
python <path_to_cerebras_modelzoo>/common/run_eleuther_eval_harness.py CSX \
--params ./llama3_8B_eeh.yaml \
--checkpoint_path <path_to_checkpoint_file> \
--python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
--mount_dirs <path(s)_to_mount_to_appliance_containers> \ # Must include the path to the "data_dir" eval harness config in the YAML
The output logs are as follows:
...
2024-06-24 11:06:49,596 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=632, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-06-24 11:06:49,625 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=633, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-06-24 11:07:04,333 INFO: <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7ff89ae3ceb0> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-24 11:07:04,696 INFO:
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande| 1|none | 0|acc |0.7284|± |0.0141|
Example (Multiple Non-generative Tasks)#
To run evaluation on more non-generative tasks, you may update the run script to the following:
python <path_to_cerebras_modelzoo>/common/run_eleuther_eval_harness.py CSX \
--params ./llama3_8B_eeh.yaml \
--tasks "arc_challenge,hellaswag,openbookqa,piqa,winogrande" \ # Overrides the EEH "tasks" config in the YAML
--checkpoint_path <path_to_checkpoint_file> \
--python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
--mount_dirs <path(s)_to_mount_to_appliance_containers> \
The output logs are as follows:
...
2024-06-24 12:40:44,896 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=2979, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
2024-06-24 12:40:57,106 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=2980, Rate=34.40 samples/sec, GlobalRate=34.39 samples/sec
2024-06-24 12:41:10,333 INFO: <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7ff89ae3ceb0> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-24 12:41:10,700 INFO:
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|-------------|-------|------|-----:|--------|-----:|---|-----:|
|arc_challenge|Yaml |none | 0|acc |0.5334|± |0.0145|
| | |none | 0|acc_norm|0.5300|± |0.0146|
|hellaswag |Yaml |none | 0|acc |0.5716|± |0.0049|
| | |none | 0|acc_norm|0.7917|± |0.0043|
|openbookqa |Yaml |none | 0|acc |0.3210|± |0.0207|
| | |none | 0|acc_norm|0.4520|± |0.0223|
|piqa |Yaml |none | 0|acc |0.7997|± |0.0097|
| | |none | 0|acc_norm|0.8090|± |0.0095|
|winogrande |Yaml |none | 0|acc |0.7284|± |0.0130|
Example (Generative task)#
The EEH flow also supports running generative (autoregressive) eval harness tasks, i.e. tasks that specify output_type: generate_until
in the task specification, such as triviaqa
or drop
. Refer to task triviqa
for an example specification from the official EEH repository.
In order to run generative inference on CSX, you must specify the following inference settings in the model config of YAML file:
start_token
- ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.stop_sequences
- List of sequences (each one being a list of token IDs). If any one of these sequences is emitted by the model, inference will stop for that sample. For example, suppose you would like to stop inferring after either a newline character (e.g. token id 1), or a combination of a period (e.g. token id 2) followed by a space (e.g. token id 3). In this case, set stop_sequences to [[1], [2, 3]]. To stop inferring after seeing a newline character only, set stop_sequences to [[1]]. To disable this feature, setstop_sequences
to an empty list []. Additionally, the following optional parameters may be set:max_tokens
- Maximum tokens to infer for each sample.loop_dim
- Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e.sequence X samples
instead ofsamples X sequence
).
For your LLaMA3 8B example, please update the YAML file ./llama3_8B_eeh.yaml
to add these inference settings:
trainer:
init:
backend: # CSX
...
model: # Llama3-8B
name: llama
...
# Inference Settings
start_token: 128256 # Set to `vocab_size`
stop_sequences: [[198], [13], [11]] # Respective tokens for "\n", "." and ","
max_tokens: 256 # Default from HF implementations
loop_dim: 1
callbacks:
- EleutherEvalHarness:
# Eleuther Eval Harness settings
eeh_args:
tasks: triviaqa
num_fewshot: 0
...
...
Note
For
start_token
, it is ideal to choose a value that’s not going to be generated by the model, i.e.vocab_size
in the example above.The generative task itself defines
stop_sequences
under settinggeneration_kwargs.until
of the task spec. For instance, triviqa specifies"\n"
,"."
and","
as the stop tokens. The EEH flow will internally override thestop_sequences
config with the value from the task, so you can also specify an arbitrary value in the YAML.
Finally, please update the bash script as follows:
python <path_to_cerebras_modelzoo>/common/run_eleuther_eval_harness.py CSX \
--params ./llama3_8B_eeh.yaml \
--checkpoint_path <path_to_checkpoint_file> \
--python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
--mount_dirs <path(s)_to_mount_to_appliance_containers> \
The output logs are as follows:
...
2024-06-26 11:22:47,530 INFO: | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=4480, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 11:22:47,553 INFO: | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=4485, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 11:38:15,868 INFO: <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7f9e305fa730> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-26 11:38:16,232 INFO:
| Tasks |Version| Filter |n-shot| Metric |Value | |Stderr|
|--------|------:|-----------------|-----:|-----------|-----:|---|-----:|
|triviaqa| 3|remove_whitespace| 0|exact_match|0.1120|± |0.0001|
By default, the model will perform greedy sampling of the inferred tokens, i.e. for all of the model’s outputs, pick the token with the highest probability.
In order to perform non-greedy sampling, you can pass in temperature
, top_k
or top_p
to either the bash script or under eeh_args
of the YAML. For example:
python <path_to_cerebras_modelzoo>/common/run_eleuther_eval_harness.py CSX \
--params ./llama3_8B_eeh.yaml \
--checkpoint_path <path_to_checkpoint_file> \
--python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
--mount_dirs <path(s)_to_mount_to_appliance_containers> \
--temperature 0.7 \
--top_k 10 \
--top_p 0.95 \
Example (Non-generative and Generative Tasks)#
You can combine evaluation for generative and non-generative eval harness tasks simply via updating the bash script as follows:
python <path_to_cerebras_modelzoo>/common/run_eleuther_eval_harness.py CSX \
--params ./llama3_8B_eeh.yaml \
--tasks winogrande,drop \ # Runs evaluation for non-generative task "winogrande" and generative task "drop"
--checkpoint_path <path_to_checkpoint_file> \
--python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers> \
--mount_dirs <path(s)_to_mount_to_appliance_containers> \
The output logs are as follows:
...
2024-06-26 13:57:52,177 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=632, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-06-26 13:57:52,206 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=633, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-06-26 13:57:52,333 INFO: Running generate_until requests
...
2024-06-26 14:00:03,499 INFO: | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=2382, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 14:00:03,519 INFO: | EleutherAI Generative Eval Device=CSX, GlobalStep=1, Batch=2383, Rate=0.38 samples/sec, GlobalRate=0.38 samples/sec
2024-06-26 14:00:31,890 INFO: <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7f156522c490> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-26 14:00:32,351 INFO:
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande| 1|none | 0|acc |0.7284|± |0.0141|
|drop | 3|none | 0|em |0.0072|± |0.0008|
| | |none | 0|f1 |0.0371|± |0.0013|
Note
Since inference settings are baked into the model configuration and that inference requires different resources, there is a separate compile and execution for running downstream validation on generative tasks.
You may specify at most one generative task at a time (per callback).
Run Multiple Generative Tasks#
You can run multiple generative tasks by configuring multiple EleutherEvalHarness
callbacks (one per task) in the YAML.
For example, you may update the YAML config as such to run downstream validation on generative tasks gsm8k
and drop
:
trainer:
init:
backend: # CSX
...
model: # Llama3-8B
name: llama # This setting is required
...
# Inference Settings
start_token: 128256 # Set to `vocab_size`
stop_sequences: [] # Arbitrary value that will be overridden from the task spec
max_tokens: 256 # Default from HF implementations
loop_dim: 1
callbacks:
- EleutherEvalHarness:
# Eleuther Eval Harness settings (also exposed via CLI)
eeh_args:
tasks: gsm8k
num_fewshot: 0
# CSX-specific eval harness settings (also exposed via CLI)
keep_data_dir: false
# Dataloader settings
batch_size: 4
shuffle: false
max_sequence_length: 8192
num_workers: 1
data_dir: <path_to_mounted_dir>
tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
eos_id: 128001
pretrained_model_name_or_path: null
# Eval Harness Flags
flags:
csx.performance.micro_batch_size: null
- EleutherEvalHarness:
# Eleuther Eval Harness settings (also exposed via CLI)
eeh_args:
tasks: drop
num_fewshot: 0
# CSX-specific eval harness settings (also exposed via CLI)
keep_data_dir: false
# Dataloader settings
batch_size: 4
shuffle: false
max_sequence_length: 8192
num_workers: 1
data_dir: <path_to_mounted_dir>
tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
eos_id: 128001
pretrained_model_name_or_path: null
# Eval Harness Flags
flags:
csx.performance.micro_batch_size: null
Supported Tasks#
You may perform downstream validation on all EEH tasks with
output_type: loglikelihood
oroutput_type: multiple_choice
in the task specification. See asdiv and arc_easy for respective examples. You may specify each of these types of tasks separately or together in a singleEleutherEvalHarness
callback.We currently do not support eval harness tasks with
output_type: loglikelihood_rolling
. These unsupported tasks arepile
andwikitext
for the supported EEH version 0.4.2.Some EEH tasks group multiple generative tasks together. Since you may only specify one generative task in the YAML per
EleutherEvalHarness
callback, you should not specify the following such task groups in a single callback:agieval
babi
bbh
bbh_cot_fewshot
bbh_cot_zeroshot
bbh_fewshot
bbh_zeroshot
bigbench
codexglue_code2text
math_word_problems
mgsm_direct
mgsm_cot_native
mmlu_flan_cot_fewshot
mmlu_flan_cot_zeroshot
mmlu_flan_n_shot_generative
polemo2
super-glue-lm-eval-v1
unscramble
In order to run downstream validation on these tasks groups, please individually specify each generative subtask using a separate
EleutherEvalHarness
callback.
Adding New Tasks#
Please refer to Eleuther’s new task implementation guide here to add new tasks.
Limitations#
We currently do not support running multiple generative eval harness tasks in the same callback.
EEH task groups, such as agieval, comprise multiple generative sub tasks that you will have to configure in the YAML via separate callbacks.
Conclusion#
In summary, by following this guide you have run standalone downstream validation for the Llama3-8B model on EEH’s multiple different evaluation datasets / tasks.
You should now be comfortable in configuring more downstream EEH runs on your model of choice on even more eval harness tasks.
What’s next?#
To run downstream validation on code generation tasks, please see check out:
You can also perform downstream validation using EEH as part of your pretraining runs with upstream validation. Check out the following guide: