Default Benchmark
LAMM-Benchmark
Notes: LAMM-Benchmark has now been fully implemented using ChEF, and we highly recommend using the latest ChEF evaluation method for benchmarking in your work. ChEF supports the common 2D and 3D tasks evaluation and locating tasks evaluation in LAMM. Please note that the GPT rank metric in LAMM is no longer applicable.
To evaluate LAMM/Octavius on LAMM-Benchmark in 2D common tasks, use the pre-defined model config (src/config/ChEF/models/lamm.yaml
or src/config/ChEF/models/octavius_2d+3d.yaml
) and the pre-defined recipes config (src/config/ChEF/scenario_recipes/LAMM/
).
For example, to evaluate LAMM on ScienceQA, run:
python eval.py --model_cfg config/ChEF/models/lamm.yaml --recipe_cfg config/ChEF/scenario_recipes/LAMM/ScienceQA.yaml
If you want to automately running all the evaluations sequentially, you can run
sh tools/LAMM/eval_lamm2d.sh
sh tools/LAMM/eval_lamm3d.sh
To evaluate Octavius on ScanNet Detection, run:
sh tools/Octavius/octavius_ChEF.sh
ChEF
Download Evaluated MLLMs
LLM | Vision Encoder | Language Model | Link |
---|---|---|---|
InstructBLIP | EVA-G | Vicuna 7B | instruct_blip_vicuna7b_trimmed |
Kosmos2 | CLIP ViT-L/14 | Decoder 1.3B | kosmos-2.pt |
LAMM | CLIP ViT-L/14 | Vicuna 13B | lamm_13b_lora32_186k |
LLaMA-Adapter-v2 | CLIP ViT-L/14 | LLaMA 7B | LORA-BIAS-7B |
LLaVA | CLIP ViT-L/14 | MPT 7B | LLaVA-Lightning-MPT-7B |
MiniGPT-4 | EVA-G | Vicuna 7B | MiniGPT-4 |
mPLUG-Owl | CLIP ViT-L/14 | LLaMA 7B | mplug-owl-llama-7b |
Otter | CLIP ViT-L/14 | LLaMA 7B | OTTER-9B-LA-InContext |
Shikra | CLIP ViT-L/14 | LLaMA 7B | shikra-7b |
Organize them as below:
...
ckpt
├── epcl_vit-L_256tokens
├──
│ ├── lamm_2d # saved checkpoints in training
│ └── ...
└── ...
Visual Performance Evaluation
We provide several recipes and model configs in src/configs/ChEF
.
For example, to evaluate LAMM on CIFAR10 using the default recipe, run:
python tools/eval.py --model_cfg config/ChEF/models/lamm.yaml --recipe_cfg config/ChEF/scenario_recipes/CIFAR10/default.yaml
Besides, you would like to conduct evaluation with your custom model, dataset or metric, please refer to Custom ChEF Evaluation
Desiderata
ChEF sets up several new evaluations to quantify the desiderata (desired capabilities) that a competent MLLM model should possess, as a reliable agent that can perform real-world multimodal interactions.
Calibration
Calibration evaluates how the uncertainty about each MLLM’s prediction is aligned with its accuracy, as highlighted by HELM. ChEF provides the calibration evaluation on MMBench
(src/config/ChEF/desiderata_recipes/Calibration/MMBench.yaml
) and ScienceQA
(src/config/ChEF/desiderata_recipes/Calibration/ScienceQA.yaml
).
python tools/ChEF/eval_calibration.py --model_cfg model_cfg --recipe_cfg recipe_cfg
In-context Learning
In-context learning evaluates the crucial in-context learning (ICL) ability of an MLLM. ChEF provides the in-context learning evaluation on MMBench
(src/config/ChEF/desiderata_recipes/ICL/MMBench.yaml
) and ScienceQA
(src/config/ChEF/desiderata_recipes/ICL/ScienceQA.yaml
).
python tools/ChEF/eval_icl.py --model_cfg model_cfg --recipe_cfg recipe_cfg
Instruction Following
Instruction following evaluates how exactly the MLLM relies on the given instructions. ChEF provides the instruction following evaluation on MMBench
(src/config/ChEF/desiderata_recipes/Insfollow/MMBench.yaml
) and ScienceQA
(src/config/ChEF/desiderata_recipes/Insfollow/ScienceQA.yaml
).
python tools/ChEF/eval_insfollow.py --model_cfg model_cfg --recipe_cfg recipe_cfg
Language Performance
Language performance evaluates the quality of the generated sentences. ChEF uses the GPT-based metric. Before evaluate the language performance, please first finish the inference on MMBench and ScienceQA, using the default recipes. MMBench_recipe
(src/config/ChEF/scenario_recipes/MMBench/default.yaml
), ScienceQA_recipe
(src/config/ChEF/scenario_recipes/ScienceQA/default.yaml
)
python tools/desiderata/eval_langperf.py --base-data-path dataset_path --answer-path results_path --response-dir output_path
Robustness
Robustness measures how robust an MLLM is to corruption in the multimodal inputs. ChEF provides the robustness evaluation on MMBench
(src/config/ChEF/desiderata_recipes/Robust/MMBench.yaml
) and ScienceQA
(src/config/ChEF/desiderata_recipes/Robust/ScienceQA.yaml
).
python tools/ChEF/eval_robust.py --model_cfg model_cfg --recipe_cfg recipe_cfg
Hallucination
Hallucination evaluates how an MLLM avoids mentioning visual objects that do not exist in the images. ChEF uses POPE
(src/config/ChEF/desiderata_recipes/Hallucination
) for hallucination evaluation.
python tools/ChEF/eval_hallucination.py --model_cfg model_cfg --recipe_cfg recipe_cfg