Skip to main content

Default Benchmark

LAMM-Benchmark

LAMM Benchmark

Notes: LAMM-Benchmark has now been fully implemented using ChEF, and we highly recommend using the latest ChEF evaluation method for benchmarking in your work. ChEF supports the common 2D and 3D tasks evaluation and locating tasks evaluation in LAMM. Please note that the GPT rank metric in LAMM is no longer applicable.

To evaluate LAMM/Octavius on LAMM-Benchmark in 2D common tasks, use the pre-defined model config (src/config/ChEF/models/lamm.yaml or src/config/ChEF/models/octavius_2d+3d.yaml) and the pre-defined recipes config (src/config/ChEF/scenario_recipes/LAMM/).

For example, to evaluate LAMM on ScienceQA, run:

python eval.py --model_cfg config/ChEF/models/lamm.yaml  --recipe_cfg config/ChEF/scenario_recipes/LAMM/ScienceQA.yaml

If you want to automately running all the evaluations sequentially, you can run

sh tools/LAMM/eval_lamm2d.sh
sh tools/LAMM/eval_lamm3d.sh

To evaluate Octavius on ScanNet Detection, run:

sh tools/Octavius/octavius_ChEF.sh

ChEF

ChEF Benchmark

Download Evaluated MLLMs

LLMVision EncoderLanguage ModelLink
InstructBLIPEVA-GVicuna 7Binstruct_blip_vicuna7b_trimmed
Kosmos2CLIP ViT-L/14Decoder 1.3Bkosmos-2.pt
LAMMCLIP ViT-L/14Vicuna 13Blamm_13b_lora32_186k
LLaMA-Adapter-v2CLIP ViT-L/14LLaMA 7BLORA-BIAS-7B
LLaVACLIP ViT-L/14MPT 7BLLaVA-Lightning-MPT-7B
MiniGPT-4EVA-GVicuna 7BMiniGPT-4
mPLUG-OwlCLIP ViT-L/14LLaMA 7Bmplug-owl-llama-7b
OtterCLIP ViT-L/14LLaMA 7BOTTER-9B-LA-InContext
ShikraCLIP ViT-L/14LLaMA 7Bshikra-7b

Organize them as below:

...

ckpt
├── epcl_vit-L_256tokens
├──
│ ├── lamm_2d # saved checkpoints in training
│ └── ...
└── ...

Visual Performance Evaluation

We provide several recipes and model configs in src/configs/ChEF.

For example, to evaluate LAMM on CIFAR10 using the default recipe, run:

python tools/eval.py --model_cfg config/ChEF/models/lamm.yaml --recipe_cfg config/ChEF/scenario_recipes/CIFAR10/default.yaml

Besides, you would like to conduct evaluation with your custom model, dataset or metric, please refer to Custom ChEF Evaluation

Desiderata

ChEF sets up several new evaluations to quantify the desiderata (desired capabilities) that a competent MLLM model should possess, as a reliable agent that can perform real-world multimodal interactions.

Calibration

Calibration evaluates how the uncertainty about each MLLM’s prediction is aligned with its accuracy, as highlighted by HELM. ChEF provides the calibration evaluation on MMBench (src/config/ChEF/desiderata_recipes/Calibration/MMBench.yaml) and ScienceQA (src/config/ChEF/desiderata_recipes/Calibration/ScienceQA.yaml).

python tools/ChEF/eval_calibration.py --model_cfg model_cfg --recipe_cfg recipe_cfg

In-context Learning

In-context learning evaluates the crucial in-context learning (ICL) ability of an MLLM. ChEF provides the in-context learning evaluation on MMBench (src/config/ChEF/desiderata_recipes/ICL/MMBench.yaml) and ScienceQA (src/config/ChEF/desiderata_recipes/ICL/ScienceQA.yaml).

python tools/ChEF/eval_icl.py --model_cfg model_cfg --recipe_cfg recipe_cfg

Instruction Following

Instruction following evaluates how exactly the MLLM relies on the given instructions. ChEF provides the instruction following evaluation on MMBench (src/config/ChEF/desiderata_recipes/Insfollow/MMBench.yaml) and ScienceQA (src/config/ChEF/desiderata_recipes/Insfollow/ScienceQA.yaml).

python tools/ChEF/eval_insfollow.py --model_cfg model_cfg --recipe_cfg recipe_cfg

Language Performance

Language performance evaluates the quality of the generated sentences. ChEF uses the GPT-based metric. Before evaluate the language performance, please first finish the inference on MMBench and ScienceQA, using the default recipes. MMBench_recipe (src/config/ChEF/scenario_recipes/MMBench/default.yaml), ScienceQA_recipe (src/config/ChEF/scenario_recipes/ScienceQA/default.yaml)

python tools/desiderata/eval_langperf.py --base-data-path dataset_path --answer-path results_path --response-dir output_path

Robustness

Robustness measures how robust an MLLM is to corruption in the multimodal inputs. ChEF provides the robustness evaluation on MMBench (src/config/ChEF/desiderata_recipes/Robust/MMBench.yaml) and ScienceQA (src/config/ChEF/desiderata_recipes/Robust/ScienceQA.yaml).

python tools/ChEF/eval_robust.py --model_cfg model_cfg --recipe_cfg recipe_cfg

Hallucination

Hallucination evaluates how an MLLM avoids mentioning visual objects that do not exist in the images. ChEF uses POPE (src/config/ChEF/desiderata_recipes/Hallucination) for hallucination evaluation.

python tools/ChEF/eval_hallucination.py --model_cfg model_cfg --recipe_cfg recipe_cfg

Sign up to get email updates on the LAMM or email us at openlamm@gmail.com.
© 2024. LAMM