Skip to main content

Leaderboards

Visual performance of MLLMs on different Scenarios

For each Scenario, we conduct various experiments with diverse Recipes, from which, the Recipe behaving most reliably (i.e. stable to Instruction variations) is selected as the default setting to evaluate the visual performance of all MLLMs.

ScenarioCIFARFlickrVOCOmniFSCSQAMMSEEDMME
LLaVA89.4080.8026.0126.6224.1146.5543.1346.4550.17
LAMM80.7072.5029.5822.5419.3352.7544.4747.0355.82
MiniGPT-480.8071.5026.5130.6022.5247.054.3446.4857.12
mPLUG-owl79.6779.2028.5030.7020.9248.4449.5742.8171.59
Otter81.3471.3027.1526.4120.0050.2253.9136.4063.78
LlaMA-Adapter270.1779.5031.6032.0021.2654.3457.0635.4169.90
InstructBLIP84.2779.4027.6530.7525.0455.1865.7350.8172.0
Shikra68.7194.7055.2322.8922.4345.2163.2649.7970.28
Kosmos-288.8785.7054.5521.3421.9334.6032.8246.3852.95
Random Choice10.025.0025.0010.9420.0035.8027.5724.2750.00

*CIFAR denotes CIFAR-10, Flickr denotes Flickr30k, VOC denotes VOC2012, Omni denotes Omnibenchmark, FSC denotes FSC147, SQA denotes ScienceQA, MM denotes MMbench, and SEED denotes Seedbench.

Results of Desiderata

We employ specialized Recipes to assess the six dimensions of desiderata. All the six dimensions of desiderata except language performance and hallucination are evaluated on MMBench and ScienceQA. Language performance is evaluated on 250 samples random retrieved from ScienceQA and MMBench. Following POPE, hallucination is specifically assessed on the MSCOCO dataset.

DesiderataCalibrationICLIns. Follow.Lang. Perf.HallucinationRobustness
LLaVA90.115.1544.2384.8250.5163.36
LAMM76.3640.1740.0179.0857.4257.98
MiniGPT-484.7336.8543.7376.0071.3060.40
mPLUG-owl84.1533.4536.7388.4452.2651.05
Otter82.8048.3138.4074.0554.5457.16
LlaMA-Adapter289.6136.5238.7690.8563.8365.37
InstructBLIP91.2546.1444.5980.0184.8172.85
Shikra88.3530.2136.2166.6783.7847.91
Kosmos-289.1910.7217.6245.8650.5022.69

*ICL denotes In-context learning, Ins. Follow. denotes Instruction Following, and Lang. Perf. denotes Language Performance.

Evaluation on GPT-4V and Bard

We evaluate GPT-4V(ision) and Bard on MMBench and ScienceQA scenarios, as well as the desiderata including in-context learning, instruction following, hallucination, and robustness. We extract 30 data samples from ScienceQA and MMBench respectively for both scenario evaluations and each of the desideratum evaluation. We compare these two api-only models with three open-source MLLMs (LLaVA, Otter, and MiniGPT4) on the same data samples.

MLLMScienceQAMMBenchICLIns. Follow.RobustnessHallucination
GPT-4V96.6793.8043.9897.6982.1696.00
Bard90.0071.4339.6171.4171.0588.88
LLaVA50.0043.3347.9936.6734.1836.67
Otter63.3350.0047.9144.4437.3580.00
mPLUG-Owl53.3346.6742.1441.6763.4636.67

*ICL denotes In-context learning, and Ins. Follow. denotes Instruction Following.

Quantitative results of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

We evaluated GPT-4, Gemini and 6 open-source LLMs and MLLMs on 3 properties (i.e. generalizability, trustworthiness and causality) through 4 modalities (i.e. ext, code, image, and video) to assess the reliability of MLLMs in supporting various downstream applications. Please refer to From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities for more details.

Text

Models\PropertiesGeneralization CapabilityTrustworthyCausality
Gemini Pro59.0914.2931.11
GPT-483.3380.9582.22
Mixtral33.3354.7644.44
Llama-229.5595.2437.78

Code

Models\PropertiesGeneralization CapabilityTrustworthyCausality
Gemini Pro56.8638.8875.00
GPT-488.2458.3391.67
Mixtral33.335075.00
Llama-221.5761.1158.33

Image

Models\PropertiesGeneralization CapabilityTrustworthyCausality
Gemini Pro87.7173.3356.25
GPT-494.5293.3381.25
LLaVA66.8680.6150
LAMM70.5781.8243.75
Qwen-VL67.2581.2146.88

Video

Models\PropertiesGeneralization CapabilityTrustworthyCausality
Gemini Pro66.675344.33
GPT-452.0810050
LLaVA62.505844.33
VideoChat78.135350
© 2024 LAMM. Built with Dyte.