Assessment of Multimodal Large Language Models in Alignment with Human Values

Zhelun Shi^1,2,*, Zhipin Wang^2,*, Hongxing Fan^2,*, Zaibin Zhang^1,3, Lijun Li¹, Yongting Zhang^1,4,
Zhenfei Yin^1,5, Lu Sheng^2,†, Yu Qiao¹, Jing Shao^1,†

¹Shanghai Artificial Intelligence Laboratory ²Beihang University ³Dalian University of Technology ⁴University of Science and Technology of China ⁵The University of Sydney
^*Equal contribution ^†Corresponding author

Paper Code Leaderboard 🤗Data

Overview

The evaluation for MLLMs can be categorized into three ascending levels of alignment. Alignment in Semantics (A1) pertains to the model's ability to perceive basic visual information in images. Alignment in Logic (A2) evaluates the model's capability in integrating its substantial knowledge reserves and analytical strengths to process visual context thoughtfully. Alignment with Human Values (A3) examines whether the model can mirror human-like engagement in the diverse and dynamic visual world meanwhile understand human expectations and preferences. The examples for each alignment level are displayed in the upper half. The benchmarks and evaluated dimensions are illustrated at each level. C $h^{3}$ Ef dataset is the first comprehensive A3 dataset on hhh (helpful, honest, harmless) criteria, and the evaluation strategy can be used to evaluate MLLMs on various scenarios across A1-A3 spectra.

Comparison with other benchmarks

Comparison between various MLLM benchmarks and C $h^{3}$ Ef. A3* denotes the evaluation of narrow dimensions within the preliminary stage of A3. Previous works either only evaluated A1-A2 or merely made preliminary explorations at A3, assessing a few dimensions. C $h^{3}$ Ef is the first attempt to define and evaluate the capabilities of MLLMs at A3.

Abstract

Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce C $h^{3}$ Ef , a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. C $h^{3}$ Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.

Taxonomy

We integrate the hhh criteria for assessing alignment with human values, and propose three levels of hierarchical dimension. These dimensions focus on their effectiveness in addressing queries and visual content (helpful), transparency about confidence and limitations within visual scenario (honest), and the avoidance of offensive or discriminatory outputs in the visual world (harmless). The taxonomy forms the basis of our comprehensive evaluation, offering a structured methodology to assess MLLMs' alignment with essential human-centric characteristics.

The taxonomy emphasizing the hhh criteria, systematically outlines 4/3/5 domains and 22/7/17 tasks for each h respectively. Details of the domains and tasks are illustrated.

Dataset

Based the defined taxonomy, C $h^{3}$ Ef dataset is meticulously crafted to closely emulate real-world scenarios. We establish several principles to faithfully replicate the conversation between humans and MLLMs, incorporating Human-Machine Synergy by utilizing responses from several prominent MLLMsduring the data creation process.

To ensure the dataset closely aligns with real-world application scenarios, we adhere to several principles. First, we strive for diversity in images, encompassing both single and multiple images, with variations in visual content. Images should be sourced from a wide range of scenarios, covering a vast array of application contexts. Second, the formulation of questions and answers aims to mirror human behavior and preferences as closely as possible while maintaining consistency with the actual potential outputs of MLLMs. For harmless, unlike prior works that evaluate with special images or prompts diverging from real-world usage scenarios, C $h^{3}$ Ef dataset ensures images and questions closely resemble practical applications, fostering a more authentic representation.

Data samples in C $h^{3}$ Ef dataset are illustrated. Each sample comprises one or more images, accompanied by a meticulously human annotated question and several options. The correct option is indicated in bold.

Evaluation Strategy

C $h^{3}$ Ef evaluation strategy comprises three compatible modules, i.e., Instruction, Inferencer and Metric, enabling different Recipes (specific selections of each module) to facilitate evaluations from different perspectives across various scenarios ranging from A1-A3 spectra. The right side shows different Recipes for evaluating different dimensions, including location (Locat.), QA performance (QAPerf.), in-context learning performance (ICLPerf.), calibration (Calib.) and alignment with human values (Human-value).

Results

Main Results across A1-A3 Spectra

The main results can be found in C $h^{3}$ Ef leaderboard.

Correlations

Pearson correlation matrix within A3 and across A1-A3. CDU for Cross-domain understanding; MRC for Machine reading comprehension. (b) Pearson correlation matrix across. Cooler colors indicate higher correlations.

Other Results

The left shows experimental results of MMBench with ICE as Instruction under different retriever settings. The retriever methodologies employed encompass Random, Fixed, Top-k Text, and Top-k Image. The right is the results of Honest and Calibration. Calibration score is calculated by (1-ECE)×100%.

Overall Results on C $h^{3}$ Ef dataset

We illustrate the accuracy of each MLLM on each task within the C $h^{3}$ Ef dataset.

BibTeX


        @misc{shi2024assessment,
          title={Assessment of Multimodal Large Language Models in Alignment with Human Values}, 
          author={Zhelun Shi and Zhipin Wang and Hongxing Fan and Zaibin Zhang and Lijun Li and Yongting Zhang and Zhenfei Yin and Lu Sheng and Yu Qiao and Jing Shao},
          year={2024},
          eprint={2403.17830},
          archivePrefix={arXiv},
          primaryClass={cs.CV}
        }