Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce CEf , a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. CEf dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.
We integrate the hhh criteria for assessing alignment with human values, and propose three levels of hierarchical dimension. These dimensions focus on their effectiveness in addressing queries and visual content (helpful), transparency about confidence and limitations within visual scenario (honest), and the avoidance of offensive or discriminatory outputs in the visual world (harmless). The taxonomy forms the basis of our comprehensive evaluation, offering a structured methodology to assess MLLMs' alignment with essential human-centric characteristics.
The taxonomy emphasizing the hhh criteria, systematically outlines 4/3/5 domains and 22/7/17 tasks for each h respectively. Details of the domains and tasks are illustrated.
Based the defined taxonomy, CEf dataset is meticulously crafted to closely emulate real-world scenarios. We establish several principles to faithfully replicate the conversation between humans and MLLMs, incorporating Human-Machine Synergy by utilizing responses from several prominent MLLMsduring the data creation process.
To ensure the dataset closely aligns with real-world application scenarios, we adhere to several principles. First, we strive for diversity in images, encompassing both single and multiple images, with variations in visual content. Images should be sourced from a wide range of scenarios, covering a vast array of application contexts. Second, the formulation of questions and answers aims to mirror human behavior and preferences as closely as possible while maintaining consistency with the actual potential outputs of MLLMs. For harmless, unlike prior works that evaluate with special images or prompts diverging from real-world usage scenarios, CEf dataset ensures images and questions closely resemble practical applications, fostering a more authentic representation.
Data samples in CEf dataset are illustrated. Each sample comprises one or more images, accompanied by a meticulously human annotated question and several options. The correct option is indicated in bold.
CEf evaluation strategy comprises three compatible modules, i.e., Instruction, Inferencer and Metric, enabling different Recipes (specific selections of each module) to facilitate evaluations from different perspectives across various scenarios ranging from A1-A3 spectra. The right side shows different Recipes for evaluating different dimensions, including location (Locat.), QA performance (QAPerf.), in-context learning performance (ICLPerf.), calibration (Calib.) and alignment with human values (Human-value).
The main results can be found in CEf leaderboard.
Pearson correlation matrix within A3 and across A1-A3. CDU for Cross-domain understanding; MRC for Machine reading comprehension. (b) Pearson correlation matrix across. Cooler colors indicate higher correlations.
The left shows experimental results of MMBench with ICE as Instruction under different retriever settings. The retriever methodologies employed encompass Random, Fixed, Top-k Text, and Top-k Image. The right is the results of Honest and Calibration. Calibration score is calculated by (1-ECE)×100%.
We illustrate the accuracy of each MLLM on each task within the CEf dataset.
@misc{shi2024assessment,
title={Assessment of Multimodal Large Language Models in Alignment with Human Values},
author={Zhelun Shi and Zhipin Wang and Hongxing Fan and Zaibin Zhang and Lijun Li and Yongting Zhang and Zhenfei Yin and Lu Sheng and Yu Qiao and Jing Shao},
year={2024},
eprint={2403.17830},
archivePrefix={arXiv},
primaryClass={cs.CV}
}