II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun, Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, Hongquan Lin,, Jiaming Li, Yuansheng Ni, Haihong Wu, Yaswanth Narsupalli, Zhigang Zheng,, Chengming Li, Xiping Hu, Ruifeng Xu, Xiaojun Chen, Min Yang

TL;DR
II-Bench is a new benchmark designed to evaluate the higher-order perceptual understanding of images by multimodal large language models, revealing significant gaps compared to human performance and highlighting areas for improvement.
Contribution
The paper introduces II-Bench, a comprehensive benchmark for assessing higher-order image understanding in MLLMs, addressing a gap in evaluating complex perceptual capabilities.
Findings
MLLMs achieve up to 74.8% accuracy, while humans reach 90-98%.
MLLMs struggle with abstract and complex images.
Incorporating sentiment hints improves model accuracy.
Abstract
The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
