II-Bench: An Image Implication Understanding Benchmark for Multimodal   Large Language Models

Ziqiang Liu; Feiteng Fang; Xi Feng; Xinrun Du; Chenhao Zhang; Zekun; Wang; Yuelin Bai; Qixuan Zhao; Liyang Fan; Chengguang Gan; Hongquan Lin,; Jiaming Li; Yuansheng Ni; Haihong Wu; Yaswanth Narsupalli; Zhigang Zheng,; Chengming Li; Xiping Hu; Ruifeng Xu; Xiaojun Chen; Min Yang; Jiaheng Liu,; Ruibo Liu; Wenhao Huang; Ge Zhang; Shiwen Ni

arXiv:2406.05862·cs.CL·January 14, 2025

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun, Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, Hongquan Lin,, Jiaming Li, Yuansheng Ni, Haihong Wu, Yaswanth Narsupalli, Zhigang Zheng,, Chengming Li, Xiping Hu, Ruifeng Xu, Xiaojun Chen, Min Yang

PDF

Open Access 1 Datasets 1 Video

TL;DR

II-Bench is a new benchmark designed to evaluate the higher-order perceptual understanding of images by multimodal large language models, revealing significant gaps compared to human performance and highlighting areas for improvement.

Contribution

The paper introduces II-Bench, a comprehensive benchmark for assessing higher-order image understanding in MLLMs, addressing a gap in evaluating complex perceptual capabilities.

Findings

01

MLLMs achieve up to 74.8% accuracy, while humans reach 90-98%.

02

MLLMs struggle with abstract and complex images.

03

Incorporating sentiment hints improves model accuracy.

Abstract

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

m-a-p/II-Bench
dataset· 152 dl
152 dl

Videos

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems