InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Xiaofei Yin, Yijie Hong, Ya Guo, Yi Tu, Weiqiang Wang and, Gongshen Liu, Huijia zhu

TL;DR
This paper introduces InsightVision, a new Chinese-based benchmark for evaluating the understanding of implicit visual semantics in large vision language models, revealing current models' limitations compared to humans.
Contribution
It presents a comprehensive, multi-level benchmark for implicit visual semantics in Chinese, along with a semi-automatic dataset construction method and evaluation of 15 LVLMs and GPT-4o.
Findings
Models lag nearly 14% behind human performance in implicit understanding.
Current LVLMs struggle with nuanced visual semantics.
The benchmark covers four levels of implicit meaning comprehension.
Abstract
In the evolving landscape of multimodal language models, understanding the nuanced meanings conveyed through visual cues - such as satire, insult, or critique - remains a significant challenge. Existing evaluation benchmarks primarily focus on direct tasks like image captioning or are limited to a narrow set of categories, such as humor or satire, for deep semantic understanding. To address this gap, we introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark designed specifically for evaluating the understanding of implicit meanings in images. This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension. We propose an innovative semi-automatic method for constructing datasets, adhering to established construction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Focus
