InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for   Evaluating Implicit Visual Semantics in Large Vision Language Models

Xiaofei Yin; Yijie Hong; Ya Guo; Yi Tu; Weiqiang Wang and; Gongshen Liu; Huijia zhu

arXiv:2502.15812·cs.LG·February 25, 2025

InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models

Xiaofei Yin, Yijie Hong, Ya Guo, Yi Tu, Weiqiang Wang and, Gongshen Liu, Huijia zhu

PDF

Open Access

TL;DR

This paper introduces InsightVision, a new Chinese-based benchmark for evaluating the understanding of implicit visual semantics in large vision language models, revealing current models' limitations compared to humans.

Contribution

It presents a comprehensive, multi-level benchmark for implicit visual semantics in Chinese, along with a semi-automatic dataset construction method and evaluation of 15 LVLMs and GPT-4o.

Findings

01

Models lag nearly 14% behind human performance in implicit understanding.

02

Current LVLMs struggle with nuanced visual semantics.

03

The benchmark covers four levels of implicit meaning comprehension.

Abstract

In the evolving landscape of multimodal language models, understanding the nuanced meanings conveyed through visual cues - such as satire, insult, or critique - remains a significant challenge. Existing evaluation benchmarks primarily focus on direct tasks like image captioning or are limited to a narrow set of categories, such as humor or satire, for deep semantic understanding. To address this gap, we introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark designed specifically for evaluating the understanding of implicit meanings in images. This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension. We propose an innovative semi-automatic method for constructing datasets, adhering to established construction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Focus