More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage
Wei He

TL;DR
This paper introduces DIVA, a benchmark and metric to evaluate how vision-language models handle abstract idiomatic meanings versus literal interpretations, revealing a bias towards literal grounding especially with high visual fidelity.
Contribution
The paper proposes DIVA and Semantic Alignment Gap as tools to measure and analyze the semiotic gap in VLMs, highlighting the impact of visual fidelity on symbolic understanding.
Findings
Models show a literal superiority bias regardless of scale.
Increased visual fidelity weakens symbolic alignment.
Iconographic abstraction improves compositional understanding.
Abstract
Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we introduce DIVA, a controlled benchmark that replaces high-fidelity visual detail with schematic iconicity by generating paired, sense-anchored visualizations for literal and idiomatic readings. We further propose Semantic Alignment Gap (), an architecture-agnostic metric that quantifies divergence between literal and idiomatic visual grounding. We additionally introduce a directional signed bias to separately measure the direction and strength of literal preference. Evaluating 8 recent VLMs, we reveal a consistent Literal Superiority Bias: model scale alone does not resolve literal preference, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
