Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan F\"urst, Kurt Stockinger

TL;DR
This paper investigates how vision-language models process visual and linguistic information, revealing that they encode visual evidence strongly but often fail to base answers on it, and explores causal interventions to improve grounding.
Contribution
It introduces a detailed analysis of visual-linguistic arbitration in VLMs, identifying encoding-grounding dissociation and demonstrating effective activation interventions.
Findings
Visual attributes are decodable from early layers with high accuracy.
Final-layer logit gap predicts grounding success with high correlation.
Activation patching and intervention can significantly improve visual grounding.
Abstract
When a Vision-Language Model (VLM) sees a blue banana and answers "yellow", is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding-Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit - not the strength of encoding - better predicts grounding outcomes with a correlation of 0.847. After having…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
