When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
Francesco Ortu, Zhijing Jin, Diego Doimo, Alberto Cazzaniga

TL;DR
This paper investigates how vision-language models resolve conflicts between internal knowledge and visual input, introducing a dataset and analyzing attention mechanisms to improve interpretability.
Contribution
It introduces WHOOPS-AHA!, a dataset of counterfactual queries, and identifies attention heads that mediate knowledge conflicts, enabling targeted interventions.
Findings
Attention heads can be manipulated to steer model predictions.
Attention patterns effectively locate image regions influencing visual overrides.
Interventions can bias models towards internal knowledge or visual input.
Abstract
Vision-language models (VLMs) increasingly combine visual and textual information to perform complex tasks. However, conflicts between their internal knowledge and external visual input can lead to hallucinations and unreliable predictions. In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by introducing WHOOPS-AHA!, a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. Through logit inspection, we identify a small set of attention heads that mediate this conflict. By intervening in these heads, we can steer the model towards its internal parametric knowledge or the visual information. Our results show that attention patterns on these heads effectively locate image regions that influence visual overrides, providing a more precise attribution compared to gradient-based methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
