TL;DR
This study reveals that Vision Language Models (VLMs) struggle to recall factual knowledge from visual references compared to textual ones, and introduces internal state probes to predict and improve their reliability.
Contribution
The paper identifies a systematic deficiency in VLMs' ability to link visual references with factual knowledge and proposes internal state probes to detect and mitigate this failure.
Findings
VLMs' factual recall drops by half when using images instead of text.
Internal state probes can predict model failures with over 92% accuracy.
Probes improve visual question answering coverage and reduce error risk.
Abstract
Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
