Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Logan Mann; Ajit Saravanan; Ishan Dave; Shikhar Shiromani; Saadullah Ismail; Yi Xia; Emily Huang

arXiv:2605.08200·cs.AI·May 12, 2026

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail, Yi Xia, Emily Huang

PDF

TL;DR

This study challenges the common belief that sharp attention maps indicate trustworthy vision-language models, showing instead that reliability is better assessed through hidden-state geometry and late-layer circuits.

Contribution

The paper introduces the VLM Reliability Probe (VRP), a unified pipeline to analyze attention, hidden states, and causal circuits across multiple open-weight VLMs.

Findings

01

Attention maps are poor predictors of correctness.

02

Reliability signals emerge later in the model computation.

03

Different architectures distribute reliability differently across layers.

Abstract

A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline -- the VLM Reliability Probe (VRP) -- that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p<0.001). (ii) Reliability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.