VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding
Rongcan Pei, Huan Li, Fang Guo, Qi Zhu

TL;DR
This paper identifies specific attention heads in vision-language models that are crucial for locating visual cues in long-context reasoning tasks, and introduces VERA, a framework that enhances model performance by explicitly verbalizing visual evidence.
Contribution
The paper uncovers and leverages Visual Evidence Retrieval (VER) heads in VLMs, proposing VERA to improve long-context understanding without additional training.
Findings
VER heads are causal to model performance
VERA improves accuracy by over 20% on multiple benchmarks
Masking VER heads degrades model reasoning ability
Abstract
While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
