VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

Rongcan Pei; Huan Li; Fang Guo; Qi Zhu

arXiv:2602.10146·cs.CV·February 12, 2026

VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

Rongcan Pei, Huan Li, Fang Guo, Qi Zhu

PDF

Open Access

TL;DR

This paper identifies specific attention heads in vision-language models that are crucial for locating visual cues in long-context reasoning tasks, and introduces VERA, a framework that enhances model performance by explicitly verbalizing visual evidence.

Contribution

The paper uncovers and leverages Visual Evidence Retrieval (VER) heads in VLMs, proposing VERA to improve long-context understanding without additional training.

Findings

01

VER heads are causal to model performance

02

VERA improves accuracy by over 20% on multiple benchmarks

03

Masking VER heads degrades model reasoning ability

Abstract

While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications