VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

Zihu Wang; Boxun Xu; Yuxuan Xia; Peng Li

arXiv:2512.12089·cs.CV·December 16, 2025

VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

Zihu Wang, Boxun Xu, Yuxuan Xia, Peng Li

PDF

Open Access

TL;DR

VEGAS is a novel inference-time method that reduces hallucinations in large vision-language models by integrating the vision encoder's attention maps into the language model's mid-layers, improving factual consistency.

Contribution

This work introduces VEGAS, a simple technique that adaptively steers language model decoding using vision encoder attention, significantly mitigating hallucinations in vision-language tasks.

Findings

01

VEGAS achieves state-of-the-art hallucination reduction across benchmarks.

02

Integrating vision encoder attention maps effectively suppresses hallucinations.

03

Analysis shows hallucinations correlate with misfocused visual attention.

Abstract

Large vision-language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs. However, they often produce outputs that are linguistically fluent but factually inconsistent with the visual evidence, i.e., they hallucinate. Despite growing efforts to mitigate such hallucinations, a key question remains: what form of visual attention can effectively suppress hallucinations during decoding? In this work, we provide a simple answer: the vision encoder's own attention map. We show that LVLMs tend to hallucinate when their final visual-attention maps fail to concentrate on key image objects, whereas the vision encoder's more concentrated attention maps substantially reduce hallucinations. To further investigate the cause, we analyze vision-text conflicts during decoding and find that these conflicts peak in the language model's middle layers. Injecting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Hallucinations in medical conditions