CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

Qiming Li; Zekai Ye; Xiaocheng Feng; Weihong Zhong; Libo Qin; Ruihan Chen; Lei Huang; Baohang Li; Kui Jiang; Yaowei Wang; Ting Liu; Bing Qin

arXiv:2605.04641·cs.CV·May 7, 2026

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Lei Huang, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

PDF

TL;DR

CAST is a training-free method that reduces object hallucination in large vision-language models by steering attention based on caption queries, improving visual perception without extra training or inference costs.

Contribution

We introduce CAST, a novel attention steering technique that mitigates hallucination in LVLMs without additional training, leveraging caption query attention patterns.

Findings

01

CAST reduces object hallucination by an average of 6.03% across models and benchmarks.

02

It achieves state-of-the-art hallucination mitigation with minimal inference overhead.

03

The method enhances LVLMs' visual perception capabilities effectively.

Abstract

Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs' visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.