CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Qiming Li; Zekai Ye; Xiaocheng Feng; Weihong Zhong; Libo Qin; Ruihan Chen; Baohang Li; Kui Jiang; Yaowei Wang; Ting Liu; Bing Qin

arXiv:2506.23590·cs.CV·July 1, 2025

CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CAI, a training-free method that uses attention patterns to reduce object hallucination in large vision-language models, improving accuracy with minimal inference overhead.

Contribution

The paper proposes a novel, plug-and-play attention intervention method that mitigates hallucination without additional training or significant inference costs.

Findings

01

Achieves state-of-the-art hallucination mitigation across multiple benchmarks.

02

Effective for both discriminative and generative vision-language tasks.

03

Minimal additional inference cost required.

Abstract

Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- This paper is the first to explicitly reveal that caption queries uniquely enhance visual attention in specific LVLM attention heads. This discovery is substantiated by quantitative evidence and layer-wise analysis, demonstrating that the attention amplification correlates with reduced object hallucination and provides valuable insights into fine-grained visual perception mechanisms. - The proposed method demonstrates state-of-the-art performance across different benchmarks while maintaining s

Weaknesses

- The study does not adequately address how variations in non-caption queries affect the proposed method. The probing methodology relies on a limited set of non-caption queries, potentially leading to inaccurate perception-refined vectors when handling diverse real-world queries. This limitation may compromise the method's robustness in practical applications.

Reviewer 02Rating 6Confidence 4

Strengths

- Well written method section. - Improvements across models and datasets.

Weaknesses

My concern regarding the presented method is its practicality - the method works well for object recognition but might degenerate some other tasks than require textual understanding (like text translation in Figure 4). I suppose this is due to the design of the process for selecting the intervention heads, which masks the attention towards textual tokens when calculating the modified attention output. This implies that the binary classifiers are trained on synthetic representations and selecting

Reviewer 03Rating 2Confidence 5

Strengths

* The paper is well-written and easy to follow. * This paper addresses object hallucination, which remains a critical problem in LVLMs. * The idea of adjusting visual attention for queries where the models originally fail to provide strong visual focus seems interesting. * Addressing hallucination through intervention at inference is an efficient choice, compared to techniques that require heavy training.

Weaknesses

* In Fig. 1: "Are there both a helmet and a motorcycle in the image?" The reason LLaVA 1.5 answered "Yes" is because (most likely) the model heavily suffered from a "yes" bias. This issue has been highlighted in prior works, e.g., *Mitigating Object Hallucination in LVLMs via Data-augmented Phrase-level Alignment, ICLR 2025*; you can do a quick study, check the confusion matrix with and without CAI -- you will most likely see that the improvements mainly stem from YES to NO. * The experimental

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis