Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs
Liu Yu, Zhonghao Chen, Ping Kuang, Zhikun Feng, Fan Zhou, Lan Wang, Gillian Dobbie

TL;DR
This paper introduces Owl, a causally-grounded framework for reducing object hallucinations in LVLMs by modeling attention interactions, quantifying modality contributions, and dynamically intervening during decoding, leading to state-of-the-art results.
Contribution
The paper proposes a novel causally-grounded attention intervention framework with a new metric VTACR and a dual-path decoding strategy to effectively mitigate hallucinations in LVLMs.
Findings
Owl significantly reduces hallucinations on POPE and CHAIR benchmarks.
VTACR correlates with hallucination likelihood, guiding interventions.
Dual-path contrastive decoding improves faithfulness without sacrificing understanding.
Abstract
Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
