Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
Yubo Jiang, Xin Yang, Abudukelimu Wuerkaixi, Zheming Yuan, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang

TL;DR
This paper introduces PND, a training-free inference method that reduces object hallucination in vision-language models by enforcing visual fidelity through a dual-path contrast mechanism.
Contribution
The paper presents PND, a novel inference framework that mitigates hallucination in VLMs without retraining, by correcting attention deficits and contrasting visual evidence during decoding.
Findings
PND improves accuracy by up to 6.5% on benchmarks.
It substantially reduces object hallucination in VLMs.
PND enhances descriptive detail without retraining models.
Abstract
Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
