Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
Yubo Jiang, Yitong An, Xin Yang, Abudukelimu Wuerkaixi, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, and Haopeng Zhang

TL;DR
This paper presents PND, a training-free inference method that improves vision-language model outputs by balancing visual evidence and linguistic priors, reducing hallucinations.
Contribution
Introducing a novel inference framework, PND, that enforces visual fidelity in VLMs without retraining by contrasting positive and negative decoding paths.
Findings
PND achieves state-of-the-art results on POPE, MME, and CHAIR datasets.
PND reduces object hallucination in vision-language models.
PND operates without additional training or fine-tuning.
Abstract
Vision-Language Models (VLMs) are frequently undermined by object hallucination, generating content that contradicts visual reality, due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our finding of an attention imbalance in VLMs, where visual features are under-weighted. Our framework introduces a dual-path contrast: a positive path that amplifies visual evidence and a negative path that constructs counterfactuals to penalize prior-dominant generation. By contrasting outputs from both paths during decoding, PND steers generation toward visually grounded results. Experiments on POPE, MME, and CHAIR demonstrate state-of-the-art performance without retraining.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
