SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models
Yuxuan Xia, Siheng Wang, Peng Li

TL;DR
This paper introduces SDCD, a training-free decoding method that reduces object hallucinations in large vision-language models by disrupting visual structure during decoding, leading to more accurate multimodal understanding.
Contribution
The paper proposes a novel, training-free contrastive decoding algorithm called SDCD that mitigates hallucinations by penalizing texture-driven biases in visual encoding.
Findings
SDCD significantly reduces hallucinations across multiple benchmarks.
SDCD improves the multimodal reasoning capabilities of LVLMs.
The method is training-free and easy to integrate into existing systems.
Abstract
Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
