Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
Boqi Chen, Xudong Liu, Jianing Qiu

TL;DR
This paper introduces an improved visual contrastive decoding method that reduces object hallucinations in multimodal large language models by using an object-aligned auxiliary view, enhancing accuracy without significant computational cost.
Contribution
The authors propose a novel object-aligned auxiliary view for visual contrastive decoding that effectively mitigates object hallucinations in MLLMs, compatible with existing models and pipelines.
Findings
Consistent performance improvements on object hallucination benchmarks.
Method is prompt-agnostic and model-agnostic, easily integrated into existing systems.
Achieves these gains with minimal additional computation.
Abstract
We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
