Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen; Xudong Liu; Jianing Qiu

arXiv:2602.11737·cs.CV·February 13, 2026

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Boqi Chen, Xudong Liu, Jianing Qiu

PDF

Open Access 1 Video

TL;DR

This paper introduces an improved visual contrastive decoding method that reduces object hallucinations in multimodal large language models by using an object-aligned auxiliary view, enhancing accuracy without significant computational cost.

Contribution

The authors propose a novel object-aligned auxiliary view for visual contrastive decoding that effectively mitigates object hallucinations in MLLMs, compatible with existing models and pipelines.

Findings

01

Consistent performance improvements on object hallucination benchmarks.

02

Method is prompt-agnostic and model-agnostic, easily integrated into existing systems.

03

Achieves these gains with minimal additional computation.

Abstract

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications