MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Sangyun Chung, Se Yeon Kim, Youngchae Chee, and Yong Man Ro

TL;DR
This paper introduces MAD, a training-free, modality-adaptive decoding method that reduces cross-modal hallucinations in multimodal large language models by dynamically weighting modality-specific branches based on task relevance.
Contribution
MAD is a novel, training-free approach that leverages model self-assessment to adaptively weight modality-specific decoding, improving multimodal reasoning robustness.
Findings
MAD significantly reduces cross-modal hallucinations in experiments.
It improves performance on CMM and AVHBench benchmarks.
The approach enhances model focus on relevant modalities.
Abstract
Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
