AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding
Chaeyoung Jung, Youngjoon Jang, Joon Son Chung

TL;DR
This paper introduces AVCD, a novel decoding method for audio-visual large language models that reduces hallucinations by adaptively masking less dominant modalities using attention, improving accuracy and robustness.
Contribution
AVCD is a training-free, modality-aware decoding framework that models trimodal interactions and employs entropy-guided adaptive decoding to mitigate hallucinations in AV-LLMs.
Findings
AVCD outperforms existing methods on AVHBench with 2-7% accuracy improvements.
AVCD effectively suppresses hallucinations in multimodal outputs.
AVCD demonstrates strong robustness and generalizability across datasets.
Abstract
Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Music and Audio Processing · Hearing Loss and Rehabilitation
