Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Ami Baid, Zihui Xue, Kristen Grauman

TL;DR
This paper introduces ACPO, a novel training framework for Audio-Visual Language Models that reduces visual hallucination and improves audio grounding by contrasting true auditory signals and penalizing visual shortcuts.
Contribution
ACPO is a new dual-contrastive learning method that explicitly discourages visual dominance and enhances audio fidelity in AVLMs.
Findings
ACPO significantly reduces audio hallucination in AVLMs.
Models trained with ACPO show improved audio grounding accuracy.
ACPO maintains overall multimodal performance while enhancing audio fidelity.
Abstract
While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
