Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

Le Xu; Chenxing Li; Yong Ren; Yujie Chen; Yu Gu; Ruibo Fu; Shan Yang; Dong Yu

arXiv:2505.22045·cs.MM·May 29, 2025

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu

PDF

Open Access

TL;DR

This paper introduces an entropy-aware gated fusion framework with audiovisual shuffling to improve visual-guide audio captioning, especially under audiovisual misalignment, achieving better accuracy and faster inference.

Contribution

We propose a novel entropy-aware gated fusion method combined with audiovisual shuffling to enhance robustness against audiovisual mismatch in captioning systems.

Findings

01

Outperforms existing methods on AudioCaps benchmark.

02

Significantly improves robustness to audiovisual misalignment.

03

Achieves approximately 6x faster inference speed.

Abstract

Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings