CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment
Yunzuo Hu, Wen Li, Jing Zhang

TL;DR
This paper introduces CAE-AV, a novel framework that enhances audio-visual learning by addressing modality misalignment through cross-modal enrichment modules and semantic guidance, leading to state-of-the-art results.
Contribution
The paper proposes a new CAE-AV framework with CASTE and CASE modules, introducing lightweight objectives to improve robustness against audio-visual misalignment.
Findings
Achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks.
Effectively alleviates audio-visual misalignment issues.
Demonstrates robustness through qualitative analyses.
Abstract
Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and Audio Processing
