CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

Yunzuo Hu; Wen Li; Jing Zhang

arXiv:2602.08309·cs.CV·February 10, 2026

CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

Yunzuo Hu, Wen Li, Jing Zhang

PDF

Open Access

TL;DR

This paper introduces CAE-AV, a novel framework that enhances audio-visual learning by addressing modality misalignment through cross-modal enrichment modules and semantic guidance, leading to state-of-the-art results.

Contribution

The paper proposes a new CAE-AV framework with CASTE and CASE modules, introducing lightweight objectives to improve robustness against audio-visual misalignment.

Findings

01

Achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks.

02

Effectively alleviates audio-visual misalignment issues.

03

Demonstrates robustness through qualitative analyses.

Abstract

Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and Audio Processing