GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
Shentong Mo, Zehua Chen, Jun Zhu

TL;DR
GMS-CAVP enhances video-audio understanding by integrating multi-scale contrastive learning with diffusion-based generative pretraining, leading to improved cross-modal retrieval and generation performance.
Contribution
It introduces a novel multi-scale contrastive and generative pretraining framework that better models dense, multi-scale V-A correspondences for improved understanding.
Findings
Outperforms previous methods on VGGSound, AudioSet, and Panda70M datasets.
Enhances cross-modal retrieval accuracy.
Enables high-fidelity audio-visual generation.
Abstract
Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
