GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining

Shentong Mo; Zehua Chen; Jun Zhu

arXiv:2601.19606·cs.CV·January 28, 2026

GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining

Shentong Mo, Zehua Chen, Jun Zhu

PDF

Open Access

TL;DR

GMS-CAVP enhances video-audio understanding by integrating multi-scale contrastive learning with diffusion-based generative pretraining, leading to improved cross-modal retrieval and generation performance.

Contribution

It introduces a novel multi-scale contrastive and generative pretraining framework that better models dense, multi-scale V-A correspondences for improved understanding.

Findings

01

Outperforms previous methods on VGGSound, AudioSet, and Panda70M datasets.

02

Enhances cross-modal retrieval accuracy.

03

Enables high-fidelity audio-visual generation.

Abstract

Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis