CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne

TL;DR
CAV-MAE Sync enhances self-supervised audio-visual learning by aligning fine-grained temporal features, separating objectives for better optimization, and improving spatial localization, leading to state-of-the-art results across multiple datasets.
Contribution
It introduces a novel extension to CAV-MAE that addresses granularity mismatch, conflicting objectives, and spatial localization in audio-visual self-supervised learning.
Findings
Achieves state-of-the-art performance on AudioSet, VGG Sound, and ADE20K Sound datasets.
Outperforms more complex architectures in zero-shot retrieval, classification, and localization.
Effectively aligns fine-grained temporal features across modalities.
Abstract
Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods
MethodsMax Pooling · Dropout · Softmax · Convolution · Dense Connections
