CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Edson Araujo; Andrew Rouditchenko; Yuan Gong; Saurabhchand Bhati; Samuel Thomas; Brian Kingsbury; Leonid Karlinsky; Rogerio Feris; James R. Glass; Hilde Kuehne

arXiv:2505.01237·cs.MM·May 22, 2025

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne

PDF

Open Access 1 Repo

TL;DR

CAV-MAE Sync enhances self-supervised audio-visual learning by aligning fine-grained temporal features, separating objectives for better optimization, and improving spatial localization, leading to state-of-the-art results across multiple datasets.

Contribution

It introduces a novel extension to CAV-MAE that addresses granularity mismatch, conflicting objectives, and spatial localization in audio-visual self-supervised learning.

Findings

01

Achieves state-of-the-art performance on AudioSet, VGG Sound, and ADE20K Sound datasets.

02

Outperforms more complex architectures in zero-shot retrieval, classification, and localization.

03

Effectively aligns fine-grained temporal features across modalities.

Abstract

Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

edsonroteia/cav-mae-sync
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods

MethodsMax Pooling · Dropout · Softmax · Convolution · Dense Connections