EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

Jongsuk Kim; Hyeongkeun Lee; Kyeongha Rho; Junmo Kim; Joon Son; Chung

arXiv:2403.09502·cs.LG·June 21, 2024·1 cites

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son, Chung

PDF

Open Access 1 Repo

TL;DR

EquiAV introduces an equivariance-based framework for audio-visual contrastive learning, enhancing robustness and performance by effectively leveraging data augmentations without disrupting input correspondence.

Contribution

The paper proposes a novel equivariance-based approach with a shared attention predictor, improving audio-visual contrastive learning efficiency and robustness over prior methods.

Findings

01

Outperforms previous methods on multiple benchmarks

02

Effective with minimal computational overhead

03

Validates through extensive ablation studies

Abstract

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jongsuk1/equiav
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation