Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Humam Alwassel; Dhruv Mahajan; Bruno Korbar; Lorenzo Torresani,; Bernard Ghanem; Du Tran

arXiv:1911.12667·cs.CV·October 27, 2020·251 cites

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani,, Bernard Ghanem, Du Tran

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Cross-Modal Deep Clustering (XDC), a self-supervised learning method that leverages audio-visual correlations and differences to improve video and audio representations, outperforming supervised pretraining on action recognition tasks.

Contribution

XDC is the first self-supervised approach to outperform large-scale supervised pretraining for action recognition using the same architecture.

Findings

01

XDC outperforms single-modality clustering and other multi-modal methods.

02

XDC achieves state-of-the-art accuracy on multiple benchmarks.

03

Pretrained video models with XDC surpass fully-supervised models on HMDB51 and UCF101.

Abstract

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HumamAlwassel/XDC
pytorchOfficial

Videos

Self-Supervised Learning by Cross-Modal Audio-Video Clustering· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization