Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani,, Bernard Ghanem, Du Tran

TL;DR
This paper introduces Cross-Modal Deep Clustering (XDC), a self-supervised learning method that leverages audio-visual correlations and differences to improve video and audio representations, outperforming supervised pretraining on action recognition tasks.
Contribution
XDC is the first self-supervised approach to outperform large-scale supervised pretraining for action recognition using the same architecture.
Findings
XDC outperforms single-modality clustering and other multi-modal methods.
XDC achieves state-of-the-art accuracy on multiple benchmarks.
Pretrained video models with XDC surpass fully-supervised models on HMDB51 and UCF101.
Abstract
Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
