Audio-Visual Contrastive Learning with Temporal Self-Supervision
Simon Jenni, Alexander Black, John Collomosse

TL;DR
This paper introduces a self-supervised learning method for videos that jointly learns audio and visual representations by leveraging temporal and multi-modal contrastive objectives, achieving state-of-the-art results in various tasks.
Contribution
It extends temporal self-supervision to audio-visual data and proposes a novel contrastive loss with sample-dependent positives and negatives for improved representation learning.
Findings
Achieves state-of-the-art results in action recognition and retrieval.
Effective in audio classification and robust video fingerprinting.
Demonstrates the benefit of multi-modal temporal self-supervision.
Abstract
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also contain sound and temporal scene dynamics. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting and integrates it with multi-modal contrastive objectives. As temporal self-supervision, we pose playback speed and direction recognition in both modalities and propose intra- and inter-modal temporal ordering tasks. Furthermore, we design a novel contrastive objective in which the usual pairs are supplemented with additional sample-dependent positives and negatives sampled from the evolving feature space. In our model, we apply such losses among video clips and between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
