Audio-Visual Contrastive Learning with Temporal Self-Supervision

Simon Jenni; Alexander Black; John Collomosse

arXiv:2302.07702·cs.CV·February 16, 2023

Audio-Visual Contrastive Learning with Temporal Self-Supervision

Simon Jenni, Alexander Black, John Collomosse

PDF

Open Access

TL;DR

This paper introduces a self-supervised learning method for videos that jointly learns audio and visual representations by leveraging temporal and multi-modal contrastive objectives, achieving state-of-the-art results in various tasks.

Contribution

It extends temporal self-supervision to audio-visual data and proposes a novel contrastive loss with sample-dependent positives and negatives for improved representation learning.

Findings

01

Achieves state-of-the-art results in action recognition and retrieval.

02

Effective in audio classification and robust video fingerprinting.

03

Demonstrates the benefit of multi-modal temporal self-supervision.

Abstract

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also contain sound and temporal scene dynamics. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting and integrates it with multi-modal contrastive objectives. As temporal self-supervision, we pose playback speed and direction recognition in both modalities and propose intra- and inter-modal temporal ordering tasks. Furthermore, we design a novel contrastive objective in which the usual pairs are supplemented with additional sample-dependent positives and negatives sampled from the evolving feature space. In our model, we apply such losses among video clips and between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings