Cooperative Learning of Audio and Video Models from Self-Supervised   Synchronization

Bruno Korbar; Du Tran; Lorenzo Torresani

arXiv:1807.00230·cs.CV·November 13, 2018·187 cites

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Bruno Korbar, Du Tran, Lorenzo Torresani

PDF

Open Access

TL;DR

This paper introduces a self-supervised approach to learn audio and video models by leveraging their natural synchronization, resulting in improved audio classification and action recognition without additional labels.

Contribution

It presents a novel self-supervised training scheme using synchronization cues, contrastive loss, and curriculum learning to enhance multi-sensory representations for audio and video tasks.

Findings

01

Audio features outperform state-of-the-art on DCASE2014 and ESC-50 benchmarks.

02

Self-supervised pretraining significantly improves video action recognition accuracy.

03

Contrastive learning with negative example selection is crucial for effective synchronization learning.

Abstract

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies