SoundNet: Learning Sound Representations from Unlabeled Video

Yusuf Aytar; Carl Vondrick; Antonio Torralba

arXiv:1610.09001·cs.CV·October 31, 2016·233 cites

SoundNet: Learning Sound Representations from Unlabeled Video

Yusuf Aytar, Carl Vondrick, Antonio Torralba

PDF

Open Access 5 Repos

TL;DR

SoundNet leverages unlabeled videos and a student-teacher training approach to learn rich sound representations, achieving significant improvements on classification benchmarks and revealing emergent high-level semantics without explicit labels.

Contribution

Introduces a novel method using synchronized video data and student-teacher training to learn sound representations without labeled data.

Findings

01

Significant performance improvements on acoustic classification benchmarks.

02

Emergence of high-level semantic features in the sound network.

03

Effective use of unlabeled video data for sound representation learning.

Abstract

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Animal Vocal Communication and Behavior