Cross-modal supervised learning for better acoustic representations

Shaoyong Jia; Xin Shu; Yang Yang; Dawei Liang; Qiyue Liu; Junhui Liu

arXiv:1911.07917·cs.CV·January 3, 2020·1 cites

Cross-modal supervised learning for better acoustic representations

Shaoyong Jia, Xin Shu, Yang Yang, Dawei Liang, Qiyue Liu, Junhui Liu

PDF

Open Access

TL;DR

This paper leverages large-scale video data with machine-generated labels to improve acoustic representations, achieving significant performance gains on standard audio classification benchmarks.

Contribution

It introduces a method to utilize synchronized vision-audio data with machine labels for training improved acoustic models, surpassing state-of-the-art results.

Findings

01

Achieved significant performance improvements on external benchmarks.

02

Collected 15 million video samples with automatic annotations.

03

Enhanced VGGish model with better results.

Abstract

Obtaining large-scale human-labeled datasets to train acoustic representation models is a very challenging task. On the contrary, we can easily collect data with machine-generated labels. In this work, we propose to exploit machine-generated labels to learn better acoustic representations, based on the synchronization between vision and audio. Firstly, we collect a large-scale video dataset with 15 million samples, which totally last 16,320 hours. Each video is 3 to 5 seconds in length and annotated automatically by publicly available visual and audio classification models. Secondly, we train various classical convolutional neural networks (CNNs) including VGGish, ResNet 50 and Mobilenet v2. We also make several improvements to VGGish and achieve better results. Finally, we transfer our models on three external standard benchmarks for audio classification task, and achieve significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection