Raw Waveform-based Audio Classification Using Sample-level CNN   Architectures

Jongpil Lee; Taejun Kim; Jiyoung Park; Juhan Nam

arXiv:1712.00866·cs.SD·December 5, 2017·45 cites

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Jongpil Lee, Taejun Kim, Jiyoung Park, Juhan Nam

PDF

Open Access

TL;DR

This paper explores sample-level deep CNN architectures for raw waveform audio classification, demonstrating state-of-the-art performance across music, speech, and acoustic scene sounds, and analyzing learned filter characteristics.

Contribution

It introduces two novel sample-level CNN models with residual and attention modules, achieving superior results and providing insights into learned filter features.

Findings

01

Sample-level CNNs reach state-of-the-art accuracy

02

Residual and squeeze-and-excitation modules improve performance

03

Visualization reveals characteristics of learned filters

Abstract

Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis