Raw Waveform-based Audio Classification Using Sample-level CNN Architectures
Jongpil Lee, Taejun Kim, Jiyoung Park, Juhan Nam

TL;DR
This paper explores sample-level deep CNN architectures for raw waveform audio classification, demonstrating state-of-the-art performance across music, speech, and acoustic scene sounds, and analyzing learned filter characteristics.
Contribution
It introduces two novel sample-level CNN models with residual and attention modules, achieving superior results and providing insights into learned filter features.
Findings
Sample-level CNNs reach state-of-the-art accuracy
Residual and squeeze-and-excitation modules improve performance
Visualization reveals characteristics of learned filters
Abstract
Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
