Learning Environmental Sounds with Multi-scale Convolutional Neural Network
Boqing Zhu, Changjian Wang, Feng Liu, Jin Lei, Zengquan Lu, Yuxing, Peng

TL;DR
This paper introduces WaveMsNet, an end-to-end neural network utilizing multi-scale convolution and a two-phase feature fusion method to improve environmental sound classification accuracy from raw waveforms and spectrograms.
Contribution
The paper proposes a novel multi-scale convolution operation and a two-phase feature fusion approach within an end-to-end network for environmental sound recognition.
Findings
Achieved 93.75% accuracy on ESC-10 dataset.
Achieved 79.10% accuracy on ESC-50 dataset.
Significantly outperforms previous methods.
Abstract
Deep learning has dramatically improved the performance of sounds recognition. However, learning acoustic models directly from the raw waveform is still challenging. Current waveform-based models generally use time-domain convolutional layers to extract features. The features extracted by single size filters are insufficient for building discriminative representation of audios. In this paper, we propose multi-scale convolution operation, which can get better audio representation by improving the frequency resolution and learning filters cross all frequency area. For leveraging the waveform-based features and spectrogram-based features in a single model, we introduce two-phase method to fuse the different features. Finally, we propose a novel end-to-end network called WaveMsNet based on the multi-scale convolution operation and two-phase method. On the environmental sounds classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
