TL;DR
This paper presents a CNN-based framework for urban sound tagging that leverages pre-trained models and data augmentation, achieving top performance in a low-data setting for environmental sound classification.
Contribution
It introduces a modified MobileNetV2 model with data augmentation techniques for urban sound tagging, demonstrating superior results in a low-data environment.
Findings
Achieved first place on DCASE 2019 leaderboard
Micro-AUPRC of 0.751 for fine tags
Micro-AUPRC of 0.860 for coarse tags
Abstract
In this paper, we propose a framework for environmental sound classification in a low-data context (less than 100 labeled examples per class). We show that using pre-trained image classification models along with the usage of data augmentation techniques results in higher performance over alternative approaches. We applied this system to the task of Urban Sound Tagging, part of the DCASE 2019. The objective was to label different sources of noise from raw audio data. A modified form of MobileNetV2, a convolutional neural network (CNN) model was trained to classify both coarse and fine tags jointly. The proposed model uses log-scaled Mel-spectrogram as the representation format for the audio data. Mixup, Random erasing, scaling, and shifting are used as data augmentation techniques. A second model that uses scaled labels was built to account for human errors in the annotations. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDepthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · 1x1 Convolution · Batch Normalization · Inverted Residual Block · Convolution · Average Pooling · Tether Customer Service Number +1-833-534-1729 · Mixup
