Maximizing Audio Event Detection Model Performance on Small Datasets Through Knowledge Transfer, Data Augmentation, And Pretraining: An Ablation Study
Daniel Tompkins, Kshitiz Kumar, Jian Wu

TL;DR
This paper investigates how knowledge transfer, data augmentation, and pretraining improve audio event detection on small datasets, demonstrating their individual contributions and proposing a smaller model that nearly achieves state-of-the-art results.
Contribution
It provides an ablation study analyzing the impact of different components on performance and introduces a compact model with competitive accuracy.
Findings
Knowledge transfer from ImageNet improves accuracy.
Pretraining on AudioSet enhances performance.
A smaller model achieves near SOTA results with fewer parameters.
Abstract
An Xception model reaches state-of-the-art (SOTA) accuracy on the ESC-50 dataset for audio event detection through knowledge transfer from ImageNet weights, pretraining on AudioSet, and an on-the-fly data augmentation pipeline. This paper presents an ablation study that analyzes which components contribute to the boost in performance and training time. A smaller Xception model is also presented which nears SOTA performance with almost a third of the parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsPointwise Convolution · Residual Connection · Depthwise Convolution · 1x1 Convolution · Average Pooling · Softmax · Global Average Pooling · Depthwise Separable Convolution · Max Pooling · Dense Connections
