Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
Naoya Takahashi, Michael Gygli, Beat Pfister, Luc Van Gool

TL;DR
This paper introduces a deep CNN with a large input field and a novel data augmentation technique for acoustic event detection, significantly outperforming previous methods in accuracy.
Contribution
It presents a deep CNN architecture inspired by VGGNet for end-to-end AED and a new data augmentation method to improve model generalization.
Findings
Achieved 16% absolute improvement over state-of-the-art methods.
Demonstrated the effectiveness of large input CNNs for long-time acoustic analysis.
Validated the proposed data augmentation enhances model performance.
Abstract
We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3x3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
