Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection
Shao-Yen Tseng, Juncheng Li, Yun Wang, Joseph Szurley, Florian Metze,, Samarjit Das

TL;DR
This paper introduces a small-footprint multiple instance learning framework for weakly supervised audio event detection that leverages pre-trained audio embeddings, achieving high performance with reduced model complexity suitable for resource-limited applications.
Contribution
The paper presents a novel MIL framework using audio embeddings and simple DNNs, improving AED performance on weakly labeled data while maintaining low computational complexity.
Findings
F1 score improved by 17% over baseline
Audio embeddings significantly boost MIL model performance
Framework suitable for resource-constrained environments
Abstract
State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In this paper, we propose a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels. The proposed MIL framework uses audio embeddings extracted from a pre-trained convolutional neural network as input features. We show that by using audio embeddings the MIL framework can be implemented using a simple DNN with performance comparable to recurrent neural networks. We evaluate our approach by training an audio tagging system using a subset of AudioSet, which is a large collection of weakly labeled YouTube video excerpts. Combined with a late-fusion approach, we improve the F1 score of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
