Multiple Instance Deep Learning for Weakly Supervised Small-Footprint   Audio Event Detection

Shao-Yen Tseng; Juncheng Li; Yun Wang; Joseph Szurley; Florian Metze,; Samarjit Das

arXiv:1712.09673·cs.SD·March 28, 2018

Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

Shao-Yen Tseng, Juncheng Li, Yun Wang, Joseph Szurley, Florian Metze,, Samarjit Das

PDF

TL;DR

This paper introduces a small-footprint multiple instance learning framework for weakly supervised audio event detection that leverages pre-trained audio embeddings, achieving high performance with reduced model complexity suitable for resource-limited applications.

Contribution

The paper presents a novel MIL framework using audio embeddings and simple DNNs, improving AED performance on weakly labeled data while maintaining low computational complexity.

Findings

01

F1 score improved by 17% over baseline

02

Audio embeddings significantly boost MIL model performance

03

Framework suitable for resource-constrained environments

Abstract

State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In this paper, we propose a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels. The proposed MIL framework uses audio embeddings extracted from a pre-trained convolutional neural network as input features. We show that by using audio embeddings the MIL framework can be implemented using a simple DNN with performance comparable to recurrent neural networks. We evaluate our approach by training an audio tagging system using a subset of AudioSet, which is a large collection of weakly labeled YouTube video excerpts. Combined with a late-fusion approach, we improve the F1 score of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.