Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an   Audio Scene?

Huy Phan; Oliver Y. Ch\'en; Philipp Koch; Lam Pham; Ian McLoughlin,; Alfred Mertins; Maarten De Vos

arXiv:1811.01095·cs.SD·May 10, 2019·6 cites

Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?

Huy Phan, Oliver Y. Ch\'en, Philipp Koch, Lam Pham, Ian McLoughlin,, Alfred Mertins, Maarten De Vos

PDF

Open Access

TL;DR

This paper investigates the minimal temporal duration needed for reliable audio scene recognition using deep learning models and examines the effectiveness of model fusion at different signal lengths.

Contribution

It introduces a study on variable scene recognition times and evaluates the necessity of model fusion depending on signal duration.

Findings

01

Some scenes can be recognized within a few seconds.

02

Longer durations are needed for certain scenes.

03

Model fusion benefits short signal durations.

Abstract

Due to the variability in characteristics of audio scenes, some scenes can naturally be recognized earlier than others. In this work, rather than using equal-length snippets for all scene categories, as is common in the literature, we study to which temporal extent an audio scene can be reliably recognized given state-of-the-art models. Moreover, as model fusion with deep network ensemble is prevalent in audio scene classification, we further study whether, and if so, when model fusion is necessary for this task. To achieve these goals, we employ two single-network systems relying on a convolutional neural network and a recurrent neural network for classification as well as early fusion and late fusion of these networks. Experimental results on the LITIS-Rouen dataset show that some scenes can be reliably recognized with a few seconds while other scenes require significantly longer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization