Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?
Huy Phan, Oliver Y. Ch\'en, Philipp Koch, Lam Pham, Ian McLoughlin,, Alfred Mertins, Maarten De Vos

TL;DR
This paper investigates the minimal temporal duration needed for reliable audio scene recognition using deep learning models and examines the effectiveness of model fusion at different signal lengths.
Contribution
It introduces a study on variable scene recognition times and evaluates the necessity of model fusion depending on signal duration.
Findings
Some scenes can be recognized within a few seconds.
Longer durations are needed for certain scenes.
Model fusion benefits short signal durations.
Abstract
Due to the variability in characteristics of audio scenes, some scenes can naturally be recognized earlier than others. In this work, rather than using equal-length snippets for all scene categories, as is common in the literature, we study to which temporal extent an audio scene can be reliably recognized given state-of-the-art models. Moreover, as model fusion with deep network ensemble is prevalent in audio scene classification, we further study whether, and if so, when model fusion is necessary for this task. To achieve these goals, we employ two single-network systems relying on a convolutional neural network and a recurrent neural network for classification as well as early fusion and late fusion of these networks. Experimental results on the LITIS-Rouen dataset show that some scenes can be reliably recognized with a few seconds while other scenes require significantly longer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
