Mobile Microphone Array Speech Detection and Localization in Diverse Everyday Environments
Pasi Pertil\"a, Emre Cakir, Aapo Hakala, Eemi Fagerlund, Tuomas, Virtanen, Archontis Politis, Antti Eronen

TL;DR
This paper presents a two-stage hierarchical convolutional recurrent neural network approach for joint speech detection and localization in diverse everyday environments using a mobile phone microphone array, improving accuracy over flat models.
Contribution
It introduces a novel hierarchical system for joint sound event detection and localization tailored for mobile device scenarios, evaluated on real-world data.
Findings
Good detection accuracy achieved
Effective localization in varied environments
Outperforms non-hierarchical models
Abstract
Joint sound event localization and detection (SELD) is an integral part of developing context awareness into communication interfaces of mobile robots, smartphones, and home assistants. For example, an automatic audio focus for video capture on a mobile phone requires robust detection of relevant acoustic events around the device and their direction. Existing SELD approaches have been evaluated using material produced in controlled indoor environments, or the audio is simulated by mixing isolated sounds to different spatial locations. This paper studies SELD of speech in diverse everyday environments, where the audio corresponds to typical usage scenarios of handheld mobile devices. In order to allow weighting the relative importance of localization vs. detection, we will propose a two-stage hierarchical system, where the first stage is to detect the target events, and the second stage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
