End-to-End Direction-Aware Keyword Spotting with Spatial Priors in Noisy Environments
Rui Wang, Zhifei Zhang, Yu Gao, Xiaofeng Mou, Yi Xu

TL;DR
This paper introduces an end-to-end multi-channel keyword spotting system that leverages spatial cues and directional priors to enhance noise robustness in challenging acoustic environments.
Contribution
It proposes a novel framework combining spatial encoding and directional priors within an end-to-end model for improved noisy environment performance.
Findings
Spatial modeling and directional priors each improve baseline performance.
Combining spatial cues and priors yields the best results in noisy conditions.
The approach demonstrates strong potential for target-speaker detection in complex scenarios.
Abstract
Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimization, inherently limiting performance. We present an end-to-end multi-channel KWS framework that exploits spatial cues to improve noise robustness. A spatial encoder learns inter-channel features, while a spatial embedding injects directional priors; the fused representation is processed by a streaming backbone. Experiments in simulated noisy conditions across multiple signal-to-noise ratios (SNRs) show that spatial modeling and directional priors each yield clear gains over baselines, with their combination achieving the best results. These findings validate end-to-end multi-channel spatial modeling,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
