End-to-End Direction-Aware Keyword Spotting with Spatial Priors in Noisy Environments

Rui Wang; Zhifei Zhang; Yu Gao; Xiaofeng Mou; Yi Xu

arXiv:2603.09505·eess.AS·March 11, 2026

End-to-End Direction-Aware Keyword Spotting with Spatial Priors in Noisy Environments

Rui Wang, Zhifei Zhang, Yu Gao, Xiaofeng Mou, Yi Xu

PDF

Open Access

TL;DR

This paper introduces an end-to-end multi-channel keyword spotting system that leverages spatial cues and directional priors to enhance noise robustness in challenging acoustic environments.

Contribution

It proposes a novel framework combining spatial encoding and directional priors within an end-to-end model for improved noisy environment performance.

Findings

01

Spatial modeling and directional priors each improve baseline performance.

02

Combining spatial cues and priors yields the best results in noisy conditions.

03

The approach demonstrates strong potential for target-speaker detection in complex scenarios.

Abstract

Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimization, inherently limiting performance. We present an end-to-end multi-channel KWS framework that exploits spatial cues to improve noise robustness. A spatial encoder learns inter-channel features, while a spatial embedding injects directional priors; the fused representation is processed by a streaming backbone. Experiments in simulated noisy conditions across multiple signal-to-noise ratios (SNRs) show that spatial modeling and directional priors each yield clear gains over baselines, with their combination achieving the best results. These findings validate end-to-end multi-channel spatial modeling,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing