End-to-End Multi-Look Keyword Spotting
Meng Yu, Xuan Ji, Bo Wu, Dan Su, Dong Yu

TL;DR
This paper introduces an end-to-end multi-look neural network for keyword spotting that enhances speech from multiple directions simultaneously, improving accuracy in noisy, far-field conditions by dynamically focusing on reliable sources.
Contribution
It presents a novel multi-look neural network model trained jointly with KWS, integrating multiple enhanced signals with an attention mechanism for improved performance in challenging environments.
Findings
Significant reduction in false alarms and false rejects in noisy, far-field conditions.
Outperforms baseline KWS and beamformer-based multi-beam systems on large evaluation sets.
Demonstrates robustness of multi-look enhancement combined with attention in real-world scenarios.
Abstract
The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model's attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
