Streaming Noise Context Aware Enhancement For Automatic Speech Recognition in Multi-Talker Environments
Joe Caroselli, Arun Narayanan, Yiteng Huang

TL;DR
This paper introduces two streaming, noise context-aware speech enhancement algorithms for multi-talker environments, improving automatic speech recognition accuracy on smart devices by effectively handling interfering speech.
Contribution
It presents novel multi-microphone algorithms that leverage noise context and hotword detection, with an adaptive selection mechanism for enhanced speech recognition in multi-talker scenarios.
Findings
Achieves 55% relative WER reduction at -12dB SNR
Achieves 43% relative WER reduction at 12dB SNR
Algorithms are complementary and effective in real-time multi-talker environments
Abstract
One of the most challenging scenarios for smart speakers is multi-talker, when target speech from the desired speaker is mixed with interfering speech from one or more speakers. A smart assistant needs to determine which voice to recognize and which to ignore and it needs to do so in a streaming, low-latency manner. This work presents two multi-microphone speech enhancement algorithms targeted at this scenario. Targeting on-device use-cases, we assume that the algorithm has access to the signal before the hotword, which is referred to as the noise context. First is the Context Aware Beamformer which uses the noise context and detected hotword to determine how to target the desired speaker. The second is an adaptive noise cancellation algorithm called Speech Cleaner which trains a filter using the noise context. It is demonstrated that the two algorithms are complementary in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis
MethodsAttentive Walk-Aggregating Graph Neural Network
