Does Single-channel Speech Enhancement Improve Keyword Spotting   Accuracy? A Case Study

Avamarie Brueggeman; Takuya Higuchi; Masood Delfarah; Stephen Shum,; Vineet Garg

arXiv:2309.16060·eess.AS·February 23, 2024·1 cites

Does Single-channel Speech Enhancement Improve Keyword Spotting Accuracy? A Case Study

Avamarie Brueggeman, Takuya Higuchi, Masood Delfarah, Stephen Shum,, Vineet Garg

PDF

Open Access

TL;DR

This study examines whether single-channel speech enhancement improves keyword spotting accuracy in noisy conditions, finding it effective only when the backend model is trained on clean speech, not noisy speech.

Contribution

It provides a comprehensive analysis of speech enhancement's impact on keyword spotting, including joint training and audio injection techniques, which are novel in this context.

Findings

01

SE improves KWS accuracy with clean-trained models in noisy environments

02

Joint training of SE and KWS models is explored but shows limited gains

03

Audio injection with optimized weighting can reduce distortions in enhanced speech

Abstract

Noise robustness is a key aspect of successful speech applications. Speech enhancement (SE) has been investigated to improve automatic speech recognition accuracy; however, its effectiveness for keyword spotting (KWS) is still under-investigated. In this paper, we conduct a comprehensive study on single-channel speech enhancement for keyword spotting on the Google Speech Command (GSC) dataset. To investigate robustness to noise, the GSC dataset is augmented with noise signals from the WSJ0 Hipster Ambient Mixtures (WHAM!) noise dataset. Our investigation includes not only applying SE before KWS but also performing joint training of the SE frontend and KWS backend models. Moreover, we explore audio injection, a common approach to reduce distortions by using a weighted average of the enhanced and original signals. Audio injection is then further optimized by using another model that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing