Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting
Iv\'an L\'opez-Espejo, Ram C. M. C. Shekar, Zheng-Hua Tan, Jesper, Jensen, John H. L. Hansen

TL;DR
This paper shows that learnable filterbank features with fewer channels can outperform traditional handcrafted features in noise-robust small-footprint keyword spotting, reducing energy consumption significantly.
Contribution
It demonstrates that filterbank learning with fewer channels improves noise robustness and energy efficiency in KWS compared to traditional features.
Findings
Learnable filterbanks outperform handcrafted features at low channel counts.
Reducing channels from 40 to 8 causes only 3.5% accuracy loss.
Energy consumption is reduced by a factor of 6.3 with learned features.
Abstract
In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsDropout
