Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution
Ximin Li, Xiaodong Wei, Xiaowei Qin

TL;DR
This paper introduces a multi-scale temporal convolution approach for small-footprint keyword spotting, achieving high accuracy with minimal parameters suitable for on-device applications.
Contribution
It proposes the MTConv module and TENet architecture, enabling efficient, high-accuracy keyword spotting without increasing computational costs.
Findings
Achieved 96.8% accuracy on Google Speech Command Dataset
Model with 100K parameters outperforms many existing methods
MTConv can be converted to standard convolution during inference
Abstract
Keyword Spotting (KWS) plays a vital role in human-computer interaction for smart on-device terminals and service robots. It remains challenging to achieve the trade-off between small footprint and high accuracy for KWS task. In this paper, we explore the application of multi-scale temporal modeling to the small-footprint keyword spotting task. We propose a multi-branch temporal convolution module (MTConv), a CNN block consisting of multiple temporal convolution filters with different kernel sizes, which enriches temporal feature space. Besides, taking advantage of temporal and depthwise convolution, a temporal efficient neural network (TENet) is designed for KWS system. Based on the purposed model, we replace standard temporal convolution layers with MTConvs that can be trained for better performance. While at the inference stage, the MTConv can be equivalently converted to the base…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
MethodsConvolution
