Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution

Ximin Li; Xiaodong Wei; Xiaowei Qin

arXiv:2010.09960·eess.AS·October 21, 2020·1 cites

Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution

Ximin Li, Xiaodong Wei, Xiaowei Qin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-scale temporal convolution approach for small-footprint keyword spotting, achieving high accuracy with minimal parameters suitable for on-device applications.

Contribution

It proposes the MTConv module and TENet architecture, enabling efficient, high-accuracy keyword spotting without increasing computational costs.

Findings

01

Achieved 96.8% accuracy on Google Speech Command Dataset

02

Model with 100K parameters outperforms many existing methods

03

MTConv can be converted to standard convolution during inference

Abstract

Keyword Spotting (KWS) plays a vital role in human-computer interaction for smart on-device terminals and service robots. It remains challenging to achieve the trade-off between small footprint and high accuracy for KWS task. In this paper, we explore the application of multi-scale temporal modeling to the small-footprint keyword spotting task. We propose a multi-branch temporal convolution module (MTConv), a CNN block consisting of multiple temporal convolution filters with different kernel sizes, which enriches temporal feature space. Besides, taking advantage of temporal and depthwise convolution, a temporal efficient neural network (TENet) is designed for KWS system. Based on the purposed model, we replace standard temporal convolution layers with MTConvs that can be trained for better performance. While at the inference stage, the MTConv can be equivalently converted to the base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Interlagos/TENet-kws
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling

MethodsConvolution