A Separable Temporal Convolution Neural Network with Attention for   Small-Footprint Keyword Spotting

Shenghua Hu; Jing Wang; Yujun Wang; Lidong Yang; Wenjing Yang

arXiv:2109.00260·cs.SD·September 2, 2021·1 cites

A Separable Temporal Convolution Neural Network with Attention for Small-Footprint Keyword Spotting

Shenghua Hu, Jing Wang, Yujun Wang, Lidong Yang, Wenjing Yang

PDF

Open Access

TL;DR

This paper introduces a lightweight separable temporal convolution neural network with attention for keyword spotting on mobile devices, achieving high accuracy with significantly fewer parameters than existing models.

Contribution

The paper presents a novel small-footprint model combining separable temporal convolutions and attention, maintaining high performance with only 32.2K parameters.

Findings

01

Achieves 95.7% accuracy on Google Speech Commands dataset

02

Uses only 32.2K parameters, much fewer than state-of-the-art models

03

Maintains high performance with a small model size

Abstract

Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. To solve this problem, this paper proposes a separable temporal convolution neural network with attention, it has a small number of parameters. Through the time convolution combined with attention mechanism, a small number of parameters model (32.2K) is implemented while maintaining high performance. The proposed model achieves 95.7% accuracy on the Google Speech Commands dataset, which is close to the performance of Res15(239K), the state-of-the-art model in KWS at present.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing