Separable Temporal Convolution plus Temporally Pooled Attention for   Lightweight High-performance Keyword Spotting

Shenghua Hu; Jing Wang; Yujun Wang; Wenjing Yang

arXiv:2108.12146·cs.SD·August 30, 2021

Separable Temporal Convolution plus Temporally Pooled Attention for Lightweight High-performance Keyword Spotting

Shenghua Hu, Jing Wang, Yujun Wang, Wenjing Yang

PDF

Open Access

TL;DR

This paper introduces ST-AttNet, a lightweight neural network for keyword spotting that combines separable temporal convolution and temporally pooled attention, achieving high accuracy with fewer parameters on Google speech commands dataset.

Contribution

The paper proposes a novel neural network architecture combining separable temporal convolution and temporally pooled attention for efficient keyword spotting.

Findings

01

Model has 1/6 the parameters of state-of-the-art with similar accuracy.

02

Achieves 96.6% accuracy on Google speech commands dataset.

03

Reduces computational complexity while maintaining performance.

Abstract

Keyword spotting (KWS) on mobile devices generally requires a small memory footprint. However, most current models still maintain a large number of parameters in order to ensure good performance. In this paper, we propose a temporally pooled attention module which can capture global features better than the AveragePool. Besides, we design a separable temporal convolution network which leverages depthwise separable and temporal convolution to reduce the number of parameter and calculations. Finally, taking advantage of separable temporal convolution and temporally pooled attention, a efficient neural network (ST-AttNet) is designed for KWS system. We evaluate the models on the publicly available Google speech commands data sets V1. The number of parameters of proposed model (48K) is 1/6 of state-of-the-art TC-ResNet14-1.5 model (305K). The proposed model achieves a 96.6% accuracy, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling