Efficient keyword spotting using time delay neural networks
Samuel Myer, Vikrant Singh Tomar

TL;DR
This paper introduces a two-stage time delay neural network for live keyword spotting that significantly improves accuracy and reduces computational complexity, especially in noisy environments, using transfer learning.
Contribution
The paper presents a novel two-stage TDN network with transfer learning for efficient, accurate live keyword spotting, reducing computation by up to 89%.
Findings
Significant reduction in false accept and reject rates.
Up to 89% savings in computational complexity.
Effective performance in noisy and clean environments.
Abstract
This paper describes a novel method of live keyword spotting using a two-stage time delay neural network. The model is trained using transfer learning: initial training with phone targets from a large speech corpus is followed by training with keyword targets from a smaller data set. The accuracy of the system is evaluated on two separate tasks. The first is the freely available Google Speech Commands dataset. The second is an in-house task specifically developed for keyword spotting. The results show significant improvements in false accept and false reject rates in both clean and noisy environments when compared with previously known techniques. Furthermore, we investigate various techniques to reduce computation in terms of multiplications per second of audio. Compared to recently published work, the proposed system provides up to 89% savings on computational complexity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
