Text Anchor Based Metric Learning for Small-footprint Keyword Spotting
Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou

TL;DR
This paper introduces a text anchor based metric learning approach and a novel LG-Net model to improve small-footprint keyword spotting accuracy, stability, and long-term feature modeling, achieving state-of-the-art results.
Contribution
It proposes using text anchors for more stable metric learning and designs LG-Net to better capture long-term acoustic features in KWS.
Findings
Text anchor based metric learning outperforms speech anchor methods.
LG-Net achieves SOTA accuracy of 97.67% and 96.79% on two datasets.
Lightweight LG-Net with 74k parameters attains high accuracy.
Abstract
Keyword Spotting (KWS) remains challenging to achieve the trade-off between small footprint and high accuracy. Recently proposed metric learning approaches improved the generalizability of models for the KWS task, and 1D-CNN based KWS models have achieved the state-of-the-arts (SOTA) in terms of model size. However, for metric learning, due to data limitations, the speech anchor is highly susceptible to the acoustic environment and speakers. Also, we note that the 1D-CNN models have limited capability to capture long-term temporal acoustic features. To address the above problems, we propose to utilize text anchors to improve the stability of anchors. Furthermore, a new type of model (LG-Net) is exquisitely designed to promote long-short term acoustic feature modeling based on 1D-CNN and self-attention. Experiments are conducted on Google Speech Commands Dataset version 1 (GSCDv1) and 2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
