Improving Small Footprint Few-shot Keyword Spotting with Supervision on   Auxiliary Data

Seunghan Yang; Byeonggeun Kim; Kyuhong Shim; Simyung Chang

arXiv:2309.00647·eess.AS·September 6, 2023

Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data

Seunghan Yang, Byeonggeun Kim, Kyuhong Shim, Simyung Chang

PDF

Open Access

TL;DR

This paper introduces a novel framework for few-shot keyword spotting that leverages automatically annotated auxiliary speech data and multi-task learning to improve performance while maintaining a small model footprint.

Contribution

It proposes a method to utilize unlabeled reading speech data with automatic annotation and filtering, enabling supervision for small FS-KWS models.

Findings

01

Significant performance improvements over existing methods.

02

Effective use of auxiliary data enhances model generalization.

03

Multi-task learning boosts representation quality.

Abstract

Few-shot keyword spotting (FS-KWS) models usually require large-scale annotated datasets to generalize to unseen target keywords. However, existing KWS datasets are limited in scale and gathering keyword-like labeled data is costly undertaking. To mitigate this issue, we propose a framework that uses easily collectible, unlabeled reading speech data as an auxiliary source. Self-supervised learning has been widely adopted for learning representations from unlabeled data; however, it is known to be suitable for large models with enough capacity and is not practical for training a small footprint FS-KWS model. Instead, we automatically annotate and filter the data to construct a keyword-like dataset, LibriWord, enabling supervision on auxiliary data. We then adopt multi-task learning that helps the model to enhance the representation power from out-of-domain auxiliary data. Our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling