Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data
Seunghan Yang, Byeonggeun Kim, Kyuhong Shim, Simyung Chang

TL;DR
This paper introduces a novel framework for few-shot keyword spotting that leverages automatically annotated auxiliary speech data and multi-task learning to improve performance while maintaining a small model footprint.
Contribution
It proposes a method to utilize unlabeled reading speech data with automatic annotation and filtering, enabling supervision for small FS-KWS models.
Findings
Significant performance improvements over existing methods.
Effective use of auxiliary data enhances model generalization.
Multi-task learning boosts representation quality.
Abstract
Few-shot keyword spotting (FS-KWS) models usually require large-scale annotated datasets to generalize to unseen target keywords. However, existing KWS datasets are limited in scale and gathering keyword-like labeled data is costly undertaking. To mitigate this issue, we propose a framework that uses easily collectible, unlabeled reading speech data as an auxiliary source. Self-supervised learning has been widely adopted for learning representations from unlabeled data; however, it is known to be suitable for large models with enough capacity and is not practical for training a small footprint FS-KWS model. Instead, we automatically annotate and filter the data to construct a keyword-like dataset, LibriWord, enabling supervision on auxiliary data. We then adopt multi-task learning that helps the model to enhance the representation power from out-of-domain auxiliary data. Our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
