Few-Shot Keyword Spotting from Mixed Speech
Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla

TL;DR
This paper explores combining Mix-Training and large-scale SSL pre-training to improve few-shot keyword spotting in mixed speech scenarios, demonstrating significant effectiveness on LibriSpeech and Google Speech Command datasets.
Contribution
It introduces the use of Mix-Training in the few-shot setting for mixed speech keyword spotting, enhanced by SSL pre-training methods like HuBert.
Findings
Mix-Training significantly improves few-shot mixed speech KWS
SSL pre-training with HuBert enhances detection accuracy
Combined approach achieves strong results across datasets
Abstract
Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited training samples. A commonly used approach is the pre-training and fine-tuning framework. While effective in clean conditions, this approach struggles with mixed keyword spotting -- simultaneously detecting multiple keywords blended in an utterance, which is crucial in real-world applications. Previous research has proposed a Mix-Training (MT) approach to solve the problem, however, it has never been tested in the few-shot scenario. In this paper, we investigate the possibility of using MT and other relevant methods to solve the two practical challenges together: few-shot and mixed speech. Experiments conducted on the LibriSpeech and Google Speech Command corpora demonstrate that MT is highly effective on this task when employed in either the pre-training phase or the fine-tuning phase. Moreover, combining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
