Fully Unsupervised Training of Few-shot Keyword Spotting
Dongjune Lee, Minchan Kim, Sung Hwan Mun, Min Hyun Han, Nam Soo Kim

TL;DR
This paper introduces an unsupervised, synthetic-data-based approach for few-shot keyword spotting that leverages metric learning and speech synthesis to eliminate the need for labeled datasets.
Contribution
It presents a fully unsupervised FS-KWS system trained solely on synthetic speech data using metric learning and speech synthesis with pseudo phonemes.
Findings
Competitive performance on real datasets without labeled data
Effective use of synthetic multi-view samples for training
Elimination of the need for large labeled datasets
Abstract
For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive data collection with labeling, in this paper, we propose a novel FS-KWS system trained only on synthetic data. The proposed system is based on metric learning enabling target keywords to be detected using distance metrics. Exploiting the speech synthesis model that generates speech with pseudo phonemes instead of texts, we easily obtain a large collection of multi-view samples with the same semantics. These samples are sufficient for training, considering metric learning does not intrinsically necessitate labeled data. All of the components in our framework do not require any supervision, making our method unsupervised. Experimental results on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing
