Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models
Alican Gok, Oguzhan Buyuksolak, Osman Erman Okman, Murat Saraclar

TL;DR
This paper introduces a novel training scheme using self-supervised speech models and knowledge distillation to significantly improve few-shot keyword spotting accuracy on edge devices.
Contribution
It proposes a new training approach leveraging Wav2Vec 2.0 and attention-based dimensionality reduction for enhanced FS-KWS performance.
Findings
10-shot classification accuracy improved from 33.4% to 74.1% on GSC dataset
Enhanced inter-class separability and intra-class compactness with Sub-center ArcFace loss
Effective deployment on resource-constrained edge devices
Abstract
Keyword Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices. Few-Shot Keyword Spotting (FS-KWS) addresses the scalability and adaptability challenges of traditional systems by enabling recognition of custom keywords with only a few examples. However, existing FS-KWS systems achieve subpar accuracy at desirable false acceptance rates, particularly in resource-constrained edge environments. To address these issues, we propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation. The teacher model, based on Wav2Vec 2.0 is trained using Sub-center ArcFace loss, which enhances inter-class separability and intra-class compactness. To enable efficient deployment on edge devices, we introduce attention-based dimensionality reduction and train a standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdditive Angular Margin Loss
