Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining
Holger Severin Bovbjerg, Zheng-Hua Tan

TL;DR
This paper demonstrates that self-supervised pretraining with Data2Vec significantly improves the accuracy of small keyword spotting models in label-scarce scenarios, making them more effective without extensive labeled data.
Contribution
It shows that self-supervised learning with Data2Vec enhances small KWS models' performance in low-label conditions, which was not previously well-studied.
Findings
Data2Vec pretraining improves accuracy by 8.22-11.18% in label-deficient scenarios.
Self-supervised learning benefits small KWS models more than large models in low-label settings.
Pretraining with Data2Vec is effective for compact transformer-based KWS models.
Abstract
Keyword Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants. To achieve satisfactory performance, these models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available. Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled data. Most SSL methods for speech have primarily been studied for large models, whereas this is not ideal, as compact KWS models are generally required. This paper explores the effectiveness of SSL on small models for KWS and establishes that SSL can enhance the performance of small KWS models when labelled data is scarce. We pretrain three compact transformer-based KWS models using Data2Vec, and fine-tune them on a label-deficient setup of the Google Speech Commands data set. It is found that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization · Residual Connection
