Improving Label-Deficient Keyword Spotting Through Self-Supervised   Pretraining

Holger Severin Bovbjerg; Zheng-Hua Tan

arXiv:2210.01703·cs.SD·May 25, 2023

Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining

Holger Severin Bovbjerg, Zheng-Hua Tan

PDF

Open Access 2 Repos

TL;DR

This paper demonstrates that self-supervised pretraining with Data2Vec significantly improves the accuracy of small keyword spotting models in label-scarce scenarios, making them more effective without extensive labeled data.

Contribution

It shows that self-supervised learning with Data2Vec enhances small KWS models' performance in low-label conditions, which was not previously well-studied.

Findings

01

Data2Vec pretraining improves accuracy by 8.22-11.18% in label-deficient scenarios.

02

Self-supervised learning benefits small KWS models more than large models in low-label settings.

03

Pretraining with Data2Vec is effective for compact transformer-based KWS models.

Abstract

Keyword Spotting (KWS) models are becoming increasingly integrated into various systems, e.g. voice assistants. To achieve satisfactory performance, these models typically rely on a large amount of labelled data, limiting their applications only to situations where such data is available. Self-supervised Learning (SSL) methods can mitigate such a reliance by leveraging readily-available unlabelled data. Most SSL methods for speech have primarily been studied for large models, whereas this is not ideal, as compact KWS models are generally required. This paper explores the effectiveness of SSL on small models for KWS and establishes that SSL can enhance the performance of small KWS models when labelled data is scarce. We pretrain three compact transformer-based KWS models using Data2Vec, and fine-tune them on a label-deficient setup of the Google Speech Commands data set. It is found that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization · Residual Connection