Self-supervised speech representation learning for keyword-spotting with   light-weight transformers

Chenyang Gao; Yue Gu; Francesco Caliva; and Yuzong Liu

arXiv:2303.04255·cs.SD·March 9, 2023·1 cites

Self-supervised speech representation learning for keyword-spotting with light-weight transformers

Chenyang Gao, Yue Gu, Francesco Caliva, and Yuzong Liu

PDF

Open Access

TL;DR

This paper demonstrates that self-supervised speech representation learning using lightweight transformers significantly improves keyword-spotting accuracy on resource-constrained devices, offering a viable alternative to supervised methods.

Contribution

The study introduces a mechanism to enhance utterance-wise distinction in lightweight transformers for S3RL, improving keyword-spotting performance on constrained hardware.

Findings

01

1. Achieved 1.2% accuracy gain on Google speech commands v2 dataset.

02

2. Reduced false accept rate by 6% to 23.7% on in-house dataset.

03

3. Validates S3RL as effective for lightweight models in resource-limited settings.

Abstract

Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data. While S3RL related studies typically use large models, we employ light-weight networks to comply with tight memory of compute-constrained devices. We demonstrate the effectiveness of S3RL on a keyword-spotting (KS) problem by using transformers with 330k parameters and propose a mechanism to enhance utterance-wise distinction, which proves crucial for improving performance on classification tasks. On the Google speech commands v2 dataset, the proposed method applied to the Auto-Regressive Predictive Coding S3RL led to a 1.2% accuracy improvement compared to training from scratch. On an in-house KS dataset with four different keywords, it provided 6% to 23.7% relative false accept improvement at fixed false reject rate. We argue this demonstrates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing