Contrastive Augmentation: An Unsupervised Learning Approach for Keyword   Spotting in Speech Technology

Weinan Dai; Yifeng Jiang; Yuanjing Liu; Jinkun Chen; Xin Sun; Jinglei; Tao

arXiv:2409.00356·cs.SD·September 4, 2024

Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology

Weinan Dai, Yifeng Jiang, Yuanjing Liu, Jinkun Chen, Xin Sun, Jinglei, Tao

PDF

Open Access

TL;DR

This paper introduces a novel unsupervised contrastive learning method with augmentation techniques for keyword spotting in speech, reducing labeled data needs and improving robustness.

Contribution

It proposes a contrastive learning framework combined with speech augmentation and a compressed convolutional architecture for improved unsupervised keyword spotting.

Findings

01

Achieves strong performance on Google Speech Commands V2 Dataset.

02

Reduces reliance on labeled data for keyword spotting.

03

Enhances robustness to variations in speech speed and volume.

Abstract

This paper addresses the persistent challenge in Keyword Spotting (KWS), a fundamental component in speech technology, regarding the acquisition of substantial labeled data for training. Given the difficulty in obtaining large quantities of positive samples and the laborious process of collecting new target samples when the keyword changes, we introduce a novel approach combining unsupervised contrastive learning and a unique augmentation-based technique. Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks with limited labeled data sets. We also propose that similar high-level feature representations should be employed for speech utterances with the same keyword despite variations in speed or volume. To achieve this, we present a speech augmentation-based unsupervised learning method that utilizes the similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Advanced Text Analysis Techniques

MethodsFocus · Contrastive Learning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings