Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology
Weinan Dai, Yifeng Jiang, Yuanjing Liu, Jinkun Chen, Xin Sun, Jinglei, Tao

TL;DR
This paper introduces a novel unsupervised contrastive learning method with augmentation techniques for keyword spotting in speech, reducing labeled data needs and improving robustness.
Contribution
It proposes a contrastive learning framework combined with speech augmentation and a compressed convolutional architecture for improved unsupervised keyword spotting.
Findings
Achieves strong performance on Google Speech Commands V2 Dataset.
Reduces reliance on labeled data for keyword spotting.
Enhances robustness to variations in speech speed and volume.
Abstract
This paper addresses the persistent challenge in Keyword Spotting (KWS), a fundamental component in speech technology, regarding the acquisition of substantial labeled data for training. Given the difficulty in obtaining large quantities of positive samples and the laborious process of collecting new target samples when the keyword changes, we introduce a novel approach combining unsupervised contrastive learning and a unique augmentation-based technique. Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks with limited labeled data sets. We also propose that similar high-level feature representations should be employed for speech utterances with the same keyword despite variations in speed or volume. To achieve this, we present a speech augmentation-based unsupervised learning method that utilizes the similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Advanced Text Analysis Techniques
MethodsFocus · Contrastive Learning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
