Contrastive Speech Mixup for Low-resource Keyword Spotting
Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang, Yukun Ma, Trung Hieu, Nguyen, Chongjia Ni, Eng Siong Chng, Bin Ma

TL;DR
This paper introduces CosMix, a contrastive speech mixup technique that enhances low-resource keyword spotting models by improving speech representations with auxiliary contrastive loss, especially effective with minimal training data.
Contribution
The paper proposes a novel contrastive speech mixup (CosMix) method that combines mixup augmentation with contrastive loss to improve low-resource keyword spotting performance.
Findings
Consistent performance improvements across models.
Effective with training data as small as 2.5 minutes per keyword.
Enhances speech representations in low-resource scenarios.
Abstract
Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-resource KWS. CosMix introduces an auxiliary contrastive loss to the existing mixup augmentation technique to maximize the relative similarity between the original pre-mixed samples and the augmented samples. The goal is to inject enhancing constraints to guide the model towards simpler but richer content-based speech representations from two augmented views (i.e. noisy mixed and clean pre-mixed utterances). We conduct our experiments on the Google Speech Command dataset, where we trim the size…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
