Speech Augmentation Based Unsupervised Learning for Keyword Spotting
Jian Luo, Jianzong Wang, Ning Cheng, Haobin Tang, Jing Xiao

TL;DR
This paper presents a speech augmentation based unsupervised learning approach for keyword spotting, utilizing a CNN-Attention model and augmentation techniques to improve robustness and accuracy without relying heavily on labeled data.
Contribution
It introduces an unsupervised learning method with speech augmentation for KWS, combining CNN-Attention architecture and similarity-based loss to enhance performance.
Findings
The CNN-Attention model achieves competitive results on Google Speech Commands V2.
Augmentation-based unsupervised learning improves KWS accuracy.
Outperforms other unsupervised methods like CPC, APC, and MPC.
Abstract
In this paper, we investigated a speech augmentation based unsupervised learning approach for keyword spotting (KWS) task. KWS is a useful speech application, yet also heavily depends on the labeled data. We designed a CNN-Attention architecture to conduct the KWS task. CNN layers focus on the local acoustic features, and attention layers model the long-time dependency. To improve the robustness of KWS model, we also proposed an unsupervised learning method. The unsupervised loss is based on the similarity between the original and augmented speech features, as well as the audio reconstructing information. Two speech augmentation methods are explored in the unsupervised learning: speed and intensity. The experiments on Google Speech Commands V2 Dataset demonstrated that our CNN-Attention model has competitive results. Moreover, the augmentation based unsupervised learning could further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
