Speech Augmentation Based Unsupervised Learning for Keyword Spotting

Jian Luo; Jianzong Wang; Ning Cheng; Haobin Tang; Jing Xiao

arXiv:2205.14329·cs.SD·May 31, 2022

Speech Augmentation Based Unsupervised Learning for Keyword Spotting

Jian Luo, Jianzong Wang, Ning Cheng, Haobin Tang, Jing Xiao

PDF

Open Access

TL;DR

This paper presents a speech augmentation based unsupervised learning approach for keyword spotting, utilizing a CNN-Attention model and augmentation techniques to improve robustness and accuracy without relying heavily on labeled data.

Contribution

It introduces an unsupervised learning method with speech augmentation for KWS, combining CNN-Attention architecture and similarity-based loss to enhance performance.

Findings

01

The CNN-Attention model achieves competitive results on Google Speech Commands V2.

02

Augmentation-based unsupervised learning improves KWS accuracy.

03

Outperforms other unsupervised methods like CPC, APC, and MPC.

Abstract

In this paper, we investigated a speech augmentation based unsupervised learning approach for keyword spotting (KWS) task. KWS is a useful speech application, yet also heavily depends on the labeled data. We designed a CNN-Attention architecture to conduct the KWS task. CNN layers focus on the local acoustic features, and attention layers model the long-time dependency. To improve the robustness of KWS model, we also proposed an unsupervised learning method. The unsupervised loss is based on the similarity between the original and augmented speech features, as well as the audio reconstructing information. Two speech augmentation methods are explored in the unsupervised learning: speed and intensity. The experiments on Google Speech Commands V2 Dataset demonstrated that our CNN-Attention model has competitive results. Moreover, the augmentation based unsupervised learning could further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings