Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain   Data

Chen Chen; Nana Hou; Yuchen Hu; Shashank Shirol; Eng Siong Chng

arXiv:2203.15321·cs.SD·March 30, 2022

Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain Data

Chen Chen, Nana Hou, Yuchen Hu, Shashank Shirol, Eng Siong Chng

PDF

Open Access

TL;DR

This paper introduces a novel approach for noise-robust speech recognition that leverages a small amount of in-domain noisy data and a generative adversarial network to simulate additional noisy spectra, enhancing recognition performance.

Contribution

The paper proposes Simu-GAN, a GAN-based method to generate noisy speech spectra from clean speech using only 10 minutes of in-domain noisy data, and a dual-path recognition system for improved robustness.

Findings

01

Achieved 7.3% absolute WER reduction over baseline with simulated noisy data.

02

Demonstrated effectiveness of using minimal in-domain noisy data for training.

03

Validated the approach with experimental results showing significant performance gains.

Abstract

Noise-robust speech recognition systems require large amounts of training data including noisy speech data and corresponding transcripts to achieve state-of-the-art performances in face of various practical environments. However, such plenty of in-domain data is not always available in the real-life world. In this paper, we propose a generative adversarial network to simulate noisy spectrum from the clean spectrum (Simu-GAN), where only 10 minutes of unparalleled in-domain noisy speech data is required as labels. Furthermore, we also propose a dual-path speech recognition system to improve the robustness of the system under noisy conditions. Experimental results show that the proposed speech recognition system achieves 7.3% absolute improvement with simulated noisy data by Simu-GAN over the best baseline in terms of word error rate (WER).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing