Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain Data
Chen Chen, Nana Hou, Yuchen Hu, Shashank Shirol, Eng Siong Chng

TL;DR
This paper introduces a novel approach for noise-robust speech recognition that leverages a small amount of in-domain noisy data and a generative adversarial network to simulate additional noisy spectra, enhancing recognition performance.
Contribution
The paper proposes Simu-GAN, a GAN-based method to generate noisy speech spectra from clean speech using only 10 minutes of in-domain noisy data, and a dual-path recognition system for improved robustness.
Findings
Achieved 7.3% absolute WER reduction over baseline with simulated noisy data.
Demonstrated effectiveness of using minimal in-domain noisy data for training.
Validated the approach with experimental results showing significant performance gains.
Abstract
Noise-robust speech recognition systems require large amounts of training data including noisy speech data and corresponding transcripts to achieve state-of-the-art performances in face of various practical environments. However, such plenty of in-domain data is not always available in the real-life world. In this paper, we propose a generative adversarial network to simulate noisy spectrum from the clean spectrum (Simu-GAN), where only 10 minutes of unparalleled in-domain noisy speech data is required as labels. Furthermore, we also propose a dual-path speech recognition system to improve the robustness of the system under noisy conditions. Experimental results show that the proposed speech recognition system achieves 7.3% absolute improvement with simulated noisy data by Simu-GAN over the best baseline in terms of word error rate (WER).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
