Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition
Chris Donahue, Bo Li, Rohit Prabhavalkar

TL;DR
This paper explores using GANs on log-Mel spectra for speech enhancement to improve noise robustness in ASR, achieving notable WER reductions but still trailing behind multi-style training methods.
Contribution
It introduces a novel approach of applying GANs to log-Mel spectra for speech enhancement in ASR, demonstrating improved performance over raw waveform methods.
Findings
GANs on log-Mel spectra improve ASR noise robustness
Appending GAN-enhanced features yields 7% WER reduction
GAN enhancement outperforms waveform-based methods in noisy conditions
Abstract
We investigate the effectiveness of generative adversarial networks (GANs) for speech enhancement, in the context of improving noise robustness of automatic speech recognition (ASR) systems. Prior work demonstrates that GANs can effectively suppress additive noise in raw waveform speech signals, improving perceptual quality metrics; however this technique was not justified in the context of ASR. In this work, we conduct a detailed study to measure the effectiveness of GANs in enhancing speech contaminated by both additive and reverberant noise. Motivated by recent advances in image processing, we propose operating GANs on log-Mel filterbank spectra instead of waveforms, which requires less computation and is more robust to reverberant noise. While GAN enhancement improves the performance of a clean-trained ASR system on noisy speech, it falls short of the performance achieved by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
