Robust Speech Recognition Using Generative Adversarial Networks
Anuroop Sriram, Heewoo Jun, Yashesh Gaur, Sanjeev Satheesh

TL;DR
This paper introduces a scalable GAN-based framework for robust speech recognition that improves invariance to noise without relying on domain-specific assumptions, enhancing far-field speech recognition performance.
Contribution
The paper presents a novel end-to-end GAN-based approach that directly encourages robustness in speech recognition models without domain expertise or simplifying assumptions.
Findings
Improved accuracy in simulated far-field speech recognition
Encoders learn noise-invariant embeddings
No need for specialized front-ends or preprocessing
Abstract
This paper describes a general, scalable, end-to-end framework that uses the generative adversarial network (GAN) objective to enable robust speech recognition. Encoders trained with the proposed approach enjoy improved invariance by learning to map noisy audio to the same embedding space as that of clean audio. Unlike previous methods, the new framework does not rely on domain expertise or simplifying assumptions as are often needed in signal processing, and directly encourages robustness in a data-driven way. We show the new approach improves simulated far-field speech recognition of vanilla sequence-to-sequence models without specialized front-ends or preprocessing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
