Pushing the Limits of Non-Autoregressive Speech Recognition
Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan

TL;DR
This paper advances non-autoregressive speech recognition by combining recent techniques, achieving state-of-the-art results on multiple datasets without using a language model.
Contribution
It introduces a novel combination of CTC, large Conformer models, SpecAugment, and wav2vec2 pre-training to significantly improve non-autoregressive speech recognition performance.
Findings
Achieved 1.8% WER on LibriSpeech test set
Achieved 5.1% WER on Switchboard
Set new state-of-the-art results without language models
Abstract
We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
