Pushing the Limits of Non-Autoregressive Speech Recognition

Edwin G. Ng; Chung-Cheng Chiu; Yu Zhang; William Chan

arXiv:2104.03416·eess.AS·September 14, 2021

Pushing the Limits of Non-Autoregressive Speech Recognition

Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan

PDF

TL;DR

This paper advances non-autoregressive speech recognition by combining recent techniques, achieving state-of-the-art results on multiple datasets without using a language model.

Contribution

It introduces a novel combination of CTC, large Conformer models, SpecAugment, and wav2vec2 pre-training to significantly improve non-autoregressive speech recognition performance.

Findings

01

Achieved 1.8% WER on LibriSpeech test set

02

Achieved 5.1% WER on Switchboard

03

Set new state-of-the-art results without language models

Abstract

We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.