Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition
Nanxin Chen, Shinji Watanabe, Jes\'us Villalba, Najim Dehak

TL;DR
This paper introduces non-autoregressive transformer models for speech recognition that enable faster inference while maintaining high accuracy, outperforming traditional systems and matching state-of-the-art autoregressive models.
Contribution
The paper proposes two non-autoregressive transformer frameworks for ASR, demonstrating effective training and inference strategies with significant speed improvements.
Findings
Outperforms Kaldi ASR on Aishell benchmark
Matches state-of-the-art autoregressive transformer performance
Achieves 7x inference speedup
Abstract
Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
