Listen and Fill in the Missing Letters: Non-Autoregressive Transformer   for Speech Recognition

Nanxin Chen; Shinji Watanabe; Jes\'us Villalba; Najim Dehak

arXiv:1911.04908·eess.AS·April 27, 2021·53 cites

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

Nanxin Chen, Shinji Watanabe, Jes\'us Villalba, Najim Dehak

PDF

Open Access

TL;DR

This paper introduces non-autoregressive transformer models for speech recognition that enable faster inference while maintaining high accuracy, outperforming traditional systems and matching state-of-the-art autoregressive models.

Contribution

The paper proposes two non-autoregressive transformer frameworks for ASR, demonstrating effective training and inference strategies with significant speed improvements.

Findings

01

Outperforms Kaldi ASR on Aishell benchmark

02

Matches state-of-the-art autoregressive transformer performance

03

Achieves 7x inference speedup

Abstract

Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax