Non-autoregressive Transformer with Unified Bidirectional Decoder for Automatic Speech Recognition
Chuan-Fei Zhang, Yan Liu, Tian-Hao Zhang, Song-Lu Chen, Feng Chen,, Xu-Cheng Yin

TL;DR
This paper introduces NAT-UBD, a non-autoregressive transformer with a unified bidirectional decoder for speech recognition, effectively utilizing both left-to-right and right-to-left contexts without increasing model complexity, and achieving state-of-the-art results.
Contribution
The paper proposes a novel NAT-UBD model with a unified bidirectional decoder and a special attention mask to prevent information leakage, enhancing speech recognition accuracy and speed.
Findings
Achieves 5.0%/5.5% CER on Aishell1 dev/test sets.
Runs 49.8x faster than autoregressive transformer.
Outperforms previous NAR transformer models.
Abstract
Non-autoregressive (NAR) transformer models have been studied intensively in automatic speech recognition (ASR), and a substantial part of NAR transformer models is to use the casual mask to limit token dependencies. However, the casual mask is designed for the left-to-right decoding process of the non-parallel autoregressive (AR) transformer, which is inappropriate for the parallel NAR transformer since it ignores the right-to-left contexts. Some models are proposed to utilize right-to-left contexts with an extra decoder, but these methods increase the model complexity. To tackle the above problems, we propose a new non-autoregressive transformer with a unified bidirectional decoder (NAT-UBD), which can simultaneously utilize left-to-right and right-to-left contexts. However, direct use of bidirectional contexts will cause information leakage, which means the decoder output can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
