Streaming Align-Refine for Non-autoregressive Deliberation
Weiran Wang, Ke Hu, Tara N. Sainath

TL;DR
This paper introduces a streaming non-autoregressive decoding algorithm for RNN-T models that achieves high efficiency and low latency, matching offline performance with the added benefit of real-time output.
Contribution
It presents a novel streaming-compatible Align-Refine algorithm with a transformer architecture and applies discriminative training with MWER, a first in non-AR decoding.
Findings
Streaming model matches offline performance with limited right context.
Discriminative training improves WER, especially with small-capacity models.
Efficient, low-latency decoding suitable for real-time speech recognition.
Abstract
We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model. Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context, thus enjoying both high efficiency and low latency. These advantages are achieved by converting the offline Align-Refine algorithm to be streaming-compatible, with a novel transformer decoder architecture that performs local self-attentions for both text and audio, and a time-aligned cross-attention at each layer. Furthermore, we perform discriminative training of our model with the minimum word error rate (MWER) criterion, which has not been done in the non-AR decoding literature. Experiments on voice search datasets and Librispeech show that with reasonable right context, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
