Alignment Restricted Streaming Recurrent Neural Network Transducer
Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le,, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer

TL;DR
This paper introduces Alignment Restricted RNN-T (Ar-RNN-T), a modified loss function for streaming speech recognition that improves latency control, accuracy, and training efficiency by utilizing audio-text alignment information.
Contribution
The paper proposes a novel loss modification, Ar-RNN-T, that incorporates alignment info to enhance streaming ASR performance and training throughput.
Findings
Ar-RNN-T improves latency and Word Error Rate trade-offs.
It guarantees token emissions within specified latency ranges.
Enables faster training with higher throughput.
Abstract
There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
