Alignment Restricted Streaming Recurrent Neural Network Transducer

Jay Mahadeokar; Yuan Shangguan; Duc Le; Gil Keren; Hang Su; Thong Le,; Ching-Feng Yeh; Christian Fuegen; Michael L. Seltzer

arXiv:2011.03072·cs.CL·November 20, 2020

Alignment Restricted Streaming Recurrent Neural Network Transducer

Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le,, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer

PDF

TL;DR

This paper introduces Alignment Restricted RNN-T (Ar-RNN-T), a modified loss function for streaming speech recognition that improves latency control, accuracy, and training efficiency by utilizing audio-text alignment information.

Contribution

The paper proposes a novel loss modification, Ar-RNN-T, that incorporates alignment info to enhance streaming ASR performance and training throughput.

Findings

01

Ar-RNN-T improves latency and Word Error Rate trade-offs.

02

It guarantees token emissions within specified latency ranges.

03

Enables faster training with higher throughput.

Abstract

There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory