Improving RNN Transducer Modeling for End-to-End Speech Recognition

Jinyu Li; Rui Zhao; Hu Hu; and Yifan Gong

arXiv:1909.12415·cs.CL·September 30, 2019·5 cites

Improving RNN Transducer Modeling for End-to-End Speech Recognition

Jinyu Li, Rui Zhao, Hu Hu, and Yifan Gong

PDF

Open Access 1 Repo

TL;DR

This paper enhances RNN Transducer models for end-to-end speech recognition by optimizing training algorithms and model structures, resulting in smaller models with significantly improved accuracy on large-scale data.

Contribution

The paper introduces optimized training methods and improved model architectures for RNN-T, enabling faster training and smaller, more accurate models.

Findings

01

Achieved up to 11.8% relative WER reduction over baseline RNN-T.

02

Developed smaller RNN-T models with comparable accuracy to larger models.

03

Outperformed device hybrid models of similar size in WER reduction.

Abstract

In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csukuangfj/optimized_transducer
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques