Exploring Pre-training with Alignments for RNN Transducer based   End-to-End Speech Recognition

Hu Hu; Rui Zhao; Jinyu Li; Liang Lu; Yifan Gong

arXiv:2005.00572·cs.CL·May 5, 2020·1 cites

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

Hu Hu, Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong

PDF

Open Access

TL;DR

This paper investigates leveraging external alignments for pre-training RNN Transducer models in end-to-end speech recognition, demonstrating significant improvements in accuracy and latency reduction on large-scale data.

Contribution

It introduces two novel pre-training methods using external alignments for RNN-T, improving performance and reducing latency compared to traditional initialization strategies.

Findings

01

Encoder pre-training achieves 10% WER reduction over random init.

02

Pre-training reduces model latency significantly.

03

Methods outperform CTC+RNNLM initialization on large-scale data.

Abstract

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing