Efficient minimum word error rate training of RNN-Transducer for   end-to-end speech recognition

Jinxi Guo; Gautam Tiwari; Jasha Droppo; Maarten Van Segbroeck; Che-Wei; Huang; Andreas Stolcke; Roland Maas

arXiv:2007.13802·eess.AS·July 29, 2020·1 cites

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei, Huang, Andreas Stolcke, Roland Maas

PDF

Open Access

TL;DR

This paper introduces an efficient MWER training method for RNN-Transducer in end-to-end speech recognition, significantly speeding up training while maintaining or improving WER performance.

Contribution

It proposes a semi-on-the-fly MWER training approach that decouples decoding and training, enabling faster training and improved accuracy over baseline models.

Findings

01

Speeds up MWER training by 6 times compared to previous methods.

02

Achieves 3.6% WER reduction on a speech recognition benchmark.

03

Reduces high-deletion errors and improves recognition in various domains.

Abstract

In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T). Unlike previous work on this topic, which performs on-the-fly limited-size beam-search decoding and generates alignment scores for expected edit-distance computation, in our proposed method, we re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists. The hypothesis probability scores and back-propagated gradients are calculated efficiently using the forward-backward algorithm. Moreover, the proposed method allows us to decouple the decoding and training processes, and thus we can perform offline parallel-decoding and MWER training for each subset iteratively. Experimental results show that this proposed semi-on-the-fly method can speed up the on-the-fly method by 6 times and result in a similar WER improvement (3.6%) over a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing