Less Is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
Rohit Prabhavalkar, Yanzhang He, David Rybach, Sean Campbell, Arun, Narayanan, Trevor Strohman, Tara N. Sainath

TL;DR
This paper demonstrates that limiting label context to four previous word-piece labels in RNN-T models maintains accuracy while significantly improving decoding efficiency through a novel path-merging scheme that reduces redundant computations.
Contribution
The study introduces a method to limit RNN-T label context to four previous labels without accuracy loss and proposes a path-merging scheme to enhance decoding efficiency and reduce model evaluations.
Findings
Limiting label context does not degrade WER.
Path merging improves oracle WER by up to 36%.
Decoding efficiency is improved with up to 5.3% fewer model evaluations.
Abstract
End-to-end models that condition the output label sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since unique label histories correspond to distinct models states, such models are decoded using an approximate beam-search process which produces a tree of hypotheses. In this work, we study the influence of the amount of label context on the model's accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve the efficiency of the beam-search process during decoding by removing redundant paths from the active…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
