Tied & Reduced RNN-T Decoder
Rami Botros (1), Tara N. Sainath (1), Robert David (1), Emmanuel, Guzman (1), Wei Li (1), Yanzhang He (1) ((1) Google Inc. USA)

TL;DR
This paper introduces a simplified, smaller RNN-T decoder with weight tying and EMBR training, achieving a 90% reduction in parameters without loss in recognition accuracy, suitable for on-device speech recognition.
Contribution
It proposes a novel, lightweight RNN-T decoder design using weighted averaging and weight tying, combined with EMBR training, to drastically reduce model size while maintaining performance.
Findings
Decoder size reduced from 23M to 2M parameters
Recognition accuracy remains unchanged with the new design
Efficient on-device speech recognition enabled by smaller model
Abstract
Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003.07705 [eess.AS], [2], arXiv:2012.06749 [cs.CL]). This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs. The benefits of such changes include reduction in model size, faster inference and power savings, which are all useful for on-device applications. In this work, we study ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance. Our prediction network performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network's output layer (a.k.a. weight tying,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
