Efficient Knowledge Distillation for RNN-Transducer Models
Sankaran Panchapagesan, Daniel S. Park, Chung-Cheng Chiu, Yuan, Shangguan, Qiao Liang, Alexander Gruenstein

TL;DR
This paper introduces an efficient knowledge distillation method for RNN-Transducer models, improving speech recognition accuracy, especially for sparse models, with simple loss functions and broad applicability across datasets.
Contribution
The paper proposes a novel, simple distillation loss for RNN-T models that enhances accuracy of sparse models and is effective across multiple speech recognition datasets.
Findings
WER reductions of 4.3% and 12.1% on noisy datasets for sparse models
4.8% relative WER reduction on LibriSpeech test-other
Effective distillation for both pruning and small models
Abstract
Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the "y" and "blank" posterior probabilities from the RNN-T output probability lattice. We study the effectiveness of the proposed approach in improving the accuracy of sparse RNN-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation. With distillation of 60% and 90% sparse multi-domain RNN-T models, we obtain WER reductions of 4.3% and 12.1% respectively, on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
