Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR
Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman

TL;DR
This paper compares soft and hard target knowledge distillation methods for large-scale RNN-T models in ASR, demonstrating their effectiveness in different scenarios and achieving state-of-the-art results on LibriSpeech.
Contribution
It provides a comprehensive comparison of soft and hard target distillation for RNN-T models and establishes new state-of-the-art results with soft distillation on LibriSpeech.
Findings
Hard targets are more effective with architecture mismatch.
Soft targets perform better in self-training scenarios.
Achieved 8% relative WER improvement on LibriSpeech dev-other.
Abstract
Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
MethodsStochastic Depth · Knowledge Distillation · RandAugment · Dropout · Noisy Student
