Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models
Xiaoyu Yang, Qiujia Li, Philip C. Woodland

TL;DR
This paper introduces a knowledge distillation method for neural transducers that enables small models to closely match the performance of large self-supervised pre-trained models in speech recognition, significantly reducing word error rates.
Contribution
It proposes a simple KD loss focusing on the one-best path, effectively transferring knowledge from large pre-trained models to smaller models in ASR tasks.
Findings
10x smaller student models achieve up to 48.2% WERR with unlabelled data.
The proposed KD loss improves WER by up to 11.5% on test-other.
Using language model shallow fusion further enhances student model performance.
Abstract
Self-supervised pre-training is an effective approach to leveraging a large amount of unlabelled data to reduce word error rates (WERs) of automatic speech recognition (ASR) systems. Since it is impractical to use large pre-trained models for many real-world ASR applications, it is desirable to have a much smaller model while retaining the performance of the pre-trained model. In this paper, we propose a simple knowledge distillation (KD) loss function for neural transducers that focuses on the one-best path in the output probability lattice under both streaming and non-streaming setups, which allows a small student model to approach the performance of the large pre-trained teacher model. Experiments on the LibriSpeech dataset show that despite being 10 times smaller than the teacher model, the proposed loss results in relative WER reductions (WERRs) of 11.5% and 6.8% on the test-other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsKnowledge Distillation
