Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition
Xiaoyu Yang, Qiujia Li, Chao Zhang, Philip C. Woodland

TL;DR
This paper introduces a two-stage knowledge distillation framework that transfers knowledge from multiple large speech foundation models into a smaller student model, significantly improving speech recognition accuracy.
Contribution
It proposes a novel multi-teacher knowledge distillation method with a two-stage training process for end-to-end speech recognition models.
Findings
Using multiple teachers yields higher WERRs than a single teacher.
Pre-training with embeddings from multiple teachers enhances student model performance.
Incorporating unlabelled data during pre-training further improves results.
Abstract
Although large foundation models pre-trained by self-supervised learning have achieved state-of-the-art performance in many tasks including automatic speech recognition (ASR), knowledge distillation (KD) is often required in practice to transfer the knowledge learned by large teacher models into much smaller student models with affordable computation and memory costs. This paper proposes a novel two-stage KD framework to distil the knowledge from multiple speech foundation models as teachers into a single student neural transducer model for ASR. In the first stage, the student model encoder is pre-trained using the embeddings extracted from multiple teacher models. In the second stage, the student encoder is fine-tuned with the audio-text pairs based on the ASR task. Experiments on the LibriSpeech 100-hour subset show that the proposed KD framework improves the performance of both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsKnowledge Distillation
