Knowledge Distillation from Multiple Foundation Models for End-to-End   Speech Recognition

Xiaoyu Yang; Qiujia Li; Chao Zhang; Philip C. Woodland

arXiv:2303.10917·eess.AS·March 21, 2023·1 cites

Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

Xiaoyu Yang, Qiujia Li, Chao Zhang, Philip C. Woodland

PDF

Open Access

TL;DR

This paper introduces a two-stage knowledge distillation framework that transfers knowledge from multiple large speech foundation models into a smaller student model, significantly improving speech recognition accuracy.

Contribution

It proposes a novel multi-teacher knowledge distillation method with a two-stage training process for end-to-end speech recognition models.

Findings

01

Using multiple teachers yields higher WERRs than a single teacher.

02

Pre-training with embeddings from multiple teachers enhances student model performance.

03

Incorporating unlabelled data during pre-training further improves results.

Abstract

Although large foundation models pre-trained by self-supervised learning have achieved state-of-the-art performance in many tasks including automatic speech recognition (ASR), knowledge distillation (KD) is often required in practice to transfer the knowledge learned by large teacher models into much smaller student models with affordable computation and memory costs. This paper proposes a novel two-stage KD framework to distil the knowledge from multiple speech foundation models as teachers into a single student neural transducer model for ASR. In the first stage, the student model encoder is pre-trained using the embeddings extracted from multiple teacher models. In the second stage, the student encoder is fine-tuned with the audio-text pairs based on the ASR task. Experiments on the LibriSpeech 100-hour subset show that the proposed KD framework improves the performance of both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsKnowledge Distillation