Knowledge Distillation for Neural Transducers from Large Self-Supervised   Pre-trained Models

Xiaoyu Yang; Qiujia Li; Philip C. Woodland

arXiv:2110.03334·eess.AS·March 3, 2022·ICASSP·1 cites

Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models

Xiaoyu Yang, Qiujia Li, Philip C. Woodland

PDF

Open Access

TL;DR

This paper introduces a knowledge distillation method for neural transducers that enables small models to closely match the performance of large self-supervised pre-trained models in speech recognition, significantly reducing word error rates.

Contribution

It proposes a simple KD loss focusing on the one-best path, effectively transferring knowledge from large pre-trained models to smaller models in ASR tasks.

Findings

01

10x smaller student models achieve up to 48.2% WERR with unlabelled data.

02

The proposed KD loss improves WER by up to 11.5% on test-other.

03

Using language model shallow fusion further enhances student model performance.

Abstract

Self-supervised pre-training is an effective approach to leveraging a large amount of unlabelled data to reduce word error rates (WERs) of automatic speech recognition (ASR) systems. Since it is impractical to use large pre-trained models for many real-world ASR applications, it is desirable to have a much smaller model while retaining the performance of the pre-trained model. In this paper, we propose a simple knowledge distillation (KD) loss function for neural transducers that focuses on the one-best path in the output probability lattice under both streaming and non-streaming setups, which allows a small student model to approach the performance of the large pre-trained teacher model. Experiments on the LibriSpeech dataset show that despite being 10 times smaller than the teacher model, the proposed loss results in relative WER reductions (WERRs) of 11.5% and 6.8% on the test-other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsKnowledge Distillation