Essence Knowledge Distillation for Speech Recognition
Zhenchuan Yang, Chun Zhang, Weibin Zhang, Jianxiu Jin, Dongpeng Chen

TL;DR
This paper introduces a novel knowledge distillation method for speech recognition that selectively uses ensemble outputs and combines them with hard labels, resulting in a more efficient and accurate single model.
Contribution
It proposes a selective distillation approach that filters ensemble outputs and employs multitask learning, improving speech recognition accuracy over traditional methods.
Findings
The method outperforms single models trained only on hard labels.
The student model surpasses the teacher model in accuracy.
Selective distillation reduces computational costs while maintaining high performance.
Abstract
It is well known that a speech recognition system that combines multiple acoustic models trained on the same data significantly outperforms a single-model system. Unfortunately, real time speech recognition using a whole ensemble of models is too computationally expensive. In this paper, we propose to distill the knowledge of essence in an ensemble of models (i.e. the teacher model) to a single model (i.e. the student model) that needs much less computation to deploy. Previously, all the soften outputs of the teacher model are used to optimize the student model. We argue that not all the outputs of the ensemble are necessary to be distilled. Some of the outputs may even contain noisy information that is useless or even harmful to the training of the student model. In addition, we propose to train the student model with a multitask learning approach by utilizing both the soften outputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
