Application of Knowledge Distillation to Multi-task Speech Representation Learning
Mine Kerpicci, Van Nguyen, Shuhua Zhang, Erik Visser

TL;DR
This paper explores applying knowledge distillation to large self-supervised speech models to significantly reduce their size while maintaining high performance on downstream tasks like keyword spotting and speaker verification.
Contribution
It demonstrates a method to compress speech representation models by nearly 75% with minimal accuracy loss, enabling better deployment on edge devices.
Findings
75% model size reduction with only 0.1% accuracy loss
Fine-tuning improves downstream task performance
Distilled models perform comparably to full-size models
Abstract
Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker verification, they provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we investigate the application of knowledge distillation to speech representation learning (SRL) models followed by joint fine-tuning with multiple downstream voice-activated tasks. In our experiments on two such tasks, our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation compared to the full-size model. In addition, we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsKnowledge Distillation
