Application of Knowledge Distillation to Multi-task Speech   Representation Learning

Mine Kerpicci; Van Nguyen; Shuhua Zhang; Erik Visser

arXiv:2210.16611·eess.AS·May 22, 2023

Application of Knowledge Distillation to Multi-task Speech Representation Learning

Mine Kerpicci, Van Nguyen, Shuhua Zhang, Erik Visser

PDF

Open Access

TL;DR

This paper explores applying knowledge distillation to large self-supervised speech models to significantly reduce their size while maintaining high performance on downstream tasks like keyword spotting and speaker verification.

Contribution

It demonstrates a method to compress speech representation models by nearly 75% with minimal accuracy loss, enabling better deployment on edge devices.

Findings

01

75% model size reduction with only 0.1% accuracy loss

02

Fine-tuning improves downstream task performance

03

Distilled models perform comparably to full-size models

Abstract

Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker verification, they provide state-of-the-art performance. However, these models use a large number of parameters, the smallest version of which has 95 million parameters. This constitutes a challenge for edge AI device deployments. In this paper, we investigate the application of knowledge distillation to speech representation learning (SRL) models followed by joint fine-tuning with multiple downstream voice-activated tasks. In our experiments on two such tasks, our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation compared to the full-size model. In addition, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsKnowledge Distillation