Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation

Danilo de Oliveira; Timo Gerkmann

arXiv:2309.09920·eess.AS·September 19, 2023

Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation

Danilo de Oliveira, Timo Gerkmann

PDF

Open Access

TL;DR

This paper demonstrates that decoupled knowledge distillation can effectively compress HuBERT into an LSTM-based model, reducing parameters while improving speech recognition performance.

Contribution

It introduces a novel approach to distilling HuBERT into an LSTM model using decoupled knowledge distillation, enabling flexible architecture and better efficiency.

Findings

01

LSTM-based model outperforms DistilHuBERT in speech recognition.

02

Parameter count is reduced below previous distilled models.

03

Improved recognition accuracy with fewer parameters.

Abstract

Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its more recently proposed extension, decoupled knowledge distillation) can be applied to the task of distilling HuBERT. In contrast to methods that focus on distilling internal features, this allows for more freedom in the network architecture of the compressed model. We thus propose to distill HuBERT's Transformer layers into an LSTM-based distilled model that reduces the number of parameters even below DistilHuBERT and at the same time shows improved performance in automatic speech recognition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsAttention Is All You Need · Softmax · Dense Connections · Absolute Position Encodings · Focus · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Knowledge Distillation