Exploring Effective Distillation of Self-Supervised Speech Models for   Automatic Speech Recognition

Yujin Wang; Changli Tang; Ziyang Ma; Zhisheng Zheng; Xie Chen and; Wei-Qiang Zhang

arXiv:2210.15631·eess.AS·May 8, 2025

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Yujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen and, Wei-Qiang Zhang

PDF

Open Access

TL;DR

This paper investigates effective methods for distilling large self-supervised speech models into smaller, efficient models for automatic speech recognition, introducing novel loss functions and input distillation techniques to improve performance and efficiency.

Contribution

It presents a comprehensive study on student model structures, introduces a discriminative loss for better distillation, and proposes an input distillation algorithm that reduces parameters and doubles inference speed.

Findings

01

Discriminative loss improves distillation in low-resource scenarios.

02

Input distillation reduces parameters by 17% and doubles inference speed.

03

Performance degradation is marginal with the proposed methods.

Abstract

Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing. The SSL model is normally pre-trained on a great variety of unlabelled data and a large model size is preferred to increase the modeling capacity. However, this might limit its potential applications due to the expensive computation and memory costs introduced by the oversize model. Miniaturization for SSL models has become an important research direction of practical value. To this end, we explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR). First, in order to establish a strong baseline, a comprehensive study on different student model structures is conducted. On top of this, as a supplement to the regression loss widely adopted in previous works, a discriminative loss is introduced for HuBERT to enhance the distillation performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing