LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT
Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei,, Yu Zhang, Tom Ko, Haizhou Li

TL;DR
LightHuBERT introduces a flexible, highly compressed speech representation model that maintains high performance across tasks while significantly reducing model size through a novel architecture search and distillation strategy.
Contribution
It proposes a once-for-all Transformer compression framework for speech models, enabling automatic architecture search and substantial parameter reduction.
Findings
Outperforms HuBERT on ASR and SUPERB tasks with fewer parameters.
Achieves over 3.5x compression ratio in key speech tasks.
Maintains comparable performance to larger models with 29% fewer parameters.
Abstract
Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pruning structured parameters. More precisely, we create a Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over architectures concerning the embedding dimension, attention dimension, head number, feed-forward network ratio, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsAttention Is All You Need · Pruning · Linear Layer · Residual Connection · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Label Smoothing
