Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models

Zhaoqing Li; Haoning Xu; Xurong Xie; Zengrui Jin; Tianzi Wang; Xunying Liu

arXiv:2505.21237·cs.SD·May 28, 2025

Unfolding A Few Structures for The Many: Memory-Efficient Compression of Conformer and Speech Foundation Models

Zhaoqing Li, Haoning Xu, Xurong Xie, Zengrui Jin, Tianzi Wang, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces a memory-efficient model compression method for Conformer and speech foundation models, using a small seed model that is unfolded into larger models during training, achieving comparable performance with fewer parameters.

Contribution

A novel 'small-to-large' unfolding approach that jointly trains seed and expanded models, reducing memory and storage while maintaining performance.

Findings

01

Achieves 35% parameter reduction for Conformer models.

02

Maintains comparable ASR performance with fewer parameters.

03

Requires minimal memory and storage during training and inference.

Abstract

This paper presents a novel memory-efficient model compression approach for Conformer ASR and speech foundation systems. Our approach features a unique "small-to-large" design. A compact "seed" model containing a few Conformer or Transformer blocks is trained and unfolded many times to emulate the performance of larger uncompressed models with different logical depths. The seed model and many unfolded paths are jointly trained within a single unfolding cycle. The KL-divergence between the largest unfolded and smallest seed models is used in a self-distillation process to minimize their performance disparity. Experimental results show that our foldable model produces ASR performance comparable to individually constructed Conformer and wav2vec2/HuBERT speech foundation models under various depth configurations, while requiring only minimal memory and storage. Conformer and wav2vec2 models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems