Keep what you need : extracting efficient subnetworks from large audio representation models
David Genova, Philippe Esling, Tom Hurlin

TL;DR
This paper proposes a method to extract lightweight, task-specific subnetworks from large pretrained audio models using learnable binary masks and sparsity loss, enabling efficient deployment without retraining the entire model.
Contribution
It introduces a novel approach with learnable binary masks and sparsity loss to create compact, specialized subnetworks from large audio foundation models, maintaining performance while reducing size.
Findings
Effective across different backbone architectures
Reduces model size significantly
Maintains performance on various audio tasks
Abstract
Recently, research on audio foundation models has witnessed notable advances, as illustrated by the ever improving results on complex downstream tasks. Subsequently, those pretrained networks have quickly been used for various audio applications. These improvements have however resulted in a considerable increase both in size and complexity of these models. Along the environmental concerns this issue raises, this prevents the deployment of such networks on consumer-level devices, and precludes their use for real-time applications. Moreover, this appears contradictory with the specificity of the tasks for which these models are used, which are often simpler compared to extracting a rich, multi-purpose representation from any type of audio data. In this paper, we address this issue with a simple, yet effective method to extract lightweight specialist subnetworks from large foundation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
