Keep what you need : extracting efficient subnetworks from large audio   representation models

David Genova; Philippe Esling; Tom Hurlin

arXiv:2502.12925·cs.SD·February 19, 2025

Keep what you need : extracting efficient subnetworks from large audio representation models

David Genova, Philippe Esling, Tom Hurlin

PDF

Open Access 1 Repo

TL;DR

This paper proposes a method to extract lightweight, task-specific subnetworks from large pretrained audio models using learnable binary masks and sparsity loss, enabling efficient deployment without retraining the entire model.

Contribution

It introduces a novel approach with learnable binary masks and sparsity loss to create compact, specialized subnetworks from large audio foundation models, maintaining performance while reducing size.

Findings

01

Effective across different backbone architectures

02

Reduces model size significantly

03

Maintains performance on various audio tasks

Abstract

Recently, research on audio foundation models has witnessed notable advances, as illustrated by the ever improving results on complex downstream tasks. Subsequently, those pretrained networks have quickly been used for various audio applications. These improvements have however resulted in a considerable increase both in size and complexity of these models. Along the environmental concerns this issue raises, this prevents the deployment of such networks on consumer-level devices, and precludes their use for real-time applications. Moreover, this appears contradictory with the specificity of the tasks for which these models are used, which are often simpler compared to extracting a rich, multi-purpose representation from any type of audio data. In this paper, we address this issue with a simple, yet effective method to extract lightweight specialist subnetworks from large foundation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gnvircam/audio-representation-trimming
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing