TL;DR
S-SONDO is a novel self-supervised knowledge distillation framework that efficiently compresses large audio models into smaller ones using only output embeddings, broadening applicability to embedding-based models.
Contribution
It introduces the first embedding-only distillation method for general audio models, enabling architecture-agnostic compression without relying on logits or layer features.
Findings
Distilled models are up to 61 times smaller while retaining 96% of performance.
S-SONDO is architecture-agnostic and applicable to embedding-based teachers.
Provides practical insights on loss functions and data sampling strategies.
Abstract
General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
