Multi-Distillation from Speech and Music Representation Models

Jui-Chiang Wei; Yi-Cheng Lin; Fabian Ritter-Gutierrez; Hung-yi Lee

arXiv:2506.07237·eess.AS·June 12, 2025

Multi-Distillation from Speech and Music Representation Models

Jui-Chiang Wei, Yi-Cheng Lin, Fabian Ritter-Gutierrez, Hung-yi Lee

PDF

Open Access

TL;DR

This paper presents a multi-teacher distillation framework that unifies speech and music models into a single, efficient model, achieving comparable or better performance than domain-specific models, especially in few-shot learning scenarios.

Contribution

It introduces a novel cross-domain distillation method combining speech and music models into one, reducing size and maintaining performance across tasks.

Findings

01

Model matches domain-specific performance

02

Outperforms in few-shot learning

03

Effective cross-domain knowledge transfer

Abstract

Real-world audio often mixes speech and music, yet models typically handle only one domain. This paper introduces a multi-teacher distillation framework that unifies speech and music models into a single one while significantly reducing model size. Our approach leverages the strengths of domain-specific teacher models, such as HuBERT for speech and MERT for music, and explores various strategies to balance both domains. Experiments across diverse tasks demonstrate that our model matches the performance of domain-specific models, showing the effectiveness of cross-domain distillation. Additionally, we conduct few-shot learning experiments, highlighting the need for general models in real-world scenarios where labeled data is limited. Our results show that our model not only performs on par with specialized models but also outperforms them in few-shot scenarios, proving that a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning