Distilling a speech and music encoder with task arithmetic

Fabian Ritter-Gutierrez; Yi-Cheng Lin; Jui-Chiang Wei; Jeremy H.M Wong; Eng Siong Chng; Nancy F. Chen; Hung-yi Lee

arXiv:2505.13270·cs.SD·May 20, 2025

Distilling a speech and music encoder with task arithmetic

Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H.M Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces a novel method for creating a unified speech and music encoder by distilling separate SSL models into task vectors and linearly interpolating them, enabling flexible domain emphasis and improved performance.

Contribution

It proposes a new approach to unify speech and music models through task vector interpolation, simplifying training and enhancing performance over traditional ensemble distillation.

Findings

01

Superior performance on speech and music benchmarks

02

Flexible domain emphasis via adjustable weights

03

Simpler training process compared to ensemble distillation

Abstract

Despite the progress in self-supervised learning (SSL) for speech and music, existing models treat these domains separately, limiting their capacity for unified audio understanding. A unified model is desirable for applications that require general representations, e.g. audio large language models. Nonetheless, directly training a general model for speech and music is computationally expensive. Knowledge Distillation of teacher ensembles may be a natural solution, but we posit that decoupling the distillation of the speech and music SSL models allows for more flexibility. Thus, we propose to learn distilled task vectors and then linearly interpolate them to form a unified speech+music model. This strategy enables flexible domain emphasis through adjustable weights and is also simpler to train. Experiments on speech and music benchmarks demonstrate that our method yields superior overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing