Distilling a speech and music encoder with task arithmetic
Fabian Ritter-Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H.M Wong, Eng Siong Chng, Nancy F. Chen, Hung-yi Lee

TL;DR
This paper introduces a novel method for creating a unified speech and music encoder by distilling separate SSL models into task vectors and linearly interpolating them, enabling flexible domain emphasis and improved performance.
Contribution
It proposes a new approach to unify speech and music models through task vector interpolation, simplifying training and enhancing performance over traditional ensemble distillation.
Findings
Superior performance on speech and music benchmarks
Flexible domain emphasis via adjustable weights
Simpler training process compared to ensemble distillation
Abstract
Despite the progress in self-supervised learning (SSL) for speech and music, existing models treat these domains separately, limiting their capacity for unified audio understanding. A unified model is desirable for applications that require general representations, e.g. audio large language models. Nonetheless, directly training a general model for speech and music is computationally expensive. Knowledge Distillation of teacher ensembles may be a natural solution, but we posit that decoupling the distillation of the speech and music SSL models allows for more flexibility. Thus, we propose to learn distilled task vectors and then linearly interpolate them to form a unified speech+music model. This strategy enables flexible domain emphasis through adjustable weights and is also simpler to train. Experiments on speech and music benchmarks demonstrate that our method yields superior overall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
