Music Foundation Model as Generic Booster for Music Downstream Tasks

WeiHsiang Liao; Yuhta Takida; Yukara Ikemiya; Zhi Zhong; Chieh-Hsin Lai; Giorgio Fabbro; Kazuki Shimada; Keisuke Toyama; Kinwai Cheuk; Marco A. Mart\'inez-Ram\'irez; Shusuke Takahashi; Stefan Uhlich; Taketo Akama; Woosung Choi; Yuichiro Koyama; Yuki Mitsufuji

arXiv:2411.01135·cs.SD·May 28, 2025

Music Foundation Model as Generic Booster for Music Downstream Tasks

WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong, Chieh-Hsin Lai, Giorgio Fabbro, Kazuki Shimada, Keisuke Toyama, Kinwai Cheuk, Marco A. Mart\'inez-Ram\'irez, Shusuke Takahashi, Stefan Uhlich, Taketo Akama, Woosung Choi, Yuichiro Koyama, Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper introduces SoniDo, a music foundation model that extracts hierarchical features to significantly improve performance across various music understanding and generation tasks, especially in data-scarce scenarios.

Contribution

The paper presents SoniDo, a novel music foundation model that leverages hierarchical intermediate representations to enhance multiple downstream music tasks.

Findings

01

Improved accuracy in music tagging, transcription, source separation, and mixing.

02

Hierarchical features from SoniDo boost downstream task performance.

03

Effective in scenarios with limited training data.

Abstract

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies