Advancing the Foundation Model for Music Understanding

Yi Jiang; Wei Wang; Xianwen Guo; Huiyun Liu; Hanrui Wang; Youri Xu; Haoqi Gu; Zhongqian Xie; Chuanjiang Luo

arXiv:2508.01178·cs.SD·August 5, 2025

Advancing the Foundation Model for Music Understanding

Yi Jiang, Wei Wang, Xianwen Guo, Huiyun Liu, Hanrui Wang, Youri Xu, Haoqi Gu, Zhongqian Xie, Chuanjiang Luo

PDF

Open Access 4 Models 5 Datasets

TL;DR

This paper introduces MuFun, a unified foundation model for comprehensive music understanding that jointly processes instrumental and lyrical content, trained on diverse tasks, and evaluated on a new benchmark, MuCUE.

Contribution

The paper presents MuFun, a novel architecture and training approach for a unified music understanding model, along with a new benchmark MuCUE for evaluation.

Findings

01

MuFun outperforms existing models on MuCUE tasks.

02

The model demonstrates strong generalization across diverse music tasks.

03

MuFun achieves state-of-the-art results in music understanding benchmarks.

Abstract

The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental and lyrical content, and is trained on a large-scale dataset covering diverse tasks such as genre classification, music tagging, and question answering. To facilitate robust evaluation, we also propose a new benchmark for multi-faceted music understanding called MuCUE (Music Comprehensive Understanding Evaluation). Experiments show our model significantly outperforms existing audio large language models across the MuCUE tasks, demonstrating its state-of-the-art effectiveness and generalization ability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Topic Modeling · Speech Recognition and Synthesis