OpenMU: Your Swiss Army Knife for Music Understanding
Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao,, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji

TL;DR
OpenMU-Bench is a comprehensive benchmark suite designed to advance multimodal music understanding by addressing data scarcity, including lyrics and tool usage, and demonstrating the effectiveness of the OpenMU model.
Contribution
We introduce OpenMU-Bench, a large-scale benchmark for multimodal music understanding, and develop OpenMU, a model that outperforms baselines, both open-sourced for future research.
Findings
OpenMU outperforms baseline models like MU-Llama.
OpenMU-Bench broadens music understanding scope.
OpenMU and OpenMU-Bench are open-sourced.
Abstract
We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
This paper has two strengths: 1. The authors make significant efforts in collecting and annotating music datasets using LLM. This contribution provides valuable resources for the music community, supporting further development of LM-based music understanding models. 2. The evaluation of the proposed OpenMU is thorough, with comparisons to baseline models such as Mu-Llama, MusiLingo, and M2UGen. The paper also includes ablation studies examining the effects of token size/length and various lo
The paper contains two weaknesses: 1. **Limited Novelty**: The contributions of this paper are somewhat constrained. It should focus on either advancing the benchmark for music understanding or improving LM-based music understanding models. While the authors have invested considerable effort in data collection, task formulation, and initial evaluation of the benchmark, there are no new designs for model architecture or evaluation metrics except a new task termed "tool using". Furthermore, it sh
1. Standardizing the evaluation metrics for text-generation tasks in OpenMU-Bench is a smart step. It enhances the consistency and fairness of benchmarking, allowing for more effective comparisons between various music-understanding models. 2. The thorough exploration of key factors in training OpenMU is useful. For instance, examining how the number of music tokens impacts training efficiency and model convergence provides reusable insights for future research.
1. The proposed model and dataset primarily rely on established techniques and common practices in the field, resulting in a lack of novelty: 1) Although the paper utilizes existing datasets and employs GPT-3.5 to generate new annotations, the underlying data sources are mainly based on pre-existing music-related datasets, with no introduction of new music data or innovative data-collection methods. 2) Regarding the model architecture, the use of AudioMAE for encoding music clips, Llama 3 as the
The paper's strengths lie in its novel contribution to the field of music information retrieval (MIR) through the creation of OpenMU-Bench, a large-scale benchmark that significantly expands the scope of music understanding tasks. The benchmark's comprehensiveness is a notable advantage, as it covers various aspects of music understanding, which is crucial for developing well-rounded multimodal language models. Additionally, the paper demonstrates OpenMU's superior performance over existing mode
1. LLark has published its source code at https://github.com/spotify-research/llark, contrary to the paper's claim that it has not open-sourced its models and datasets. 2. OpenMU lacks innovation, being derived from previous works with limited novelty in training. 3. OpenMU-Bench lacks discussion on its construction, including data handling and annotation, limiting its practical application value. 4. The paper does not clarify the overlap between training and testing sets in OpenMU-Bench or e
This paper proposes a multimodal large language model, OpenMU, capable of comprehensive MIR tasks and outperforms the existing MU-LLAMA model. In addition, this paper establishes the publicly available dataset OpenMU-Bench, which is large-scale and comprehensive.
The primary weakness of this paper is that its content and experimental results do not convincingly support its claimed contributions. Here are the specific issues: 1. Ambiguity in Contribution between Benchmark and Model: The relationship between the benchmark (including the dataset) and the model is unclear. In the abstract, the benchmark appears to be the main contribution, with the model serving to demonstrate the dataset’s capabilities and provide an example usage. However, in the introduc
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Diverse Musicological Studies · Music and Audio Processing
