Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Bo Zhang, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, Kinhei Lee, Z henxuan Zhang, Xiaobing Li, Maosong Sun

TL;DR
The paper introduces MSU-Bench, a comprehensive benchmark for evaluating large language and vision-language models' ability to understand complete musical scores across textual and visual formats, highlighting current limitations and improvements through fine-tuning.
Contribution
It presents a new multimodal benchmark with 1,800 questions on musical scores, evaluates state-of-the-art models, and demonstrates the benefits of fine-tuning for improved musical understanding.
Findings
Models show modality gaps and unstable performance across difficulty levels.
Fine-tuning significantly enhances model accuracy and consistency.
MSU-Bench provides a foundation for future multimodal musical reasoning research.
Abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
