Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores

Congren Dai; Yue Yang; Krinos Li; Huichi Zhou; Shijie Liang; Bo Zhang; Enyang Liu; Ge Jin; Hongran An; Haosen Zhang; Peiyuan Jing; Kinhei Lee; Z henxuan Zhang; Xiaobing Li; Maosong Sun

arXiv:2511.20697·cs.SD·April 24, 2026

Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores

Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Bo Zhang, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, Kinhei Lee, Z henxuan Zhang, Xiaobing Li, Maosong Sun

PDF

1 Repo 1 Datasets

TL;DR

The paper introduces MSU-Bench, a comprehensive benchmark for evaluating large language and vision-language models' ability to understand complete musical scores across textual and visual formats, highlighting current limitations and improvements through fine-tuning.

Contribution

It presents a new multimodal benchmark with 1,800 questions on musical scores, evaluates state-of-the-art models, and demonstrates the benefits of fine-tuning for improved musical understanding.

Findings

01

Models show modality gaps and unstable performance across difficulty levels.

02

Fine-tuning significantly enhances model accuracy and consistency.

03

MSU-Bench provides a foundation for future multimodal musical reasoning research.

Abstract

Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Congren-Dai/MSU-Bench
github

Datasets

Krinos/MSU-Bench
dataset· 61 dl
61 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.