The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS
Brandon James Carone, Iran R. Roman, and Pablo Ripoll\'es

TL;DR
The MUSE Benchmark evaluates music perception and relational reasoning in audio large language models, revealing significant gaps compared to human performance and exposing perceptual deficits in current models.
Contribution
We introduce the MUSE Benchmark with 10 tasks to systematically assess fundamental music perception skills in audio LLMs, highlighting their limitations.
Findings
Wide variance in SOTA model capabilities
Persistent gap between models and human experts
Severe perceptual deficits in some models
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Neuroscience and Music Perception · Music Technology and Sound Studies
