The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

Brandon James Carone; Iran R. Roman; and Pablo Ripoll\'es

arXiv:2510.19055·cs.AI·October 23, 2025

The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

Brandon James Carone, Iran R. Roman, and Pablo Ripoll\'es

PDF

Open Access

TL;DR

The MUSE Benchmark evaluates music perception and relational reasoning in audio large language models, revealing significant gaps compared to human performance and exposing perceptual deficits in current models.

Contribution

We introduce the MUSE Benchmark with 10 tasks to systematically assess fundamental music perception skills in audio LLMs, highlighting their limitations.

Findings

01

Wide variance in SOTA model capabilities

02

Persistent gap between models and human experts

03

Severe perceptual deficits in some models

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Neuroscience and Music Perception · Music Technology and Sound Studies