Evaluating Multimodal Large Language Models on Core Music Perception Tasks
Brandon James Carone, Iran R. Roman, Pablo Ripoll\'es

TL;DR
This paper benchmarks state-of-the-art multimodal large language models on core music perception tasks, revealing a perceptual gap between MIDI and audio inputs and highlighting the need for more robust audio understanding in music AI systems.
Contribution
It introduces a comprehensive benchmark and dataset for evaluating music perception in multimodal LLMs, emphasizing the perceptual and reasoning gaps, especially in audio processing.
Findings
Models perform near ceiling on MIDI inputs.
Accuracy drops significantly on audio inputs.
Reasoning strategies offer minimal improvements.
Abstract
Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
