CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

Yinghao Ma; Siyou Li; Juntao Yu; Emmanouil Benetos; Akira Maezawa

arXiv:2506.12285·eess.AS·July 1, 2025

CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

Yinghao Ma, Siyou Li, Juntao Yu, Emmanouil Benetos, Akira Maezawa

PDF

Open Access 1 Datasets

TL;DR

CMI-Bench introduces a comprehensive benchmark for evaluating music instruction-following capabilities of audio-text LLMs across diverse MIR tasks, highlighting current model limitations and biases.

Contribution

This paper presents CMI-Bench, a new standardized benchmark for assessing music instruction-following in audio-text LLMs across multiple MIR tasks, enabling fair comparison and driving progress.

Findings

01

Significant performance gaps between LLMs and supervised models.

02

Identification of cultural, chronological, and gender biases in models.

03

Benchmark supports multiple open-source audio-text LLMs.

Abstract

Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

nicolaus625/CMI-bench
dataset· 253 dl
253 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDiverse Music Education Insights

MethodsSparse Evolutionary Training