CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Yinghao Ma; Haiwen Xia; Hewei Gao; Weixiong Chen; Yuxin Ye; Yuchen Yang; Sungkyun Chang; Mingshuo Ding; Yizhi Li; Ruibin Yuan; Simon Dixon; Emmanouil Benetos

arXiv:2603.00610·cs.SD·March 5, 2026

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos

PDF

Open Access

TL;DR

This paper introduces a comprehensive ecosystem for evaluating music reward models in multimodal settings, including new datasets, a benchmark, and reward models that align well with human judgments.

Contribution

It presents CMI-RewardBench, a unified benchmark for music reward evaluation, and develops CMI reward models capable of processing diverse multimodal inputs.

Findings

01

CMI-RM correlates strongly with human judgments.

02

The datasets enable fine-grained alignment evaluation.

03

Reward models support effective inference-time scaling.

Abstract

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Topic Modeling · Multimodal Machine Learning Applications