MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li; Shujian Zhang; Wenxuan Zhou; John Lambert; Chi Jin; Andrew Hard; Rajiv Mathews; Lun Wang

arXiv:2512.24693·cs.CL·January 1, 2026

MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Wenzhe Li, Shujian Zhang, Wenxuan Zhou, John Lambert, Chi Jin, Andrew Hard, Rajiv Mathews, Lun Wang

PDF

Open Access

TL;DR

This paper introduces MUSIC, an unsupervised data augmentation method that enhances multi-turn reward models by incorporating multi-turn contrastive signals, leading to improved evaluation of multi-turn conversations in large language models.

Contribution

The paper proposes MUSIC, a novel unsupervised augmentation strategy that synthesizes multi-turn contrastive conversation pairs, significantly improving multi-turn reward model performance.

Findings

01

MUSIC-augmented RM outperforms baseline methods in multi-turn conversation evaluation.

02

The approach maintains performance on single-turn benchmarks.

03

Incorporating multi-turn contrasts is crucial for robust multi-turn reward modeling.

Abstract

Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Speech Recognition and Synthesis