CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization
Xinge Ye, Rui Wang, Yuchuan Wu, Victor Ma, Feiteng Fang, Fei Huang, Yongbin Li

TL;DR
This paper introduces Comparative Policy Optimization (CPO), a novel reinforcement learning approach that improves role-playing dialogue by using comparative evaluations instead of traditional reward signals, leading to more robust and fair performance assessments.
Contribution
The paper proposes CPO, a new reward evaluation method based on comparative judgments, and introduces CharacterArena, a framework for more reliable dialogue evaluation, addressing reward ambiguity in subjective tasks.
Findings
CPO reduces reward ambiguity in role-playing dialogue tasks.
CharacterArena provides more robust and fair performance evaluation.
Empirical results show improved dialogue quality with CPO.
Abstract
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring.Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Reinforcement Learning in Robotics · Topic Modeling
