URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Songshuo Lu; Hua Wang; Zhi Chen; Yaohua Tang

arXiv:2507.17515·cs.CV·July 24, 2025

URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Songshuo Lu, Hua Wang, Zhi Chen, Yaohua Tang

PDF

Open Access 1 Video

TL;DR

URPO introduces a unified framework that combines reward modeling and policy optimization in a single model and training phase, improving alignment and performance of large language models.

Contribution

The paper proposes URPO, a novel unified training method that integrates reward and policy optimization, reducing complexity and enhancing model performance.

Findings

01

Outperforms baseline with separate reward model in instruction-following and reasoning tasks.

02

Boosts instruction-following score from 42.24 to 44.84 on AlpacaEval.

03

Achieves a RewardBench score of 85.15, surpassing dedicated reward models.

Abstract

Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following ("player") and reward modeling ("referee") within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

URPO: A Unified Reward & Policy Optimization Framework for Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling