Value-Free Policy Optimization via Reward Partitioning

Bilal Faye; Hanane Azzag; Mustapha Lebbah

arXiv:2506.13702·cs.LG·December 23, 2025

Value-Free Policy Optimization via Reward Partitioning

Bilal Faye, Hanane Azzag, Mustapha Lebbah

PDF

Open Access 1 Repo

TL;DR

This paper introduces Reward Partitioning Optimization (RPO), a novel reinforcement learning method that normalizes rewards directly from data, eliminating the need for value function modeling and improving stability and simplicity in scalar-feedback tasks.

Contribution

RPO is a new approach that removes the need for value function approximation in single-trajectory RL, providing direct policy supervision through reward normalization.

Findings

01

RPO outperforms DRO and KTO on language modeling tasks.

02

RPO is simpler, more stable, and easier to implement.

03

Theoretically grounded and effective in practice.

Abstract

Single-trajectory reinforcement learning (RL) methods aim to optimize policies from datasets consisting of (prompt, response, reward) triplets, where scalar rewards are directly available. This supervision format is highly practical, as it mirrors real-world human feedback, such as thumbs-up/down signals, and avoids the need for structured preference annotations. In contrast, pairwise preference-based methods like Direct Preference Optimization (DPO) rely on datasets with both preferred and dispreferred responses, which are harder to construct and less natural to collect. Among single-trajectory approaches, Direct Reward Optimization (DRO) has shown strong empirical performance due to its simplicity and stability. However, DRO requires approximating a value function, which introduces several limitations: high off-policy variance, coupling between policy and value learning, and a lack of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

b-faye/rpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Autonomous Vehicle Technology and Safety · Adversarial Robustness in Machine Learning

MethodsFlan-T5