Value-Free Policy Optimization via Reward Partitioning
Bilal Faye, Hanane Azzag, Mustapha Lebbah

TL;DR
This paper introduces Reward Partitioning Optimization (RPO), a novel reinforcement learning method that normalizes rewards directly from data, eliminating the need for value function modeling and improving stability and simplicity in scalar-feedback tasks.
Contribution
RPO is a new approach that removes the need for value function approximation in single-trajectory RL, providing direct policy supervision through reward normalization.
Findings
RPO outperforms DRO and KTO on language modeling tasks.
RPO is simpler, more stable, and easier to implement.
Theoretically grounded and effective in practice.
Abstract
Single-trajectory reinforcement learning (RL) methods aim to optimize policies from datasets consisting of (prompt, response, reward) triplets, where scalar rewards are directly available. This supervision format is highly practical, as it mirrors real-world human feedback, such as thumbs-up/down signals, and avoids the need for structured preference annotations. In contrast, pairwise preference-based methods like Direct Preference Optimization (DPO) rely on datasets with both preferred and dispreferred responses, which are harder to construct and less natural to collect. Among single-trajectory approaches, Direct Reward Optimization (DRO) has shown strong empirical performance due to its simplicity and stability. However, DRO requires approximating a value function, which introduces several limitations: high off-policy variance, coupling between policy and value learning, and a lack of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Autonomous Vehicle Technology and Safety · Adversarial Robustness in Machine Learning
MethodsFlan-T5
