TL;DR
This paper introduces Bounded Ratio Reinforcement Learning (BRRL), a new framework that unifies trust region methods and PPO, providing theoretical guarantees and improved empirical performance across various domains.
Contribution
The paper develops the BRRL framework with an analytical optimal solution, introduces Bounded Policy Optimization (BPO), and extends it to LLM fine-tuning, connecting trust region methods with the Cross-Entropy Method.
Findings
BPO outperforms PPO in stability and final performance across MuJoCo, Atari, and IsaacLab environments.
BRRL provides a theoretical foundation that explains PPO's success and guarantees monotonic improvement.
GBPO effectively fine-tunes LLMs, matching or surpassing existing methods.
Abstract
Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
