PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization
Ben Rahman

TL;DR
PPO-BR introduces an adaptive trust region mechanism that dynamically balances exploration and convergence in reinforcement learning, leading to faster, more stable training suitable for safety-critical applications.
Contribution
It proposes a theoretically grounded dual-signal entropy-reward adaptation for PPO, unifying exploration and convergence control in a single trust region.
Findings
29.1% faster convergence on benchmarks
2.3x lower reward variance than PPO
Less than 1.8% runtime overhead
Abstract
Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
