PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

Ben Rahman

arXiv:2505.17714·cs.LG·May 26, 2025

PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

Ben Rahman

PDF

TL;DR

PPO-BR introduces an adaptive trust region mechanism that dynamically balances exploration and convergence in reinforcement learning, leading to faster, more stable training suitable for safety-critical applications.

Contribution

It proposes a theoretically grounded dual-signal entropy-reward adaptation for PPO, unifying exploration and convergence control in a single trust region.

Findings

01

29.1% faster convergence on benchmarks

02

2.3x lower reward variance than PPO

03

Less than 1.8% runtime overhead

Abstract

Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.