Stabilizing Policy Gradient Methods via Reward Profiling
Shihab Ahmed, El Houcine Bergou, Aritra Dutta, Yue Wang

TL;DR
This paper introduces a reward profiling framework that enhances policy gradient methods by reducing variance and improving convergence stability, leading to faster and more reliable reinforcement learning.
Contribution
The authors propose a universal reward profiling technique that integrates with any policy gradient algorithm, ensuring stable and monotonic performance improvements.
Findings
Up to 1.5x faster convergence on benchmarks
Up to 1.75x reduction in return variance
Applicable to various continuous-control environments
Abstract
Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
