Stabilizing Policy Gradient Methods via Reward Profiling

Shihab Ahmed; El Houcine Bergou; Aritra Dutta; Yue Wang

arXiv:2511.16629·cs.LG·January 27, 2026

Stabilizing Policy Gradient Methods via Reward Profiling

Shihab Ahmed, El Houcine Bergou, Aritra Dutta, Yue Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces a reward profiling framework that enhances policy gradient methods by reducing variance and improving convergence stability, leading to faster and more reliable reinforcement learning.

Contribution

The authors propose a universal reward profiling technique that integrates with any policy gradient algorithm, ensuring stable and monotonic performance improvements.

Findings

01

Up to 1.5x faster convergence on benchmarks

02

Up to 1.75x reduction in return variance

03

Applicable to various continuous-control environments

Abstract

Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Stabilizing Policy Gradient Methods via Reward Profiling· underline

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques