Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator
Haoyu Han, Heng Yang

TL;DR
This paper analyzes the noise-to-signal ratio in policy-gradient estimators within reinforcement learning, revealing its non-uniform behavior and potential to cause training instability as policies improve.
Contribution
It provides exact characterizations of the NSR for linear and polynomial systems and bounds for nonlinear policies, enhancing understanding of training dynamics in policy-gradient methods.
Findings
NSR landscape is highly non-uniform across policy parameters.
NSR tends to increase and can blow up near optima, risking training instability.
Exact and numerical methods for NSR characterization in specific systems.
Abstract
Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing
