Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

Haoyu Han; Heng Yang

arXiv:2602.01460·math.OC·February 10, 2026

Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

Haoyu Han, Heng Yang

PDF

Open Access

TL;DR

This paper analyzes the noise-to-signal ratio in policy-gradient estimators within reinforcement learning, revealing its non-uniform behavior and potential to cause training instability as policies improve.

Contribution

It provides exact characterizations of the NSR for linear and polynomial systems and bounds for nonlinear policies, enhancing understanding of training dynamics in policy-gradient methods.

Findings

01

NSR landscape is highly non-uniform across policy parameters.

02

NSR tends to increase and can blow up near optima, risking training instability.

03

Exact and numerical methods for NSR characterization in specific systems.

Abstract

Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing