On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces
Amrit Singh Bedi, Souradip Chakraborty, Anjaly Parayil, Brian Sadler,, Pratap Tokekar, Alec Koppel

TL;DR
This paper investigates biases in policy gradient methods for continuous actions, proposing a stable heavy-tailed parameterization with mirror ascent updates that converges reliably and improves reward outcomes.
Contribution
It introduces a convergence analysis for heavy-tailed policy parameterizations using mirror ascent and gradient tracking, enabling stable learning with fixed step and batch sizes.
Findings
Convergence is achieved with constant step and batch sizes.
Heavy-tailed policies improve reward accumulation.
The method demonstrates stability and better performance across benchmarks.
Abstract
We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which fails to hold even for Gaussian policies. To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Adversarial Robustness in Machine Learning
