Moments Matter:Stabilizing Policy Optimization using Return Distributions
Dennis Jabs, Aditya Mohan, Marius Lindauer

TL;DR
This paper introduces a moment-based regularization method for policy optimization in deep RL that leverages return distribution moments to improve stability without expensive distribution estimation.
Contribution
It proposes a novel approach using higher-order moments of return distributions to stabilize policy updates in continuous control tasks.
Findings
Achieves up to 75% stability improvement in Walker2D.
Maintains comparable evaluation returns to standard PPO.
Effectively reduces policy instability caused by noisy updates.
Abstract
Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution , obtained by repeatedly sampling minibatches, updating , and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow can improve stability, directly estimating is computationally expensive in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Domain Adaptation and Few-Shot Learning
