Moments Matter:Stabilizing Policy Optimization using Return Distributions

Dennis Jabs; Aditya Mohan; Marius Lindauer

arXiv:2601.01803·cs.LG·January 6, 2026

Moments Matter:Stabilizing Policy Optimization using Return Distributions

Dennis Jabs, Aditya Mohan, Marius Lindauer

PDF

Open Access

TL;DR

This paper introduces a moment-based regularization method for policy optimization in deep RL that leverages return distribution moments to improve stability without expensive distribution estimation.

Contribution

It proposes a novel approach using higher-order moments of return distributions to stabilize policy updates in continuous control tasks.

Findings

01

Achieves up to 75% stability improvement in Walker2D.

02

Maintains comparable evaluation returns to standard PPO.

03

Effectively reduces policy instability caused by noisy updates.

Abstract

Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution $R (θ)$ , obtained by repeatedly sampling minibatches, updating $θ$ , and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow $R (θ)$ can improve stability, directly estimating $R (θ)$ is computationally expensive in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Domain Adaptation and Few-Shot Learning