One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry

Weisong Zhao; Tong Wang; Zichang Tan; Te Yang; Siran Peng; Haoyuan Zhang; Tianshuo Zhang; Haichao Shi; Meng Meng; Yang Yang; Xiangyu Zhu; Zhen Lei; Xiao-Yu Zhang; Xu Zhou

arXiv:2601.22521·cs.CL·February 2, 2026

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry

Weisong Zhao, Tong Wang, Zichang Tan, Te Yang, Siran Peng, Haoyuan Zhang, Tianshuo Zhang, Haichao Shi, Meng Meng, Yang Yang, Xiangyu Zhu, Zhen Lei, Xiao-Yu Zhang, Xu Zhou

PDF

Open Access

TL;DR

This paper introduces Power-Mean Policy Optimization (PMPO), a unified framework for group-based reinforcement learning that adaptively adjusts the aggregation geometry to improve stability and performance across diverse trajectories.

Contribution

The paper unifies existing group-based RL methods under a flexible power-mean framework and proposes an adaptive mechanism to optimize the aggregation geometry dynamically.

Findings

01

PMPO outperforms strong baselines on mathematical reasoning benchmarks.

02

Adaptive adjustment of the aggregation geometry improves stability and learning efficiency.

03

Theoretical analysis shows how the parameter p influences gradient concentration and trajectory weighting.

Abstract

Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms · Adversarial Robustness in Machine Learning