Holder Policy Optimisation

Yuxiang Chen; Dingli Liang; Yihang Chen; Ziqin Gong; Chenyang Le; Zhaokai Wang; Jiachen Zhu; Lingyu Yang; Jianghao Lin; Weinan Zhang; Jun Wang

arXiv:2605.12058·cs.LG·May 22, 2026

Holder Policy Optimisation

Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang

PDF

TL;DR

H"olderPO introduces a flexible policy optimization framework using the H"older mean to balance gradient concentration and variance, improving training stability and performance in large language models.

Contribution

It unifies token-level probability aggregation via the H"older mean and employs a dynamic schedule for the parameter p, enhancing training stability and effectiveness.

Findings

01

Achieves 54.9% average accuracy on mathematical benchmarks.

02

Secures 93.8% success rate on ALFWorld.

03

Outperforms standard GRPO with a 7.2% relative gain.

Abstract

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{H\"{o}lderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the H\"{o}lder mean. By explicitly modulating the parameter $p$ , our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.