Holder Policy Optimisation
Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang

TL;DR
H"olderPO introduces a flexible policy optimization framework using the H"older mean to balance gradient concentration and variance, improving training stability and performance in large language models.
Contribution
It unifies token-level probability aggregation via the H"older mean and employs a dynamic schedule for the parameter p, enhancing training stability and effectiveness.
Findings
Achieves 54.9% average accuracy on mathematical benchmarks.
Secures 93.8% success rate on ALFWorld.
Outperforms standard GRPO with a 7.2% relative gain.
Abstract
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{H\"{o}lderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the H\"{o}lder mean. By explicitly modulating the parameter , our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
