Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Yun Qu; Qi Wang; Yixiu Mao; Heming Zou; Yuhang Jiang; Yingyue Li; Wutong Xu; Lizhou Cai; Weijie Liu; Clive Bai; Kai Yang; Yangkun Chen; Saiyong Yang; Xiangyang Ji

arXiv:2605.06139·cs.LG·May 21, 2026

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

PDF

TL;DR

This paper introduces Listwise Policy Optimization (LPO), a novel reinforcement learning method for large language models that explicitly targets response distributions, improving training stability and diversity.

Contribution

LPO provides a new framework for RLVR that explicitly conducts target-projection on the response simplex, enhancing performance and stability over existing methods.

Findings

01

LPO achieves consistent improvements across diverse reasoning tasks.

02

LPO maintains optimization stability and response diversity.

03

LPO outperforms typical policy gradient baselines.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.