LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Zhe Yuan; Yipeng Zhou; Jinghan Li; Xinyuan Chen; Bowen Deng; Zhiqian Chen; Liang Zhao

arXiv:2605.19416·cs.CL·May 20, 2026

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao

PDF

TL;DR

LambdaPO introduces a pairwise preference-based advantage estimation framework for reinforcement learning, enhancing reasoning language models by capturing fine-grained reward signals and improving performance on complex tasks.

Contribution

It redefines advantage estimation from scalar to pairwise preferences, addressing information loss in group policy optimization for better reasoning capabilities.

Findings

01

LambdaPO outperforms baseline methods on math reasoning tasks.

02

The pairwise preference approach captures more detailed reward information.

03

Augmentation with semantic density reward improves supervision signals.

Abstract

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.