LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao

TL;DR
LambdaPO introduces a pairwise preference-based advantage estimation framework for reinforcement learning, enhancing reasoning language models by capturing fine-grained reward signals and improving performance on complex tasks.
Contribution
It redefines advantage estimation from scalar to pairwise preferences, addressing information loss in group policy optimization for better reasoning capabilities.
Findings
LambdaPO outperforms baseline methods on math reasoning tasks.
The pairwise preference approach captures more detailed reward information.
Augmentation with semantic density reward improves supervision signals.
Abstract
Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
