LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

Zhe Yuan; Yipeng Zhou; Jinghan Li; Xinyuan Chen; Bowen Deng; Zhiqian Chen; Liang Zhao

arXiv:2605.21235·cs.CL·May 21, 2026

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao

PDF

1 Models

TL;DR

LamPO introduces a pairwise advantage approach for reinforcement learning in reasoning language models, improving credit assignment and training stability over existing methods.

Contribution

It proposes LamPO, a lambda-style policy optimization method that enhances reward comparison granularity and incorporates auxiliary rewards for better reasoning model training.

Findings

01

LamPO outperforms GRPO and recent RLVR methods on multiple reasoning benchmarks.

02

It achieves more stable training dynamics and improved sample efficiency.

03

Experimental results demonstrate consistent performance gains across various models and datasets.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
xychen123/LamPO
model· 32k dl· ♡ 1
32k dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.