LAD: Learning Advantage Distribution for Reasoning

Wendi Li; Sharon Li

arXiv:2602.20132·cs.LG·February 24, 2026

LAD: Learning Advantage Distribution for Reasoning

Wendi Li, Sharon Li

PDF

Open Access

TL;DR

LAD introduces a distribution-matching framework for reinforcement learning that enhances reasoning diversity and accuracy without extra training costs, by learning advantage-induced distributions instead of maximizing expected rewards.

Contribution

This paper proposes Learning Advantage Distributions (LAD), a novel approach that replaces advantage maximization with distribution matching, improving reasoning diversity and performance in large language models.

Findings

01

LAD faithfully recovers multimodal advantage distributions in bandit settings.

02

LAD improves accuracy and diversity in math and code reasoning tasks.

03

LAD scales naturally to large language models without additional training cost.

Abstract

Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$ -divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Explainable Artificial Intelligence (XAI)