LAD: Learning Advantage Distribution for Reasoning
Wendi Li, Sharon Li

TL;DR
LAD introduces a distribution-matching framework for reinforcement learning that enhances reasoning diversity and accuracy without extra training costs, by learning advantage-induced distributions instead of maximizing expected rewards.
Contribution
This paper proposes Learning Advantage Distributions (LAD), a novel approach that replaces advantage maximization with distribution matching, improving reasoning diversity and performance in large language models.
Findings
LAD faithfully recovers multimodal advantage distributions in bandit settings.
LAD improves accuracy and diversity in math and code reasoning tasks.
LAD scales naturally to large language models without additional training cost.
Abstract
Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an -divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Explainable Artificial Intelligence (XAI)
