Beyond Distribution Sharpening: The Importance of Task Rewards
Sarthak Mittal, Leo Gagnon, Guillaume Lajoie

TL;DR
This paper compares distribution sharpening and task-reward-based reinforcement learning, showing that the latter leads to more robust performance improvements and stability in frontier models.
Contribution
It provides a first-principles analysis and empirical evidence demonstrating the limitations of distribution sharpening and the advantages of task-reward-based learning.
Findings
Distribution sharpening has unfavorable optima and is fundamentally unstable.
Task-reward-based learning significantly improves robustness and performance.
Sharpening yields limited gains compared to reward-based methods.
Abstract
Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from pure reasoning models into sophisticated agents. However, debate persists regarding whether RL genuinely instills new skills within a base model or merely sharpens its existing distribution to elicit latent capabilities. To address this dichotomy, we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms. Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable. Furthermore, our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
