TL;DR
This paper shows that reinforcement learning improves large language model reasoning mainly through sparse, targeted corrections at uncertain decision points, rather than teaching new capabilities.
Contribution
It introduces ReasonMaxxer, a minimal RL-free method that applies contrastive loss at high-entropy points, matching RL performance with significantly less training.
Findings
RL affects only 1-3% of token positions, within top-5 alternatives.
Base model entropy can identify correction points without RL.
ReasonMaxxer achieves comparable performance with minimal data and training time.
Abstract
Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
