Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

\"Omer Faruk Akg\"ul; Rajgopal Kannan; Willie Neiswanger; Viktor Prasanna

arXiv:2605.06241·cs.CL·May 12, 2026

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

\"Omer Faruk Akg\"ul, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna

PDF

1 Repo

TL;DR

This paper shows that reinforcement learning improves large language model reasoning mainly through sparse, targeted corrections at uncertain decision points, rather than teaching new capabilities.

Contribution

It introduces ReasonMaxxer, a minimal RL-free method that applies contrastive loss at high-entropy points, matching RL performance with significantly less training.

Findings

01

RL affects only 1-3% of token positions, within top-5 alternatives.

02

Base model entropy can identify correction points without RL.

03

ReasonMaxxer achieves comparable performance with minimal data and training time.

Abstract

Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

farukakgul/ReasonMaxxer
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.