TL;DR
This paper introduces a discounted reinforcement learning approach to encourage concise reasoning in large models, reducing token usage without sacrificing accuracy, supported by theoretical analysis and experiments.
Contribution
It proposes a novel discounted RL framework for reasoning models, integrating token cost penalties to promote brevity and efficiency.
Findings
Shorter reasoning chains with maintained accuracy
Theoretical validation of Blackwell optimality in policy selection
Empirical results confirm reduced token usage without accuracy loss
Abstract
Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
Peer Reviews
Decision·ICLR 2026 Poster
* The paper makes a principled design choice to discount only extrinsic (correctness) rewards while leaving intrinsic (formatting) rewards undiscounted, which is well-justified both theoretically and practically. * The theoretical exposition clearly communicates Blackwell optimality concepts and the main results, making the mathematical framework accessible (I am not a hard RL person and found it pleasant to read). The paper provides thorough theoretical analysis that comprehensively covers fin
The paper * evaluation is limited to math benchmarks (GSM8K, MATH, AMC, AIME, MINERVA, OLYMPIAD); it remains unclear whether these findings generalize to coding tasks or other reasoning domains. * theory section relies on the deterministic transitions assumption, but the paper does not specify whether training uses stochastic sampling (T>0) or greedy decoding (T=0), creating a potential theory-practice gap. * assumes the policy class is finite, which is needed for the existence of the Blackwe
1) The four design components used here are Discounting only the environment reward, Regularizing KL to a changing reference, Discounting only reasoning tokens and Comparable token budgets across methods. These components are simple to adopt and are also defended in the theory. 2) In the finite restricted policy classes, for γ close to 1 the Blackwell optimal policies are accuracy maximizing and have shortest mean response length. This is proved in the theorems 3.4, 3.7 and 3.10 which strengthen
1) Modeling as a finite horizon with deterministic transitions and binary terminal reward worked theoretically. But many real-world reasoning workflows violate these assumptions. 2) In Blackwell optimality analysis, the γ is selected as for from 1 as possible. However, this is done by a simple bisection search. 3) Discounting is only applied to reasoning tokens. The authors say that discounting entire response slightly hurt the accuracy. This suggest there may be some issues that prevents errors
- The paper makes a nice connection between discounting and reducing trajectory length in the context of reasoning language models. - The authors provide a formal underpinning to the connection. - Their recipe for training a model seems reasonable, and leads to a reduction in reasoning token usage in their tested settings.
- The experimental results do not seem to address the standard long chain-of-thought setting. For instance, in Table 1 the lengths are in the hundreds, and in table 2 the lengths are typically 1000 or less. These inference budgets are lower than those typically used to evaluate long chain-of-thought models. Based on the results it's unclear whether the proposed methods work for these settings. - There are many related methods for efficient reasoning models; the area is a very active research ar
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
