Learning to Reason Efficiently with Discounted Reinforcement Learning

Alex Ayoub; Kavosh Asadi; Dale Schuurmans; Csaba Szepesv\'ari; Karim Bouyarmane

arXiv:2510.23486·cs.LG·October 28, 2025

Learning to Reason Efficiently with Discounted Reinforcement Learning

Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesv\'ari, Karim Bouyarmane

PDF

3 Reviews

TL;DR

This paper introduces a discounted reinforcement learning approach to encourage concise reasoning in large models, reducing token usage without sacrificing accuracy, supported by theoretical analysis and experiments.

Contribution

It proposes a novel discounted RL framework for reasoning models, integrating token cost penalties to promote brevity and efficiency.

Findings

01

Shorter reasoning chains with maintained accuracy

02

Theoretical validation of Blackwell optimality in policy selection

03

Empirical results confirm reduced token usage without accuracy loss

Abstract

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* The paper makes a principled design choice to discount only extrinsic (correctness) rewards while leaving intrinsic (formatting) rewards undiscounted, which is well-justified both theoretically and practically. * The theoretical exposition clearly communicates Blackwell optimality concepts and the main results, making the mathematical framework accessible (I am not a hard RL person and found it pleasant to read). The paper provides thorough theoretical analysis that comprehensively covers fin

Weaknesses

The paper * evaluation is limited to math benchmarks (GSM8K, MATH, AMC, AIME, MINERVA, OLYMPIAD); it remains unclear whether these findings generalize to coding tasks or other reasoning domains. * theory section relies on the deterministic transitions assumption, but the paper does not specify whether training uses stochastic sampling (T>0) or greedy decoding (T=0), creating a potential theory-practice gap. * assumes the policy class is finite, which is needed for the existence of the Blackwe

Reviewer 02Rating 6Confidence 4

Strengths

1) The four design components used here are Discounting only the environment reward, Regularizing KL to a changing reference, Discounting only reasoning tokens and Comparable token budgets across methods. These components are simple to adopt and are also defended in the theory. 2) In the finite restricted policy classes, for γ close to 1 the Blackwell optimal policies are accuracy maximizing and have shortest mean response length. This is proved in the theorems 3.4, 3.7 and 3.10 which strengthen

Weaknesses

1) Modeling as a finite horizon with deterministic transitions and binary terminal reward worked theoretically. But many real-world reasoning workflows violate these assumptions. 2) In Blackwell optimality analysis, the γ is selected as for from 1 as possible. However, this is done by a simple bisection search. 3) Discounting is only applied to reasoning tokens. The authors say that discounting entire response slightly hurt the accuracy. This suggest there may be some issues that prevents errors

Reviewer 03Rating 4Confidence 4

Strengths

- The paper makes a nice connection between discounting and reducing trajectory length in the context of reasoning language models. - The authors provide a formal underpinning to the connection. - Their recipe for training a model seems reasonable, and leads to a reduction in reasoning token usage in their tested settings.

Weaknesses

- The experimental results do not seem to address the standard long chain-of-thought setting. For instance, in Table 1 the lengths are in the hundreds, and in table 2 the lengths are typically 1000 or less. These inference budgets are lower than those typically used to evaluate long chain-of-thought models. Based on the results it's unclear whether the proposed methods work for these settings. - There are many related methods for efficient reasoning models; the area is a very active research ar

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.