BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens

Hao Wen; Xinrui Wu; Yi Sun; Feifei Zhang; Liye Chen; Jie Wang; Yunxin Liu; Yunhao Liu; Ya-Qin Zhang; Yuanchun Li

arXiv:2508.17196·cs.LG·September 1, 2025

BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens

Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li

PDF

Open Access 1 Models 3 Reviews

TL;DR

BudgetThinker introduces a control token-based framework for LLMs that enables precise management of reasoning length, balancing accuracy and resource constraints through a two-stage training process involving supervised fine-tuning and reinforcement learning.

Contribution

The paper presents a novel method for budget-aware reasoning in LLMs using control tokens and a two-stage training pipeline, improving efficiency and performance under resource constraints.

Findings

01

Outperforms baseline models in reasoning accuracy across various budgets.

02

Effectively maintains reasoning quality while adhering to token limits.

03

Scalable approach suitable for real-time, resource-constrained environments.

Abstract

Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper is interesting in introducing a combo of test time behavior and RL training reward to better elicit token control behavior for length-controlled generation. The method sounds novel and has rather solid tech contributions.

Weaknesses

One weakness is that comparison against baseline seems relatively weak in that there is no alternative RL training method that gets compared, and it will be useful to have extra ablations to better understand the mechanism of the improvement.

Reviewer 02Rating 6Confidence 2

Strengths

Originality: - The ratio-based control token design is elegant and scalable, requiring only K fixed tokens regardless of budget size, which addresses a practical limitation of fixed-interval approaches - The training methodology is thoughtfully designed, with a well-motivated balance of three data types (long reasoning, iteratively compressed, non-thinking) and a sensible curriculum learning strategy - The plug-and-play nature of the framework makes it potentially valuable as a complementary mod

Weaknesses

- The task coverage is somewhat focused on mathematical reasoning. I'm interested in how the approach might generalize to other important applications like code generation, open-domain QA, or multi-step agentic tasks - For the 7B experiments, it would be helpful to include ThinkPrune comparisons if feasible - The budget range tested (50-10K tokens) is reasonable, though I wonder about behavior in more extreme regimes Beyond Pass@1 accuracy, it might be illuminating to examine additional dimensio

Reviewer 03Rating 2Confidence 4

Strengths

* The use of the periodic special tokens to remind the LLM its remaining budget is somewhat novel, and incorporating it into SFT and RL is reasonable. * Experiments cover both LLM and VLM.

Weaknesses

* The paper shows that the model adheres to the budget, but offers less insight into how the reasoning strategy changes. A qualitative analysis comparing the CoT traces at different budgets (e.g., does it skip certain reasoning steps? does it use more abbreviations?) would be very valuable. * The method is a bit complex and costly due to the SFT stage.

Code & Models

Models

🤗
Xin-Rui/BudgetThinker_backup
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Materials Science