BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens
Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li

TL;DR
BudgetThinker introduces a control token-based framework for LLMs that enables precise management of reasoning length, balancing accuracy and resource constraints through a two-stage training process involving supervised fine-tuning and reinforcement learning.
Contribution
The paper presents a novel method for budget-aware reasoning in LLMs using control tokens and a two-stage training pipeline, improving efficiency and performance under resource constraints.
Findings
Outperforms baseline models in reasoning accuracy across various budgets.
Effectively maintains reasoning quality while adhering to token limits.
Scalable approach suitable for real-time, resource-constrained environments.
Abstract
Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is interesting in introducing a combo of test time behavior and RL training reward to better elicit token control behavior for length-controlled generation. The method sounds novel and has rather solid tech contributions.
One weakness is that comparison against baseline seems relatively weak in that there is no alternative RL training method that gets compared, and it will be useful to have extra ablations to better understand the mechanism of the improvement.
Originality: - The ratio-based control token design is elegant and scalable, requiring only K fixed tokens regardless of budget size, which addresses a practical limitation of fixed-interval approaches - The training methodology is thoughtfully designed, with a well-motivated balance of three data types (long reasoning, iteratively compressed, non-thinking) and a sensible curriculum learning strategy - The plug-and-play nature of the framework makes it potentially valuable as a complementary mod
- The task coverage is somewhat focused on mathematical reasoning. I'm interested in how the approach might generalize to other important applications like code generation, open-domain QA, or multi-step agentic tasks - For the 7B experiments, it would be helpful to include ThinkPrune comparisons if feasible - The budget range tested (50-10K tokens) is reasonable, though I wonder about behavior in more extreme regimes Beyond Pass@1 accuracy, it might be illuminating to examine additional dimensio
* The use of the periodic special tokens to remind the LLM its remaining budget is somewhat novel, and incorporating it into SFT and RL is reasonable. * Experiments cover both LLM and VLM.
* The paper shows that the model adheres to the budget, but offers less insight into how the reasoning strategy changes. A qualitative analysis comparing the CoT traces at different budgets (e.g., does it skip certain reasoning steps? does it use more abbreviations?) would be very valuable. * The method is a bit complex and costly due to the SFT stage.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Materials Science
