Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

Junhong Lin; Xinyue Zeng; Jie Zhu; Song Wang; Julian Shun; Jun Wu; Dawei Zhou

arXiv:2505.16122·cs.LG·March 3, 2026

Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Plan-and-Budget, a test-time framework that decomposes complex reasoning queries into sub-questions and allocates token budgets adaptively, significantly improving efficiency and accuracy of large language models.

Contribution

It formalizes reasoning as a sequence of sub-questions with BAM, introduces the E3 metric, and proposes a model-agnostic adaptive scheduling method for efficient inference.

Findings

01

Up to 70% accuracy gains across tasks

02

39% token reduction in reasoning

03

193.8% improvement in E3 metric

Abstract

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent work has tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

I feel like this addresses real problem - overthinking/underthinking is genuine issue in reasoning LLMs. Theoretical grounding via BAM provides principled justification beyond heuristics. E3 metric is sensible - A^2/T appropriately weights correctness over pure efficiency unlike A/T. Model-agnostic approach works across multiple architectures without retraining. Comprehensive experiments across three diverse domains with consistent improvements. Clear presentation of decay scheduling strategies

Weaknesses

Core assumption is questionable? - uncertainty decomposition (epistemic vs aleatoric) requires Bayesian treatment but paper doesn't actually compute posterior p(theta|D), just hand-waves with "Monte Carlo approximation" in Appendix B without showing how this applies to deterministic transformer inference. BAM's power law U_epistemic = c/b^β is asserted not derived - why inverse power law specifically? Parameters c_ij and β_ij are never actually estimated, making Eq 6 theoretical only. Decay sche

Reviewer 02Rating 4Confidence 4

Strengths

1. Clear formulation & theory: BAM provides a principled lens on how to distribute limited compute across subproblems. 2. Simple, model‑agnostic implementation: Works at prompt time; no training; adds a lightweight planning LLM whose contribution is controlled by planned baselines. 3. Consistent gains across tasks/models: Large (E^3) improvements with stable or improved accuracy; front‑loaded schedules (polynomial/cosine) perform best on complex tasks. 4. Bridging model sizes: The method help

Weaknesses

1. Uncertainty signal not directly validated. The budgeting step is motivated by uncertainty reduction, yet the experiments do not present a correlation between used proxies (e.g., prefix‑entropy drop) and actual downstream gains, nor do they show a head‑to‑head between decay‑only vs. truly measured uncertainty allocation. An ablation that (a) measures token‑level entropy and (b) reallocates budgets on‑the‑fly based on it would strengthen the central claim. 2. Metric definition vs. datasets

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper is detailed and transparent, making it easy for readers to understand and replicate the proposed approach. 2. The method design is rigorous: using uncertainty as a proxy for reasoning difficulty and allocating computation where it provides the highest marginal gain is both intuitive and elegant. 3. The experiments are comprehensive, covering multiple models (both open- and closed-source) and diverse benchmarks, demonstrating strong generalizability.

Weaknesses

1. Although the method is conceptually rigorous, Table 3–5 show that the method may still hurt accuracy compared to full-length baselines. 2. Some of the simpler baselines (e.g., Planned Vanilla or Global Budget) already achieve noticeable token savings with minimal performance loss. This raises the question of whether the added complexity of PLAN-AND-BUDGET always justifies its gains.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification