Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
Zhiyuan Zhai, Bingcong Li, Bingnan Xiao, Ming Li, Xin Wang

TL;DR
This paper introduces a method for dynamically allocating compute resources during inference of large language models, optimizing accuracy within a fixed compute budget through a two-stage solve-then-learn approach.
Contribution
It formalizes the compute allocation as a constrained optimization problem and develops a practical, real-time policy learned via supervised classification to improve inference efficiency.
Findings
Achieves up to 12.8% relative accuracy improvement on MATH dataset.
Closely tracks the Lagrangian oracle with over 91% imitation accuracy.
Outperforms uniform and heuristic baselines across multiple models.
Abstract
Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
