On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Yiming Wang, Zhuosheng Zhang, Rui Wang

TL;DR
This paper analyzes the overscaling curse in parallel reasoning systems, introduces LanBo to predict sample-specific budgets, and proposes PreAda for more efficient parallel decoding, improving hardware efficiency.
Contribution
It formally analyzes the overscaling curse, introduces LanBo for predicting optimal budgets, and develops PreAda for budget allocation before decoding to enhance efficiency.
Findings
LanBo improves budget utilization significantly.
PreAda enhances hardware efficiency in latency and memory.
The analysis quantifies the prevalence of the overscaling curse.
Abstract
Parallel thinking improves LLM reasoning through multi-path sampling and aggregation. In standard evaluations, due to a lack of sample-specific priors, all samples share a global budget chosen to maximize dataset accuracy. However, many samples reach their best accuracy with much smaller budgets, causing low budget utilization. This contradiction between system efficacy and sample efficiency constitutes the Overscaling Curse. In this paper, we first provide a formal analysis of the overscaling curse and quantify its prevalence and severity in real-world systems. To break it, we propose Latent Budget Predictor (LanBo), which probes model latent representations to predict sample-specific optimal budgets. LanBo significantly improves budget utilization while maintaining dataset accuracy. We further integrate LanBo into the full decoding pipeline, inspiring Pre-decoding Budget Adaptation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
