Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
Yuhang Wang, Youhe Jiang, Bin Cui, Fangcheng Fu

TL;DR
SART is a framework that improves LLM reasoning by managing response length and quality, reducing computation and memory costs while maintaining or enhancing accuracy through early stopping and dynamic pruning.
Contribution
The paper introduces SART, a novel serving framework that optimizes LLM reasoning efficiency and accuracy by controlling response length and quality via empirical and theoretical methods.
Findings
SART outperforms existing methods by up to 28.2 times in efficiency.
SART maintains high reasoning accuracy with shorter responses.
Dynamic pruning reduces memory consumption significantly.
Abstract
Recent advances in test-time scaling suggest that Large Language Models (LLMs) can gain better capabilities by generating Chain-of-Thought reasoning (analogous to human thinking) to respond a given request, and meanwhile exploring more reasoning branches (i.e., generating multiple responses and ensembling them) can improve the final output quality. However, when incorporating the two scaling dimensions, we find that the system efficiency is dampened significantly for two reasons. Firstly, the time cost to generate the final output increases substantially as many reasoning branches would be trapped in the over-thinking dilemma, producing excessively long responses. Secondly, generating multiple reasoning branches for each request increases memory consumption, which is unsuitable for LLM serving since we can only batch a limited number of requests to process simultaneously. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsEarly Stopping
