Faster and Better LLMs via Latency-Aware Test-Time Scaling

Zili Wang; Tianyu Zhang; Haoli Bai; Lu Hou; Xianzhi Yu; Wulong Liu; Shiming Xiang; Lei Zhu

arXiv:2505.19634·cs.CL·September 15, 2025

Faster and Better LLMs via Latency-Aware Test-Time Scaling

Zili Wang, Tianyu Zhang, Haoli Bai, Lu Hou, Xianzhi Yu, Wulong Liu, Shiming Xiang, Lei Zhu

PDF

Open Access 1 Video

TL;DR

This paper introduces latency-aware test-time scaling (TTS) methods that optimize inference speed and accuracy for large language models by leveraging concurrency and speculative decoding, especially in latency-critical applications.

Contribution

It proposes two novel concurrency-based approaches for latency-optimal TTS, significantly improving inference efficiency and accuracy in latency-sensitive scenarios.

Findings

01

Achieves 82.3% accuracy on MATH-500 within 1 minute for a 32B model.

02

Enables a 3B model to reach 72.4% accuracy within 10 seconds.

03

Demonstrates the importance of latency-aware TTS for speed and accuracy trade-offs.

Abstract

Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Faster and Better LLMs via Latency-Aware Test-Time Scaling· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings