$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

Mert Cemri; Nived Rajaraman; Rishabh Tiwari; Xiaoxuan Liu; Kurt Keutzer; Ion Stoica; Kannan Ramchandran; Ahmad Beirami; Ziteng Sun

arXiv:2506.15733·cs.AI·February 20, 2026

$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun

PDF

Open Access

TL;DR

SPECS is a latency-aware test-time scaling method for large language models that uses speculative drafts and reward signals to improve efficiency and reduce latency without sacrificing accuracy.

Contribution

It introduces SPECS, a novel approach combining speculative decoding with reward-guided evaluation to optimize test-time scaling under latency constraints.

Findings

01

Reduces latency by up to 19.1% while maintaining or improving accuracy.

02

Matches or surpasses beam search performance on multiple datasets.

03

Converges to a KL-regularized reinforcement learning solution.

Abstract

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPS), often overlooking latency constraints. To address this gap, we propose $SPECS$ , a latency-aware test-time scaling method inspired by speculative decoding. $SPECS$ ~uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications