Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song; Saket Dingliwal; Sai Muralidhar Jayanthi; Bhavana Ganesh; Jinwoo Shin; Aram Galstyan; Sravan Babu Bodapati

arXiv:2506.04708·cs.CL·May 22, 2026

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

PDF

1 Video

TL;DR

STAND is a model-free speculative decoding method that accelerates language model reasoning by exploiting redundancy in reasoning paths, reducing inference latency by over 60% without accuracy loss.

Contribution

Introducing STAND, a novel model-free speculative decoding approach that leverages reasoning path redundancy for efficient, accurate, and versatile inference acceleration.

Findings

01

Reduces inference latency by 60-65% across multiple models and tasks.

02

Maintains accuracy comparable to standard decoding methods.

03

Outperforms existing speculative decoding techniques in diverse inference scenarios.

Abstract

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Accelerated Test-Time Scaling with Model-Free Speculative Sampling· underline