TL;DR
STAND is a model-free speculative decoding method that accelerates language model reasoning by exploiting redundancy in reasoning paths, reducing inference latency by over 60% without accuracy loss.
Contribution
Introducing STAND, a novel model-free speculative decoding approach that leverages reasoning path redundancy for efficient, accurate, and versatile inference acceleration.
Findings
Reduces inference latency by 60-65% across multiple models and tasks.
Maintains accuracy comparable to standard decoding methods.
Outperforms existing speculative decoding techniques in diverse inference scenarios.
Abstract
Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
