Speeding up Speculative Decoding via Sequential Approximate Verification
Meiyu Zhong, Noel Teku, Ravi Tandon

TL;DR
This paper introduces SPRINTER, a method that uses sequential approximate verification with a low-complexity verifier to speed up Large Language Model inference by reducing calls to the larger target model.
Contribution
SPRINTER is a novel approach that replaces periodic parallel verification with sequential approximate verification, significantly reducing latency and computational costs in speculative decoding.
Findings
SPRINTER achieves higher speedups compared to traditional speculative decoding.
Theoretical analysis confirms the statistical soundness of the approximate verifier.
Experimental results show maintained generation quality with reduced latency.
Abstract
Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs). SD operates by using a smaller draft LLM for autoregressively generating a sequence of tokens and a larger target LLM for parallel verification to ensure statistical consistency. However, periodic parallel calls to the target LLM for verification prevent SD from achieving even lower latencies. We propose SPRINTER, which utilizes a low-complexity verifier trained to predict if tokens generated from a draft LLM would be accepted by the target LLM. By performing sequential approximate verification, SPRINTER does not require verification by the target LLM and is only invoked when a token is deemed unacceptable. This reduces the number of calls to the larger LLM, achieving further speedups and lower computation cost. We present a theoretical analysis of SPRINTER, examining the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
