SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia,, Max Ryabinin

TL;DR
This paper introduces SpecExec, a parallel speculative decoding method enabling efficient inference of large language models on consumer GPUs with RAM offloading, achieving significant speedups over traditional methods.
Contribution
SpecExec is a novel parallel decoding approach that leverages token probability distributions to enable fast LLM inference on consumer hardware with offloaded parameters.
Findings
Achieves 4-6 tokens/sec with 4-bit quantization
Achieves 2-3 tokens/sec with 16-bit weights
Enables large LLM inference on consumer GPUs
Abstract
As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Rights Management and Security · Mathematics, Computing, and Information Processing
