SpecExec: Massively Parallel Speculative Decoding for Interactive LLM   Inference on Consumer Devices

Ruslan Svirschevski; Avner May; Zhuoming Chen; Beidi Chen; Zhihao Jia,; Max Ryabinin

arXiv:2406.02532·cs.CL·December 3, 2024·3 cites

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia,, Max Ryabinin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SpecExec, a parallel speculative decoding method enabling efficient inference of large language models on consumer GPUs with RAM offloading, achieving significant speedups over traditional methods.

Contribution

SpecExec is a novel parallel decoding approach that leverages token probability distributions to enable fast LLM inference on consumer hardware with offloaded parameters.

Findings

01

Achieves 4-6 tokens/sec with 4-bit quantization

02

Achieves 2-3 tokens/sec with 16-bit weights

03

Enables large LLM inference on consumer GPUs

Abstract

As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yandex-research/specexec
pytorchOfficial

Videos

SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices· slideslive

Taxonomy

TopicsDigital Rights Management and Security · Mathematics, Computing, and Information Processing