Distributed Speculative Inference (DSI): Speculation Parallelism for   Provably Faster Lossless Language Model Inference

Nadav Timor; Jonathan Mamou; Daniel Korat; Moshe Berchansky; Oren; Pereg; Moshe Wasserblat; Tomer Galanti; Michal Gordon; David Harel

arXiv:2405.14105·cs.DC·March 18, 2025

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren, Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

PDF

Open Access 1 Repo 1 Video

TL;DR

Distributed Speculative Inference (DSI) is a new algorithm that significantly accelerates language model inference by leveraging speculation parallelism, outperforming previous speculative methods and standard inference across various models and tasks.

Contribution

The paper introduces DSI, a provably faster inference algorithm that overcomes limitations of prior speculative inference methods by ensuring speedup regardless of drafter accuracy.

Findings

01

DSI achieves 1.29-1.92x speedup over SI in simulations.

02

DSI is faster than both SI and non-SI for various LMs and tasks.

03

Open-sourced implementation available.

Abstract

This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI--but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI--given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages speculation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

keyboardAnt/distributed-speculative-inference
noneOfficial

Videos

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference· slideslive

Taxonomy

TopicsTopic Modeling