Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Nadav Timor; Jonathan Mamou; Daniel Korat; Moshe Berchansky; Gaurav Jain; Oren Pereg; Moshe Wasserblat; David Harel

arXiv:2502.05202·cs.CL·June 12, 2025

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, David Harel

PDF

Open Access 1 Video

TL;DR

This paper introduces three lossless speculative decoding algorithms that enable faster large language model inference without shared vocabularies or retraining, broadening practical applicability and achieving up to 2.8x speedup.

Contribution

The paper presents three novel lossless speculative decoding methods that remove the shared-vocabulary constraint and work with off-the-shelf models without retraining.

Findings

01

Achieved up to 2.8x speedup in inference

02

Enabled use of any off-the-shelf model as drafter

03

Demonstrated effectiveness on summarization, programming, and long-context tasks

Abstract

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies· slideslive

Taxonomy

TopicsNatural Language Processing Techniques