PaSS: Parallel Speculative Sampling

Giovanni Monea; Armand Joulin; Edouard Grave

arXiv:2311.13581·cs.CL·November 23, 2023·2 cites

PaSS: Parallel Speculative Sampling

Giovanni Monea, Armand Joulin, Edouard Grave

PDF

Open Access

TL;DR

PaSS introduces a parallel decoding method for large language models that speeds up token generation by up to 30% without needing a second model, reducing memory bottlenecks during inference.

Contribution

The paper presents a novel parallel decoding approach that enables drafting multiple tokens simultaneously within a single model, eliminating the need for a second model as in speculative sampling.

Findings

01

Achieves up to 30% speed-up in token generation.

02

Requires only O(d_emb) additional parameters.

03

Does not need a second model or shared tokenizer.

Abstract

Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training