PaSS: Parallel Speculative Sampling
Giovanni Monea, Armand Joulin, Edouard Grave

TL;DR
PaSS introduces a parallel decoding method for large language models that speeds up token generation by up to 30% without needing a second model, reducing memory bottlenecks during inference.
Contribution
The paper presents a novel parallel decoding approach that enables drafting multiple tokens simultaneously within a single model, eliminating the need for a second model as in speculative sampling.
Findings
Achieves up to 30% speed-up in token generation.
Requires only O(d_emb) additional parameters.
Does not need a second model or shared tokenizer.
Abstract
Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsSparse Evolutionary Training
