ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Zilin Xiao; Hongming Zhang; Tao Ge; Siru Ouyang; Vicente Ordonez; Dong; Yu

arXiv:2410.05589·cs.CL·October 10, 2024

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong, Yu

PDF

Open Access 3 Reviews

TL;DR

ParallelSpec introduces a parallel drafting approach for speculative decoding in large language models, significantly reducing inference latency by predicting multiple tokens simultaneously, and achieving substantial speedups over traditional auto-regressive methods.

Contribution

It proposes a novel parallel drafter trained to predict multiple future tokens at once, replacing auto-regressive drafting in speculative decoding for improved efficiency.

Findings

01

Up to 62% latency reduction on text generation benchmarks.

02

Achieves 2.84X overall speedup on Llama-2-13B.

03

Compatible with existing speculative decoding frameworks.

Abstract

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most existing works still draft tokens auto-regressively to maintain sequential dependency in language modeling, which we consider a huge computational burden in speculative decoding. We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model. ParallelSpec learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

+ The paper’s parallel drafter introduces a novel alternative to sequential SD models, advancing speculative decoding with efficient multi-token generation. + Empirical evaluations across multiple benchmarks and models, along with sound theoretical justifications, confirm the validity of the approach. + The work has strong implications for real-time, large-scale LLM applications, especially with the demonstrated integration into established frameworks like Medusa and EAGLE. + By employing reject

Weaknesses

- Unlike speculative decoding approaches that can use pre-trained models or lightweight modifications, PARALLELSPEC requires a dedicated training process for the parallel drafter. This additional training step may increase setup time and computational cost, which could limit the method’s immediate applicability in certain use cases. - The absence of publicly available code for the training and experimental setups raises concerns about reproducibility. Without the code, it may be challenging for

Reviewer 02Rating 5Confidence 5

Strengths

1. The intuition of why ParallelSpec can accelerate EAGLE is convincing. Based on the experiment results shown in Table 1, although ParallelSpec generates less tokens per iteration, it improves the drafting efficiency via parallel drafing, so it improves the overall efficiency. 2. The paper is well-written and easy to understand. 3. The paper includes comprehensive discussion on how to integrate the proposed method with state-of-the-art framework such as EAGLE and Medusa.

Weaknesses

1. One of my main concern is the comparison between the proposed method and Medusa. Although authors provides an intuitive explanation that ParallelSpec has better parameter sharing, I am not fully convinced. So I expect to see a comprehensive comparison between Medusa and ParallelSpec in the experiments. However, in Table 1, the comparison between Medusa and Medusa+ParallelSpec only covers the temperature=0, but not temperature=1. 2. In Table 1, I believe the two settings (temp=0 and temp=1) a

Reviewer 03Rating 5Confidence 4

Strengths

**Originality:** While the idea of parallel drafting in speculative decoding was already explored, this paper is the first time the effectiveness of this method is demonstrated with a separate drafter. **Clarity**: Overall the method is simple, elegant, and clearly presented. **Significance**: Speculative decoding is a widely used technique for accelerating language model inference, and speeding up state of the art speculative decoding methods is an important problem with clear implications fo

Weaknesses

While accelerating speculative decoding methods is an important problem, and the ~15% speedup over existing methods is commendable, I believe the paper should be rejected for the following reasons: (1) the method is not novel enough as it is very similar to ideas already present in existing works (2) the results are not overly convincing and their magnitude is generally far below the 62% reported in the abstract. (3) There are some reproducibility issues, though I think these can be easily clari

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Advanced Database Systems and Queries · Rough Sets and Fuzzy Logic