ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong, Yu

TL;DR
ParallelSpec introduces a parallel drafting approach for speculative decoding in large language models, significantly reducing inference latency by predicting multiple tokens simultaneously, and achieving substantial speedups over traditional auto-regressive methods.
Contribution
It proposes a novel parallel drafter trained to predict multiple future tokens at once, replacing auto-regressive drafting in speculative decoding for improved efficiency.
Findings
Up to 62% latency reduction on text generation benchmarks.
Achieves 2.84X overall speedup on Llama-2-13B.
Compatible with existing speculative decoding frameworks.
Abstract
Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most existing works still draft tokens auto-regressively to maintain sequential dependency in language modeling, which we consider a huge computational burden in speculative decoding. We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model. ParallelSpec learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output…
Peer Reviews
Decision·Submitted to ICLR 2025
+ The paper’s parallel drafter introduces a novel alternative to sequential SD models, advancing speculative decoding with efficient multi-token generation. + Empirical evaluations across multiple benchmarks and models, along with sound theoretical justifications, confirm the validity of the approach. + The work has strong implications for real-time, large-scale LLM applications, especially with the demonstrated integration into established frameworks like Medusa and EAGLE. + By employing reject
- Unlike speculative decoding approaches that can use pre-trained models or lightweight modifications, PARALLELSPEC requires a dedicated training process for the parallel drafter. This additional training step may increase setup time and computational cost, which could limit the method’s immediate applicability in certain use cases. - The absence of publicly available code for the training and experimental setups raises concerns about reproducibility. Without the code, it may be challenging for
1. The intuition of why ParallelSpec can accelerate EAGLE is convincing. Based on the experiment results shown in Table 1, although ParallelSpec generates less tokens per iteration, it improves the drafting efficiency via parallel drafing, so it improves the overall efficiency. 2. The paper is well-written and easy to understand. 3. The paper includes comprehensive discussion on how to integrate the proposed method with state-of-the-art framework such as EAGLE and Medusa.
1. One of my main concern is the comparison between the proposed method and Medusa. Although authors provides an intuitive explanation that ParallelSpec has better parameter sharing, I am not fully convinced. So I expect to see a comprehensive comparison between Medusa and ParallelSpec in the experiments. However, in Table 1, the comparison between Medusa and Medusa+ParallelSpec only covers the temperature=0, but not temperature=1. 2. In Table 1, I believe the two settings (temp=0 and temp=1) a
**Originality:** While the idea of parallel drafting in speculative decoding was already explored, this paper is the first time the effectiveness of this method is demonstrated with a separate drafter. **Clarity**: Overall the method is simple, elegant, and clearly presented. **Significance**: Speculative decoding is a widely used technique for accelerating language model inference, and speeding up state of the art speculative decoding methods is an important problem with clear implications fo
While accelerating speculative decoding methods is an important problem, and the ~15% speedup over existing methods is commendable, I believe the paper should be rejected for the following reasons: (1) the method is not novel enough as it is very similar to ideas already present in existing works (2) the results are not overly convincing and their magnitude is generally far below the 62% reported in the abstract. (3) There are some reproducibility issues, though I think these can be easily clari
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Database Systems and Queries · Rough Sets and Fuzzy Logic
