PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

Bradley McDanel; Sai Qian Zhang; Yunhai Hu; Zining Liu

arXiv:2505.01572·cs.AI·May 6, 2025

PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

Bradley McDanel, Sai Qian Zhang, Yunhai Hu, Zining Liu

PDF

Open Access

TL;DR

PipeSpec introduces a hierarchical pipeline framework for speculative decoding in large language models, enabling asynchronous execution and significantly improving inference speed across multiple tasks and models.

Contribution

It generalizes speculative decoding to a hierarchical pipeline, reducing stage dependencies and increasing hardware utilization for faster LLM inference.

Findings

01

Achieves up to 2.54× speedup over existing methods.

02

Demonstrates scalability with model depth in multi-device systems.

03

Validates effectiveness on text summarization and code generation tasks.

Abstract

Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full hardware utilization. We present PipeSpec, a framework that generalizes speculative decoding to $k$ models arranged in a hierarchical pipeline, enabling asynchronous execution with lightweight coordination for prediction verification and rollback. Our analytical model characterizes token generation rates across pipeline stages and proves guaranteed throughput improvements over traditional decoding for any non-zero acceptance rate. We further derive closed-form expressions for steady-state verification probabilities that explain the empirical benefits of pipeline depth. Experimental results show that PipeSpec achieves up to 2.54 $\times$ speedup while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Algorithms and Data Compression · Advanced Data Processing Techniques

MethodsLLaMA