PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel, Sai Qian Zhang, Yunhai Hu, Zining Liu

TL;DR
PipeSpec introduces a hierarchical pipeline framework for speculative decoding in large language models, enabling asynchronous execution and significantly improving inference speed across multiple tasks and models.
Contribution
It generalizes speculative decoding to a hierarchical pipeline, reducing stage dependencies and increasing hardware utilization for faster LLM inference.
Findings
Achieves up to 2.54× speedup over existing methods.
Demonstrates scalability with model depth in multi-device systems.
Validates effectiveness on text summarization and code generation tasks.
Abstract
Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full hardware utilization. We present PipeSpec, a framework that generalizes speculative decoding to models arranged in a hierarchical pipeline, enabling asynchronous execution with lightweight coordination for prediction verification and rollback. Our analytical model characterizes token generation rates across pipeline stages and proves guaranteed throughput improvements over traditional decoding for any non-zero acceptance rate. We further derive closed-form expressions for steady-state verification probabilities that explain the empirical benefits of pipeline depth. Experimental results show that PipeSpec achieves up to 2.54 speedup while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Algorithms and Data Compression · Advanced Data Processing Techniques
MethodsLLaMA
