FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
Xing Liu, Lizhuo Luo, Ming Tang, Chao Huang, Xu Chen

TL;DR
FlowSpec introduces a pipeline-parallel speculative decoding framework for distributed LLM inference at the edge, significantly improving speed and efficiency by optimizing pipeline utilization and token verification.
Contribution
The paper presents a novel tree-based speculative decoding framework with three key mechanisms to enhance distributed LLM inference efficiency at the network edge.
Findings
Achieves 1.37× to 1.73× speedup over baselines.
Improves pipeline utilization and speculative decoding efficiency.
Effective on diverse models and configurations.
Abstract
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accepted tokens; 2) efficient draft management to prune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Systems and Technologies · Neural Networks and Applications · Advanced Computational Techniques and Applications
