FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Xing Liu; Lizhuo Luo; Ming Tang; Chao Huang; Xu Chen

arXiv:2507.02620·cs.DC·January 13, 2026

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Xing Liu, Lizhuo Luo, Ming Tang, Chao Huang, Xu Chen

PDF

Open Access

TL;DR

FlowSpec introduces a pipeline-parallel speculative decoding framework for distributed LLM inference at the edge, significantly improving speed and efficiency by optimizing pipeline utilization and token verification.

Contribution

The paper presents a novel tree-based speculative decoding framework with three key mechanisms to enhance distributed LLM inference efficiency at the network edge.

Findings

01

Achieves 1.37× to 1.73× speedup over baselines.

02

Improves pipeline utilization and speculative decoding efficiency.

03

Effective on diverse models and configurations.

Abstract

Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accepted tokens; 2) efficient draft management to prune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower Systems and Technologies · Neural Networks and Applications · Advanced Computational Techniques and Applications