A Pipelined Collaborative Speculative Decoding Framework for Efficient Edge-Cloud LLM Inference
Yida Zhang, Zhiyong Gao, Shuaibing Yue, Jie Li, Rui Wang

TL;DR
This paper introduces PicoSpec, a novel edge-cloud collaborative inference framework for LLMs that improves efficiency through asynchronous pipelining and sparse compression, achieving significant speedups.
Contribution
The paper presents a training-free speculative decoding framework with an asynchronous pipeline and sparse compression, addressing mutual waiting and communication latency in edge-cloud LLM inference.
Findings
Achieves up to 2.9x speedup over baseline methods.
Introduces a one-time compressed vocabulary transmission for rejection sampling.
Effectively balances edge and cloud computation for LLM inference.
Abstract
Recent advancements and widespread adoption of Large Language Models (LLMs) in both industry and academia have catalyzed significant demand for LLM serving. However, traditional cloud services incur high costs, while on-device inference alone faces challenges due to limited resources. Edge-cloud collaboration emerges as a key research direction to combine the strengths of both paradigms, yet efficiently utilizing limited network bandwidth while fully leveraging and balancing the computational capabilities of edge devices and the cloud remains an open problem. To address these challenges, we propose Pipelined Collaborative Speculative Decoding Framework (PicoSpec), a novel, general-purpose, and training-free speculative decoding framework for LLM edge-cloud collaborative inference. We design an asynchronous pipeline that resolves the mutual waiting problem inherent in vanilla speculative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
