FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding

Yuchen Li; Rui Kong; Zhonghao Lyu; Qiyang Li; Xinran Chen; Hengyi Cai; Lingyong Yan; Shuaiqiang Wang; Jiashu Zhao; Guangxu Zhu; Linghe Kong; Guihai Chen; Haoyi Xiong; Dawei Yin

arXiv:2601.00644·cs.DC·January 5, 2026

FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding

Yuchen Li, Rui Kong, Zhonghao Lyu, Qiyang Li, Xinran Chen, Hengyi Cai, Lingyong Yan, Shuaiqiang Wang, Jiashu Zhao, Guangxu Zhu, Linghe Kong, Guihai Chen, Haoyi Xiong, Dawei Yin

PDF

Open Access

TL;DR

FlexSpec introduces a communication-efficient, adaptable framework for edge-cloud collaborative inference with LLMs, reducing latency and costs by decoupling edge and cloud models and dynamically adjusting to network conditions.

Contribution

FlexSpec proposes a shared-backbone architecture enabling compatibility with evolving cloud models, eliminating frequent retraining and reducing communication overhead in edge-cloud LLM inference.

Findings

01

Significantly reduces communication and maintenance costs.

02

Achieves higher inference efficiency compared to traditional speculative decoding.

03

Effectively adapts to varying wireless and device conditions.

Abstract

Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with speculative decoding (SD) can reduce end-to-end latency by executing a lightweight draft model at the edge and verifying it with a cloud-side target model, existing frameworks fundamentally rely on tight coupling between the two models. Consequently, repeated model synchronization introduces excessive communication overhead, increasing end-to-end latency, and ultimately limiting the scalability of SD in edge environments. To address these limitations, we propose FlexSpec, a communication-efficient collaborative inference framework tailored for evolving edge-cloud systems. The core design of FlexSpec is a shared-backbone architecture that allows a single and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Big Data and Digital Economy