SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
Ce Zheng, Xinghan Wang, Jiahong Ning, Yuxuan Shi, Ning Huang, Tingting Yang

TL;DR
This paper introduces SpecFed, a method that accelerates federated LLM inference by combining speculative decoding with compressed transmission, reducing communication bottlenecks while maintaining high fidelity.
Contribution
It proposes a novel top-K compression scheme with server-side reconstruction strategies and provides theoretical analysis of robustness and bounds.
Findings
Achieves high generation fidelity with reduced communication overhead.
Significantly improves decoding throughput in federated LLM inference.
Provides theoretical bounds on reconstruction error and biases.
Abstract
Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce speculative decoding to enable parallel LLM processing and propose a top-K compressed transmission scheme with two server-side reconstruction strategies. We theoretically analyze the robustness of our method in terms of local reconstruction error, aggregation bias, and acceptance-rate bias, and derive corresponding bounds. Experiments demonstrate that our scheme achieves high generation fidelity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
