TL;DR
SpecEdge is a scalable, cost-effective framework that leverages edge and server GPUs with speculative decoding to improve LLM serving efficiency and reduce latency.
Contribution
It introduces a novel edge-assisted inference framework that splits workloads, overlaps token creation with verification, and interleaves requests for better throughput.
Findings
Achieves 2.22x server throughput increase.
Enhances cost efficiency by 1.91x.
Reduces inter-token latency by 11.24%.
Abstract
Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving. The code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
