WANSpec: Leveraging Global Compute Capacity for LLM Inference
Noah Martin, Fahad Dogar

TL;DR
WANSpec leverages under-utilized global data centers and speculative decoding to optimize LLM inference, reducing latency and computational load by intelligently offloading parts of the workload across geographically distributed resources.
Contribution
This work introduces WANSpec, a novel approach that shifts parts of LLM inference to under-utilized data centers using speculative decoding, improving efficiency and latency.
Findings
Reduces forward passes of speculative decoding by over 50%
Mitigates capacity issues in high-demand data centers
Effectively utilizes global compute resources for LLM inference
Abstract
Data centers capable of running large language models (LLMs) are spread across the globe. Some have high end GPUs for running the most advanced models (100B+ parameters), and others are only suitable for smaller models (1B parameters). The most capable GPUs are under high demand thanks to the rapidly expanding applications of LLMs. Choosing the right location to run an LLM inference workload can have consequences on the latency of requests due to these high demands. In this work, we explore options to shift some aspects of inference to the under-utilized data centers. We first observe the varying delays affecting inference in AWS services from different regions, demonstrating that load is not spread evenly. We then introduce WANSpec, which offloads part of LLM generation to the under-utilized data centers. In doing so, WANSpec can mitigate capacity issues as well as effectively use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Scientific Computing and Data Management · Software System Performance and Reliability
