Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models
Tingyang Sun, Ting He, Bo Ji, Parimal Parag

TL;DR
This paper systematically studies resource allocation in distributed large language model inference, proposing models, algorithms, and a simulator to optimize performance and reduce inference time across geographically distributed servers.
Contribution
It introduces the first comprehensive approach to optimize block placement and request routing in distributed LLM inference, including performance models, algorithms with performance guarantees, and a CPU-only simulator.
Findings
Proposed models accurately predict inference performance.
Polynomial-time algorithm with performance guarantees for resource allocation.
Significant reduction in inference time compared to existing solutions.
Abstract
Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower the barrier for deploying LLMs by splitting the model blocks across multiple servers with low-end GPUs distributed over the Internet, which was much faster than swapping the model parameters between the GPU memory and other cheaper but slower local storage media. However, the performance of such a distributed system critically depends on the resource allocation, and how to do so optimally remains unknown. In this work, we present the first systematic study of the resource allocation problem in distributed LLM inference, with focus on two important decisions: block placement and request routing. Our main results include: experimentally validated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Natural Language Processing Techniques
