DeServe: Towards Affordable Offline LLM Inference via Decentralization
Linyu Wu, Xiaoyuan Liu, Tianneng Shi, Zhe Ye, Dawn Song

TL;DR
DeServe is a decentralized offline system that leverages idle GPU resources to reduce costs and improve throughput for large language model inference, especially in high-latency network environments.
Contribution
This paper introduces DeServe, a novel decentralized system that enables affordable, high-throughput offline LLM inference by utilizing idle GPU resources in a network-aware manner.
Findings
Achieves 6.7x-12.6x throughput improvement over baselines.
Effectively utilizes idle GPU resources for cost reduction.
Optimized for high-latency network environments.
Abstract
The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in open-source LLMs have positioned them as strong contenders. However, deploying these models is often constrained by the high costs and limited availability of GPU resources. In response, this paper presents the design of a decentralized offline serving system for LLM inference. Utilizing idle GPU resources, our proposed system, DeServe, decentralizes access to LLMs at a lower cost. DeServe specifically addresses key challenges in optimizing serving throughput in high-latency network environments. Experiments demonstrate that DeServe achieves a 6.7x-12.6x improvement in throughput over existing serving system baselines in such conditions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Artificial Intelligence in Law
