MoLink: Distributed and Efficient Serving Framework for Large Models
Lewei Jin, Yongqi Chen, Kui Zhang, Yifan Zhuo, Yi Gao, Bowei Yang, Zhengong Cai, Wei Dong

TL;DR
MoLink is a distributed framework that enables efficient, cost-effective large language model serving on heterogeneous consumer-grade GPUs over limited network connections, significantly improving throughput and profit margins.
Contribution
It introduces a novel distributed serving system tailored for heterogeneous consumer GPUs, overcoming network and system limitations for large model deployment.
Findings
Achieves up to 458% throughput improvement
Realizes up to 151% cost-profit margin increase
Supports 18 open-source LLM architectures
Abstract
Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458\% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Context-Aware Activity Recognition Systems
