MoLink: Distributed and Efficient Serving Framework for Large Models

Lewei Jin; Yongqi Chen; Kui Zhang; Yifan Zhuo; Yi Gao; Bowei Yang; Zhengong Cai; Wei Dong

arXiv:2507.05043·cs.DC·October 17, 2025

MoLink: Distributed and Efficient Serving Framework for Large Models

Lewei Jin, Yongqi Chen, Kui Zhang, Yifan Zhuo, Yi Gao, Bowei Yang, Zhengong Cai, Wei Dong

PDF

Open Access

TL;DR

MoLink is a distributed framework that enables efficient, cost-effective large language model serving on heterogeneous consumer-grade GPUs over limited network connections, significantly improving throughput and profit margins.

Contribution

It introduces a novel distributed serving system tailored for heterogeneous consumer GPUs, overcoming network and system limitations for large model deployment.

Findings

01

Achieves up to 458% throughput improvement

02

Realizes up to 151% cost-profit margin increase

03

Supports 18 open-source LLM architectures

Abstract

Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458\% and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Context-Aware Activity Recognition Systems