Helix: Serving Large Language Models over Heterogeneous GPUs and Network   via Max-Flow

Yixuan Mei; Yonghao Zhuang; Xupeng Miao; Juncheng Yang; Zhihao Jia,; Rashmi Vinayak

arXiv:2406.01566·cs.DC·March 7, 2025·2 cites

Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia,, Rashmi Vinayak

PDF

Open Access 1 Repo

TL;DR

Helix is a system that optimizes large language model serving on heterogeneous GPU clusters by formulating inference as a max-flow problem, significantly improving throughput and reducing latency.

Contribution

Helix introduces a novel max-flow based formulation and MILP optimization for joint model placement and request scheduling in heterogeneous GPU clusters.

Findings

01

Up to 3.3x increase in serving throughput.

02

Latency reductions of up to 66% in prompting and 24% in decoding.

03

Effective optimization across clusters with 24 to 42 GPUs.

Abstract

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving in heterogeneous GPU clusters. The key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem on directed, weighted graphs, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs on heterogeneous GPUs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous clusters ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 3.3x and reduces prompting and decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Thesys-lab/Helix-ASPLOS25
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling