HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous   Environment

Youhe Jiang; Ran Yan; Binhang Yuan

arXiv:2502.07903·cs.DC·February 13, 2025

HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment

Youhe Jiang, Ran Yan, Binhang Yuan

PDF

Open Access

TL;DR

HexGen-2 introduces a distributed system that disaggregates LLM inference across heterogeneous GPUs, optimizing resource allocation and communication to improve throughput and reduce latency and costs.

Contribution

It presents a novel scheduling algorithm for disaggregated LLM inference on heterogeneous GPUs, leveraging graph partitioning and max-flow algorithms for resource optimization.

Findings

01

Up to 2.0x increase in serving throughput.

02

1.5x reduction in inference latency.

03

Achieves similar performance at 30% lower cost.

Abstract

Disaggregating the prefill and decoding phases represents an effective new paradigm for generative inference of large language models (LLM), which eliminates prefill-decoding interference and optimizes resource allocation. However, it is still an open problem about how to deploy the disaggregated inference paradigm across a group of heterogeneous GPUs, which can be an economical alternative to deployment over homogeneous high-performance GPUs. Towards this end, we introduce HexGen-2, a distributed system for efficient and economical LLM serving on heterogeneous GPUs following the disaggregated paradigm. Built on top of HexGen, the core component of HexGen-2 is a scheduling algorithm that formalizes the allocation of disaggregated LLM inference computations and communications over heterogeneous GPUs and network connections as a constraint optimization problem. We leverage the graph…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques