HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment
Youhe Jiang, Ran Yan, Binhang Yuan

TL;DR
HexGen-2 introduces a distributed system that disaggregates LLM inference across heterogeneous GPUs, optimizing resource allocation and communication to improve throughput and reduce latency and costs.
Contribution
It presents a novel scheduling algorithm for disaggregated LLM inference on heterogeneous GPUs, leveraging graph partitioning and max-flow algorithms for resource optimization.
Findings
Up to 2.0x increase in serving throughput.
1.5x reduction in inference latency.
Achieves similar performance at 30% lower cost.
Abstract
Disaggregating the prefill and decoding phases represents an effective new paradigm for generative inference of large language models (LLM), which eliminates prefill-decoding interference and optimizes resource allocation. However, it is still an open problem about how to deploy the disaggregated inference paradigm across a group of heterogeneous GPUs, which can be an economical alternative to deployment over homogeneous high-performance GPUs. Towards this end, we introduce HexGen-2, a distributed system for efficient and economical LLM serving on heterogeneous GPUs following the disaggregated paradigm. Built on top of HexGen, the core component of HexGen-2 is a scheduling algorithm that formalizes the allocation of disaggregated LLM inference computations and communications over heterogeneous GPUs and network connections as a constraint optimization problem. We leverage the graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
