HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
Yan Liang, Youhe Jiang, Ran Yan, Binhang Yuan, Wei Wang, Chuan Wu

TL;DR
HexiSeq is a system designed to enable efficient long-context training of large language models on heterogeneous GPU clusters by optimizing partitioning and scheduling based on device capabilities.
Contribution
It extends existing parallelism methods to heterogeneous hardware, formalizes the allocation as an optimization problem, and develops a scheduler for improved throughput.
Findings
HexiSeq improves training throughput by up to 1.72x on heterogeneous clusters.
It achieves near-homogeneous performance on mixed GPU setups.
HexiSeq effectively supports models from 3B to 70B parameters with long context lengths.
Abstract
Long-context training of large language models (LLMs) is commonly distributed with Context Parallelism (CP) and Head Parallelism (HP), but existing training systems largely assume homogeneous GPU meshes. This paper extends CP and HP to heterogeneous GPU clusters with mixed GPU models and non-uniform network bandwidths, a common setting in production training. We introduce HexiSeq, a system that supports fully asymmetric CP--HP partitioning by assigning sequence shards and attention heads according to device compute, memory, and communication capabilities. We formalize heterogeneous CP--HP allocation as a constrained optimization problem and develop an efficient hierarchical scheduler for finding optimal schedules. We evaluate HexiSeq against state-of-the-art CP and HP baselines on both real and simulated heterogeneous clusters. Across models from 3B to 70B parameters and context lengths…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
