H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips

Ding Tang; Jiecheng Zhou; Jiakai Hu; Shengwei Li; Huihuang Zheng; Zhilin Pei; Hui Wang; Xingcheng Zhang

arXiv:2505.17548·cs.DC·May 26, 2025

H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips

Ding Tang, Jiecheng Zhou, Jiakai Hu, Shengwei Li, Huihuang Zheng, Zhilin Pei, Hui Wang, Xingcheng Zhang

PDF

TL;DR

H2 presents a comprehensive framework for efficient large-scale training of LLMs on hyper-heterogeneous clusters with over 1,000 chips, combining unified interfaces, optimized communication, and adaptive parallelism.

Contribution

The paper introduces H2, a novel system integrating DiTorch, DiComm, and HeteroPP with HeteroAuto for scalable, efficient LLM training on highly heterogeneous hardware clusters.

Findings

01

Achieves up to 16.37% superlinear speedup over baseline methods.

02

Demonstrates effective training of a 100-billion-parameter LLM on heterogeneous clusters.

03

Validates the feasibility of hyper-heterogeneous large-scale LLM training.

Abstract

Recent advancements in large language models (LLMs) necessitate extensive computational resources, prompting the use of diverse hardware accelerators from multiple vendors. However, traditional distributed training frameworks struggle to efficiently utilize hyper-heterogeneous clusters comprising thousands of chips due to significant disparities in software stacks, operator implementations, communication libraries, and hardware capabilities. To address these challenges, we propose H2, which stands for HyperHetero and is a systematic framework enabling efficient training of LLMs on clusters with over 1,000 heterogeneous chips. H2 incorporates DiTorch, a unified PyTorch-compatible interface ensuring program consistency across chips, and DiComm, a device-direct RDMA communication library optimized for heterogeneous environments. Furthermore, we introduce HeteroPP with HeteroAuto, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.