HETHUB: A Distributed Training System with Heterogeneous Cluster for   Large-Scale Models

Si Xu; Zixiao Huang; Yan Zeng; Shengen Yan; Xuefei Ning; Quanlu Zhang,; Haolin Ye; Sipei Gu; Chunsheng Shui; Zhezheng Lin; Hao Zhang; Sheng Wang,; Guohao Dai; Yu Wang

arXiv:2405.16256·cs.DC·August 12, 2024

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Si Xu, Zixiao Huang, Yan Zeng, Shengen Yan, Xuefei Ning, Quanlu Zhang,, Haolin Ye, Sipei Gu, Chunsheng Shui, Zhezheng Lin, Hao Zhang, Sheng Wang,, Guohao Dai, Yu Wang

PDF

Open Access

TL;DR

HETHUB is a novel distributed training system that enables efficient large-scale model training on heterogeneous GPU clusters, supporting multiple GPU types and achieving high performance close to theoretical limits.

Contribution

The paper introduces HETHUB, the first distributed training system supporting heterogeneous GPU clusters with unified communication, performance prediction, and automatic parallel planning.

Findings

01

Supports six combinations of heterogeneous GPU-accelerators.

02

Achieves up to 97.49% of theoretical maximum performance.

03

Successfully trains Llama-140B on a mixed GPU cluster.

Abstract

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques

MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections