Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters

Runsheng Benson Guo; Utkarsh Anand; Khuzaima Daudjee; Rathijit Sen

arXiv:2507.10392·cs.DC·July 15, 2025

Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters

Runsheng Benson Guo, Utkarsh Anand, Khuzaima Daudjee, Rathijit Sen

PDF

Open Access

TL;DR

Zorse is a system designed to improve the efficiency of training large language models on heterogeneous GPU clusters by integrating flexible parallelism strategies and an automatic configuration planner.

Contribution

It introduces Zorse, the first system to unify pipeline and data parallelism with adaptive configuration for heterogeneous GPU clusters.

Findings

01

Zorse outperforms existing systems in heterogeneous training scenarios.

02

It effectively balances load and memory across diverse GPUs.

03

The automatic planner optimizes training strategies for specific workloads.

Abstract

Large language models (LLMs) require vast amounts of GPU compute to train, but limited availability and high costs of GPUs make homogeneous clusters impractical for many organizations. Instead, assembling heterogeneous clusters by pooling together GPUs of different generations allows them to achieve higher aggregate compute and make use of all available GPUs. However, training on heterogeneous clusters presents several challenges, including load balancing across GPUs, optimizing memory usage to accommodate varying memory capacities, and ensuring communication-efficient training over diverse network interconnects potentially spanning multiple datacenters. In this paper, we make the case that efficient training on heterogeneous clusters requires (1) the integration of pipeline parallelism and data parallelism in a manner that is both communication- and memory-efficient, and (2) a more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies