Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale   Models via Malleable Data and Model Parallelization

Haoyang Li; Fangcheng Fu; Hao Ge; Sheng Lin; Xuanyu Wang; Jiawen Niu,; Yujie Wang; Hailin Zhang; Xiaonan Nie; Bin Cui

arXiv:2410.13333·cs.DC·May 7, 2025

Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization

Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu,, Yujie Wang, Hailin Zhang, Xiaonan Nie, Bin Cui

PDF

Open Access

TL;DR

Malleus is a novel framework that enhances large-scale model training efficiency by dynamically detecting and mitigating GPU stragglers through adaptive parallelization and seamless model state migration.

Contribution

It introduces a per-GPU straggler quantification and a re-planning algorithm for adaptive hybrid parallel training, improving resilience and efficiency.

Findings

01

Achieves 2.63-5.28x efficiency gains over existing methods.

02

Effectively adapts to dynamic straggler situations during training.

03

Supports large language models up to 110B parameters.

Abstract

As the scale of models and training data continues to grow, there is an expanding reliance on more GPUs to train large-scale models, which inevitably increases the likelihood of encountering dynamic stragglers that some devices lag behind in performance occasionally. However, hybrid parallel training, one of the de facto paradigms to train large models, is typically sensitive to the stragglers. This paper presents Malleus, a straggler-resilient hybrid parallel training framework for large-scale models. Malleus quantifies the stragglers at the nuanced, per-GPU granularity during training, and develops a novel planning algorithm to deduce the optimal parallelization of GPU devices, pipeline stages, model layers, and training data, maximizing training efficiency when stragglers exist. In addition, once a shift in the straggler situation is detected, Malleus adaptively adjusts the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning in Healthcare