Improving Automatic Parallel Training via Balanced Memory Workload Optimization
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu,, Xiaonan Nie, Yaofeng Tu, Bin Cui

TL;DR
This paper introduces Galvatron-BMW, a system that automatically finds the most efficient hybrid parallelism strategies for training Transformer models across multiple GPUs, improving throughput and resource utilization.
Contribution
The paper presents a novel framework that automates hybrid parallelism strategy selection using decision trees and dynamic programming, optimizing training efficiency.
Findings
Galvatron-BMW outperforms previous methods in training throughput.
It effectively balances workload across GPUs under memory constraints.
The system adapts to different Transformer models and hardware setups.
Abstract
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Label Smoothing · Adam · Position-Wise Feed-Forward Layer
