Improving Automatic Parallel Training via Balanced Memory Workload   Optimization

Yujie Wang; Youhe Jiang; Xupeng Miao; Fangcheng Fu; Shenhan Zhu,; Xiaonan Nie; Yaofeng Tu; Bin Cui

arXiv:2307.02031·cs.LG·September 6, 2024·1 cites

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu,, Xiaonan Nie, Yaofeng Tu, Bin Cui

PDF

Open Access 1 Repo

TL;DR

This paper introduces Galvatron-BMW, a system that automatically finds the most efficient hybrid parallelism strategies for training Transformer models across multiple GPUs, improving throughput and resource utilization.

Contribution

The paper presents a novel framework that automates hybrid parallelism strategy selection using decision trees and dynamic programming, optimizing training efficiency.

Findings

01

Galvatron-BMW outperforms previous methods in training throughput.

02

It effectively balances workload across GPUs under memory constraints.

03

The system adapts to different Transformer models and hardware setups.

Abstract

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-dair/hetu-galvatron
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Label Smoothing · Adam · Position-Wise Feed-Forward Layer