Galvatron: Efficient Transformer Training over Multiple GPUs Using   Automatic Parallelism

Xupeng Miao; Yujie Wang; Youhe Jiang; Chunan Shi; Xiaonan Nie; Hailin; Zhang; Bin Cui

arXiv:2211.13878·cs.LG·November 28, 2022

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin, Zhang, Bin Cui

PDF

3 Repos

TL;DR

Galvatron is a system that automatically finds the most efficient hybrid parallelism strategy for training large Transformer models across multiple GPUs, significantly improving throughput over previous methods.

Contribution

Introduces Galvatron, a framework that automates the selection of hybrid parallelism strategies for Transformer training, combining decision trees and dynamic programming for optimal plans.

Findings

01

Galvatron outperforms previous methods in system throughput.

02

It effectively handles various GPU memory budgets.

03

Automatic parallelism selection improves training efficiency.

Abstract

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Pruning · Layer Normalization · Adam · Softmax · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing