BAGUA: Scaling up Distributed Learning with System Relaxations

Shaoduo Gan; Xiangru Lian; Rui Wang; Jianbin Chang; Chengjun Liu,; Hongmei Shi; Shengzhuo Zhang; Xianghong Li; Tengxu Sun; Jiawei Jiang; Binhang; Yuan; Sen Yang; Ji Liu; Ce Zhang

arXiv:2107.01499·cs.LG·November 29, 2021

BAGUA: Scaling up Distributed Learning with System Relaxations

Shaoduo Gan, Xiangru Lian, Rui Wang, Jianbin Chang, Chengjun Liu,, Hongmei Shi, Shengzhuo Zhang, Xianghong Li, Tengxu Sun, Jiawei Jiang, Binhang, Yuan, Sen Yang, Ji Liu, Ce Zhang

PDF

1 Repo

TL;DR

BAGUA is a flexible MPI-style library that enables advanced system relaxations for distributed training, significantly improving training speed and adaptability across diverse network conditions.

Contribution

It introduces a modular MPI-style communication library supporting various system relaxation techniques for distributed learning.

Findings

01

BAGUA outperforms PyTorch-DDP, Horovod, and BytePS by up to 2x in training time.

02

It demonstrates effective use of system relaxations like quantization and decentralization.

03

Tradeoff analysis shows different algorithms excel under different network conditions.

Abstract

Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via system relaxations: quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BaguaSys/bagua
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBAGUA