TL;DR
BAGUA is a flexible MPI-style library that enables advanced system relaxations for distributed training, significantly improving training speed and adaptability across diverse network conditions.
Contribution
It introduces a modular MPI-style communication library supporting various system relaxation techniques for distributed learning.
Findings
BAGUA outperforms PyTorch-DDP, Horovod, and BytePS by up to 2x in training time.
It demonstrates effective use of system relaxations like quantization and decentralization.
Tradeoff analysis shows different algorithms excel under different network conditions.
Abstract
Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via system relaxations: quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBAGUA
