Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev; Mike Del Balso

arXiv:1802.05799·cs.LG·February 22, 2018·522 cites

Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev, Mike Del Balso

PDF

Open Access 5 Repos

TL;DR

Horovod is an open-source library that simplifies and accelerates distributed deep learning in TensorFlow by providing efficient inter-GPU communication and minimal code modifications.

Contribution

It introduces Horovod, which employs ring reduction for efficient communication and requires only minimal changes to user code, improving scalability and ease of use.

Findings

01

Horovod achieves faster training times compared to existing methods.

02

It requires only a few lines of code modification for users.

03

Horovod scales efficiently across multiple GPUs.

Abstract

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques