Efficient Embedding of MPI Collectives in MXNET DAGs for scaling Deep Learning
Amith R Mamidala

TL;DR
This paper presents efficient methods for integrating MPI collective operations into MXNET's DAG-based deep learning framework, enabling scalable training on large GPU clusters with minimal epoch times.
Contribution
It introduces three novel MPI collective embedding designs for MXNET DAGs that enable overlap of communication and computation, improving scalability and performance.
Findings
Scales to 256 GPUs with 50-second epoch times on ImageNet.
Demonstrates overlap of communication and computation in DAG execution.
Achieves efficient distributed training on large GPU clusters.
Abstract
Availability of high performance computing infrastructures such as clusters of GPUs and CPUs have fueled the growth of distributed learning systems. Deep Learning frameworks express neural nets as DAGs and execute these DAGs on computation resources such as GPUs. In this paper, we propose efficient designs of embedding MPI collective operations into data parallel DAGs. Incorrect designs can easily lead to deadlocks or program crashes. In particular, we demonstrate three designs: Funneled, Concurrent communication and Dependency chaining of using MPI collectives with DAGs. These designs automatically enable overlap of computation with communication by allowing for concurrent execution with the other tasks. We directly implement these designs into the KVStore API of the MXNET. This allows us to directly leverage the rest of the infrastructure. Using ImageNet and CIFAR data sets, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques
