Efficient distributed algorithms for Convolutional Neural Networks
Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, P, Sadayappan

TL;DR
This paper introduces communication-efficient distributed algorithms for CNNs, inspired by existing matrix multiplication algorithms, optimizing memory and data transfer in distributed computing environments.
Contribution
It generalizes matrix multiplication algorithms to CNN computations, providing new distributed-memory algorithms that improve communication efficiency.
Findings
Algorithms reduce inter-node communication volume
Memory requirements are optimized for distributed CNN training
Framework applicable to various CNN architectures
Abstract
Several efficient distributed algorithms have been developed for matrix-matrix multiplication: the 3D algorithm, the 2D SUMMA algorithm, and the 2.5D algorithm. Each of these algorithms was independently conceived and they trade-off memory needed per node and the inter-node data communication volume. The convolutional neural network (CNN) computation may be viewed as a generalization of matrix-multiplication combined with neighborhood stencil computations. We develop communication-efficient distributed-memory algorithms for CNNs that are analogous to the 2D/2.5D/3D algorithms for matrix-matrix multiplication.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
