Memory-efficient array redistribution through portable collective communication
Norman A. Rink, Adam Paszke, Dimitrios Vytiniotis, Georg Stefan Schmid

TL;DR
This paper introduces a memory-efficient method for array redistribution in large-scale deep learning, using a formal approach to synthesize MPI-style collective operations that optimize data transfer and reduce bottlenecks.
Contribution
We propose a type-directed synthesis approach for array redistribution using collective operations, with formal guarantees of memory efficiency and no excessive data transfer, integrated into a production system.
Findings
Achieves a 1.22x average speedup over existing methods
Maximum speedup observed up to 5.7x
Provides provable memory guarantees for large-scale models
Abstract
Modern large-scale deep learning workloads highlight the need for parallel execution across many devices in order to fit model data into hardware accelerator memories. In these settings, array redistribution may be required during a computation, but can also become a bottleneck if not done efficiently. In this paper we address the problem of redistributing multi-dimensional array data in SPMD computations, the most prevalent form of parallelism in deep learning. We present a type-directed approach to synthesizing array redistributions as sequences of MPI-style collective operations. We prove formally that our synthesized redistributions are memory-efficient and perform no excessive data transfers. Array redistribution for SPMD computations using collective operations has also been implemented in the context of the XLA SPMD partitioner, a production-grade tool for partitioning programs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Advanced Data Storage Technologies
