Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Liang Luo, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze

TL;DR
This paper introduces Cloud Collectives, a method that reorders VM ranks to optimize collective communication in cloud environments, significantly improving ML workload training performance without requiring code changes.
Contribution
It proposes a cloud-aware rank reordering technique for collectives that enhances communication efficiency in cloud-based ML workloads, with no need for application modifications.
Findings
Up to 3.7x speedup in microbenchmarks
1.3x speedup in real-world ML training workloads
Effective exploitation of network locality through rank reordering
Abstract
ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
