Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with   Rank Reordering

Liang Luo; Jacob Nelson; Arvind Krishnamurthy; Luis Ceze

arXiv:2105.14088·cs.DC·June 1, 2021

Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Liang Luo, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze

PDF

Open Access

TL;DR

This paper introduces Cloud Collectives, a method that reorders VM ranks to optimize collective communication in cloud environments, significantly improving ML workload training performance without requiring code changes.

Contribution

It proposes a cloud-aware rank reordering technique for collectives that enhances communication efficiency in cloud-based ML workloads, with no need for application modifications.

Findings

01

Up to 3.7x speedup in microbenchmarks

02

1.3x speedup in real-world ML training workloads

03

Effective exploitation of network locality through rank reordering

Abstract

ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques