Scaling Distributed Machine Learning with In-Network Aggregation

Amedeo Sapio; Marco Canini; Chen-Yu Ho; Jacob Nelson; Panos Kalnis,; Changhoon Kim; Arvind Krishnamurthy; Masoud Moshref; Dan R. K. Ports; Peter; Richt\'arik

arXiv:1903.06701·cs.DC·October 1, 2020·119 cites

Scaling Distributed Machine Learning with In-Network Aggregation

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis,, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, Peter, Richt\'arik

PDF

Open Access 2 Repos

TL;DR

This paper introduces SwitchML, a novel in-network aggregation method that leverages programmable switches to efficiently combine model updates during distributed machine learning training, significantly reducing communication overhead and speeding up training.

Contribution

It presents a co-designed switch processing and end-host protocol approach that enables in-network aggregation, achieving up to 5.5× faster training for real-world models.

Findings

01

Up to 5.5× speedup in training time

02

Reduced data exchange volume through in-network aggregation

03

Effective integration with existing ML frameworks

Abstract

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5 $\times$ for a number of real-world benchmark models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Advanced Graph Neural Networks