Scaling Distributed Machine Learning with In-Network Aggregation
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis,, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, Peter, Richt\'arik

TL;DR
This paper introduces SwitchML, a novel in-network aggregation method that leverages programmable switches to efficiently combine model updates during distributed machine learning training, significantly reducing communication overhead and speeding up training.
Contribution
It presents a co-designed switch processing and end-host protocol approach that enables in-network aggregation, achieving up to 5.5× faster training for real-world models.
Findings
Up to 5.5× speedup in training time
Reduced data exchange volume through in-network aggregation
Effective integration with existing ML frameworks
Abstract
Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5 for a number of real-world benchmark models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Advanced Graph Neural Networks
