NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning
Yisu Wang, Xinjiao Li, Ruilong Wu, Huangxun Chen, and Dirk Kutscher

TL;DR
NetSenseML is a dynamic framework that adaptively adjusts gradient compression techniques based on real-time network conditions to optimize distributed machine learning training efficiency without sacrificing accuracy.
Contribution
It introduces a novel network adaptive approach that balances compression and accuracy by monitoring network conditions during training.
Findings
Improves training throughput by up to 9.84x in bandwidth-constrained environments.
Effectively balances network load and model accuracy during distributed training.
Demonstrates significant performance gains over existing compression methods.
Abstract
Training large-scale distributed machine learning models imposes considerable demands on network infrastructure, often resulting in sudden traffic spikes that lead to congestion, increased latency, and reduced throughput, which would ultimately affect convergence times and overall training performance. While gradient compression techniques are commonly employed to alleviate network load, they frequently compromise model accuracy due to the loss of gradient information. This paper introduces NetSenseML, a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network conditions, NetSenseML applies gradient compression only when network congestion negatively impacts convergence speed, thus effectively balancing data payload reduction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Traffic Prediction and Management Techniques · Cloud Computing and Resource Management
