Application-aware Congestion Mitigation for High-Performance Computing Systems
Archit Patke, Saurabh Jha, Haoran Qiu, Jim Brandt, Ann Gentile, Joe, Greenseid, Zbigniew Kalbarczyk, Ravishankar Iyer

TL;DR
This paper introduces Netscope, an ML-driven framework that dynamically mitigates network congestion in HPC systems by considering application-specific network characteristics, significantly reducing runtime variability and improving system utility.
Contribution
Netscope is a novel, automated ML framework that accurately predicts congestion impacts and adapts mitigation strategies in real-time for HPC applications.
Findings
Netscope achieves a correlation of 0.7 to 0.9 in estimating congestion impact.
It reduces tail runtime variability by up to 14.9 times.
It improves median system utility by 12%.
Abstract
High-performance computing (HPC) systems frequently experience congestion leading to significant application performance variation. However, the impact of congestion on application runtime differs from application to application depending on their network characteristics (such as bandwidth and latency requirements). We leverage this insight to develop Netscope, an automated ML-driven framework that considers those network characteristics to dynamically mitigate congestion. We evaluate Netscope on four Cray Aries systems, including a production supercomputer on real scientific applications. Netscope has a lower training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7and 0.9 for common scientific applications. Moreover, we find that Netscope reduces tail runtime variability by up to 14.9 times while improving median system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management
