Noise in the Clouds: Influence of Network Performance Variability on Application Scalability
Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, Salvatore Di, Girolamo, Tobias Rahn, Torsten Hoefler

TL;DR
This paper investigates how network performance variability in cloud environments affects the scalability and cost-efficiency of HPC workloads, highlighting the impact of network noise on application performance.
Contribution
It provides a detailed analysis and simulation of network noise effects on HPC scalability across multiple cloud providers and on-premise systems, emphasizing the importance of network stability.
Findings
Network noise can significantly reduce HPC application performance.
Cloud network variability impacts scalability and cost.
Validation across multiple providers confirms the effects.
Abstract
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Caching and Content Delivery · Interconnection Networks and Systems
