Characterizing the Impact of Congestion in Modern HPC Interconnects
Lorenzo Piarulli, Marco Faltelli, Dirk Pleiter, Karthee Sivalingam, Dancheng Zhang, Kexue Zhao, Matteo Turisini, Francesco Iannone, Aldo Artigiani, Daniele De Sensi

TL;DR
This paper provides a comprehensive analysis of congestion behavior in various modern HPC interconnects, highlighting scale-dependent effects and informing future network optimization strategies.
Contribution
It offers the first detailed characterization of congestion responses across multiple HPC fabrics under diverse traffic patterns and system scales.
Findings
Congestion manifests differently across fabrics as system size increases.
Bursty traffic patterns significantly impact network performance.
Insights can guide the design of better congestion-control mechanisms.
Abstract
High-performance computing (HPC) systems increasingly support both scalable AI training and large-scale simulation workloads. Both typically rely heavily on collective communication operations. On modern supercomputers, however, network congestion has emerged as a major limitation, driven by heterogeneous traffic patterns resulting from diverse workload mixes. As system scale and active users continue to grow, understanding how today's interconnect technologies respond to congestion is essential for establishing realistic performance expectations and informing future system design. This paper presents a comprehensive characterization of congestion behavior across four major HPC fabrics: EDR InfiniBand, HDR InfiniBand, NDR InfiniBand, Cray Slingshot, and emerging Ethernet fabrics. These fabrics span high-performance proprietary interconnects as well as adaptive Ethernet-based designs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
