TL;DR
This paper presents an empirical analysis of network congestion in petascale supercomputers, comparing two interconnect technologies, and introduces a framework for long-term congestion monitoring to inform better congestion control strategies.
Contribution
It provides the first field-based congestion characterization for two major high-speed interconnects using a new monitoring framework.
Findings
Congestion patterns differ significantly between Cray Gemini and Cray Aries.
The monitoring framework effectively captures long-term congestion trends.
Results highlight the need for tailored congestion control approaches for different topologies.
Abstract
Network congestion in high-speed interconnects is a major source of application run time performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.
| Blue Waters | Edison | |
|---|---|---|
| Flow control | Credit-based | Credit-based |
| Technology | Cray Gemini | Cray Aries |
| Topology | 3D Torus | Dragonfly |
| Routing | Directional-order routing | Adaptive |
| Number of Nodes | 27,648 | 5,586 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
Saurabh Jha1, Archit Patke1, Jim Brandt2, Ann Gentile2, Mike Showerman4, Eric Roman3,
Zbigniew T. Kalbarczyk1, Bill Kramer1,4, and Ravishankar K. Iyer1
1University of Illinois at Urbana-Champaign
2Sandia National Labs
4National Center for Supercomputing Applications
3National Energy Research Scientific Computing Center
Abstract
Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.
I Introduction
Despite years of innovation in network routing, congestion avoidance, and mitigation algorithms across generation high-performance interconnects, extreme-scale applications running on high-performance computing systems continue to suffer from performance variation and scaling challenges [1, 2, 3] due to (i) frequent exposure to congestion; and (ii) the inability to automatically optimize resource parameters (such as placement, rank mapping, and application scheduling) to improve network utilization. The current interconnects suffer from congestion that can occur because of (i) bad application placement (e.g., tightly packed ranks versus ranks spread across the network) [4, 5] and presence of bully applications [6]; (ii) the presence of large numbers of failed links [7] and new link failures, which force adversarial traffic shaping/flow on the network; and (iii) inherent congestion susceptibility due to network design choices (e.g., directional order routing in torus networks) or design bugs/gaps (e.g., incorrect/bad dynamic routing policy).
Growing body of work on network congestion [4, 5, 6, 7, 1, 3, 8] reflect the challenges in achieving high application performance, and the difficulty of providing timely monitoring and diagnosis of network congestion. However, the prior studies and tools are limited to proxy applications, simulations and benchmarks and hence are not reflective of production characteristics and issues. This is mainly due to inability to monitor, collect, analyze data on network congestion. To fill this gap, we provide (a) an end-to-end monitoring and analysis tool for understanding congestion causes and support long-term field-congestion characteristics, and (b) an empirical study of network congestion in high-performance computing systems. In particular, we present empirical measurements and insights obtained from two generations of Cray interconnects by using the Lightweight Distributed Metric Service (LDMS) [9] as a monitoring tool, and Monet [10] as a diagnosis tool. The studied high-speed interconnects are: (a) a Cray Gemini network deployed on Blue Waters [11] which uses a 3D torus-based topology and direction order routing; and (b) a Cray Aries network deployed on Edison [12] and 2-cabinet experimental system that uses a DragonFly-based topology and dynamic routing. The study provides empirical data and insights to influence research directions in application run-time, networking and system development.
Our contributions include the following:
- •
Measurements obtained from a production system running production workloads,
- •
Demonstration of end-to-end monitoring and analysis framework at scale. The proposed end-to-end framework uses LDMS [9] for monitoring and Monet [10] for diagnosis of network congestion to generate actionable insights for application developers, system managers, and network designers. LDMS collects performance-related information on links via Cray’s gpcdr [13] kernel module, whereas Monet uses a combination of data science and machine learning techniques on LDMS-collected data to enable online diagnosis, data summarization, and visualization.
- •
An empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D Torus topology; and (ii) Cray Aries which uses the DragonFly topology.
The key results obtained from empirical measurements are:
- •
Despite the use of low-level flow control and routing algorithms, hotspots in the network are common and exist for long duration of time.
- •
Heterogeneity in link bandwidth across different link-types (electrical and optical links) increases the susceptibility to congestion.
- •
Use of adaptive routing and a low-diameter network topology (such as the Cray Aries DragonFly topology) significantly improves congestion avoidance and mitigation compared to non-adaptive routing protocols, such as directional order routing in a high-diameter network topology (such as Cray Gemini Torus networks).
- •
Identification of design gaps in HPC interconnects. Using our monitoring and analysis, we found that routing algorithms in the production network may not choose the least congested path among symmetrical paths.
More detailed results on Blue Waters Cray Gemini network can be found in [10].
II Network and Data Description
In this paper, we demonstrate our monitoring tool on two production systems, Blue Waters and Edison, which use the Cray Gemini and Cray Aries interconnects respectively. Subsection II-A provides system description, Subsection II-B provides dataset description and Subsection II-C defines the congestion metric used in the study.
II-A System and Interconnect Description
Table I shows differences and similarities between the two production system with respect to network interconnect technology features.
NCSA’s (National Center for Supercomputing Applications) Blue Waters system is composed of 27,648 nodes and has a large-scale (13,824 x 48 port switches) Gemini 3D torus (dimension 24x24x24) interconnect. The available bandwidth on a particular network link is dependent on the link type (i.e., electrical vs optical) and number of tiles in the link. Where multiple links from a Gemini switch connect in the same direction, it is convenient to consider them as a directionally aggreggated link which we will henceforth call link, one in each of the 6 directions, X+/-, Y+/-,Z+/-, in the torus. For large XE/XK systems [14] all such aggregated X links have an aggregate bandwidth of 9.4 GB/s, Y links alternate between 9.4 GB/s and and 4.7 GB/s, and Z links are predominantly 15 GB/s with 1/8 of them at 9.4 GB/s. Gemini interconnect uses directional-order routing which is predominantly static.
NERSC’s (National Energy Research Scientific Computing Center) Edison system is composed of 5,586 nodes and has an (1,440 x 48 port switches) Aries DragonFly interconnect with 15 electrical groups. Electrical links (Green and Black) connect Aries switches within the group, where as, optical links (Blue) form group to group connections. Optical and electrical links have a bandwidth of 1.56 GB/s and 1.75 GB/s respectively. Adaptive routing is used to route packets on non-minimal paths to alleviate congestion on the minimal path.
II-B Field-Congestion Datasets
Network performance on links is exposed via Cray’s gpcdr kernel module. These are not collected nor made available for analysis via vendor-provided collection mechanisms. This data is collected and transported off the system for storage and analysis via the Lightweight Distributed Metric Service (LDMS) monitoring framework [9]. LDMS daemons synchronize their sampling (node to node time skew not accounted for) in order to provide coherent snapshots of network state across the whole system. The resolution of sampling is one second on Edison and sixty seconds on Blue Waters. In this work, we demonstrate the capability using one week of production data; that amounts to 7.7 TB for Edison and 370 GB for Blue Waters.
II-C Congestion Metric
In this paper, we use Percent Time Stalled (PTS) as a congestion metric to quantify network congestion. It is a suitable metric for interconnect networks that use credit-based flow control algorithms. In credit-based flow control networks, a source is allowed to send packets to the destination only if the source has sufficient credits. If sufficient credits are not available, then the link stalls, and the percentage of time spent in stalled state per unit time quantifies the extent of congestion. In this paper, we refer to this metric (i.e., percentage of time spent in stalled state per unit time) as the Percent Time Stalled.
III Tool Demonstration
In this section, we describe the following two results obtained from Monet [10] tool on field-congestion data.
- •
impact of routing algorithms on congestion (see Subsection III-A)
- •
impact of heterogeneity in link-bandwidth on congestion (see Subsection III-B)
III-A Impact of Routing Algorithms
Figure 1 shows the quantile values for different congested link durations, i.e., durations for which the PTS value on the link is above a fixed threshold (). The figure leads to the following insights:
- •
Use of the dragonfly topology and adaptive routing has led to improvement in congestion control between two generations of Cray interconnects. The Dragonfly topology used in Aries has a low global diameter of one hop, which helps to contain the back pressure of congested links. Furthermore, adaptive routing allows packets to take a longer but less congested path, which helps to alleviate congestion on the minimal path. Figure 1 provides empirical evidence for that observation. For every threshold, the congested link duration in Aries is an order of magnitude less than in Gemini. For example, if the threshold for congestion is fixed at 15% PTS, while the median duration is close to zero in both systems, the 99.9th percentile duration is approximately 1 minute for Edison and 400 minutes for Blue Waters. However, while Aries manages long bouts of congestion better than Gemini does, application runtime variability due to network performance remains a concern[15].
- •
Detection of long-duration congestion using traffic measurements can facilitate intervention such as rank remapping or rescheduling of bully jobs [6]. The 99.9th percentile congested link duration observed in both systems for is greater than a minute. Such long duration congestion allows us to tolerate greater latency for detection and diagnosis in real time. Moreover, a diagnosis can be converted to actionable feedback to be used by tools such as TopoMesh [16], which can remap MPI ranks or the scheduler to reschedule bully jobs.
III-B Impact of Heterogeneity in Link-bandwidth
Heterogeneity in link bandwidth across different link types (electrical and optical links) increases the susceptibility to congestion. Figure 2 (a), Figure 2 (b) and Figure 2 (c) respectively show congested link durations at different quantile values for X, Y and Z directional links of Cray Gemini interconnect in Blue Waters, and Figure 3 (a), Figure 3 (b) and Figure 3 (c) respecitvely show the congested link durations at different quantile values for Green, Black and Blue links of Cray Aries interconnect in Edison. In Gemini, for higher thresholds (), links along the X direction have longer lasting congestion than those on the Y and Z direction links. Similarly, in Aries, optical links (Blue) have shorter and less severe bursts of congestion than the electrical links (Green and Black). Thus, mismatch and heterogeneity in link-bandwidth leads to varying levels of congestion along network path.
IV Conclusion
In this work, we demonstrated the use of Monet [10] to conduct long-term characterization of field-congestion data obtained from petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology. Future work will include an in-depth analysis of field-congestion data and methods to alleviate congestion issues in HPC interconnects.
V Acknowledgement*
We thank Larry Kaplan (Cray), Gregory Bauer (NCSA) and Jeremy Enos (NCSA) for having many insightful conversations. We thank K. Atchley and J. Applequist for their help in preparing the manuscript.
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Number 2015-02674. This work is partially supported by NSF CNS 13-14891, and an IBM faculty award.
This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Application.
Sandia National Laboratories (SNL) is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs, “There goes the neighborhood: performance degradation due to nearby jobs,” in Proc. International Conference for High Performance Computing, Networking, Storage and Analysis , 2013, pp. 41:1–41:12.
- 2[2] N. Jain, A. Bhatele, S. White, T. Gamblin, and L. V. Kale, “Evaluating HPC networks via simulation of parallel workloads,” in High Performance Computing, Networking, Storage and Analysis, SC 16: International Conference for . IEEE, 2016, pp. 154–165.
- 3[3] T. Hoefler, T. Schneider, and A. Lumsdaine, “Characterizing the influence of system noise on large-scale applications by simulation,” in Proc. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2010, pp. 1–11.
- 4[4] T. Agarwal, A. Sharma, A. Laxmikant, and L. V. Kalé, “Topology-aware task mapping for reducing communication contention on large parallel machines,” in Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International . IEEE, 2006, pp. 10–pp.
- 5[5] M. Mubarak, P. Carns, J. Jenkins, J. K. Li, N. Jain, S. Snyder, R. Ross, C. D. Carothers, A. Bhatele, and K.-L. Ma, “Quantifying I/O and communication traffic interference on dragonfly networks equipped with burst buffers,” in Cluster Computing, 2017 IEEE Int’l Conf. on . IEEE, 2017, pp. 204–215.
- 6[6] X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan, “Watch out for the bully!: job interference study on dragonfly network,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE Press, 2016, p. 64.
- 7[7] S. Jha, V. Formicola, C. Di Martino, M. Dalton, W. T. Kramer, Z. Kalbarczyk, and R. K. Iyer, “Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters,” IEEE Transactions on Dependable and Secure Computing , 2017.
- 8[8] S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer, “Characterizing supercomputer traffic networks through link-level analysis,” in 2018 IEEE International Conference on Cluster Computing (CLUSTER) . IEEE, 2018, pp. 562–570.
