On the Impact of Intra-node Communication in the Performance of Supercomputer and Data Center Interconnection Networks
Joaquin Tarraga-Moreno, Jesus Escudero-Sahuquillo, Pedro Javier Garcia, Francisco J. Quiles

TL;DR
This paper investigates how increasing intra-node communication bandwidth in supercomputers and data centers can negatively impact inter-node communication performance due to interference, using simulation models and traffic pattern analysis.
Contribution
It introduces a generic simulation model for intra- and inter-node communication and demonstrates that higher intra-node bandwidth may hinder overall system performance.
Findings
Higher intra-node bandwidth can cause interference with inter-node traffic.
Increasing the number of accelerators per node may reduce inter-node communication efficiency.
Simulation results confirm the counterproductive effect of high intra-node bandwidth on inter-node performance.
Abstract
In the last decade, specific-purpose computing and storage devices, such as GPUs, TPUs, or high-speed storage, have been incorporated into server nodes of Supercomputers and Data centers. The development of high-bandwidth memory (HBM) enabled a much more compact form factor for these devices, thus allowing the interconnection of several of them within a server node, typically using an intra-node interconnection network (e.g., PCIe, NVLink, or Infinity Fabric). These networks allow scaling up the number of specific computing and storage devices per node. Furthermore, the inter-node networks communicate thousands of these devices placed in different server nodes in a Supercomputer or Data Center. Unfortunately, the intra- and inter-node networks may become the system's bottleneck due to the increasing communication demand among accelerators of applications such as generative AI. Although…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Cloud Computing and Resource Management · Advanced Data Storage Technologies
