Switching Efficiency: A Novel Framework for Dissecting AI Data Center Network Efficiency
Niangen Ye, Jiawen Zhu, Baojun Chen, Dong Wang, Jiang Sun, Weiqiang Sun, Weisheng Hu

TL;DR
This paper introduces the Switching Efficiency Framework, a new metric system for analyzing and diagnosing communication bottlenecks in AI data center networks, guiding future design improvements.
Contribution
It proposes a novel, comprehensive metric framework that links network activity to computational progress and decomposes efficiency into analyzable factors.
Findings
Symmetric, distributed switching aligns with sparse LLM traffic.
All-to-All traffic from Mixture-of-Experts models degrades port utilization.
Design choices like resource allocation and in-network computing improve efficiency.
Abstract
Communication is pivotal in LLM training, and a thorough analysis of the communication efficiency of AI data center (AIDC) network is essential for guiding the design of these capital-intensive clusters. However, conventional metrics are inadequate for such analysis, as they do not directly link network activity to computational progress and lack granularity to diagnose the impact of different network design patterns. To address this, we introduce a metric framework, the Switching Efficiency Framework, whose core metric - Switching Efficiency () - quantifies computationally effective data throughput per unit switching capacity. We further decompose into three factors - Data, Routing Efficiency, and Port Utilization to facilitate analysis of distinct communication bottlenecks. Using this metric framework, we demonstrate how the symmetric, distributed switching of 3D-Torus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
