RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems
Yinxiao Feng, Tiancheng Chen, Yuchen Wei, Siyuan Shen, Shiju Wang, Wei Li, Kaisheng Ma, Torsten Hoefler

TL;DR
RailX introduces a reconfigurable, scalable, and cost-effective network architecture for hyper-scale LLM training, utilizing intra-node direct connectivity and optical circuit switching to interconnect over 100,000 chips efficiently.
Contribution
The paper presents RailX, a novel network design combining intra-node direct links and optical circuit switching, with a Hamiltonian Decomposition-based method for efficient all-to-all topology organization.
Findings
Supports over 100K chips with hyper bandwidth
Cost per bandwidth significantly lower than Fat-Tree
Diameter of 2-4 inter-node hops
Abstract
Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the \textit{Rail-optimized} network are extremely expensive, while direct topologies such as \textit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose \textit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on \textit{Hamiltonian Decomposition} theory to organize separate rail-based rings into \textit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Graph Theory and Algorithms
