High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract)
John Gliksberg (LI-PaRAD, UCLM), Antoine Capra, Alexandre Louvet,, Pedro Javier Garcia (UCLM), Devan Sohier (LI-PaRAD)

TL;DR
This paper introduces Dmodc, a fast deterministic routing algorithm for PGFTs that minimizes congestion risk during equipment failures, enabling rapid topology rerouting in large-scale HPC networks without disrupting applications.
Contribution
The paper presents Dmodc, a novel modulo-based routing algorithm that allows quick rerouting in large PGFT topologies, improving fault resilience and management efficiency.
Findings
Dmodc reroutes topologies with tens of thousands of nodes in less than a second.
Dmodc achieves congestion risk comparable to the most stable algorithms under heavy degradation.
Dmodc maintains near-optimal congestion risk for shift permutation traffic with minimal degradation.
Abstract
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete rerouting of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
