High-Quality Fault Resiliency in Fat Trees
John Gliksberg (LI-PaRAD, UCLM), Antoine Capra, Alexandre Louvet,, Pedro Javier Garcia (UCLM), Devan Sohier (LI-PaRAD)

TL;DR
This paper introduces Dmodc, a fast deterministic routing algorithm for PGFTs that minimizes congestion risk during network failures, enabling quick re-routing in large-scale HPC clusters.
Contribution
The paper presents Dmodc, a novel routing algorithm that computes high-quality routing tables rapidly for large PGFT networks, improving fault resiliency.
Findings
Recomputes routing tables in less than a second for networks with tens of thousands of nodes.
Minimizes congestion risk even under massive network degradation.
Enables high-quality, fault-tolerant routing without impacting running applications.
Abstract
Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers.In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure.Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase.This allows complete re-routing of networks with tens of thousands of nodes in less than a second.In turn, this greatly helps centralised fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
