SprayCheck: Finding Gray Failures in Adaptive Routing Networks
Jakob Krebs, Daniel Amir, Shir Landau Feibish, Mark Silberstein

TL;DR
SprayCheck is a passive detection system that identifies gray failures in adaptive routing networks by analyzing traffic patterns, enabling early failure detection without extra network load.
Contribution
It introduces a scalable, passive gray failure detection method leveraging adaptive routing properties and flow-level data, improving failure detection speed and accuracy.
Findings
Detects 1.5% packet-drop-rate failures within a single iteration.
Identifies 0.5% packet-drop-rate failures within 5 iterations.
Effective in large-scale data center network topologies.
Abstract
Distributed machine learning (ML) training has become a dominant workload in modern data center networks, operating at massive scale with clusters comprising tens to hundreds of thousands of GPUs. The scale of these networks makes failures, and particularly gray failures, inevitable. Gray failures can significantly degrade both network and application performance, yet they are notoriously difficult to detect, localize, and debug. To meet the performance demands of ML workloads, adaptive routing is widely deployed to maximize network utilization by dynamically spreading traffic across many paths. While adaptive routing increases network utilization, it also greatly intensifies the effect of gray failures. Prior work has either dismissed gray failures as negligible or proposed detection mechanisms that fail to scale, rendering these approaches increasingly impractical for large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
