The Ghost in the Datacenter: Link Flapping, Topology Knowledge Failures, and the FITO Category Mistake
Paul Borrill

TL;DR
This paper investigates the pervasive problem of network topology corruption in datacenters caused by link flaps and failures, highlighting the limitations of current timeout-based failure detection methods and proposing a new atomic link layer solution.
Contribution
It identifies the fundamental limitations of existing failure detectors due to the FITO model and introduces Open Atomic Ethernet as a novel approach to eliminate ghosts at the link layer.
Findings
Ghosts occur at all network scales due to timeout-based failure detection.
Existing mitigation techniques still create ghosts because of fundamental timeout limitations.
Open Atomic Ethernet can eliminate ghosts through transactional topology knowledge.
Abstract
Every link disconnection or flap in a datacenter corrupts the network's self-knowledge -- its graph. We call this corruption a ghost: a node that appears reachable but is not, a link that reports "up" but silently drops traffic, or an IP address that resolves to a partitioned machine. Ghosts arise at every scale -- chiplet-to-chiplet (PCIe, UCIe), GPU-to-GPU (NVLink, NVSwitch), node-to-node (Ethernet, Thunderbolt), and cluster-to-cluster (IP, BGP) -- because all these protocols inherit Shannon's forward-in-time-only (FITO) channel model and use Timeout And Retry (TAR) as their failure detector. TAR cannot distinguish "slow" from "dead," which is precisely the ambiguity that Fischer--Lynch--Paterson proved unresolvable in asynchronous systems. We survey the problem using production data from Meta (419 interruptions in 54 days of LLaMA 3 training), ByteDance (38,236 explicit and 5,948…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Software System Performance and Reliability
