The Ghost in the Datacenter: Link Flapping, Topology Knowledge Failures, and the FITO Category Mistake

Paul Borrill

arXiv:2603.03736·cs.DC·March 5, 2026

The Ghost in the Datacenter: Link Flapping, Topology Knowledge Failures, and the FITO Category Mistake

Paul Borrill

PDF

Open Access

TL;DR

This paper investigates the pervasive problem of network topology corruption in datacenters caused by link flaps and failures, highlighting the limitations of current timeout-based failure detection methods and proposing a new atomic link layer solution.

Contribution

It identifies the fundamental limitations of existing failure detectors due to the FITO model and introduces Open Atomic Ethernet as a novel approach to eliminate ghosts at the link layer.

Findings

01

Ghosts occur at all network scales due to timeout-based failure detection.

02

Existing mitigation techniques still create ghosts because of fundamental timeout limitations.

03

Open Atomic Ethernet can eliminate ghosts through transactional topology knowledge.

Abstract

Every link disconnection or flap in a datacenter corrupts the network's self-knowledge -- its graph. We call this corruption a ghost: a node that appears reachable but is not, a link that reports "up" but silently drops traffic, or an IP address that resolves to a partitioned machine. Ghosts arise at every scale -- chiplet-to-chiplet (PCIe, UCIe), GPU-to-GPU (NVLink, NVSwitch), node-to-node (Ethernet, Thunderbolt), and cluster-to-cluster (IP, BGP) -- because all these protocols inherit Shannon's forward-in-time-only (FITO) channel model and use Timeout And Retry (TAR) as their failure detector. TAR cannot distinguish "slow" from "dead," which is precisely the ambiguity that Fischer--Lynch--Paterson proved unresolvable in asynchronous systems. We survey the problem using production data from Meta (419 interruptions in 54 days of LLaMA 3 training), ByteDance (38,236 explicit and 5,948…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Software System Performance and Reliability