Avoiding Cross-Datacenter Collective Congestion via Disaggregated Buffering
Mariano Scazzariello, Noga H. Rotman, Dima Gavrilenko, Sajy Khashab, Alexander Shpiner, Matty Kadosh, Marco Chiesa, Dejan Kostic, Mark Silberstein

TL;DR
Spillway is a novel in-network buffering mechanism that prevents congestion collapse in cross-datacenter GPU training by buffering dropped packets at switches, improving performance without modifying end systems.
Contribution
We introduce Spillway, a transparent switch-disaggregated buffering system that mitigates cross-DC collective collisions, validated through simulations and hardware prototypes.
Findings
Reduces iteration time by up to 14% in large-scale training.
Eliminates performance degradation caused by collective collisions.
Works without changes to end hosts or training frameworks.
Abstract
LLM training at the scale of tens of thousands of GPUs now spans multiple datacenters (DC), making cross-DC collectives over long-haul links unavoidable. A critical and overlooked bottleneck arises when these collectives collide with intra-DC traffic at the destination - a common pattern in real workloads. The multi-millisecond congestion control loop is too slow to react, triggering severe packet loss and congestion collapse. We present Spillway, a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, we show that Spillway eliminates performance degradation from collective collisions, reducing iteration time by up to 14 %, without changes to end hosts or training frameworks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
