Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
Ioannis Papavasileiou, Sairam Prabhakar, Indu Kant Deo, Sergejs Makovejs

TL;DR
This paper uses discrete-event simulation to analyze how fiber latency affects the efficiency of geo-distributed AI training, finding optimal cluster distances and benefits of hollow-core fiber.
Contribution
It provides quantitative insights into fiber latency effects on geo-distributed AI training and identifies optimal cluster distances and fiber types for improved overlap.
Findings
Optimal cluster distance is 10-100km for best performance.
Hollow-core fiber enables 25% higher compute-communication overlap.
Fiber latency significantly impacts training efficiency.
Abstract
We use discrete-event simulation to quantify the impact of fiber latency on the efficacy of geo-distributed AI model training with data parallelism. We conclude that the optimum distances between two AI clusters is 10-100km, over which hollow-core fiber enables 25% higher compute-communication overlap.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
