Photonic Rails in ML Datacenters
Eric Ding, Chuhan Ouyang, Rachee Singh

TL;DR
This paper proposes using optical circuit switches to implement rail-optimized network fabrics in ML datacenters, addressing power and complexity issues of electrical switches by leveraging parallelism-driven reconfiguration and a new control plane, Opus.
Contribution
It introduces a novel optical switch-based rail abstraction with a control plane for time-multiplexed emulation, enabling dynamic, model-aware datacenter network reconfiguration.
Findings
Optical switches can emulate electrical rails with reconfiguration.
Parallelism-driven reconfiguration improves network flexibility.
Opus enables dynamic, time-multiplexed rail management.
Abstract
Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power, cost, and complexity overheads. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support only one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We introduce parallelism-driven rail reconfiguration as a solution that leverages the sequential ordering between traffic from different parallelisms. We design a control plane, Opus, to enable time-multiplexed emulation of electrical rail switches using optical switches. More broadly, our work discusses a new research agenda:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
