Photonic Rails in ML Datacenters with Opus
Eric Ding, Barry Lyu, Bhaskar Kataria, Rachee Singh

TL;DR
This paper introduces Opus, a photonic rail-based network fabric for ML datacenters that significantly reduces power and cost by leveraging optical switches and parallelism-driven reconfiguration, with minimal training overhead.
Contribution
It proposes a novel optical circuit switch-based rail abstraction with a reconfiguration method tailored for ML workloads, and demonstrates its effectiveness on real hardware and simulations.
Findings
Over 23x power reduction in network fabric
Up to 4x cost savings compared to electrical switches
Less than 6% training overhead due to reconfiguration latency
Abstract
Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power and cost. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We overcome this through \emph{parallelism-driven rail reconfiguration}, which exploits the non-overlapping communication phases of different parallelism dimensions. This time-multiplexes a single set of physical ports across circuit configurations tailored to each phase within a training iteration. We design and implement Opus, a control plane that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optical Network Technologies · Cloud Computing and Resource Management · Software-Defined Networks and 5G
