Bridge: Optimizing Collective Communication Schedules in Reconfigurable Networks with Reusable Subrings
Anton Juerss, Stefan Schmid

TL;DR
Bridge introduces a reconfiguration strategy for optical networks that reuses links across multiple steps, significantly improving collective communication performance in AI/ML and HPC workloads.
Contribution
It leverages Bruck's pattern to enable sparse reconfiguration and reuse of optical links, reducing delays and enhancing throughput for collective primitives.
Findings
Reduces All-to-All completion time by 3x to 10x over static baselines.
Outperforms existing reconfiguration strategies for AllReduce, with up to 1.5x speedup.
Exceeds bandwidth-optimal Ring algorithm by 1.5x to 6.6x on certain workloads.
Abstract
Optical circuit-switched networks have emerged as an appealing alternative to electrical fabrics as they can reconfigure the network topology at runtime, reducing communication cost and improving bandwidth utilization. Yet exploiting optical reconfigurable networks for collective communication comes with a fundamental trade-off: each reconfiguration incurs non-negligible delay, communication must pause while the fabric reconfigures, and the benefit of a new topology depends on future traffic. The central question is therefore when reconfiguration is worth its cost. While prior work has demonstrated the benefits of reconfiguration, existing strategies use optical links only to optimize the current step, without reusing them for future steps. In this paper, we present Bridge, a reconfiguration strategy for important collective communication primitives used in AI/ML and HPC applications,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
