OptINC: Optical In-Network-Computing for Scalable Distributed Learning
Sijie Fei, Grace Li Zhang, Bing Li, Ulf Schlichtmann

TL;DR
This paper introduces OptINC, an optical in-network computing architecture that reduces communication overhead in distributed learning by executing gradient operations within optical interconnects, maintaining accuracy while improving efficiency.
Contribution
It proposes a novel optical neural network architecture with hardware-aware training and preprocessing algorithms to offload computation onto optical fibers, reducing communication costs in distributed training.
Findings
Achieves comparable accuracy to ring all-reduce baseline.
Eliminates communication overhead in distributed training.
Effective on large models like ResNet50 and LLaMA-based networks.
Abstract
Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed results for subsequent computations or parameter updates. Existing communication algorithms for distributed learning such as ring all-reduce result in heavy communication overhead between servers. Since communication in large-scale systems uses optical fibers, we propose an Optical In-Network-Computing (OptINC) architecture to offload the computation in servers onto the optical interconnects. To execute gradient averaging and quantization in the optical domain, we incorporate optical devices such as Mach-Zehnder-Interferometers (MZIs) into the interconnects. Such a de facto optical neural network (ONN) can effectively reduce the communication overhead in existing distributed training solutions. To reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
