Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML
Abhishek Vijaya Kumar, Eric Ding, Arjun Devraj, Darius Bunandar, Rachee Singh

TL;DR
Morphlux is a programmable photonic fabric that enhances server-scale ML data-centers by significantly increasing bandwidth, reducing fragmentation, and improving fault tolerance, leading to faster training and better reliability.
Contribution
We introduce Morphlux, a novel hardware fabric that transforms torus networks for improved performance and fault management in multi-tenant ML servers.
Findings
Up to 66% bandwidth improvement for tenant compute
Up to 70% reduction in compute fragmentation
1.72X increase in ML training throughput
Abstract
We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML data-centers with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize the blast radius of chip failures. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits which translate to 1.72X improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can replace a failed accelerator chip with a healthy one in 1.2 seconds.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Advanced Materials and Mechanics · Advanced Sensor and Energy Harvesting Materials
