Nezha: Breaking Multi-Rail Network Barriers for Distributed DNN Training
Enda Yu, Dezun Dong, Xiangke Liao

TL;DR
Nezha is a protocol-agnostic system that optimizes multi-rail network communication for distributed deep learning, significantly improving throughput and efficiency on legacy hardware without requiring hardware upgrades.
Contribution
Nezha introduces a unified, protocol-agnostic framework with dynamic load balancing and fault tolerance for multi-rail networks in distributed DNN training.
Findings
Achieves 74-80% higher throughput than MPTCP.
Delivers 2.36x higher training efficiency than Gloo on large clusters.
Reduces latency by 1.7 to 4.3 times compared to Gloo.
Abstract
In distributed deep learning, communication remains a critical bottleneck. While modern hardware advances rapidly, over 60 percent of production HPC systems still rely on legacy infrastructure (V100 GPUs, multi-plane Ethernet/InfiniBand), necessitating communication optimization without hardware upgrades. Existing approaches face three key limitations: (1) static single-rail binding underutilizes multi-rail bandwidth, (2) protocol heterogeneity (TCP-RDMA coexistence) causes synchronization delays, and (3) mainstream libraries (NCCL/MPI) lack cross-protocol coordination. We present Nezha, the first protocol-agnostic system for multi-rail networks. Our contributions include: (1) Hardware-agnostic cross-protocol coordination: A unified abstraction enabling seamless collaboration between in-network computing (SHARP), adaptive RDMA (GLEX), and TCP, achieving 1.7 to 4.3 times lower latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Network Traffic and Congestion Control · IPv6, Mobility, Handover, Networks, Security
