Nezha: Breaking Multi-Rail Network Barriers for Distributed DNN Training

Enda Yu; Dezun Dong; Xiangke Liao

arXiv:2405.17870·cs.DC·February 10, 2026

Nezha: Breaking Multi-Rail Network Barriers for Distributed DNN Training

Enda Yu, Dezun Dong, Xiangke Liao

PDF

Open Access

TL;DR

Nezha is a protocol-agnostic system that optimizes multi-rail network communication for distributed deep learning, significantly improving throughput and efficiency on legacy hardware without requiring hardware upgrades.

Contribution

Nezha introduces a unified, protocol-agnostic framework with dynamic load balancing and fault tolerance for multi-rail networks in distributed DNN training.

Findings

01

Achieves 74-80% higher throughput than MPTCP.

02

Delivers 2.36x higher training efficiency than Gloo on large clusters.

03

Reduces latency by 1.7 to 4.3 times compared to Gloo.

Abstract

In distributed deep learning, communication remains a critical bottleneck. While modern hardware advances rapidly, over 60 percent of production HPC systems still rely on legacy infrastructure (V100 GPUs, multi-plane Ethernet/InfiniBand), necessitating communication optimization without hardware upgrades. Existing approaches face three key limitations: (1) static single-rail binding underutilizes multi-rail bandwidth, (2) protocol heterogeneity (TCP-RDMA coexistence) causes synchronization delays, and (3) mainstream libraries (NCCL/MPI) lack cross-protocol coordination. We present Nezha, the first protocol-agnostic system for multi-rail networks. Our contributions include: (1) Hardware-agnostic cross-protocol coordination: A unified abstraction enabling seamless collaboration between in-network computing (SHARP), adaptive RDMA (GLEX), and TCP, achieving 1.7 to 4.3 times lower latency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInterconnection Networks and Systems · Network Traffic and Congestion Control · IPv6, Mobility, Handover, Networks, Security