SHIFT: Exploring the Boundary of RDMA Network Fault Tolerance
Shengkai Lin, Kairui Zhou, Hongtao Zhang, Yibo Wu, Yi Pan, Yihan Yang, Qinwei Yang, Wei Zhang, Arvind Krishnamurthy, Shizhen Zhao

TL;DR
This paper introduces SHIFT, a user-space RDMA layer that extends fault tolerance across NICs in distributed training, overcoming fundamental limitations and maintaining training progress despite network failures.
Contribution
We demonstrate the fundamental Trilemma in cross-NIC RDMA failover and propose SHIFT, a novel solution that provides fault tolerance while preserving memory semantics.
Findings
SHIFT incurs negligible overhead during normal operation.
SHIFT successfully masks NIC failures and link anomalies.
Training continues without costly restarts despite network faults.
Abstract
Under gang scheduling for large-scale distributed large language model (LLM) training, a single network anomaly can stall or abort an entire job. Current network fault tolerance mechanisms typically adopt a ``fallback and bypass'' approach within the switching fabric and at the access layer, tolerating in-network and access-layer failures. We explore whether RDMA fault tolerance can be extended to the cross-NIC level by failing over traffic to intra-host backup NICs. For the first time, we prove a fundamental Trilemma: it is impossible to have Cross-NIC RDMA failover that simultaneously preserves Exactly-Once Execution, Receiver-NIC Opacity, and a Zero-Copy datapath. Fortunately, we observe that dominant training frameworks (e.g., NCCL) rely on idempotent bulk transfers that tolerate relaxed memory ordering, as long as notification ordering is preserved. Leveraging this insight, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Distributed systems and fault tolerance
