SHIFT: Exploring the Boundary of RDMA Network Fault Tolerance

Shengkai Lin; Kairui Zhou; Hongtao Zhang; Yibo Wu; Yi Pan; Yihan Yang; Qinwei Yang; Wei Zhang; Arvind Krishnamurthy; Shizhen Zhao

arXiv:2512.11094·cs.NI·February 27, 2026

SHIFT: Exploring the Boundary of RDMA Network Fault Tolerance

Shengkai Lin, Kairui Zhou, Hongtao Zhang, Yibo Wu, Yi Pan, Yihan Yang, Qinwei Yang, Wei Zhang, Arvind Krishnamurthy, Shizhen Zhao

PDF

Open Access

TL;DR

This paper introduces SHIFT, a user-space RDMA layer that extends fault tolerance across NICs in distributed training, overcoming fundamental limitations and maintaining training progress despite network failures.

Contribution

We demonstrate the fundamental Trilemma in cross-NIC RDMA failover and propose SHIFT, a novel solution that provides fault tolerance while preserving memory semantics.

Findings

01

SHIFT incurs negligible overhead during normal operation.

02

SHIFT successfully masks NIC failures and link anomalies.

03

Training continues without costly restarts despite network faults.

Abstract

Under gang scheduling for large-scale distributed large language model (LLM) training, a single network anomaly can stall or abort an entire job. Current network fault tolerance mechanisms typically adopt a ``fallback and bypass'' approach within the switching fabric and at the access layer, tolerating in-network and access-layer failures. We explore whether RDMA fault tolerance can be extended to the cross-NIC level by failing over traffic to intra-host backup NICs. For the first time, we prove a fundamental Trilemma: it is impossible to have Cross-NIC RDMA failover that simultaneously preserves Exactly-Once Execution, Receiver-NIC Opacity, and a Zero-Copy datapath. Fortunately, we observe that dominant training frameworks (e.g., NCCL) rely on idempotent bulk transfers that tolerate relaxed memory ordering, as long as notification ordering is preserved. Leveraging this insight, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Software System Performance and Reliability · Distributed systems and fault tolerance