SDR-RDMA: Software-Defined Reliability Architecture for Planetary Scale RDMA Communication
Mikhail Khalilov, Siyuan Shen, Marcin Chrapek, Tiancheng Chen, Kenji Nakano, Peter-Jan Gootzen, Salvatore Di Girolamo, Rami Nudelman, Gil Bloch, Sreevatsa Anantharamu, Mahmoud Elhaddad, Jithin Jose, Abdul Kabbani, Scott Moe, Konstantin Taranov, Zhuolong Yu, Jie Zhang

TL;DR
SDR-RDMA introduces a flexible, software-defined reliability architecture for RDMA that allows custom reliability schemes, improving efficiency and performance for large-scale inter-datacenter AI training.
Contribution
It proposes a novel software-defined reliability stack with a receive buffer bitmap, enabling customizable reliability schemes on existing hardware.
Findings
Achieves line-rate performance with offloaded reliability backend.
Supports partial message completion for tailored reliability.
Enhances efficiency of inter-datacenter RDMA communication.
Abstract
RDMA is vital for efficient distributed training across datacenters, but millisecond-scale latencies complicate the design of its reliability layer. We show that depending on long-haul link characteristics, such as drop rate, distance and bandwidth, the widely used Selective Repeat algorithm can be inefficient, warranting alternatives like Erasure Coding. To enable such alternatives on existing hardware, we propose SDR-RDMA, a software-defined reliability stack for RDMA. Its core is a lightweight SDR SDK that extends standard point-to-point RDMA semantics -- fundamental to AI networking stacks -- with a receive buffer bitmap. SDR bitmap enables partial message completion to let applications implement custom reliability schemes tailored to specific deployments, while preserving zero-copy RDMA benefits. By offloading the SDR backend to NVIDIA's Data Path Accelerator (DPA), we achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software-Defined Networks and 5G · Advanced Data Storage Technologies
