Varuna: Enabling Failure-Type Aware RDMA Failover
Xiaoyang Wang, Yongkun Li, Lulu Yao, Guoli Wei, Longcheng Yang, Yinlong Xu, Weiqing Kong, Weiguang Wang, Peng Dong, Bingyang Liu

TL;DR
Varuna is a failure-type-aware RDMA recovery mechanism that improves failover efficiency by selectively retransmitting requests based on whether they were executed before a link failure.
Contribution
It introduces a lightweight completion log that enables correct, selective retransmission of RDMA requests, reducing recovery time and overhead during link failures.
Findings
Reduces recovery retransmission time by 65%.
Incurs only 0.6-10% latency overhead in steady state.
Preserves transactional consistency during failover.
Abstract
RDMA link failures can render connections temporarily unavailable, causing both performance degradation and significant recovery overhead. To tolerate such failures, production datacenters assign each primary link with a standby link and, upon failure, uniformly retransmit all in-flight RDMA request over the backup path. However, we observe that such blanket retransmission is unnecessary. In-flight requests can be split into pre-failure and post-failure categories depending on whether the responder has already executed. Retransmitting post-failure requests is not only redundant (consuming bandwidth), but also incorrect for non-idempotent operations, where duplicate execution can violate application semantics. We present Varuna, a failure-type-aware RDMA recovery mechanism that enables correct retransmission and us-level failover. Varuna piggybacks a lightweight completion log on every…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
