Swift: Rethinking RDMA Control Plane for Elastic Computing

Junxue Zhang; Han Tian; Xinyang Huang; Wenxue Li; Kaiqiang Xu; Dian; Shen; Yong Wang; Kai Chen

arXiv:2501.19051·cs.NI·February 3, 2025

Swift: Rethinking RDMA Control Plane for Elastic Computing

Junxue Zhang, Han Tian, Xinyang Huang, Wenxue Li, Kaiqiang Xu, Dian, Shen, Yong Wang, Kai Chen

PDF

Open Access

TL;DR

This paper introduces Swift, a novel RDMA control plane design for elastic computing that leverages caching and process forking to significantly improve throughput and latency in serverless environments.

Contribution

Swift rethinks RDMA control plane assumptions, proposing cache-based connection setup and fork-based resource sharing to enhance elastic computing performance.

Findings

01

Achieves 30.56-46.50% higher throughput

02

Reduces latency by 18.55-37.21%

03

Adds only 6.5% control plane overhead

Abstract

Elastic computing enables dynamic scaling to meet workload demands, and Remote Direct Memory Access (RDMA) enhances this by providing high-throughput, low-latency network communication. However, integrating RDMA into elastic computing remains a challenge, particularly in control plane operations for RDMA connection setup. This paper revisits the assumptions of prior work on high-performance RDMA for elastic computing, and reveals that extreme microsecond-level control plane optimizations are often unnecessary. By challenging the conventional beliefs on the slowness of user-space RDMA control plane and the difficulty of user-space RDMA resource sharing, we uncover new design opportunities. Our key insight is that user-space RDMA connection setup can be significantly improved with caching, while RDMA resources can be efficiently shared among processes using fork. In light of this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems