Closing the HPC-Cloud Convergence Gap: Multi-Tenant Slingshot RDMA for Kubernetes
Philipp A. Friese, Ahmed Eleliemy, Utz-Uwe Haus, Martin Schulz

TL;DR
This paper presents an extension to HPE Slingshot networking hardware that enables secure, multi-tenant RDMA communication in Kubernetes, facilitating converged HPC-Cloud environments with minimal performance overhead.
Contribution
We designed and implemented a Slingshot stack extension that supports secure, container-level multi-tenant RDMA networking for Kubernetes, addressing a key gap in converged HPC-Cloud systems.
Findings
Achieved secure, container-granular RDMA access with minimal overhead.
Enabled multi-tenant HPC-Cloud networking on Slingshot hardware.
Demonstrated effective integration with Kubernetes for converged workloads.
Abstract
Converged HPC-Cloud computing is an emerging computing paradigm that aims to support increasingly complex and multi-tenant scientific workflows. These systems require reconciliation of the isolation requirements of native cloud workloads and the performance demands of HPC applications. In this context, networking hardware is a critical boundary component: it is the conduit for high-throughput, low-latency communication and enables isolation across tenants. HPE Slingshot is a high-speed network interconnect that provides up to 200 Gbps of throughput per port and targets high-performance computing (HPC) systems. The Slingshot host software, including hardware drivers and network middleware libraries, is designed to meet HPC deployments, which predominantly use single-tenant access modes. Hence, the Slingshot stack is not suited for secure use in multi-tenant deployments, such as converged…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
