Kub: Enabling Elastic HPC Workloads on Containerized Environments
Daniel Medeiros, Jacob Wahlgren, Gabin Schieffer, Ivy Peng

TL;DR
Kub introduces a method for elastic HPC workloads on Kubernetes, allowing dynamic resource scaling during execution to improve performance while minimizing disruption, demonstrated with real applications.
Contribution
This work presents a novel approach for elastic resource management in HPC on Kubernetes, enabling dynamic scaling with minimal disruption and application-specific optimization.
Findings
Up to 2x speedup with proper scaling points.
Resource adaptation benefits vary with workload characteristics.
Overhead of checkpointing influences scaling decisions.
Abstract
The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications -- GROMACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
