Towards an Adaptive Runtime System for Cloud-Native HPC
Aditya Bhosale, Advait Tahilyani, Laxmikant Kale, Sara Kokkila-Schumacher

TL;DR
This paper presents a Charm++ based adaptive runtime system that enables HPC applications to efficiently and resiliently run on dynamic, heterogeneous cloud infrastructure by leveraging rate-aware load balancing and extended resource management.
Contribution
It introduces novel strategies for load balancing and resource management in Charm++ to support heterogeneous CPU and GPU cloud instances with minimal disruption.
Findings
Rate-aware load balancing improves performance on heterogeneous cloud instances.
Charm++ mitigates performance issues caused by network contention and variability.
Support for GPU and CPU spot instances reduces costs with minimal overhead.
Abstract
The ongoing convergence of HPC and cloud computing presents a fundamental challenge: HPC applications, designed for static and homogeneous supercomputers, are ill-suited for the dynamic, heterogeneous, and volatile nature of the cloud. Traditional parallel programming models like MPI struggle to leverage key cloud advantages, such as resource elasticity and low-cost spot instances, while also failing to address challenges like performance variability and processor heterogeneity. This paper demonstrates how the asynchronous, message-driven paradigm of the Charm++ parallel runtime system can bridge this gap. We present a set of tools and strategies that enable HPC applications to run efficiently and resiliently on dynamic cloud infrastructure across both CPU and GPU resources. Our work makes two key contributions. First, we demonstrate that rate-aware load balancing in Charm++ improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
