More for Less: Integrating Capability-Predominant and Capacity-Predominant Computing
Zhong Zheng, Michael E. Papka, Zhiling Lan

TL;DR
This paper investigates integrating capability-predominant and capacity-predominant workloads in HPC systems to improve resource utilization and reduce costs through workload fusion and injection strategies, supported by real workload analysis and simulations.
Contribution
It introduces a comprehensive study of unifying siloed HPC platforms by analyzing real workloads and exploring workload fusion and injection methods for better resource efficiency.
Findings
Workload fusion can improve resource utilization in unified HPC systems.
Workload injection identifies opportunities to leverage capacity jobs on capability systems.
Simulations show potential cost reductions and efficiency gains from integration.
Abstract
Capability jobs (e.g., large, long-running tasks) and capacity jobs (e.g., small, short-running tasks) are two common types of workloads in high-performance computing (HPC). Different HPC systems are typically deployed to handle distinct computing workloads. For example, Theta at the Argonne Leadership Computing Facility (ALCF) primarily serves capability jobs, while Cori at the National Energy Research Scientific Computing Center (NERSC) predominantly handles capacity workloads. However, this segregation often leads to inefficient resource utilization and higher costs due to the need for operating separate computing platforms. This work examines what-if scenarios for integrating siloed platforms. Specifically, we collect and characterize two real workloads from production systems at DOE laboratories, representing capabilitypredominant and capacity-predominant computing, respectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence · Cloud Computing and Resource Management · Computability, Logic, AI Algorithms
