Is Disaggregation possible for HPC Cognitive Simulation?
Michael R Wyatt II, Valen Yamamoto, Zoe Tosi, Ian Karlin, Brian Van, Essen

TL;DR
This paper investigates the feasibility of disaggregating AI accelerators from compute nodes to improve the efficiency of in-loop surrogate model inference in HPC cognitive simulations, comparing it with traditional node-local GPU approaches.
Contribution
It explores the potential of using disaggregated AI accelerators for HPC cognitive simulation workloads, a novel approach not extensively studied before.
Findings
Disaggregated accelerators can potentially reduce latency in in-loop inference.
Comparison shows trade-offs between disaggregated and node-local GPU accelerators.
Disaggregation may improve resource utilization and scalability in HPC workflows.
Abstract
Cognitive simulation (CogSim) is an important and emerging workflow for HPC scientific exploration and scientific machine learning (SciML). One challenging workload for CogSim is the replacement of one component in a complex physical simulation with a fast, learned, surrogate model that is "inside" of the computational loop. The execution of this in-the-loop inference is particularly challenging because it requires frequent inference across multiple possible target models, can be on the simulation's critical path (latency bound), is subject to requests from multiple MPI ranks, and typically contains a small number of samples per request. In this paper we explore the use of large, dedicated Deep Learning / AI accelerators that are disaggregated from compute nodes for this CogSim workload. We compare the trade-offs of using these accelerators versus the node-local GPU accelerators on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
