Fusion research using Azure A100 HPC instances
Igor Sfiligoi, Jeff Candy, Devarajan Subramanian

TL;DR
This paper demonstrates running complex fusion plasma simulations using CGYRO on NVIDIA A100 GPUs in Azure HPC, highlighting hardware capabilities and performance comparisons with older resources.
Contribution
It presents the first experience of deploying CGYRO on Azure A100 HPC instances, showcasing the potential of cloud-based high-performance fusion simulations.
Findings
CGYRO runs efficiently on Azure A100 GPUs with InfiniBand networking.
Compared performance favorably against older CPU and GPU Azure resources.
Cloud HPC can meet the demanding requirements of large-scale fusion simulations.
Abstract
Fusion simulations have in the past required the use of leadership scale HPC resources to produce advances in physics. One such package is CGYRO, a premier multi-scale plasma turbulence simulation code. CGYRO is a typical HPC application that would not fit into a single node, as it requires O(100 GB) of memory and O(100 TFLOPS) worth of compute for relevant simulations. When distributed across multiple nodes, CGYRO requires high-throughput and low-latency networking to effectively use the compute resources. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just a couple of nodes to deliver the necessary compute power. This paper presents our experience running CGYRO on NVIDIA A100 GPUs on InfiniBand-connected HPC resources in the Microsoft Azure Cloud. A comparison to older generation CPU and GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Magnetic confinement fusion research · Distributed and Parallel Computing Systems
