TL;DR
This paper presents a CUDA-aware MPI communication scheme for high-order stencil computations, significantly improving scalability and efficiency in GPU-based magnetohydrodynamics simulations.
Contribution
It introduces a generic GPU communication scheme using CUDA-aware MPI that enhances intra-node locality and scales efficiently across multiple GPUs for high-order stencil computations.
Findings
Strong scaling from 1 to 64 GPUs at 50-87% efficiency
20-60x speedup over CPU solvers
9-12x energy efficiency improvement on 16 nodes
Abstract
Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
