Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters
Massimo Bernaschi, Mauro G. Carrozzo, Alessandro Celestini, Giacomo Piperno, Pasqua D'Ambra

TL;DR
This paper presents an efficient GPU-accelerated implementation of a communication-reduced Conjugate Gradient method, improving scalability for large sparse linear systems in scientific computing.
Contribution
It introduces a parallel solver that fully exploits low-granularity operations and overlaps communication with computation on multi-GPU clusters.
Findings
Achieves better scalability on GPU clusters for large sparse systems.
Reduces synchronization and communication overhead compared to standard CG.
Demonstrates effectiveness on Poisson PDE discretization benchmarks.
Abstract
Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
