Efficient executions of Pipelined Conjugate Gradient Method on Heterogeneous Architectures
Manasi Tiwari, Sathish Vadhiyar

TL;DR
This paper presents three novel methods for efficiently executing the Pipelined Conjugate Gradient algorithm on heterogeneous CPU-GPU architectures, achieving significant speedups over existing CPU and GPU implementations.
Contribution
It introduces task-parallelism and data parallelism strategies, including a performance model-based workload decomposition, for the Pipelined PCG method on heterogeneous systems.
Findings
Up to 8x speedup over CPU implementations
Up to 5x speedup over GPU implementations
Effective handling of large problems exceeding GPU memory
Abstract
The Preconditioned Conjugate Gradient (PCG) method is widely used for solving linear systems of equations with sparse matrices. A recent version of PCG, Pipelined PCG, eliminates the dependencies in the computations of the PCG algorithm so that the non-dependent computations can be overlapped with communication. In this paper, we propose three methods for efficient execution of the Pipelined PCG algorithm on GPU accelerated heterogeneous architectures. The first two methods achieve task-parallelism using asynchronous executions of different tasks on CPU cores and GPU. The third method achieves data parallelism by decomposing the workload between CPU and GPU based on a performance model. The performance model takes into account the relative performance of CPU cores and GPU using some initial executions and performs 2D data decomposition. We also implement optimization strategies like…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Tensor decomposition and applications · Electromagnetic Scattering and Analysis
