Pushing Memory Bandwidth Limitations Through Efficient Implementations   of Block-Krylov Space Solvers on GPUs

M. A. Clark; Alexei Strelchenko; Alejandro Vaquero; Mathias Wagner and; Evan Weinberg

arXiv:1710.09745·hep-lat·August 9, 2018·Comput. Phys. Commun.

Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

M. A. Clark, Alexei Strelchenko, Alejandro Vaquero, Mathias Wagner and, Evan Weinberg

PDF

TL;DR

This paper introduces an efficient GPU implementation of block-CG solvers for lattice QCD simulations, significantly reducing memory bandwidth bottlenecks and achieving a 5x speedup over traditional methods.

Contribution

The paper presents a novel GPU implementation of block-CG solvers that reduces memory bandwidth complexity from quadratic to linear, enabling faster lattice QCD computations.

Findings

01

Achieved a 5x speedup over existing methods.

02

Reduced vector-vector operation complexity from quadratic to linear.

03

Demonstrated effectiveness on NVIDIA's SaturnV cluster.

Abstract

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.