An implementation of tensor product patch smoothers on GPU
Cu Cui, Paul Grosse-Bley, Guido Kanschat, Robert Strzodka

TL;DR
This paper introduces a GPU implementation of tensor product patch smoothers for higher order finite element methods, optimizing memory use and achieving significant speedups over naive approaches.
Contribution
It presents a novel GPU-based implementation that reduces global data transfer and conflict, enabling faster multigrid smoothing for finite element methods in 2D and 3D.
Findings
At least 2x speedup over straightforward implementation
Achieves up to 36% of peak GPU performance
Effective in both single and double precision
Abstract
We present a GPU implementation of vertex-patch smoothers for higher order finite element methods in two and three dimensions. Analysis shows that they are not memory bound with respect to GPU DRAM, but with respect to on-chip scratchpad memory. Multigrid operations are optimized through localization and reorganized local operations in on-chip memory, achieving minimal global data transfer and a conflict free memory access pattern. Performance tests demonstrate that the optimized kernel is at least 2 times faster than the straightforward implementation for the Poisson problem, across various polynomial degrees in 2D and 3D, achieving up to 36% of the peak performance in both single and double precision on Nvidia A100 GPU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTensor decomposition and applications · Computational Physics and Python Applications · Distributed and Parallel Computing Systems
