Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs
Chetan Jhurani

TL;DR
This paper introduces a fast, memory-efficient GPU implementation for batched Kronecker products on small matrices and arrays, optimized for finite element applications and outperforming generic GEMM-based methods.
Contribution
A specialized GPU algorithm for batched Kronecker products that is faster and more memory-efficient than existing GEMM-based approaches, tailored for small matrix sizes.
Findings
Achieves up to 285 GFlop/s for single precision on matrices of size 16
Faster and uses less memory than generic GEMM-based methods
Effective for finite element polynomial degrees up to size 16
Abstract
We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours [1] or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Model Reduction and Neural Networks · VLSI and FPGA Design Techniques
