Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Chetan Jhurani

arXiv:1304.7054·cs.MS·April 29, 2013·1 cites

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Chetan Jhurani

PDF

Open Access

TL;DR

This paper introduces a fast, memory-efficient GPU implementation for batched Kronecker products on small matrices and arrays, optimized for finite element applications and outperforming generic GEMM-based methods.

Contribution

A specialized GPU algorithm for batched Kronecker products that is faster and more memory-efficient than existing GEMM-based approaches, tailored for small matrix sizes.

Findings

01

Achieves up to 285 GFlop/s for single precision on matrices of size 16

02

Faster and uses less memory than generic GEMM-based methods

03

Effective for finite element polynomial degrees up to size 16

Abstract

We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours [1] or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMatrix Theory and Algorithms · Model Reduction and Neural Networks · VLSI and FPGA Design Techniques