Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs
Dominik Ernst, Georg Hager, Jonas Thies, Gerhard Wellein

TL;DR
This paper presents a novel, autotuned approach for optimizing tall & skinny matrix multiplication kernels on GPUs, significantly improving performance over existing vendor libraries by leveraging code generation and flexible parallelization strategies.
Contribution
It introduces a new implementation strategy for tall & skinny matrix multiplication on GPUs, using code generation and autotuning to achieve near-roofline performance.
Findings
Achieves at least 2/3 of roofline performance across various matrix sizes.
Substantially outperforms NVIDIA's CUBLAS for tall & skinny matrices.
Provides a flexible, configurable mapping scheme for optimized GPU kernel execution.
Abstract
General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA's current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and…
| ILP, Gbyte/s | ||||
|---|---|---|---|---|
| occupancy | 1 | 4 | 16 | |
| 1 block, 4 warps | 3.0 | 10.1 | 16.3 | |
| 6.25% | 228 | 629 | 815 | |
| 12.5% | 419 | 824 | 877 | |
| 25% | 681 | 872 | 884 | |
| 50% | 834 | 884 | 887 | |
| 100% | 879 | 891 | 877 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\corrauth
Dominik Ernst, Erlangen Regional Computing Center, Martensstraße 1, 91058 Erlangen, Germany
Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs
Dominik Ernst11affiliationmark:
Georg Hager11affiliationmark:
Jonas Thies22affiliationmark:
and Gerhard Wellein11affiliationmark:
11affiliationmark: Erlangen Regional Computing Center (RRZE), 91058 Erlangen, Germany
22affiliationmark: German Aerospace Center (DLR), Simulation and Software Technology
Abstract
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an NVIDIA Volta GPGPU.
keywords:
performance engineering, complex, tall & skinny, matrix multiplication, CUDA, GPU
1 Introduction
1.1 Tall & Skinny Matrix Multiplications
The general matrix-matrix multiplication (GEMM) is an essential linear algebra operation used in many numerical algorithms and hardware vendors usually supply an implementation that is perfectly optimized for their hardware. In case of NVIDIA, this is part of CUBLAS (cublas). However, since these implementations are focused on mostly square matrices, they often perform poorly for matrices with unusual shapes.
This paper covers two types of matrix multiplications with tall & skinny matrices, i.e., matrices that are much taller than they are wide. We define skinny as having in the range of columns, and tall as having more than rows. Both types of multiplications involve the two tall & skinny matrices and , with sizes and , respectively, and being the long dimension. The small dimensions and form a small matrix with size .
The two variants are shown in Figures 1 and 2: The Tall & Skinny Matrix Transposed times Tall & Skinny Matrix (TSMTTSM) multiplication and the Tall & Skinny Matrix times Matrix (TSMM) multiplication .
We are interested in a highly efficient implementation of these operations using double precision real and complex data types on the NVIDIA Volta GPGPU, used nowadays in many HPC systems.
1.2 Application
Row-major tall & skinny matrices are the result of combining several vectors to block vectors. Block Vector Algorithms are linear algebra algorithms that compute on multiple vectors simultaneously for improved performance. For instance, by combining multiple, consecutive sparse matrix-vector (SpMV) multiplications to a sparse matrix-multiple-vector (SpMMV) multiplication, the matrix entries are loaded only once and used for the multiple vectors, which reduces the overall memory traffic and consequently increases performance of this memory-bound operation. This has first been analytically shown in Gropp99 and is used in many applications; see, e.g., blockjada; Kreutzer:2018.
The simultaneous computation on multiple vectors can also be used to gain numerical advantages. This has been shown for block vector versions of the Lanzcos algorithm (see blocklanzcos), of the biconjugate gradient algorithm (see blockcg), and of the Jacobi-Davidson Method (see blockjada), each of which use block vectors to compute multiple eigenvectors simultaneously. Many such algorithms require multiplications of block vectors. For example, both the TSMTTSM (\mbox{\mathbf{A}}^{T}\mbox{\mathbf{B}}) and TSMM (\mathbf{A}$$\mathbf{C}) occur in classical Gram-Schmidt orthogonalization of a number of vectors represented by against an orthogonal basis .
1.3 Roofline Model
We use the roofline model by roofline to obtain an upper limit for the performance of these kernels. In all cases, each of the three matrices has to be transferred between the memory and the chip at least once. Even though the directions of data transfers differ between the kernels, the total data volume does not, as GPUs generally do not need a write-allocate transfer. Therefore the computational intensity is the same for both kernels if and are the same. floating point operations are performed in a matrix-matrix multiplication, so for double precision the arithmetic intensity assuming and is
[TABLE]
In this symmetric case, the arithmetic intensity grows linearly with . We will show measurements only for this symmetric case, although the nonsymmetric case is not fundamentally different, with the intensity being proportional to the harmonic mean of both dimensions and consequently dominated by the smaller number. If the achievable memory bandwidth is (see below), the model predicts as the as an absolute upper performance limit. In the case of complex numbers, the data volume increases by and the number of floating-point operations by , resulting in a doubled arithmetic intensity .
With proper loop optimizations in place, the GEMM is usually considered a classic example for a compute-bound problem with high arithmetic intensity. However, at , the arithmetic intensity of , is far to the left of the roofline knee of modern compute devices (typical values ranging from 5 flop/byte to 17 flop/byte) and strongly memory bound. This is not surprising given that a matrix multiplication with is the same as a scalar product. At the other end of the considered spectrum, at , the arithmetic intensity is , which is close to the roofline knee of a V100 GPU (see below). Therefore the performance character of the operation changes from extremely memory bound at to simultaneously memory and compute bound at . An implementation with perfect performance thus needs to fully utilize the memory bandwidth at all sizes and additionally reach peak floating point performance for the large sizes. The very different performance characteristics make it hard to write an optimal implementation for both ends of the spectrum, i.e., different optimizations and specialization is required for both cases.
It is possible to judge the quality of an implementation’s performance as the percentage of the roofline limit. This metric is shown for CUBLAS in Figures 3 and 4, where the ratio of measured and roofline performance is plotted as a function of the matrix width. There is very little performance improvement headroom for CUBLAS’ TSMM implementation for real-valued matrices, but there is some opportunity for complex matrices. For the TSMTTSM kernel, there is a to gap to the upper limit, apart from , where NVIDIA obviously implemented a special case handling. Similarly to the BLAS nomenclature, we use the shorthand “D” for double precision real values and “Z” for double precision complex values.
1.4 Contribution
This paper presents the necessary implementation techniques to achieve near-perfect performance for two tall & skinny matrix-matrix multiplication variants on an NVIDIA V100 GPGPU with real- and complex-valued matrices.
To this end, two parallel reduction schemes are implemented and analyzed as to their suitability for small matrices.
A code generator is implemented that produces code for specific matrix sizes and tunes many configuration options specifically to that size. This allows to exploit regularity where size parameters allow it, while still generating the least possible overhead where they do not. As a result, our implementation outperforms state-of-the-art vendor implementations for most of the parameter range.
1.5 Related Work
This work is an extended version of Ernst:2019. In comparison to that paper we have added a different variant of matrix-matrix multiplication (TSMM), added a more in depth performance analysis, extended the analysis to double precision complex data types, and examined a new TSMTTSM thread mapping scheme.
CUBLAS is NVIDIA’s BLAS implementation. The GEMM function interface in BLAS only accepts column-major matrices. Treating the matrices as transposed column major matrices and executing \mbox{\mathbf{A}}\mbox{\mathbf{B}}^{T} for the TSMTTSM operation and \mathbf{C}$$\mathbf{A} for TSMM are equivalent operations.
CUTLASS (cutlass) is a collection of primitives for multiplications especially of small matrices, which can be composed in different ways to form products of larger matrices. One of these is the splitK kernel, which additionally parallelizes the inner summation of the matrix multiplication for increase parallelism for the TSMTTSM kernel. We adapted the “06_splitK_gemm” code sample from the library for benchmarking.
1.6 Hardware
In this work we use NVIDIA’s V100-PCIe-16GB GPGPU (Volta architecture) with CUDA 10.0. The hardware data was collected with our own CUDA micro benchmarks, which are available at cuda-benches together with more detailed data.
Memory Bandwidth. Whereas the TSMM operation has a read and a write stream and fits well to the “scale” kernel from the STREAM benchmarks (McCalpin1995), the TSMTTSM is read-only. We thus use a thread-local sum reduction to estimate the achievable memory bandwidth (see Table 1). Read-only has a much higher maximum ceiling of about 880 Gbyte/s, compared to 820 Gbyte/s for a “scale” kernel. Maximum bandwidth is only attainable with sufficient parallelism, either through high occupancy or instruction level parallelism (ILP) in the form of multiple read streams, achieved here through unrolling.
Floating-Point Throughput. The V100 can execute one 32-wide double precision (DP) floating point multiply add (FMA) per cycle on each of its 80 streaming multiprocessors (SMs) and runs at a clock speed of 1.38 GHz for a DP peak of . One SM quadrant can process one instruction that is 32 warp lanes wide every four cycles at a latency of eight cycles. Full throughput can already be achieved with a single warp per quadrant if instructions are independent.
L1 Cache. The L1 cache plays is instrumental in achieving the theoretically possible arithmetic intensity. Though load and DP FMA instructions have the same throughput of , the actual L1 cache bandwidth of one 128-byte cache line per cycle means that the actual load instruction throughput is dependent on the number of touched cache lines. For example, a 32-wide, unit-stride DP load touches 2 cache lines and therefore takes two cycles. For that access pattern, the floating point to load instruction ratio would need to be at least 2:1 to attain peak performance.
Shared Memory. The Shared Memory uses the same physical structure on the chip as the L1 cache. It has the same bandwidth, but lower access latency than the L1 cache.
2 General Implementation Strategies
2.1 Code Generation
A single implementation cannot be suitable for all matrix sizes. In order to engineer the best code for each size, some form of meta programming is required. C++ templates allow some degree of meta programming but are limited in their expressiveness or require convoluted constructs. Usually the compiler unrolls and eliminates short loops with known iteration count in order to reduce loop overhead, combine address calculations, avoid indexed loads from arrays for the thread-local results, deduplicate and batch loads, and much more. Direct generation of the intended code offers more control, however. For example, when using a thread count per row that is not a divisor of the matrix width, some threads would need to compute fewer results than others. Guarding if statements have to be added around these computations that could exceed the matrix size. These can be omitted wherever it is safe, i.e. all threads compute a valid value, in order to not compromise performance for even, dividing thread mappings. We therefore use a code generating script in python, which allows to prototype new techniques much quicker and with more control. Many different parameters can be configured easily and benchmarked automatically, for example whether leap frogging and unrolling (see below) are used, how the reduction is performed, and what thread mapping to set. The same reasoning for code generation is made by codegen, where it is used to generate small matrix multiplication kernels for CPUs.
2.2 Thread Mapping Options
