Optimizing CUDA Code By Kernel Fusion---Application on BLAS
J. Filipovi\v{c}, M. Madzin, J. Fousek, L. Matyska

TL;DR
This paper presents an automatic kernel fusion compiler for CUDA that improves GPU performance by enhancing memory locality, demonstrated on BLAS routines with up to 2.61x speedup over CUBLAS.
Contribution
The paper introduces a source-to-source compiler that automatically fuses map and reduce kernels, optimizing memory locality and performance for GPU applications.
Findings
Fused kernels achieve up to 2.61x speedup over CUBLAS.
Automatic kernel fusion improves memory locality and GPU utilization.
Demonstrated on BLAS-1 and BLAS-2 routines.
Abstract
Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs. However, the memory locality can be often improved by kernel fusion when a sequence of kernels is executed and some kernels in this sequence share data. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared to similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.61x faster for the examples tested.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
