The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries
Oscar Amoros (1), Albert Andaluz (2), Johnny Nunez (3), Antonio J. Pena (4) ((1) Universitat Politecnica de Catalunya, (2) Independent researcher, (3) NVIDIA, (4) Barcelona Supercomputing Center)

TL;DR
This paper introduces a C++17-based methodology and library that automatically fuses GPU functions at compile time, significantly improving performance and resource utilization without manual kernel development.
Contribution
It presents a novel approach for automatic kernel fusion in GPU libraries using high-level C++ interfaces and metaprogramming, enabling arbitrary function combinations with optimized fused kernels.
Findings
Achieves speedups from 2x to over 1000x compared to traditional libraries.
Enables automatic, on-demand kernel fusion for arbitrary GPU function sequences.
Maintains high-level programmability while maximizing GPU resource utilization.
Abstract
Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and Vertical Fusion (VF) can mitigate this, current library implementations often require library developers to manually create fused kernels. Hence, library users rely on limited sets of pre-compiled or template-based fused kernels. This limits the use cases that can benefit from HF and VF and increases development costs. In order to solve these issues, we present a novel methodology for building GPU libraries that enables automatic on-demand HF and VF for arbitrary combinations of GPU library functions. Our methodology defines reusable, fusionable components that users combine via high-level programming interfaces. Leveraging C++17 metaprogramming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Network Packet Processing and Optimization · Embedded Systems Design Techniques
