Fast GPU Linear Algebra via Compile Time Expression Fusion
Ryan R. Curtin, Marcus Edel, Conrad Sanderson

TL;DR
The paper introduces Bandicoot, a GPU linear algebra library that uses compile-time expression fusion for high efficiency, outperforming popular frameworks.
Contribution
It presents a new compile-time kernel fusion approach in a user-friendly C++ library compatible with existing CPU linear algebra tools.
Findings
Bandicoot achieves high GPU kernel efficiency through compile-time fusion.
Empirical results show Bandicoot outperforms PyTorch, TensorFlow, and JAX.
The library simplifies GPU linear algebra programming without runtime overhead.
Abstract
We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
