Fast GPU Linear Algebra via Compile Time Expression Fusion

Ryan R. Curtin; Marcus Edel; Conrad Sanderson

arXiv:2604.22242·cs.MS·April 27, 2026

Fast GPU Linear Algebra via Compile Time Expression Fusion

Ryan R. Curtin, Marcus Edel, Conrad Sanderson

PDF

TL;DR

The paper introduces Bandicoot, a GPU linear algebra library that uses compile-time expression fusion for high efficiency, outperforming popular frameworks.

Contribution

It presents a new compile-time kernel fusion approach in a user-friendly C++ library compatible with existing CPU linear algebra tools.

Findings

01

Bandicoot achieves high GPU kernel efficiency through compile-time fusion.

02

Empirical results show Bandicoot outperforms PyTorch, TensorFlow, and JAX.

03

The library simplifies GPU linear algebra programming without runtime overhead.

Abstract

We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.