Lightweight Gaussian Process Inference in C++ on Metal and CUDA

Yu-Hsueh Fang

arXiv:2605.17898·cs.LG·May 19, 2026

Lightweight Gaussian Process Inference in C++ on Metal and CUDA

Yu-Hsueh Fang

PDF

TL;DR

LightGP is a fast, dependency-free C++17 library for Gaussian process inference supporting Metal and CUDA, outperforming Python-based libraries like GPyTorch in speed and scalability.

Contribution

The paper introduces LightGP, a novel C++17 library for GP inference with multiple optimized backends, offering significant speed improvements over existing Python libraries.

Findings

01

LightGP CPU is 2.6--8.7× faster than GPyTorch CPU on Apple M4.

02

LightGP CUDA is 2.3--6.7× faster than GPyTorch CUDA on NVIDIA RTX 3060.

03

Fused matrix-free kernel-vector product on Metal achieves 32× speedup at N=20,000.

Abstract

Gaussian process (GP) inference in Python is dominated by libraries such as GPyTorch and GPflow, which are built on deep-learning frameworks and inherit their dispatch overhead and dependency footprint. We present LightGP, a dependency-free C++17 library for GP regression with Python bindings, supporting Apple Metal and NVIDIA CUDA backends alongside tuned CPU paths via Apple Accelerate and OpenBLAS. LightGP provides four inference paths -- exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT -- covering problems from $N = 100$ to $N = 500, 000$ . On an Apple M4, LightGP CPU is 2.6--8.7 $\times$ faster than GPyTorch CPU for exact GP and $\sim 1.5 \times$ faster for sparse GP at every scale tested. On an NVIDIA RTX~3060, LightGP CUDA is 2.3--6.7 $\times$ faster than GPyTorch CUDA for exact GP up to $N = 2, 048$ , with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.