gp2Scale: A Class of Compactly-Supported Non-Stationary Kernels and Distributed Computing for Exact Gaussian Processes on 10 Million Data Points
Marcus M. Noack, Mark D. Risser, Hengrui Luo, Vardaan Tekriwal, Ronald J. Pandolfi

TL;DR
gp2Scale enables exact Gaussian process inference on over 10 million data points by leveraging non-stationary kernels and distributed computing, avoiding common approximation methods and maintaining high accuracy and flexibility.
Contribution
The paper introduces gp2Scale, a novel scalable method for exact Gaussian processes that exploits kernel design and sparse structure without approximations, handling large datasets efficiently.
Findings
Successfully scales to over 10 million data points.
Outperforms state-of-the-art approximation algorithms in accuracy.
Maintains flexibility for various kernel and noise configurations.
Abstract
Despite a large corpus of recent work on scaling up Gaussian processes, a stubborn trade-off between computational speed, prediction and uncertainty quantification accuracy, and customizability persists. This is because the vast majority of existing methodologies exploit various levels of approximations that lower accuracy and limit the flexibility of kernel and noise-model designs -- an unacceptable drawback at a time when expressive non-stationary kernels are on the rise in many fields. Here, we propose a methodology we term \emph{gp2Scale} that scales exact Gaussian processes to more than 10 million data points without relying on inducing points, kernel interpolation, or neighborhood-based approximations, and instead leveraging the existing capabilities of a GP: its kernel design. Highly flexible, compactly supported, and non-stationary kernels lead to the identification of naturally…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-written and clear, they point out all the caveats. - There have been many works trying to address scalability in exact GPs, but viewing it through the lens of compactly supported kernels to unveil sparsity and speed up training is a new and, what seems to be an effective approach. - The authors demonstrate that exact GP inference on up to 10 million points is possible using their framework. The large-scale experiments (topography, housing, MNIST, and temperature data) show
- The authors should be providing a bit more detail about how exactly the distributed computing works, as well as the block MCMC. They just mention it rather than describe / explain. - It is unclear whether the goal is accuracy, scalability, or flexibility, the narrative oscillates between all three. - Claims that gp2Scale is “exact” and “agnostic to any input space” are not fully substantiated, discrete or structured domains may still pose issues.. I feel the claims maybe a bit overstated. -
The systematic construction of non-stationary compactly supported kernels (Sections 4.1-4.5) provides a useful taxonomy for practitioners. The combination kernel offers flexibility for encoding complex non-stationary structure. The paper includes a study on a 3D regression dataset with 10M points using extreme computational resources (1000+ GPUs).
The authors state that: “In this work, we argue that the fundamental problem of scaling Gaussian processes (GPs) stems from the misconception that the covariance matrix is inherently dense.” I have to disagree with this sweeping generalization. The problem of designing kernels that lead to covariance matrices with computationally advantageous structure is an old and extensively studied topic. The framing adopted by the authors presents a well-studied area as if it contains a fundamental misconce
1. This paper manage to train Gaussian processes on 10 million data points across 1024 GPUs by distributed computing. This seems to be a nontrivial engineering effort, even though the authors mentioned that "they are much more mundane compared to the kernel designs". I actually think these technical details deserve a spot in the main paper. In particular, I am personally interested in how sparsity patterns are handled and how the matrix blocks are distributed across GPUs.
1. Most experiments are conducted on datasets with \\(\leq 10^5\\) data points. The only truly large dataset is the temperature dataset (Menne et al., 2012) that has 10 million data points. Ideally, it would be better to have more diverse datasets in the middle range, e.g., say larger than 100 thousand and smaller a few million. - The missing of CRPS (or similar metrics that capture the quality of the predictive distributions) on the largest dataset is unfortunate, because the current evalua
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Machine Learning in Materials Science · Advanced Multi-Objective Optimization Algorithms
