Stochastic Gradient Descent for Gaussian Processes Done Right
Jihao Andreas Lin, Shreyas Padhy, Javier Antor\'an, Austin Tripp,, Alexander Terenin, Csaba Szepesv\'ari, Jos\'e Miguel Hern\'andez-Lobato,, David Janz

TL;DR
This paper demonstrates that with proper techniques, stochastic gradient descent can efficiently solve large linear systems in Gaussian process regression, outperforming traditional methods and matching advanced neural network models.
Contribution
The paper introduces a simple stochastic dual descent algorithm tailored for Gaussian processes, leveraging insights from optimization and kernel theory, with extensive empirical validation.
Findings
The proposed method is highly competitive on UCI regression tasks.
It outperforms preconditioned conjugate gradients and variational methods.
Achieves performance comparable to state-of-the-art graph neural networks.
Abstract
As is well known, both sampling from the posterior and computing the mean of the posterior in Gaussian process regression reduces to solving a large linear system of equations. We study the use of stochastic gradient descent for solving this linear system, and show that when \emph{done right} -- by which we mean using specific insights from the optimisation and kernel communities -- stochastic gradient descent is highly effective. To that end, we introduce a particularly simple \emph{stochastic dual descent} algorithm, explain its design in an intuitive manner and illustrate the design choices through a series of ablation studies. Further experiments demonstrate that our new method is highly competitive. In particular, our evaluations on the UCI regression tasks and on Bayesian optimisation set our approach apart from preconditioned conjugate gradients and variational Gaussian process…
Peer Reviews
Decision·ICLR 2024 poster
* This paper proposes a new method for the kernel ridge regression problem. * Experimental results show that the proposed algorithms can achieve better performance than baselines. When combined with the Gaussian process, the method can also achieve comparable performance to that of graph neural networks.
* This paper only provides numerical experiments to evaluate the performance of different algorithms. However, it would be good if rigorous theoretical guarantees could be proved, at least for some special cases. Besides, I think the authors stress too much on the algorithm details, which can be deferred to the appendix for a major part of them while trying to leave some room for theoretical analysis. * There are many different optimizers for the kernel ridge regression, such as AdaGrad, Adam, e
The authors present a novel "dual" formulation for the Gaussian process regression problem. After studying the condition number of new and old formulations, the authors observe that the "dual" formulation allows for the use of larger learning rates, indicating its potential to converge faster. They then propose the stochastic dual gradient descent method, leveraging various optimization techniques based on the "dual" formulation, including feature and coordinate sampling (or minibatch) [1], Nest
The authors do not provide a theoretical justification to verify the convergence of the proposed method. Nevertheless, it is likely that convergence can be ensured under mild conditions, as the optimization techniques employed are standard and well-established in the community and literature. From my perspective, the primary contribution of this paper lies in the introduction of the "dual" formulation, as presented on page 4 after Equation (2). This formulation allows for the use of larger step
This is a well written paper that considers an interesting problem. The use of several benchmarks in the experimental section and comparison with recent work is a plus. The justification for use of the dual objective as well as the illustrative example is clear. The reason behind the choice of random coordinate estimates is well done.
It would be useful to emphasise that this work is useful when the Kernel is already known. Comments on whether these methods would be useful in hyperparameter estimation would be useful. The claim that the method can be implemented in a few lines of code should be demonstrated. The repo given does not clearly illustrate this using a simple example. The paper would benefit from a visualisation comparing samples from a GP using SDD to an exact GP fit to show that the samples lie within the confi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Spectroscopy Techniques in Biomedical and Chemical Research
MethodsSparse Evolutionary Training · Gaussian Process
