Parametric Gaussian Process Regression for Big Data

Maziar Raissi

arXiv:1704.03144·stat.ML·May 8, 2017

Parametric Gaussian Process Regression for Big Data

Maziar Raissi

PDF

1 Repo

TL;DR

This paper proposes parametric Gaussian processes (PGPs) for big data, offering a scalable alternative to stochastic variational inference, and demonstrates their effectiveness on large-scale datasets.

Contribution

It introduces the novel concept of parametric Gaussian processes that operate efficiently in big data settings without relying on stochastic variational inference.

Findings

01

Effective on simulated data

02

Performs well on airline industry benchmark dataset

03

Avoids the need for stochastic variational inference

Abstract

This work introduces the concept of parametric Gaussian processes (PGPs), which is built upon the seemingly self-contradictory idea of making Gaussian processes parametric. Parametric Gaussian processes, by construction, are designed to operate in "big data" regimes where one is interested in quantifying the uncertainty associated with noisy data. The proposed methodology circumvents the well-established need for stochastic variational inference, a scalable algorithm for approximating posterior distributions. The effectiveness of the proposed approach is demonstrated using an illustrative example with simulated data and a benchmark dataset in the airline industry with approximately 6 million records.

Equations18

u (x) \sim G P (0, k (x, x^{'}; θ)),

u (x) \sim G P (0, k (x, x^{'}; θ)),

u \sim N (m, S) .

u \sim N (m, S) .

f (x) := u (x) ∣ m, S \sim G P (μ (x; θ, m), Σ (x, x^{'}; θ, S)),

f (x) := u (x) ∣ m, S \sim G P (μ (x; θ, m), Σ (x, x^{'}; θ, S)),

μ (x; θ, m)

μ (x; θ, m)

Σ (x, x^{'}; θ, S)

m

m

S

N LML (θ, σ_{ϵ}^{2}) := \frac{1}{2} m^{T} k (Z, Z; θ)^{- 1} m + \frac{1}{2} lo g ∣ k (Z, Z; θ) ∣ + \frac{1}{2} M lo g (2 π) .

N LML (θ, σ_{ϵ}^{2}) := \frac{1}{2} m^{T} k (Z, Z; θ)^{- 1} m + \frac{1}{2} lo g ∣ k (Z, Z; θ) ∣ + \frac{1}{2} M lo g (2 π) .

k (x, x^{'}; θ) = γ^{2} exp (- \frac{1}{2} w^{2} (x - x^{'})^{2}),

k (x, x^{'}; θ) = γ^{2} exp (- \frac{1}{2} w^{2} (x - x^{'})^{2}),

k (x, x^{'}; θ) = γ^{2} exp (- \frac{1}{2} d = 1 \sum 8 w_{d}^{2} (x_{d} - x_{d}^{'})^{2}),

k (x, x^{'}; θ) = γ^{2} exp (- \frac{1}{2} d = 1 \sum 8 w_{d}^{2} (x_{d} - x_{d}^{'})^{2}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maziarraissi/ParametricGP
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Parametric Gaussian Process Regression for Big Data

Maziar Raissi

Division of Applied Mathematics

Brown University

Providence, RI 02912

[email protected] http://www.dam.brown.edu/people/mraissi/

Abstract

This work introduces the concept of parametric Gaussian processes (PGPs), which is built upon the seemingly self-contradictory idea of making Gaussian processes parametric. Parametric Gaussian processes, by construction, are designed to operate in “big data” regimes where one is interested in quantifying the uncertainty associated with noisy data. The proposed methodology circumvents the well-established need for stochastic variational inference, a scalable algorithm for approximating posterior distributions. The effectiveness of the proposed approach is demonstrated using an illustrative example with simulated data and a benchmark dataset in the airline industry with approximately $6$ million records.

1 Introduction

Gaussian processes (see [1, 2]) is a non-parametric Bayesian machine learning technique that provides a flexible prior distribution over functions, enjoys analytical tractability, and has a fully probabilistic work-flow that returns robust posterior variance estimates, which quantify uncertainty in a natural way. Moreover, Gaussian processes are among a class of methods known as kernel machines (see [3, 4, 5]) and are analogous to regularization approaches (see [6, 7, 8]). They can also be viewed as a prior on one-layer feed-forward Bayesian neural networks with an infinite number of hidden units [9]. Non-parametric models such as Gaussian processes need to “remember” the full dataset in order to be trained and make predictions. Therefore, the complexity of non-parametric models grows with the size of the dataset. For instance, when applying a Gaussian process to a dataset of size $N$ , exact inference has computational complexity $\mathcal{O}(N^{3})$ with storage demands of $\mathcal{O}(N^{2})$ . In recent years, we have been witnessing tremendous amount of efforts (see e.g., [10, 11]) to reduce these complexities. Such efforts generally lead to a computational complexity of $O(NM^{2})$ and storage demands of $O(NM)$ where $M$ is a user specified parameter governing the number of “inducing variables” (see e.g., [12, 13, 14, 15]). However, as is truly pointed out in [16] even these reduced storage are prohibitive for “big data”. In [16], the authors combine the idea of inducing variables with recent advances in variational inference (see e.g., [17, 18]) to develop a practical algorithm for fitting Gaussian processes using stochastic variational inference.

In contrast, the current work avoids stochastic variational inference and attempts to present an alternative approach to the one proposed in [16]. The seemingly self-contradictory idea is to make Gaussian processes parametric. The key feature of parametric models in general, and the current work in particular, is that predictions are conditionally independent of the observed data given the parameters. In other words, the data is distilled into the parameters and any subsequent prediction does not make use of the original dataset. This is very convenient as it enables efficient mini-batch training procedures. However, this is not without drawbacks since choosing a model from a particular parametric class constrains its flexibility. Therefore, it is of great importance to devise models that are aware of their imperfections and are capable of properly quantifying the uncertainty in their predictions associated with such limitations.

2 Methodology

Let us start by making the prior assumption that

[TABLE]

is a zero mean Gaussian process [1] with covariance function $k(\bm{x},\bm{x}^{\prime};\bm{\theta})$ which depends on the hyper-parameters $\bm{\theta}$ . Moreover, let us postulate the existence of some hypothetical dataset $\{\bm{Z},\bm{u}\}$ with

[TABLE]

Here, $\bm{Z}=\{\bm{z}^{i}\}_{i=1}^{M}$ and $\bm{u}=\{u^{i}\}_{i=1}^{M}$ . Let us define a parametric Gaussian process by the resulting conditional distribution

[TABLE]

where

[TABLE]

The parameters $\bm{m}$ and $\bm{S}$ of a parametric Gaussian process (3) will play a crucial role; The data will be distilled into these parameters and any subsequent predictions will not make use of the original dataset. This is very convenient as it enables an efficient mini-batch training procedure outlined in the following. Taking advantage of the favorable form (3) of a parametric Gaussian process, the mean $\bm{m}$ and covariance matrix $\bm{S}$ of the hypothetical dataset (2) can be updated by employing the posterior distribution resulting from conditioning on the observed mini-batch of data $\{\widetilde{\bm{X}},\widetilde{\bm{y}}\}$ of size $\widetilde{N}$ ; i.e.,

[TABLE]

It is worth mentioning that $\mu(\bm{Z};\bm{\theta},\bm{m})=\bm{m}$ and $\Sigma(\bm{Z},\bm{Z};\bm{\theta},\bm{S})=\bm{S}$ . The information corresponding to the mini-batch $\{\widetilde{\bm{X}},\widetilde{\bm{y}}\}$ is now distilled in the parameters $\bm{m}$ and $\bm{S}$ . The hyper-parameters $\bm{\theta}$ and noise variance parameter $\sigma_{\epsilon}^{2}$ can be updated by taking a step proportional to the gradient of the negative log marginal likelihood

[TABLE]

The training procedure is initialized by setting $\bm{m}_{0}=\bm{0}$ and $\bm{S}_{0}=k(\bm{Z},\bm{Z};\bm{\theta}_{0})$ where $\bm{\theta}_{0}$ is some initial set of hyper-parameters. Having trained the hyper-parameters and parameters of the model, one can use equation (4) to predict the mean $\mu(\bm{x}^{*};\bm{\theta},\bm{m})$ of the solution at a new test point $\bm{x}^{*}$ . Moreover, the predicted variance is given by $\Sigma(\bm{x}^{*},\bm{x}^{*};\bm{\theta},\bm{S})$ , where $\Sigma$ is obtained from equation (2).

3 Experiments

Parametric Gaussian process regression is entirely agnostic to the size of the dataset and can effectively handle datasets with millions or billions of records. The effectiveness of the proposed methodology will be demonstrated using an illustrative example with simulated data and a benchmark dataset in the literature on Gaussian processes and big data.

3.1 Illustrative example

To demonstrate the proposed framework, let us begin with a simple dataset generated by random perturbations of a one dimensional function given explicitly by $f(x)=x\sin(4\pi x)$ . The $6000$ training data are depicted in panel (A) of figure 1. The Gaussian process prior (1) used for this example is assumed to have a squared exponential [1] covariance function, i.e.,

[TABLE]

where $\gamma^{2}$ is a variance parameter and $\bm{\theta}=\left(\gamma,w\right)$ are the hyper-parameters. The model employs a hypothetical data-set (see equation (2)) of size $M=8$ . The locations $\bm{Z}$ of the hypothetical dataset are obtained by employing the k-means clustering algorithm. The training procedures is carried out using the Adam stochastic optimizer [19] with default settings and mini-batches of size one. After one pass through the entire training data, it is remarkable how the parameters $\bm{m}$ and $\bm{S}$ of the hypothetical dataset enable us to summarize the actual training data. The red circles in figure 1 denote the pairs $\{\bm{Z},\bm{m}\}$ of the hypothetical data. The resulting prediction of the model is plotted in figure 1.

3.2 Airline delays

The US flight delay prediction example, originally proposed in [16], has reached a status of a standard benchmark dataset (see e.g., [20, 21, 22, 23, 24]) in Gaussian process regression, partly because of the massive size of the dataset with nearly $5.93$ million records and partly because of its large-scale non-stationary nature. The dataset111http://stat-computing.org/dataexpo/2009/ consists of flight arrival and departure times for every commercial flight in the USA for the year 2008. Each record is complemented with details on the flight and the aircraft. The aim is to predict the delay in minutes of the aircraft at landing, $y$ . The eight covariates $\bm{x}$ are the same as [16], namely the age of the aircraft (number of years since deployment), route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. Two third of the entire data set, which totals $3.95$ million records, is used for training and one third for testing. The output data are normalized by subtracting the training sample mean from the outputs and dividing the results by the sample standard deviation. The input data are normalized to the interval $[0,1]$ . The Gaussian process prior (1) used for this example is assumed to have a squared exponential [1] covariance function, i.e.,

[TABLE]

where $\gamma^{2}$ is a variance parameter, $\bm{x}$ is the vector of covariates, and $\bm{\theta}=\left(\gamma,w_{1},\ldots,w_{8}\right)$ are the hyper-parameters. Moreover, anisotropy across input dimensions is handled by Automatic Relevance Determination (ARD) weights $w_{d}$ . From a theoretical point of view, each kernel gives rise to a Reproducing Kernel Hilbert Space [25, 26, 27] that defines a class of functions that can be represented by this kernel. In particular, the squared exponential covariance function chosen above implies smooth approximations. More complex function classes can be accommodated by appropriately choosing kernels. The model employs a hypothetical dataset (see equation (2)) of size $M=500$ . The locations $\bm{Z}$ of the hypothetical dataset are obtained by employing the k-means clustering algorithm. The training procedures is carried out using the Adam stochastic optimizer [19] with default settings and mini-batches of size $1000$ . After $10000$ iterations of the training procedure, the predictive mean squared error (MSE) on the normalized test data is given by $0.832810$ . This value for the MSE is within the range reported in the literature (see e.g., table 2 in [20]). The MSE over the normalized data can be interpreted as a fraction of the sample variance of airline arrival delays. Thus a MSE of $1.00$ is as good as using the training mean as predictor. In order to further reduce the MSE one could increase the size $M$ of the hypothetical data-set, increase the batch-size, and/or choose a more accommodative covariance function. Moreover, to get a better idea of the relevance of the different features available in this dataset, figure 2 plots the automatic relevance determination parameters $w_{d}$ . The most relevant variable turns out to be the airtime that needs to be covered. The month and time of departure of the flight are also two important features in predicting flight delays.

4 Related works

Despite some subtle differences, it is generally safe to recognize the input-output pairs $\{\bm{Z},\bm{u}\}$ (see equation (2)) as the so called “inducing points”, a frequently used term in the literature on sparse approximations to Gaussian process priors (see e.g., [14] for a compressive review). However, it is not advisable to interpret $\bm{m}$ and $\bm{S}$ (see equation (2)) as variational parameters [16] since no (stochastic) variational inference is carried out in the current work. Furthermore, to highlight the subtle differences between “inducing points” and what this work calls hypothetical dataset (2), it is worth observing that in the literature on sparse approximations to Gaussian processes it turns out that $\bm{m}=\bm{0}$ and $\bm{S}=k(\bm{Z},\bm{Z};\bm{\theta})$ . Under these assumptions and using equations (4) and (2), one obtains $\mu(\bm{x};\bm{\theta},\bm{m})=0$ and $\Sigma(\bm{x},\bm{x}^{\prime};\bm{\theta},\bm{S})=k(\bm{x},\bm{x}^{\prime};\bm{\theta})$ . In other words, in the sparse Gaussian processes framework, $f(\bm{x})$ and $u(\bm{x})$ are essentially identical; i.e., $f(\bm{x})=u(\bm{x})\sim\mathcal{GP}\left(0,k(\bm{x},\bm{x}^{\prime};\bm{\theta})\right)$ . In contrast, this work treats $\bm{m}$ and $\bm{S}$ as parameters of the model responsible for encoding the history of observed data. In this regard, the current work is similar to [16]. However, unlike [16], the parameter $\bm{m}$ and $\bm{S}$ are not variational parameters of some variational distribution.

5 Concluding remarks

Modern datasets are rapidly growing in size and complexity, and there is a pressing need to develop new statistical methods and machine learning techniques to harness this wealth of data. This work presented a novel regression framework for encoding massive amount of data into a small number of hypothetical data points. While being effective, the resulting model is conceptually very simple, is based on the idea of making Gaussian processes parametric, and it takes at most $8$ mathematical formulas to explain every single detail of the algorithm. This simplicity is extremely important specially when it comes to deploying machine learning algorithms on big data flow engines (see e.g., [28]) such as MapReduce [29] and Apache Spark [30]. Moreover, Gaussian processes are a powerful tool for probabilistic inference over functions. They offer desirable properties such as uncertainty estimates, automatic discovery of important dimensions, robustness to over-fitting, and principled ways of tuning hyper-parameters. Thus, scaling Gaussian processes to big datasets and deploying it on big data flow engines is, and will remain, an active area of research.

Acknowledgments

This works received support by the DARPA EQUiPS grant N66001-15-2-4055, and the AFOSR grant FA9550-17-1-0013.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.
2[2] Kevin P Murphy. Machine learning: a probabilistic perspective . MIT press, 2012.
3[3] Vladimir Vapnik. The nature of statistical learning theory . Springer Science & Business Media, 2013.
4[4] Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond . MIT press, 2002.
5[5] Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. The journal of machine learning research , 1:211–244, 2001.
6[6] Andrey Tikhonov. Solution of incorrectly formulated problems and the regularization method. In Soviet Math. Dokl. , volume 5, pages 1035–1038, 1963.
7[7] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed problems . W.H. Winston, 1977.
8[8] Tomaso Poggio and Federico Girosi. Networks for approximation and learning. Proceedings of the IEEE , 78(9):1481–1497, 1990.