TL;DR
This paper proposes parametric Gaussian processes (PGPs) for big data, offering a scalable alternative to stochastic variational inference, and demonstrates their effectiveness on large-scale datasets.
Contribution
It introduces the novel concept of parametric Gaussian processes that operate efficiently in big data settings without relying on stochastic variational inference.
Findings
Effective on simulated data
Performs well on airline industry benchmark dataset
Avoids the need for stochastic variational inference
Abstract
This work introduces the concept of parametric Gaussian processes (PGPs), which is built upon the seemingly self-contradictory idea of making Gaussian processes parametric. Parametric Gaussian processes, by construction, are designed to operate in "big data" regimes where one is interested in quantifying the uncertainty associated with noisy data. The proposed methodology circumvents the well-established need for stochastic variational inference, a scalable algorithm for approximating posterior distributions. The effectiveness of the proposed approach is demonstrated using an illustrative example with simulated data and a benchmark dataset in the airline industry with approximately 6 million records.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Parametric Gaussian Process Regression for Big Data
Maziar Raissi
Division of Applied Mathematics
Brown University
Providence, RI 02912
[email protected] http://www.dam.brown.edu/people/mraissi/
Abstract
This work introduces the concept of parametric Gaussian processes (PGPs), which is built upon the seemingly self-contradictory idea of making Gaussian processes parametric. Parametric Gaussian processes, by construction, are designed to operate in “big data” regimes where one is interested in quantifying the uncertainty associated with noisy data. The proposed methodology circumvents the well-established need for stochastic variational inference, a scalable algorithm for approximating posterior distributions. The effectiveness of the proposed approach is demonstrated using an illustrative example with simulated data and a benchmark dataset in the airline industry with approximately million records.
1 Introduction
Gaussian processes (see [1, 2]) is a non-parametric Bayesian machine learning technique that provides a flexible prior distribution over functions, enjoys analytical tractability, and has a fully probabilistic work-flow that returns robust posterior variance estimates, which quantify uncertainty in a natural way. Moreover, Gaussian processes are among a class of methods known as kernel machines (see [3, 4, 5]) and are analogous to regularization approaches (see [6, 7, 8]). They can also be viewed as a prior on one-layer feed-forward Bayesian neural networks with an infinite number of hidden units [9]. Non-parametric models such as Gaussian processes need to “remember” the full dataset in order to be trained and make predictions. Therefore, the complexity of non-parametric models grows with the size of the dataset. For instance, when applying a Gaussian process to a dataset of size , exact inference has computational complexity with storage demands of . In recent years, we have been witnessing tremendous amount of efforts (see e.g., [10, 11]) to reduce these complexities. Such efforts generally lead to a computational complexity of and storage demands of where is a user specified parameter governing the number of “inducing variables” (see e.g., [12, 13, 14, 15]). However, as is truly pointed out in [16] even these reduced storage are prohibitive for “big data”. In [16], the authors combine the idea of inducing variables with recent advances in variational inference (see e.g., [17, 18]) to develop a practical algorithm for fitting Gaussian processes using stochastic variational inference.
In contrast, the current work avoids stochastic variational inference and attempts to present an alternative approach to the one proposed in [16]. The seemingly self-contradictory idea is to make Gaussian processes parametric. The key feature of parametric models in general, and the current work in particular, is that predictions are conditionally independent of the observed data given the parameters. In other words, the data is distilled into the parameters and any subsequent prediction does not make use of the original dataset. This is very convenient as it enables efficient mini-batch training procedures. However, this is not without drawbacks since choosing a model from a particular parametric class constrains its flexibility. Therefore, it is of great importance to devise models that are aware of their imperfections and are capable of properly quantifying the uncertainty in their predictions associated with such limitations.
2 Methodology
Let us start by making the prior assumption that
[TABLE]
is a zero mean Gaussian process [1] with covariance function which depends on the hyper-parameters . Moreover, let us postulate the existence of some hypothetical dataset with
[TABLE]
Here, and . Let us define a parametric Gaussian process by the resulting conditional distribution
[TABLE]
where
[TABLE]
The parameters and of a parametric Gaussian process (3) will play a crucial role; The data will be distilled into these parameters and any subsequent predictions will not make use of the original dataset. This is very convenient as it enables an efficient mini-batch training procedure outlined in the following. Taking advantage of the favorable form (3) of a parametric Gaussian process, the mean and covariance matrix of the hypothetical dataset (2) can be updated by employing the posterior distribution resulting from conditioning on the observed mini-batch of data of size ; i.e.,
[TABLE]
It is worth mentioning that and . The information corresponding to the mini-batch is now distilled in the parameters and . The hyper-parameters and noise variance parameter can be updated by taking a step proportional to the gradient of the negative log marginal likelihood
[TABLE]
The training procedure is initialized by setting and where is some initial set of hyper-parameters. Having trained the hyper-parameters and parameters of the model, one can use equation (4) to predict the mean of the solution at a new test point . Moreover, the predicted variance is given by , where is obtained from equation (2).
3 Experiments
Parametric Gaussian process regression is entirely agnostic to the size of the dataset and can effectively handle datasets with millions or billions of records. The effectiveness of the proposed methodology will be demonstrated using an illustrative example with simulated data and a benchmark dataset in the literature on Gaussian processes and big data.
3.1 Illustrative example
To demonstrate the proposed framework, let us begin with a simple dataset generated by random perturbations of a one dimensional function given explicitly by . The training data are depicted in panel (A) of figure 1. The Gaussian process prior (1) used for this example is assumed to have a squared exponential [1] covariance function, i.e.,
[TABLE]
where is a variance parameter and are the hyper-parameters. The model employs a hypothetical data-set (see equation (2)) of size . The locations of the hypothetical dataset are obtained by employing the k-means clustering algorithm. The training procedures is carried out using the Adam stochastic optimizer [19] with default settings and mini-batches of size one. After one pass through the entire training data, it is remarkable how the parameters and of the hypothetical dataset enable us to summarize the actual training data. The red circles in figure 1 denote the pairs of the hypothetical data. The resulting prediction of the model is plotted in figure 1.
3.2 Airline delays
The US flight delay prediction example, originally proposed in [16], has reached a status of a standard benchmark dataset (see e.g., [20, 21, 22, 23, 24]) in Gaussian process regression, partly because of the massive size of the dataset with nearly million records and partly because of its large-scale non-stationary nature. The dataset111http://stat-computing.org/dataexpo/2009/ consists of flight arrival and departure times for every commercial flight in the USA for the year 2008. Each record is complemented with details on the flight and the aircraft. The aim is to predict the delay in minutes of the aircraft at landing, . The eight covariates are the same as [16], namely the age of the aircraft (number of years since deployment), route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. Two third of the entire data set, which totals million records, is used for training and one third for testing. The output data are normalized by subtracting the training sample mean from the outputs and dividing the results by the sample standard deviation. The input data are normalized to the interval . The Gaussian process prior (1) used for this example is assumed to have a squared exponential [1] covariance function, i.e.,
[TABLE]
where is a variance parameter, is the vector of covariates, and are the hyper-parameters. Moreover, anisotropy across input dimensions is handled by Automatic Relevance Determination (ARD) weights . From a theoretical point of view, each kernel gives rise to a Reproducing Kernel Hilbert Space [25, 26, 27] that defines a class of functions that can be represented by this kernel. In particular, the squared exponential covariance function chosen above implies smooth approximations. More complex function classes can be accommodated by appropriately choosing kernels. The model employs a hypothetical dataset (see equation (2)) of size . The locations of the hypothetical dataset are obtained by employing the k-means clustering algorithm. The training procedures is carried out using the Adam stochastic optimizer [19] with default settings and mini-batches of size . After iterations of the training procedure, the predictive mean squared error (MSE) on the normalized test data is given by . This value for the MSE is within the range reported in the literature (see e.g., table 2 in [20]). The MSE over the normalized data can be interpreted as a fraction of the sample variance of airline arrival delays. Thus a MSE of is as good as using the training mean as predictor. In order to further reduce the MSE one could increase the size of the hypothetical data-set, increase the batch-size, and/or choose a more accommodative covariance function. Moreover, to get a better idea of the relevance of the different features available in this dataset, figure 2 plots the automatic relevance determination parameters . The most relevant variable turns out to be the airtime that needs to be covered. The month and time of departure of the flight are also two important features in predicting flight delays.
4 Related works
Despite some subtle differences, it is generally safe to recognize the input-output pairs (see equation (2)) as the so called “inducing points”, a frequently used term in the literature on sparse approximations to Gaussian process priors (see e.g., [14] for a compressive review). However, it is not advisable to interpret and (see equation (2)) as variational parameters [16] since no (stochastic) variational inference is carried out in the current work. Furthermore, to highlight the subtle differences between “inducing points” and what this work calls hypothetical dataset (2), it is worth observing that in the literature on sparse approximations to Gaussian processes it turns out that and . Under these assumptions and using equations (4) and (2), one obtains and . In other words, in the sparse Gaussian processes framework, and are essentially identical; i.e., . In contrast, this work treats and as parameters of the model responsible for encoding the history of observed data. In this regard, the current work is similar to [16]. However, unlike [16], the parameter and are not variational parameters of some variational distribution.
5 Concluding remarks
Modern datasets are rapidly growing in size and complexity, and there is a pressing need to develop new statistical methods and machine learning techniques to harness this wealth of data. This work presented a novel regression framework for encoding massive amount of data into a small number of hypothetical data points. While being effective, the resulting model is conceptually very simple, is based on the idea of making Gaussian processes parametric, and it takes at most mathematical formulas to explain every single detail of the algorithm. This simplicity is extremely important specially when it comes to deploying machine learning algorithms on big data flow engines (see e.g., [28]) such as MapReduce [29] and Apache Spark [30]. Moreover, Gaussian processes are a powerful tool for probabilistic inference over functions. They offer desirable properties such as uncertainty estimates, automatic discovery of important dimensions, robustness to over-fitting, and principled ways of tuning hyper-parameters. Thus, scaling Gaussian processes to big datasets and deploying it on big data flow engines is, and will remain, an active area of research.
Acknowledgments
This works received support by the DARPA EQUiPS grant N66001-15-2-4055, and the AFOSR grant FA9550-17-1-0013.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.
- 2[2] Kevin P Murphy. Machine learning: a probabilistic perspective . MIT press, 2012.
- 3[3] Vladimir Vapnik. The nature of statistical learning theory . Springer Science & Business Media, 2013.
- 4[4] Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond . MIT press, 2002.
- 5[5] Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. The journal of machine learning research , 1:211–244, 2001.
- 6[6] Andrey Tikhonov. Solution of incorrectly formulated problems and the regularization method. In Soviet Math. Dokl. , volume 5, pages 1035–1038, 1963.
- 7[7] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed problems . W.H. Winston, 1977.
- 8[8] Tomaso Poggio and Federico Girosi. Networks for approximation and learning. Proceedings of the IEEE , 78(9):1481–1497, 1990.
