Scalable GAM using sparse variational Gaussian processes

Vincent Adam; Nicolas Durrande; ST John

arXiv:1812.11106·cs.LG·December 31, 2018

Scalable GAM using sparse variational Gaussian processes

Vincent Adam, Nicolas Durrande, ST John

PDF

Open Access

TL;DR

This paper introduces a scalable Bayesian approach to generalized additive models using sparse variational Gaussian processes, enabling efficient and interpretable modeling of complex data.

Contribution

It presents a novel scalable Bayesian GAM framework with sparse GPs and variational inference, improving computational efficiency and model calibration.

Findings

01

Efficient inference for large datasets using sparse GPs.

02

Enhanced interpretability of GAM components.

03

Well-calibrated Bayesian uncertainty estimates.

Abstract

Generalized additive models (GAMs) are a widely used class of models of interest to statisticians as they provide a flexible way to design interpretable models of data beyond linear models. We here propose a scalable and well-calibrated Bayesian treatment of GAMs using Gaussian processes (GPs) and leveraging recent advances in variational inference. We use sparse GPs to represent each component and exploit the additive structure of the model to efficiently represent a Gaussian a posteriori coupling between the components.

Figures2

Click any figure to enlarge with its caption.

Equations100

lo g p (Y ∣ X) \geq \mathds E_{q (F)} [lo g p (Y ∣ F, X)] - KL [q (F) ∥ p (F)] = L (q) .

lo g p (Y ∣ X) \geq \mathds E_{q (F)} [lo g p (Y ∣ F, X)] - KL [q (F) ∥ p (F)] = L (q) .

{\cal L}(q)=\mathds{E}_{q({\cal F})}[\log p(Y\,|\,{\cal F})]+\frac{1}{2}\Big{(}\operatorname{tr}(K_{{\cal F},{\cal F}}^{-1}\Sigma_{{\cal F}})+\mu_{{\cal F}}^{\intercal}K_{{\cal F},{\cal F}}^{-1}\mu_{{\cal F}}-\log|\Sigma_{{\cal F}}|-NC\Big{)}.

{\cal L}(q)=\mathds{E}_{q({\cal F})}[\log p(Y\,|\,{\cal F})]+\frac{1}{2}\Big{(}\operatorname{tr}(K_{{\cal F},{\cal F}}^{-1}\Sigma_{{\cal F}})+\mu_{{\cal F}}^{\intercal}K_{{\cal F},{\cal F}}^{-1}\mu_{{\cal F}}-\log|\Sigma_{{\cal F}}|-NC\Big{)}.

\Sigma^{-1}_{{\cal F}}=K_{{\cal F},{\cal F}}^{-1}+\nabla_{\Sigma_{{\cal F}}}\big{[}\mathds{E}_{q({\cal F})}[\log p(Y\,|\,{\cal F})]\big{]}.

\Sigma^{-1}_{{\cal F}}=K_{{\cal F},{\cal F}}^{-1}+\nabla_{\Sigma_{{\cal F}}}\big{[}\mathds{E}_{q({\cal F})}[\log p(Y\,|\,{\cal F})]\big{]}.

Σ_{F}^{- 1} = K_{F, F}^{- 1} + (1_{C} \otimes Λ) (1_{C} \otimes Λ)^{⊺} .

Σ_{F}^{- 1} = K_{F, F}^{- 1} + (1_{C} \otimes Λ) (1_{C} \otimes Λ)^{⊺} .

L (q)

L (q)

Σ_{U, U}^{- 1}

Σ_{U, U}^{- 1}

Σ_{U, U}^{- 1}

Σ_{U, U}^{- 1}

s_{i} (x_{i}, y_{i}) = g_{i} (x_{i}, y_{i}) - \frac{\int _{0}^{1} g _{i} ( x _{i} , s ) d s \int _{0}^{1} g _{i} ( y _{i} , s ) d s}{\iint _{0}^{1} g _{i} ( s _{i} , t ) d s d t} .

s_{i} (x_{i}, y_{i}) = g_{i} (x_{i}, y_{i}) - \frac{\int _{0}^{1} g _{i} ( x _{i} , s ) d s \int _{0}^{1} g _{i} ( y _{i} , s ) d s}{\iint _{0}^{1} g _{i} ( s _{i} , t ) d s d t} .

Σ_{F}^{- 1} = K_{F, F}^{- 1} + \nabla_{Σ_{F}} V (Y, μ_{F}, Σ_{F}) .

Σ_{F}^{- 1} = K_{F, F}^{- 1} + \nabla_{Σ_{F}} V (Y, μ_{F}, Σ_{F}) .

V (Y, μ_{F}, Σ_{F})

V (Y, μ_{F}, Σ_{F})

= \sum_{n} v (y_{n}, μ_{ρ (x_{n})}, σ_{ρ (x_{n})}^{2}),

\nabla_{Σ_{F}} V (Y, μ_{F}, Σ_{F})

\nabla_{Σ_{F}} V (Y, μ_{F}, Σ_{F})

= \sum_{n} λ_{n}^{2} \sum_{c c^{'}} e_{c n, c^{'} n},

\nabla_{Σ_{F}} V (Y, μ_{F}, Σ_{F})

\nabla_{Σ_{F}} V (Y, μ_{F}, Σ_{F})

q (F) = N (K_{F, F} α, Σ_{F})

q (F) = N (K_{F, F} α, Σ_{F})

\Sigma_{{\cal F}}=\big{(}K_{{\cal F},{\cal F}}^{-1}+(1\otimes\Lambda)(1\otimes\Lambda)^{\intercal}\big{)}^{-1},

\Sigma_{{\cal F}}=\big{(}K_{{\cal F},{\cal F}}^{-1}+(1\otimes\Lambda)(1\otimes\Lambda)^{\intercal}\big{)}^{-1},

L (q) = \mathds E_{q (\sum_{c} f_{c})} [lo g p (y ∣ \sum_{c} f_{c})] - KL [q (F) ∥ p (F)],

L (q) = \mathds E_{q (\sum_{c} f_{c})} [lo g p (y ∣ \sum_{c} f_{c})] - KL [q (F) ∥ p (F)],

KL [q (F) ∥ p (F)] = \frac{1}{2} [- lo g ∣ K_{F, F}^{- 1} Σ_{F} ∣ + α^{⊺} K_{F, F} α + tr (K_{F, F}^{- 1} Σ_{F}) - N C] .

KL [q (F) ∥ p (F)] = \frac{1}{2} [- lo g ∣ K_{F, F}^{- 1} Σ_{F} ∣ + α^{⊺} K_{F, F} α + tr (K_{F, F}^{- 1} Σ_{F}) - N C] .

Σ_{F}

Σ_{F}

\displaystyle=K_{{\cal F},{\cal F}}-K_{{\cal F},{\cal F}}(1\otimes\Lambda)\big{(}I+(1\otimes\Lambda)^{\intercal}K_{{\cal F},{\cal F}}(1\otimes\Lambda)\big{)}^{-1}(1\otimes\Lambda)^{\intercal}K_{{\cal F},{\cal F}}

= K_{F, F} - K_{F, F} (1 \otimes Λ) A^{- 1} (1 \otimes Λ)^{⊺} K_{F, F},

A

A

= I + \sum_{c} Λ^{⊺} K_{f_{c}, f_{c}} Λ.

Σ_{sum}

Σ_{sum}

= (1 \otimes I)^{⊺} [K_{F, F} - K_{F, F} (1 \otimes Λ) A^{- 1} (1 \otimes Λ)^{⊺} K_{F, F}] (1 \otimes I)^{⊺}

= \sum_{c} K_{f_{c}, f_{c}} - \sum_{c, c^{'}} K_{f_{c}, f_{c}} Λ A^{- 1} Λ K_{f_{c^{'}}, f_{c^{'}}}

= \sum_{c} K_{f_{c}, f_{c}} - \sum_{c, c^{'}} (L_{A}^{- ⊺} Λ K_{f_{c}, f_{c}})^{⊺} (L_{A}^{- ⊺} Λ K_{f_{c^{'}}, f_{c^{'}}}) .

∣ K_{F, F}^{- 1} Σ_{F} ∣

∣ K_{F, F}^{- 1} Σ_{F} ∣

= ∣ K_{F, F}^{- 1} ∣/∣ K_{F, F}^{- 1} + (1 \otimes Λ) (1 \otimes Λ)^{⊺} ∣

= ∣ K_{F, F}^{- 1} ∣/ [∣ I + (1 \otimes Λ)^{⊺} K_{F, F} (1 \otimes Λ) ∣∣ I ∣∣ K_{F, F}^{- 1} ∣]

= 1/∣ A ∣

tr (K_{F, F}^{- 1} Σ_{F})

tr (K_{F, F}^{- 1} Σ_{F})

= tr (I - (1 \otimes Λ) A^{- 1} (1 \otimes Λ)^{⊺} K_{F, F})

= N C - tr (A^{- 1} (1 \otimes Λ)^{⊺} K_{F, F})

= N C - \sum_{c} tr (Λ A^{- 1} Λ^{⊺} K_{f_{c}, f_{c}})

= N C - tr (Λ A^{- 1} Λ^{⊺} \sum_{c} K_{f_{c}, f_{c}})

KL [q (F) ∥ p (F)] = \frac{1}{2} [lo g ∣ A ∣ + α^{⊺} K_{F, F} α - tr (Λ A^{- 1} Λ^{⊺} \sum_{c} K_{f_{c}, f_{c}})] .

KL [q (F) ∥ p (F)] = \frac{1}{2} [lo g ∣ A ∣ + α^{⊺} K_{F, F} α - tr (Λ A^{- 1} Λ^{⊺} \sum_{c} K_{f_{c}, f_{c}})] .

A

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Control Systems and Identification · Advanced Multi-Objective Optimization Algorithms

Full text

\theorembodyfont\theoremheaderfont\theorempostheader

: \theoremsep

\jmlrproceedingsAABI 20181st Symposium on Advances in Approximate Bayesian Inference, 2018

Scalable GAM using sparse variational Gaussian processes

\NameVincent Adam \[email protected]

\NameNicolas Durrande \[email protected]

\NameST John \[email protected]

\addrPROWLER.io

72 Hills Road

Cambridge

CB2 1LA

United Kingdom

Abstract

Generalized additive models (GAMs) are a widely used class of models of interest to statisticians as they provide a flexible way to design interpretable models of data beyond linear models. We here propose a scalable and well-calibrated Bayesian treatment of GAMs using Gaussian processes (GPs) and leveraging recent advances in variational inference. We use sparse GPs to represent each component and exploit the additive structure of the model to efficiently represent a Gaussian a posteriori coupling between the components.

keywords:

GAM, Gaussian Process, Variational Inference

1 Introduction

Generalized additive models (GAMs) are a class of interpretable regression models with non-linear yet additive predictors (Hastie, 2017). Their Bayesian treatment requires the specification of priors over functions. Here, we use Gaussian processes (GPs) (Rasmussen and Williams, 2006) and propose an approximate inference algorithm that is scalable with both the number of data points and additive components and that provides accurate posterior uncertainty estimates. We extend the variational pseudo-point GP approximation (Titsias, 2009; Bauer et al., 2016) to posterior dependencies across GPs. This approximation provides state-of-the art performance for GP regression and provides approximations to the posterior distributions in the form of a GP. This approach has been successfully extended to the multiple GP setting using a factorized (mean-field) approximation of the posterior across GPs (Saul et al., 2016; Adam et al., 2016). However, it suffers from the known variance underestimation of mean-field approximations and therefore can lead to poor predictions or can bias learning (Turner and Sahani, 2011). Adam (2017) introduced additional structure to the posterior distribution by allowing some coupling across the inducing variables of the different GPs but this was at the cost of scalability.

2 Background

2.1 Regression with multiple GPs

We consider models with additive predictor and factorizing likelihood $p(Y\,|\,f_{1\dots C},X)=\prod_{n=1}^{N}p(y_{n}\,|\,\textstyle\sum_{c}f_{c}(x_{n}))$ , where $f_{1},\dots,f_{C}$ are functions from ${\cal X}_{c}\to\mathds{R}$ . The specific form of the likelihood is arbitrary. We denote ${\cal F}=\{f_{1},...,f_{C}\}$ such that $p({\cal F})=\prod_{c}p(f_{c})$ constitutes the joint distribution over the a priori independent processes. ${\cal F}(x)=[f_{1}(x),...,f_{C}(x)]$ is the vector of function evaluations at $x$ . To simplify notation, when no argument is given, ${\cal F}={\cal F}(X)\in\mathds{R}^{NC}$ . We denote by $K_{{\cal F},{\cal F}}$ the block-diagonal prior covariance matrix over ${\cal F}$ . We are interested in computing the joint posterior $p({\cal F}\,|\,X,Y)$ .

2.2 Variational Inference

The classical variational lower bound (or ELBO) to the marginal likelihood is given by

[TABLE]

This is the optimization objective in the Variational Free Energy (VFE) approximation. We choose $q({\cal F})$ to be a multivariate normal distribution with mean $\mu_{{\cal F}}$ and covariance $\Sigma_{{\cal F}}$ , which is not an approximation in the conjugate likelihood setting. This leads to

[TABLE]

The expectation term in equation (2) is intractable in most cases and needs to be approximated. See Hensman et al. (2015) for deterministic approximations and Salimbeni and Deisenroth (2017) for stochastic approximations.

3 Optimal Gaussian posterior in variational inference

Following Opper and Archambeau (2009), we derive the expression for the optimal $\Sigma_{{\cal F}}$ by noting that at the optimum, $\nabla_{\Sigma_{{\cal F}}}{\cal L}(q)=0$ . This implies that

[TABLE]

3.1 Optimality in the additive case

In the additive case considered here, the gradient term in (3) is low-rank and can be parameterized by a vector $\lambda\in\mathds{R}^{N}$ as follows, with $\Lambda=\operatorname{diag}(\lambda)$ and $1_{C}=[\underbrace{1,\dots,1}_{C\text{ times}}]^{\intercal}$ :

[TABLE]

This parameterization requires $2N$ values, equal to that of the classical single GP regression setting described in Opper and Archambeau (2009). It also inherits the non-convexity of this objective as highlighted by Khan et al. (2012).

3.2 Optimality in the sparse additive case

Following Adam et al. (2016) we introduce for each GP indexed by $c$ a set of $M$ ‘inducing points’ $Z_{c}=[z_{c}^{(1)},...,z_{c}^{(M)}]\in{\cal X}_{c}^{M}$ . The vector of associated function evaluations is given by $\mathbf{U}_{c}=[u_{c}^{(1)},...,u_{c}^{(M)}]=[f_{c}(z_{c}^{(1)}),...,f_{c}(z_{c}^{(M)})]$ . We also define the stacked vector $\mathbf{U}=[\mathbf{U}_{1},...,\mathbf{U}_{c}]\in\mathds{R}^{MC}$ .

Following Adam (2017), we parameterize $q({\cal F})=q(\mathbf{U})\prod_{c}p(f_{c\neg\mathbf{U}_{c}}\,|\,\mathbf{U}_{c})$ . This choice leads to a simplification of the lower bound (2) as

[TABLE]

Saul et al. (2016) considered the mean field case $q(\mathbf{U})=\prod_{c}q(\mathbf{U}_{c})$ with each factor parameterized as a multivariate normal distribution ${\cal N}(\mu_{\mathbf{U}_{c}},\Sigma_{\mathbf{U}_{c}})$ . This approach does not capture posterior dependencies across GPs. Adam (2017) parameterized $q(\mathbf{U})$ as a multivariate normal distribution ${\cal N}(\mu_{\mathbf{U}},\Sigma_{\mathbf{U}})$ to include cross-GP coupling through the inducing variables $\mathbf{U}$ . We extend this last approach but ask what the optimal $q(\mathbf{U})$ should be. It turns out to be (see Appendix A):

[TABLE]

This form has again $2N$ parameters which becomes an over-parameterization as soon as $N>M^{2}C^{2}/2$ . Since we are interested in scalability, it is not of practical interest.

4 A new parameterization for $q(\mathbf{U})$

The second term of the sum in (6) can be expressed as $AA^{\intercal}$ with $A$ of size $MC\times N$ . Keeping this structure arising from the additivity of the model, we propose the parameterization

[TABLE]

with $B$ of size $MC\times M$ smaller than $A$ . This parameterization preserves the structure of the optimal covariance. It requires storing $M^{2}C$ values, which is less than a direct representation of a Cholesky factor of $\Sigma^{-1}_{\mathbf{U},\mathbf{U}}$ that would require $M^{2}C^{2}$ parameters.

5 Summary of complexities

Time and space complexity of the sparse variational algorithms are summarized in Table LABEL:table:complexity.

6 Related work

Variational inference for the multi-GP setting has so far only used the mean-field (MF) approximation as described in Saul et al. (2016). When posterior dependencies are a quantity of interest, a natural approach is to increase the complexity of the variational posterior to capture these dependencies. This often results in a prohibitive increase in the complexity of the inference. Different solutions have been proposed to tackle this problem. A first approach in Giordano et al. (2015) consists of a two-step scheme where MF inference is assumed to provide accurate posterior mean estimates. A perturbation analysis is then performed around the MF posterior means to provide second order (covariance) estimates. A second approach consists in ‘relaxing’ the MF approximation by extending the variational posterior $q({\cal F})$ with additional multiplicative terms capturing dependencies while keeping the computational complexity of the resulting inference scheme low (Tran et al., 2015; Hoffman and Blei, 2015). Our approach fits in this second family of extensions of the MF parameterization. It is tailored to the VFE approximation to GP models and leverages its sparsity to provide a fast and scalable inference algorithm.

7 Illustration

We consider a simple regression task consisting of approximating the following function: $f(x)=10\sin(\pi x_{1}x_{2})+20(x_{3}-0.5)^{2}+10x_{4}+5x_{5}$ with $x\in[0,1]^{6}$ (note that the last variable has no effect), given 5000 observation points uniformly distributed in the input space and a Gaussian observation noise with unit variance.

We choose a kernel dedicated to sensitivity analysis and tailored to the structure of the function at stake [Durrande et al. (2013)]. Given univariate squared exponential kernels $g_{1},\dots,g_{8}$ we define the kernel as $k(x,y)=\sigma_{0}+\sum_{i=1}^{6}s_{i}(x_{i},y_{i})+s_{7}(x_{1},y_{1})s_{8}(x_{2},y_{2})$ with

[TABLE]

Since the number of observations is relatively large and the kernel has an additive structure (it is the sum of 8 kernels), we choose the sparse additive model described above. We choose 16 regularly spaced one-dimensional inducing points for each kernel $s_{1},\dots,_{6}$ and 16 points distributed as a $4\times 4$ grid for the bi-dimensional kernel $s_{7}s_{8}$ . The final model is obtained by maximizing the ELBO with respect to the variational parameters and the hyper-parameters of the $g_{i}$ . Given the structure of the model and the fact that inducing inputs are dedicated to model components, it is then possible to decompose the model predictions and to represent separately all the components of the ANOVA representation of the test function. Figure 1 shows that the model accurately approximates the test function and that the proposed framework is helpful to reveal its inner structure.

8 Conclusion

We presented a method that provides a fast, scalable and well-calibrated Bayesian treatment of GAMs. Although motivated by GAMs, our structured variational distribution may be used in models where the predictor is non-additive but where the posterior is well-approximated by a unimodal distribution.

Appendix A Optimal covariance in the additive case

We first define $V(Y,\mu_{{\cal F}},\Sigma_{{\cal F}})=\mathds{E}_{q({\cal F})}\log p(Y\,|\,{\cal F})$ . From Opper and Archambeau (2009), we know that the optimal variational precision is structured as

[TABLE]

For factorizing likelihood and additive predictors, and defining $\rho(\cdot)=\textstyle\sum_{c}f_{c}(\cdot)$ , we have

[TABLE]

where $q(\rho(x_{n}))$ has variance $\sigma^{2}_{\rho(x_{n})}=1_{C}^{\intercal}\Sigma_{{\cal F}(x_{n})}1_{C}=\textstyle\sum_{c,c^{\prime}}\Sigma_{f_{c}(x_{n}),f_{c^{\prime}}(x_{n})}$ .

The gradient term in the optimal precision thus can be written as

[TABLE]

where $e_{i,j}$ is the indicator matrix of size $NC\times NC$ with $1$ at location $(i,j)$ . With $\Lambda=\operatorname{diag}(\lambda)$ , this can be rewritten in matrix form:

[TABLE]

Appendix B ELBO evaluation: additive case

We parameterize the approximate posterior as

[TABLE]

with

[TABLE]

and optimize

[TABLE]

where

[TABLE]

B.1 Computing marginals of $\Sigma_{{\cal F}^{(n)}}$

[TABLE]

where

[TABLE]

To evaluate the ELBO we need, for each data point $(x_{n},y_{n})$ , the marginal $q(\sum_{c}f^{n}_{c})$ . This corresponds to the diagonal elements of

[TABLE]

B.2 Computing the KL

[TABLE]

In the end,

[TABLE]

B.3 Summary

$\displaystyle A$ $\displaystyle=I+\textstyle\sum_{c}\Lambda^{\intercal}K_{f_{c},f_{c}}\Lambda$

$\displaystyle\operatorname{KL}[q({\cal F})\,\|\,p({\cal F})]$ $\displaystyle=\frac{1}{2}[\log|A|+\alpha^{\intercal}K_{{\cal F},{\cal F}}\alpha-\operatorname{tr}(\Lambda A^{-1}\Lambda^{\intercal}\textstyle\sum_{c}K_{f_{c},f_{c}})]$

$\displaystyle\mu_{\text{sum}}$ $\displaystyle=\textstyle\sum_{c}K_{f_{c},f_{c}}\alpha_{c}$

$\displaystyle\Sigma_{\text{sum}}$ $\displaystyle=\textstyle\sum_{c}\operatorname{diag}(K_{f_{c},f_{c}})-\textstyle\sum_{c,c^{\prime}}\operatorname{diag}(K_{f_{c},f_{c}}\Lambda A^{-1}\Lambda K_{f_{c^{\prime}},f_{c^{\prime}}})$

Appendix C ELBO evaluation: sparse additive case

We parameterize an approximate posterior over the inducing values as

[TABLE]

with

[TABLE]

where $B=[B_{1},\dots,B_{C}]\in\mathds{R}^{MC\times M}$ . We optimize

[TABLE]

with

[TABLE]

C.1 Computing marginals of $\Sigma_{{\cal F}}$

We have

[TABLE]

where $A=I+B^{\intercal}K_{{\cal F},{\cal F}}B$ , so

[TABLE]

and

[TABLE]

Therefore

[TABLE]

The Cholesky decomposition of $A=L_{A}L_{A}^{\intercal}$ is of cost $\mathcal{O}(M^{3})$ . Solving operations $L^{-1}B_{c}^{\intercal}$ for each additive term costs a total of $\mathcal{O}(CM^{3})$ . Computing the marginal predictor variances then costs an extra $\mathcal{O}(NC^{2}M^{2})$ . In total, the computational cost of posterior predictions is $\mathcal{O}(CM^{3}+NC^{2}M^{2})$ .

C.2 Computing the KL

As in the additive case, we have

[TABLE]

and

[TABLE]

In the end,

[TABLE]

C.3 Summary

$\displaystyle A$ $\displaystyle=I+B^{\intercal}K_{{\cal F},{\cal F}}B$

$\displaystyle\operatorname{KL}[q(\mathbf{U})\,\|\,p(\mathbf{U})]$ $\displaystyle=\frac{1}{2}[\log|A|+\alpha^{\intercal}K_{\mathbf{U},\mathbf{U}}\alpha-\textstyle\sum_{c}\operatorname{tr}(B_{c}A^{-1}B_{c}^{\intercal}K_{\mathbf{U}_{c},\mathbf{U}_{c}})]$

$\displaystyle\mu_{\text{sum}}$ $\displaystyle=\textstyle\sum_{c}K_{f_{c},\mathbf{U}_{c}}\alpha_{c}$

$\displaystyle\Sigma_{\text{sum}}$ $\displaystyle=\textstyle\sum_{c}\operatorname{diag}(K_{f_{c},f_{c}})-\textstyle\sum_{c,c^{\prime}}\operatorname{diag}(K_{f_{c},\mathbf{U}_{c}}B_{c}A^{-1}B_{c^{\prime}}^{\intercal}K_{\mathbf{U}_{c^{\prime}},f_{c^{\prime}}})$

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adam (2017) Vincent Adam. Structured variational inference for coupled Gaussian processes. ar Xiv preprint ar Xiv:1711.01131 , 2017.
2Adam et al. (2016) Vincent Adam, James Hensman, and Maneesh Sahani. Scalable transformed additive signal decomposition by non-conjugate Gaussian process inference. In Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on , pages 1–6. IEEE, 2016.
3Bauer et al. (2016) Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems , pages 1533–1541, 2016.
4Durrande et al. (2013) Nicolas Durrande, David Ginsbourger, Olivier Roustant, and Laurent Carraro. ANOVA kernels and RKHS of zero mean functions for model-based sensitivity analysis. Journal of Multivariate Analysis , 115:57–67, 2013.
5Giordano et al. (2015) Ryan J Giordano, Tamara Broderick, and Michael I Jordan. Linear response methods for accurate covariance estimates from mean field variational Bayes. In Advances in Neural Information Processing Systems , pages 1441–1449, 2015.
6Hastie (2017) Trevor J Hastie. Generalized additive models. In Statistical models in S , pages 249–307. Routledge, 2017.
7Hensman et al. (2015) James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational gaussian process classification. JMLR , 2015.
8Hoffman and Blei (2015) Matthew Hoffman and David Blei. Stochastic structured variational inference. In Artificial Intelligence and Statistics , pages 361–369, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Scalable GAM using sparse variational Gaussian processes

Abstract

keywords:

1 Introduction

2 Background

2.1 Regression with multiple GPs

2.2 Variational Inference

3 Optimal Gaussian posterior in variational inference

3.1 Optimality in the additive case

3.2 Optimality in the sparse additive case

4 A new parameterization for q(U)q(\mathbf{U})q(U)

5 Summary of complexities

6 Related work

7 Illustration

8 Conclusion

Appendix A Optimal covariance in the additive case

Appendix B ELBO evaluation: additive case

B.1 Computing marginals of ΣF(n)\Sigma_{{\cal F}^{(n)}}ΣF(n)​

B.2 Computing the KL

B.3 Summary

Appendix C ELBO evaluation: sparse additive case

C.1 Computing marginals of ΣF\Sigma_{{\cal F}}ΣF​

C.2 Computing the KL

C.3 Summary

4 A new parameterization for $q(\mathbf{U})$

B.1 Computing marginals of $\Sigma_{{\cal F}^{(n)}}$

C.1 Computing marginals of $\Sigma_{{\cal F}}$