Efficient Deep Gaussian Process Models for Variable-Sized Input

Issam H. Laradji; Mark Schmidt; Vladimir Pavlovic; Minyoung Kim

arXiv:1905.06982·cs.LG·May 20, 2019

Efficient Deep Gaussian Process Models for Variable-Sized Input

Issam H. Laradji, Mark Schmidt, Vladimir Pavlovic, Minyoung Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces GP-DRF, a novel Bayesian model combining Gaussian processes and deep random features, designed to efficiently handle variable-sized inputs and learn deep dependencies, with improved uncertainty quantification.

Contribution

The paper proposes GP-DRF, a new scalable Bayesian model that effectively manages variable-sized data and deep dependencies, overcoming limitations of existing DGP and DRF models.

Findings

01

GP-DRF outperforms standard GP and DRF models on multiple datasets.

02

The model provides better uncertainty quantification than GP and DRF.

03

Efficient inference method for posterior estimation in GP-DRF.

Abstract

Deep Gaussian processes (DGP) have appealing Bayesian properties, can handle variable-sized data, and learn deep features. Their limitation is that they do not scale well with the size of the data. Existing approaches address this using a deep random feature (DRF) expansion model, which makes inference tractable by approximating DGPs. However, DRF is not suitable for variable-sized input data such as trees, graphs, and sequences. We introduce the GP-DRF, a novel Bayesian model with an input layer of GPs, followed by DRF layers. The key advantage is that the combination of GP and DRF leads to a tractable model that can both handle a variable-sized input as well as learn deep long-range dependency structures of the data. We provide a novel efficient method to simultaneously infer the posterior of GP's latent vectors and infer the posterior of DRF's internal weights and random frequencies.…

Tables2

Table 1. TABLE I : Dataset Statistics and Benchmark Results.

Dataset Statistics
	POWERPLANT	PROTEIN	SPAM	EEG	MNIST	MUSIC	REUTERS	SCOP
# Train	9469	45515	4532	14857	60000	900	8084	2575
# Test	99	215	69	123	10000	100	899	287
# Classes	1	1	2	2	10	10	2	7
Benchmark Results (Lower is better; best method in bold)
GP	0.207	0.737	0.043	0.114	0.059	0.350	0.050	0.301
DRF	0.201	0.613	0.029	0.106	0.045	0.660	0.020	0.381
GP-DRF (Ours)	0.194	0.652	0.001	0.016	0.033	0.300	0.017	0.286

Table 2. TABLE II : Comparison with respect to the average Battaacharya distance for correctly labeled (higher is better) and misclassified samples (lower is better). The best scores are boldfaced.

Model	EEG		MNIST		Music
Model	correctly $↑$	misclassified $↓$	correctly $↑$	misclassified $↓$	correctly $↑$	misclassified $↓$
GP	37.41	6.37	18.85	0.79	0.89	0.21
DRF	6.16	2.19	0.31	0.18	1.40	0.97
GP-DRF (Ours)	110.52	5.45	50.17	4.65	0.99	0.21

Equations52

k_{ARD}(x,x^{\prime})=\alpha\exp\Big{(}-\frac{1}{2}(x-x^{\prime})^{\top}\Gamma^{-1}(x-x^{\prime})\Big{)},

k_{ARD}(x,x^{\prime})=\alpha\exp\Big{(}-\frac{1}{2}(x-x^{\prime})^{\top}\Gamma^{-1}(x-x^{\prime})\Big{)},

\begin{split}\phi_{ARD}(x)=\sqrt{\frac{\alpha}{M}}\ \Big{[}\cos(\omega_{(1)}^{\top}x),\sin(\omega_{(1)}^{\top}x),\dots,\\ \cos(\omega_{(M)}^{\top}x),\sin(\omega_{(M)}^{\top}x)\Big{]}^{\top},\end{split}

\begin{split}\phi_{ARD}(x)=\sqrt{\frac{\alpha}{M}}\ \Big{[}\cos(\omega_{(1)}^{\top}x),\sin(\omega_{(1)}^{\top}x),\dots,\\ \cos(\omega_{(M)}^{\top}x),\sin(\omega_{(M)}^{\top}x)\Big{]}^{\top},\end{split}

h_{j}^{l + 1} = f_{j}^{l} (h^{l}) with f_{j}^{l} (\cdot) \sim G P (k^{l} (\cdot, \cdot)) for j = 1, \dots, d_{l + 1} .

h_{j}^{l + 1} = f_{j}^{l} (h^{l}) with f_{j}^{l} (\cdot) \sim G P (k^{l} (\cdot, \cdot)) for j = 1, \dots, d_{l + 1} .

h_{j}^{l + 1} = w_{j}^{l}^{⊤} ϕ^{l} (h^{l}) with w_{j}^{l} \sim N (0, I_{D_{l}}) for j = 1, \dots, d_{l + 1},

h_{j}^{l + 1} = w_{j}^{l}^{⊤} ϕ^{l} (h^{l}) with w_{j}^{l} \sim N (0, I_{D_{l}}) for j = 1, \dots, d_{l + 1},

h^{l + 1} = W^{l}^{⊤} ϕ^{l} (h^{l}; Ω^{l}) with W^{l} \sim N (0, I) and Ω^{l} \sim N (0, Λ^{l}) .

h^{l + 1} = W^{l}^{⊤} ϕ^{l} (h^{l}; Ω^{l}) with W^{l} \sim N (0, I) and Ω^{l} \sim N (0, Λ^{l}) .

G (x; W, Ω, θ_{o}) = g^{L - 1} (\dots (g^{1} (g^{0} (x))) \dots),

G (x; W, Ω, θ_{o}) = g^{L - 1} (\dots (g^{1} (g^{0} (x))) \dots),

P (F) = j = 1 \prod d_{0} N (F^{j}; 0, K_{j}),

P (F) = j = 1 \prod d_{0} N (F^{j}; 0, K_{j}),

P (y_{n} ∣ G (F_{n}; W, Ω, θ_{o}), θ_{l}),

P (y_{n} ∣ G (F_{n}; W, Ω, θ_{o}), θ_{l}),

P (Y, W, Ω, F ∣ X, Θ) = P (F ∣ θ_{k}) P (W) P (Ω∣Λ) n = 1 \prod N P (y_{n} ∣ G (F_{n}; W, Ω, θ_{o}), θ_{l}),

P (Y, W, Ω, F ∣ X, Θ) = P (F ∣ θ_{k}) P (W) P (Ω∣Λ) n = 1 \prod N P (y_{n} ∣ G (F_{n}; W, Ω, θ_{o}), θ_{l}),

P (F, W, Ω∣ X, Y, Θ) .

P (F, W, Ω∣ X, Y, Θ) .

q (W, Ω, F ∣Ψ) = q (W ∣ Ψ_{W}) q (Ω∣ Ψ_{Ω}) \int P (F ∣ \overline{F}) q (\overline{F} ∣ Ψ_{F}) d \overline{F},

q (W, Ω, F ∣Ψ) = q (W ∣ Ψ_{W}) q (Ω∣ Ψ_{Ω}) \int P (F ∣ \overline{F}) q (\overline{F} ∣ Ψ_{F}) d \overline{F},

q (W ∣ Ψ_{W})

q (W ∣ Ψ_{W})

q (Ω∣ Ψ_{Ω})

q (\overline{F} ∣ Ψ_{F})

lo g P (Y ∣ X, \overline{X}, Θ) \geq ELBO (Ψ, Θ),

lo g P (Y ∣ X, \overline{X}, Θ) \geq ELBO (Ψ, Θ),

ELBO (Ψ, Θ)

ELBO (Ψ, Θ)

E_{q} [lo g P (y_{n} ∣ G (F_{n}; W, Ω, θ_{o}), θ_{l})] .

E_{q} [lo g P (y_{n} ∣ G (F_{n}; W, Ω, θ_{o}), θ_{l})] .

q (W, Ω, F_{n}) = q (W) q (Ω) q (F_{n}) .

q (W, Ω, F_{n}) = q (W) q (Ω) q (F_{n}) .

[a_{n}]_{j}

[a_{n}]_{j}

[B_{n}]_{j, j}

w_{i, j}^{l}

w_{i, j}^{l}

ω_{i, j}^{l}

[F_{n}]_{j}

\frac{1}{S} s = 1 \sum S lo g P (y_{n} ∣ G (F_{n}^{(s)}; W^{(s)}, Ω^{(s)}, θ_{o}), θ_{l}) .

\frac{1}{S} s = 1 \sum S lo g P (y_{n} ∣ G (F_{n}^{(s)}; W^{(s)}, Ω^{(s)}, θ_{o}), θ_{l}) .

P (y_{*} ∣ x_{*}, X, \overline{X}, Y, Θ) \approx \int P (y_{*} ∣ G (F_{*}; W, Ω, θ_{o}), θ_{l}) P (F_{*} ∣ \overline{F}) q (W, Ω, \overline{F}) d W d Ω d \overline{F} .

P (y_{*} ∣ x_{*}, X, \overline{X}, Y, Θ) \approx \int P (y_{*} ∣ G (F_{*}; W, Ω, θ_{o}), θ_{l}) P (F_{*} ∣ \overline{F}) q (W, Ω, \overline{F}) d W d Ω d \overline{F} .

\frac{1}{S} s = 1 \sum S P (y_{*} ∣ G (F_{*}^{(s)}; W^{(s)}, Ω^{(s)}, θ_{o}), θ_{l}) .

\frac{1}{S} s = 1 \sum S P (y_{*} ∣ G (F_{*}^{(s)}; W^{(s)}, Ω^{(s)}, θ_{o}), θ_{l}) .

E [y_{*} ∣ x_{*}, X, \overline{X}, Y]

E [y_{*} ∣ x_{*}, X, \overline{X}, Y]

V (y_{*} ∣ x_{*}, X, \overline{X}, Y)

\begin{split}D(F_{*}(x),F_{+}(x))=&\frac{1}{4}\ln{\Big{(}\frac{1}{4}\Big{(}\frac{\sigma_{*}}{\sigma_{+}}+\frac{\sigma_{+}}{\sigma_{*}}+2}\Big{)}\Big{)}\\ &+\frac{1}{4}\Big{(}\frac{(\mu_{*}-\mu_{+})^{2}}{\sigma_{*}+\sigma_{+}}\Big{)},\end{split}

\begin{split}D(F_{*}(x),F_{+}(x))=&\frac{1}{4}\ln{\Big{(}\frac{1}{4}\Big{(}\frac{\sigma_{*}}{\sigma_{+}}+\frac{\sigma_{+}}{\sigma_{*}}+2}\Big{)}\Big{)}\\ &+\frac{1}{4}\Big{(}\frac{(\mu_{*}-\mu_{+})^{2}}{\sigma_{*}+\sigma_{+}}\Big{)},\end{split}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IssamLaradji/GP_DRF
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Machine Learning and Data Classification · Time Series Analysis and Forecasting

Full text

Efficient Deep Gaussian Process Models for

Variable-Sized Inputs

Issam H. Laradji12, Mark Schmidt1, Vladimir Pavlovic34, Minyoung Kim35

2*Element AI, Montreal, Canada

1Dept. of Computer Science, University of British Columbia, Vancouver, Canada

3Dept. of Computer Science, Rutgers University, Piscataway, New Jersey, USA

4Samsung AI Center, Cambridge, UK

5Dept. of Electronic Engineering, Seoul Nat’l Univ. of Science & Technology, Seoul, South Korea

*1{issamou, schmidtm}@cs.ubc.ca, [email protected], [email protected], [email protected]

Abstract

Deep Gaussian processes (DGP) have appealing Bayesian properties, can handle variable-sized data, and learn deep features. Their limitation is that they do not scale well with the size of the data. Existing approaches address this using a deep random feature (DRF) expansion model, which makes inference tractable by approximating DGPs. However, DRF is not suitable for variable-sized input data such as trees, graphs, and sequences. We introduce the GP-DRF, a novel Bayesian model with an input layer of GPs, followed by DRF layers. The key advantage is that the combination of GP and DRF leads to a tractable model that can both handle a variable-sized input as well as learn deep long-range dependency structures of the data. We provide a novel efficient method to simultaneously infer the posterior of GP’s latent vectors and infer the posterior of DRF’s internal weights and random frequencies. Our experiments show that GP-DRF outperforms the standard GP model and DRF model across many datasets. Furthermore, they demonstrate that GP-DRF enables improved uncertainty quantification compared to GP and DRF alone, with respect to a Bhattacharyya distance assessment. Source code is available at https://github.com/IssamLaradji/GP_DRF.

I Introduction

Deep neural network (DNN) models have achieved ground-breaking performance in many real-life domains such as computer vision and natural language processing [1]. This is mainly due to their ability to model long-range dependency structures that may reside in the data. However, they do not provide uncertainty quantification, which can be useful in decision making and high risk applications such as medical informatics or autonomous driving [2]. A recent method addresses this limitation using random feature expansion [3], but is unable to efficiently handle variable-sized data, such as trees [4], protein sequences [5], audio sequences [6], or graphs [7], in an end-to-end fashion. For instance, to predict the chemical properties of variable-sized molecular data [8], a separate feature extraction stage was required in order to construct fixed-sized fingerprint vectors. Such sophisticated feature extraction schemes often require human expertise. In this work, we propose Gaussian Process Deep Random Feature (GP-DRF), a scalable Bayesian method, that addresses the aforementioned limitations.

Bayesian models have received significant attention over the last decade. Gaussian processes (GP) are a family of flexible function distributions that can exploit the kernel trick to avoid dealing with input instances directly, leading to a scalable approach for computing instance similarities, and uncertainties about the latent functions. This is determined by the covariance of the GP model using a kernel function. Hence, the many choices of kernels, such as string kernel functions for sequence classification [9], enable GPs to handle variable-sized data effectively in the Euclidean, non-Euclidean, and RKHS metric spaces. However, GPs are shallow and therefore unable to benefit from the key properties of DNNs, namely exploiting the deep long-range dependency structures in the data through a long chain of composite operations.

To overcome the aforementioned shortcoming, Damianou et al. [10] proposed deep GP models. The main idea of deep GPs is to replace each layer of linear-to-nonlinear mappings in the DNN by a layer of random (nonlinear) functions sampled from a GP, thus achieving a long chain of composite operations where all the functions involved have their own GP priors. However, a critical weakness of this approach is that it is highly intractable; the functions in one layer take as the input the outputs of the latent functions in the prior layer, implying that the kernels are built on the outputs of other latent functions. As a result, the marginalization of the latent functions in a long chain of compositions becomes computationally expensive. Methods such as the inducing point approach [11] have been proposed to address the scalability issue of deep GPs. However, many parameters still need to be inferred, making the deep GP impractical for many large-scale applications [3].

Several works address the computational difficulty of the non-parametric kernel machines by transforming them into parametric models using random feature expansion [12, 13]. Random features are (nonlinear) feature vector representations that approximate, in expectation, the kernel values of the feature vector inner products. For a Gaussian process, the latent function can be expressed as a linear function in the feature space with a Gaussian-priored weight vector, leading to a parametric Bayesian model. As a result, the kernel matrices need not be stored or inverted, leading to a dramatically improved computational efficiency. Recently, Cutajar et al. [3] proposed a deep random feature (DRF) model as a parametric formulation of the deep GP by approximating its kernel features. Their results showed that DRF can yield significant computational benefits compared to deep GPs, while maintaining comparable performance in their benchmarks. However, DRF is limited only to kernels that are shift-invariant, such as the radial basis (RBF) and arc-cosine kernel functions. Consequently, DRF have difficulty dealing with variable-sized data, in contrast to generic sequence, tree, and graph kernels, which can be leveraged by Gaussian process models.

The motivation of this work is to build a deep GP model that is not only as scalable as DRF, but can also handle variable-sized data in the same manner as GPs coupled with kernel machines, while achieving comparable performance. To that end, we propose a GP-DRF, which combines a single layer of a GP model with multiple layers of DRF models. As shown in Fig.1, the GP model represents the first layer that takes arbitrarily shaped data as the input and returns fixed-sized feature vectors of latent functions as the output. The upper DRF model then maps the feature vectors to the prediction space. For the GP-DRF, we propose an efficient variational inference scheme that can handle large-scale data using pseudo-inputs as inducing points, in a fashion similar to Dai et al. [14] and Bui et al.[15].

We summarize the benefits of GP-DRF as follows:

It can effectively handle variable-sized data by learning sequence kernels using the GP component in its first layer, unlike DRF-only models; 2. 2.

It is a more accurate representation of a deep GP model than DRF, as GP-DRF ’s first layer is the exact non-parametric formulation of the GP model, whereas DRF layers are all approximations to GPs; 3. 3.

It is a scalable approximation of deep GPs as the random feature expansion method allows computationally efficient training and inference; 4. 4.

The Bayesian nature of our model allows it to estimate uncertainty, which is crucial for many real-world applications; and 5. 5.

It outperformed DRF and GPs on several classification, regression, and uncertainty benchmarks.

II Background on Random Features in GP and DRF

This section briefly reviews the DRF model [3] which can accurately approximate deep GPs. Since the core idea of the DRF is to model each layer in DNN by the random-feature expansion of the GP, we begin the discussion with random features and its applications in GPs. Note that the DRF model assumes, and is restricted to, fixed-sized inputs, hence we denote $d$ as the dimensionality of the input vector $x$ throughout this section.

Random features [12, 13] have a finite dimensional feature vector representation for inputs where the inner product on their feature space equals (approximately and/or in expectation) the value of the kernel function of interest. That is, the aim is to find a $D$ -dim feature vector $\phi(x)$ such that for a given kernel $k(x,x^{\prime})$ , we have $\phi(x)^{\top}\phi(x^{\prime})\simeq k(x,x^{\prime})$ . For instance, the ARD kernel,

[TABLE]

with parameters $\theta=\{\alpha,\Gamma=\textrm{diag}(\gamma_{1},\dots,\gamma_{d})\}$ , admits the $D=2M$ -dim feature representation:

[TABLE]

where $\omega_{(m)}$ for $m=1,\dots,M$ are i.i.d. samples from $\mathcal{N}(0,\Gamma^{-1})$ [12]. Refer to Cho et al. [13] for more details about the arc-cosine kernel feature representations.

We acquire several advantages by having a random feature representation for the kernel. Particularly in the GP, as we need $\textrm{Cov}(f(x),f(x^{\prime}))=k(x,x^{\prime})$ for every input pair $(x,x^{\prime})$ , it can be achieved by having a Bayesian parametric linear model on top of the feature space, namely, $f(x)=w^{\top}\phi(x)$ with prior $w\sim\mathcal{N}(0,I_{D})$ . This is because $\textrm{Cov}(f(x),f(x^{\prime}))=\textrm{Cov}(w^{\top}\phi(x),w^{\top}\phi(x^{\prime}))=\phi(x)^{\top}\phi(x^{\prime})=k(x,x^{\prime})$ . Hence, using the random feature expansion method, one can reduce all inference operations of a non-parametric GP to a parametric formulation, which significantly improves computational time as illustrated in Cutajar et al. [3]. This is because the setup does not require storing training data points nor kernel matrices, thus avoiding the need to perform kernel matrix inversions which are costly. In other words, we have a succinct summary about the posterior information through the finite dimensional $w$ , namely $P(w|\mathcal{D})$ .

Next, we briefly describe how the DRF model [3] approximates the deep GP model [10] using random feature expansion. In the deep GP model, the $l$ -th layer ( $l=0,\dots,L-1$ ) taking $h^{l}\in\mathbb{R}^{d_{l}}$ as input (by convention, $h^{0}=x$ and $h^{L}$ is the model’s final output) is modeled as:

[TABLE]

Note that the functions within the $l$ -th layer (i.e $f^{l}_{j}$ for $j=1,\dots,d_{l+1}$ ) do not necessarily need to have identical GP prior defined by the kernel function $k^{l}(\cdot,\cdot)$ ; each function can have its own prior. In the DRF model, the random feature expansion replaces GP-priored $f^{l}_{j}(\cdot)$ in Eq. 3 by Gaussian-priored linear functions of random features, yielding:

[TABLE]

where $\phi^{l}(h^{l})$ is the $D_{l}$ -dim feature vector corresponding to $k^{l}(\cdot,\cdot)$ , the kernel in the $l$ -th layer. If it is ARD, for instance, one can use the form defined in Eq. 2 with $x$ replaced by $h^{l}$ .

That is, the $l$ -th layer of the DRF model is, using the vector forms and explicitly specifying the dependency of $\phi^{l}(\cdot)$ on the random spectra, can be written as:

[TABLE]

$W^{l}$ is a $(D_{l}\times d_{l+1})$ matrix where $w^{l}_{j}$ represents its $j$ -th column, $\Omega^{l}$ denotes all the random spectra $\omega$ ’s in the random features $\phi^{l}(\cdot)$ (such as those in Eq. 2), and $\Lambda^{l}$ defines the parameters of the density from which the random spectra are sampled (for example, $\Lambda^{l}=\Gamma^{-1}$ for the ARD kernel) where we assumed zero-mean Gaussian111Although there exist random features based on non-Gaussian samples, we confine all our derivations to the Gaussian density due to simplicity and popularity. Nonetheless, this can be extended to non-Gaussian densities where sampling is easy, and evaluating the corresponding Gaussian-expected log-density and its gradient is easy to carry out, at least approximately..

Now, cascading Eq. 5 for $l=0,\dots,L-1$ forms the feed-forward function of the DRF, which is denoted as $y=G(x)$ . That is,

[TABLE]

where $h^{l+1}=g^{l}(h^{l};W^{l},\Omega^{l},\theta^{l}_{o})$ are shorthand for Eq. 5 where $\theta^{l}_{o}$ indicates parameters other than $W^{l}$ and $\Omega^{l}$ in the $l$ -th layer (this includes the output variance parameter $\alpha^{l}$ defined in Eq. 2). We have also denoted $W=\{W^{l}\}_{l=0}^{L-1}$ (similarly for $\Omega$ and $\theta_{o}$ ).

DRF is a deep Bayesian neural network model that addresses the critical drawback of deep GPs by making inference much more scalable using a parametric formulation. However, the random feature expansion method can only be applied to a restricted class of kernel function. First, random feature representations are only known for a limited number of kernel functions such as ARD and arc-cosine. Second, it is not applicable to kernel functions that operate on variable-sized inputs as they are not shift invariant (it is not feasible to define a shift operation for a pair of variable-sized inputs), according to Bochner’s theorem [16]. This poses a limitation on DRF’s ability to deal with sequence data. In the next section, we introduce a novel model that extends DRF that can deal with variable-sized inputs.

III Proposed approach

We propose GP-DRF, a deep Bayesian model that uses GP’s kernel machines (which can utilize sequence kernel functions for variable-sized inputs) in conjunction with the deep architecture of the DRF model. As shown in Fig. 2, the GP layer is placed at the bottom which takes possibly variable-sized input data $x$ as input and returns a vector of latent functions $F$ as output. $F$ is then fed into the upper DRF model as input which maps it to the prediction space.

In the next section, we provide a detailed description of the semi-parametric model that is GP-DRF. Further, we propose an efficient variational inference method for computing the posterior of GP’s latent vector, used as input to the DRF model, in conjunction with the posterior of the internal weights and random spectra of the DRF model. To make the GP layer computationally efficient, we use the inducing point method [14] in our implementation.

III-A Model Architecture

The bottom layer of GP-DRF is a GP whose latent functions operate on (possibly variable-sized) input $x$ , and returns an output vector that is fed into the upper DRF model as input. More specifically, we consider $d_{0}$ latent functions $\{f_{j}(\cdot)\}_{j=1}^{d_{0}}$ (so, $d_{0}$ becomes the input dimension of the DRF), and each latent function is drawn from $\mathcal{GP}(k_{j}(\cdot,\cdot))$ independently from one another.

We are given $N$ training instances $\mathcal{D}=\{(x_{n},y_{n})\}_{n=1}^{N}$ where each $x_{n}$ is an (sequence) input and $y_{n}$ is the corresponding target value (For example, a discrete class label for classification or real-valued for regression). Often, the input data alone are separately denoted by $X=\{x_{1},\dots,x_{N}\}$ , and $Y=\{y_{1},\dots,y_{N}\}$ . As the model contains the non-parametric component (the bottom GP layer), we need to maintain the outputs of the latent functions as random variables. Formally, we denote them by the $(N\times d_{0})$ matrix, $F=[F_{1},\dots,F_{N}]^{\top}$ where its $n$ -th row contains the GP’s output vector for $x_{n}$ , denoted by (using subscript) $F_{n}=[f_{1}(x_{n}),\dots,f_{d_{0}}(x_{n})]^{\top}$ . The $j$ -th column of $F$ consists of the outputs of the $j$ -th function over all input instances, denoted by (using superscript), $F^{j}=[f_{j}(x_{1}),\dots,f_{j}(x_{N})]^{\top}$ for $j=1,\dots,d_{0}$ . From the aforementioned independent GP prior assumption, $F$ is distributed as a Gaussian distribution factorized over $j$ :

[TABLE]

where $K_{j}$ is the $(N\times N)$ kernel matrix extracted from $X$ using the kernel function $k_{j}(\cdot,\cdot)$ for $f_{j}(\cdot)$ .

For each instance $n$ , the output $F_{n}$ from the GP layer serves as input to the DRF model, resulting in the final output $G(F_{n};W,\Omega,\theta_{o})$ by following Eq. 6. We link this output to the target $y_{n}$ by a likelihood model. The likelihood model can be chosen according to the prediction task (some examples are logistic or probit model for class-labeled $y$ , and Gaussian for real-valued $y$ ). We denote the likelihood model as:

[TABLE]

where $\theta_{l}$ stands for the parameters of the likelihood model (for instance, the weight vector in a logistic model or the noise variance in a Gaussian). As is common in practice, we assume the data instances are i.i.d., which lets the total likelihood be a product of Eq. 8 over $n=1,\dots,N$ .

Placing the priors on $W$ and $\Omega$ in the upper DRF model together, the full joint likelihood of our GP-DRF model can be written as follows:

[TABLE]

where (i) $\theta_{k}$ indicates the parameters of all the kernel functions $k_{j}(\cdot,\cdot)$ in Eq. 7; (ii) $P(W)=\prod_{l}\mathcal{N}(W^{l};0,I)$ ; (iii) $P(\Omega|\Lambda)=\prod_{l}\mathcal{N}(\Omega^{l};0,\Lambda^{l})$ ; and (iv) $\Theta=\{\theta_{k},\theta_{l},\theta_{o},\Lambda\}$ which represents all the parameters of the GP-DRF model.

III-B Variational Inference

In this section, we describe the inference formulation for the posterior distribution of the underlying latent variables of the GP-DRF model, specifically

[TABLE]

A main benefit of our approach is that from Eq. 10, we can quantify the uncertainty about not only the parameters of the deep model ( $W$ and $\Omega$ ), but also the inputs ( $F$ ) to the deep model.

To perform inference, we opt for the popular variational inference method that uses pseudo inputs [17, 18]. This is computationally feasible for large-scale data as the complexity grows linearly with $N$ . Furthermore, this allows for mini-batch type variational optimization since the log of Eq. 10 admits the form of summation of the log-likelihoods over instances. We introduce $M(\ll N)$ as pseudo inputs, denoted by $\overline{X}=\{\overline{x}_{1},\dots,\overline{x}_{M}\}$ . The pseudo inputs can be either selected randomly from $X$ , or chosen as representatives by performing clustering on $X$ . Note that clustering variable-sized data is feasible as sequence kernels can operate directly on points in $X$ . The latent function vectors on $\overline{X}$ are denoted as $\overline{F}$ , similarly as we defined $F$ .

Next, we introduce the variational density $q(\cdot)$ that approximates Eq. 10. In defining $q(\cdot)$ , we assume fully factorized Gaussians for $W$ and $\Omega$ for computational simplicity. For $F$ , we force the conditional density $q(F|\overline{F})$ to coincide with the prior $P(F|\overline{F})$ , which is crucial to have some difficult terms canceled out, making the inference scalable [17]. In essence, the variational density is defined as:

[TABLE]

where

[TABLE]

where the notations are described as follows. (i) $w^{l}_{i,j}$ (scalar) is the $(i,j)$ -element of $W^{l}$ , and all the variational parameters for $q(W)$ are denoted as $\Psi_{W}=\{(m^{l}_{i,j},s^{l}_{i,j})\}$ (similarly for $\omega^{l}_{i,j}$ and $\Psi_{\Omega}$ ), (ii) $\mu_{j}$ and $\Sigma_{j}$ are $M$ -dim mean vector and $(M\times M)$ full covariance matrix for Gaussian $q(\overline{F}^{j})$ , where $\Psi_{F}=\{(\mu_{j},\Sigma_{j})\}$ , and (iii) $\Psi=\{\Psi_{W},\Psi_{\Omega},\Psi_{F}\}$ indicates the entire variational parameters.

The following inequality, derived from the KL divergence between $q(\cdot)$ and the posterior Eq. 10, provides the lower bound of the log-evidence.

[TABLE]

where the evidence lower-bound (ELBO) is defined as:

[TABLE]

Since the bounding gap in Eq. 15 is exactly the KL divergence between $q(\cdot)$ and the posterior, increasing $\textrm{ELBO}(\Psi,\Theta)$ with respect to $\Psi$ leads to a better variational density, whereas increasing with respect to $\Theta$ may improve the data evidence score of the model. Hence, maximizing $\textrm{ELBO}(\Psi,\Theta)$ with respect to both variable sets can achieve variational inference and model selection.

Next we describe how to evaluate the objective $\textrm{ELBO}(\Psi,\Theta)$ and its gradient. The second term Eq. 16 is comprised of KL divergences between Gaussians, which admit closed forms and are easy to derive. The first term, as briefly mentioned earlier, has the form of a summation over the data instances, which can be readily approximated by a mini-batch average over a small subset of data (thus scalable to a large dataset via stochastic gradient [19]. Now we explain each individual term $n$ ( $=1,\dots,N$ ), that is,

[TABLE]

Note that the expectation is with respect to

[TABLE]

For $q(F_{n})$ , the integration in the third term of Eq. 11 can be done analytically, yielding a Gaussian: $q(F_{n})=\mathcal{N}(F_{n};a_{n},B_{n})$ . Specifically, the mean vector $a_{n}$ is $(d_{0}\times 1)$ and the covariance matrix $B_{n}$ is $(d_{0}\times d_{0})$ diagonal, and their $j$ -th elements can be written as ( $j=1,\dots,d_{0}$ ):

[TABLE]

where $\overline{k}_{j}(x_{n})=[k_{j}(x_{n},\overline{x}_{1}),\dots,k_{j}(x_{n},\overline{x}_{M})]^{\top}$ and $\overline{K}_{j}$ is the $(M\times M)$ kernel matrix for $k_{j}(\cdot,\cdot)$ on pseudo inputs $\overline{X}$ .

Although the expectation is taken with respect to the Gaussian distribution, the log-likelihood is a highly complex function of the integration variables $W$ , $\Omega$ , and $F_{n}$ , and thus it cannot be done analytically. Furthermore, when we take the gradient of Eq. 17 with respect to $\Psi$ and $\Theta$ , we should note that the underlying density $q(\cdot)$ is dependent on both of these variable sets. To overcome this difficulty, we follow the re-parametrized Monte-Carlo estimation technique suggested by Kingma et al. [20] for the Bayesian DNN, and also adopted in Cutajar et al. [3] for the parametric inference of the DRF model. The idea is to re-parametrize the Gaussian integration variables by decomposing them into parameters that we optimize over and random variables that are parameter-free. More specifically, we re-write each variable as:

[TABLE]

After sampling $S$ sets of independent standard normal random numbers $\{e^{(s)}_{lij},\tau^{(s)}_{lij},\epsilon^{(s)}_{nj}\}_{l,i,j,n}$ for $s=1,\dots,S$ , we plug these into Eq. 21–23 to get the sample versions of $(W^{(s)},\Omega^{(s)},F^{(s)}_{n})$ , and have an unbiased estimate of Eq. 17:

[TABLE]

Note that since we separated the parameters from random samples, the gradient of Eq. 24 can be derived for individual terms with respect to $\Psi$ and $\Theta$ , yielding an unbiased estimate of the gradient of Eq. 17.

Three options were used to perform DRF inference in Cutajar et al. [3], known as PRIOR-FIXED, VAR-FIXED, and VAR-RESAMPLED. With PRIOR-FIXED, the random spectra $\Omega$ is not inferred (for simplicity), but marginalized out from Eq. 10). Then only the parameters $\Lambda$ are trained. This can be achieved by removing the KL term regarding $\Omega$ in Eq. 16 and use $\Omega^{(s)}$ sampled from $P(\Omega|\Lambda)$ in Eq. 24 instead of Eq. 22.

With VAR-FIXED and VAR-RESAMPLED, $\Omega$ is inferred in the posterior $q(\Omega)$ with the corresponding variational parameters $\Psi_{\Omega}$ (this is shown in our derivation above). The difference between the two VAR options is whether the random numbers $\{e^{(s)}_{lij},\tau^{(s)}_{lij},\epsilon^{(s)}_{nj}\}_{l,i,j,n}$ are sampled once and fixed throughout the optimization (VAR-FIXED), or sampled at every iteration (VAR-RESAMPLED).

III-C Prediction

Given a trained model, where the variational density $q(\cdot)$ and model parameters $\Theta$ are optimized, we predict the model’s output and its uncertainty for an unseen test input $x_{*}$ as follows. Let $F_{*}=[f_{1}(x_{*}),\dots,f_{d_{0}}(x_{*})]^{\top}$ be the output vector of the bottom GP layer on $x_{*}$ (also the input vector to the upper DRF), and $y_{*}$ the final target output of the model. The posterior distribution for $y_{*}$ is approximated as,

[TABLE]

Although the last two integrands in Eq. 25 are Gaussians, the first term is highly involved with integration variables, analytic solution is infeasible. Rather, we do the Monte-Carlo estimation similar to what we did in Section III-B. That is, after sampling $(W^{(s)},\Omega^{(s)},F^{(s)}_{*})$ from the Gaussian of the last two integrands, we have the approximation of Eq. 25 as:

[TABLE]

Then we can represent the posterior distribution of $y_{*}$ by the samples $\{y^{(t)}_{*}\}_{t=1}^{T}$ which are obtained by sampling from the mixture density defined in Eq. 26. Namely, for each $t=1,\dots,T$ , (i) select $s$ uniformly at random from $\{1,\dots,S\}$ , then (ii) sample $y^{(t)}_{*}\sim P(y_{*}|G(F^{(s)}_{*};W^{(s)},\Omega^{(s)},\theta_{o}),\theta_{l})$ which requires a full feed-forward pass of the input $x_{*}$ through the GP-DRF model. Therefore, for a scalar target $y_{*}$ , the posterior mean and variance can be estimated as:

[TABLE]

IV Experiments

IV-A Experimental Setup

To showcase the efficacy of GP-DRF, we evaluate it on several datasets, grouped into 3 tasks: (1) fixed-sized input classification task, which includes MNIST[21], EEG, and SPAM [22]; (2) fixed-sized input regression task, POWERPLANT, and PROTEIN [22]; and (3) variable-sized (sequence) input classification task, which includes MUSIC [23] (a music genre dataset for multi-class genre prediction), REUTERS222Available at,

http://www.daviddlewis.com/resources/testcollections/reuters21578/ (a text dataset for text categorization, and SCOP [5] (a protein sequence dataset for protein fold recognition). Their statistics are described in Table I. The evaluation metric for classification datasets is the mean number of misclassifications (error rate), and, for regression datasets, the root mean square error (RMSE).

We compare GP-DRF against two baselines: (1) GP, and (2) DRF. GP is a Gaussian process based model with the same architecture as the first layer of GP-DRF. Each target class is associated with a Gaussian process, and is trained using variational inference as described in Hensman et al. [24]. DRF represents the same architecture as the DRF component of GP-DRF, and we train it using the procedure in Cutajar et al. [3]. For the sequence datasets, the models use the double (1.5) kernel features as described in Kuksa et al. [6], and the ARD kernel (as described in Cutajar et al. [3]) features for the rest of the datasets. For the Gaussian processes, each kernel feature $K_{i}(x,y)$ is associated with two trainable parameters: (1) $\alpha_{i}$ which scales the output as $\alpha_{i}\cdot K_{i}(x,y)$ , and (2) $\sigma_{i}$ which is a parameter within the kernel function.

IV-B Implementation Details

We run the ADAM [25] optimizer for 1000 epochs with learning rate $1\times 10^{-5}$ . L2 penalty is added to all parameters with the coefficient $5\times 10^{-4}$ . For GP and GP-DRF, the number of inducing points is 200. At each iteration, a single example is selected uniformly at random from the training set and 100 MCMC samples are collected from each random variable. Each model uses the Gaussian likelihood for regression and the softmax likelihood for classification problems.

IV-C Comparison to GP and DRF Models

Table I shows that GP-DRF consistently outperforms GP and DRF on all eight datasets. Further, GP-DRF reduces the error rate of “Double-(1,5) (MFCC)" [6] by $7.7\%$ on the Music dataset while having uncertainty quantification. This suggests that combining exact and approximate approaches to computing kernel features, and leveraging deep structures can be useful.

IV-D Bhattacharyya Distance Benchmark

The Bhattacharyya distance [26] is a widely used measure within the research community [27, 28, 29]. It can be used to measure the separability of classes in classification. It is more reliable than the Mahalanobis distance [30] as the Bhattacharyya distance grows depending on the difference between the means of the classes as well as their standard deviations, rather than just the means.

In this setup, we perform uncertainty analysis on our models by computing the distance between “the two most confident class posterior distributions" with respect to the Bhattacharaya measure. For a K-way classification task, the certainty is

[TABLE]

where $F_{*}(x)=\mathcal{N}(\mu_{*},\sigma_{*})$ and is the distribution over the posterior samples obtained for the test example’s most confident predicted class; and $F_{+}(x)=\mathcal{N}(\mu_{+},\sigma_{+})$ represents that of the test example’s second most confident predicted class. This is the notion of "margin" in class prediction, where the larger distance suggests the model is more certain about its prediction.

Table II shows the average Bhattacharyya distances for the correctly, $D_{c}$ , and misclassified samples, $D_{m}$ , on the three datasets. We see that GP-DRF has the largest discrepancy between $D_{c}$ and $D_{m}$ , suggesting it is significantly more confident than competing models when making correct prediction.

The histograms in Figure 3 show the Bhattacharyya distances for each sample in the test set for the correctly classified (shown as green bars) and the misclassified samples (red bars). The histograms further justify GP-DRF’s efficacy, as it offers higher certainty compared to GP and DRF when it correctly classifies a test example. This implies that the quantitative measure of prediction uncertainty, derived from our Bayesian model, can be used as an accurate gauge of the quality of prediction.

V Conclusion

We proposed GP-DRF, a novel deep Gaussian process, which defines a powerful Bayesian model that is scalable, can deal with sequential inputs, provides uncertainty estimates, and achieves superior performance compared to its counterparts. It combines the non-parametric structure of Gaussian processes in its first layer and the parametric approximation of Gaussian processes in the rest of the network. GP-DRF consistently outperforms the GP and the DRF models on several benchmarks. GP-DRF can also provide better certainty estimates, quantified by the Battacharaya distance. In our future work, we will explore other structured, variable-size data domains, including, graph and language data.

VI Acknowledgements

We would like to thank the anonymous referees for their constructive comments and suggestions. Issam Laradji was funded by the UBC Four-Year Doctoral Fellowships (4YF).

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning . 2016.
2[2] Yarin Gal. Uncertainty in deep learning. University of Cambridge , 2016.
3[3] K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Random feature expansions for deep Gaussian processes. ICML , 2017.
4[4] Alessandro Moschitti. Making tree kernels practical for natural language learning. ECACL , 2006.
5[5] Loredana Lo Conte, Bart Ailey, Tim JP Hubbard, Steven E Brenner, Alexey G Murzin, and Cyrus Chothia. Scop: a structural classification of proteins database. Nucleic acids research , 2000.
6[6] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Kernel methods and algorithms for general sequence analysis. Technical report, 2008.
7[7] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. JMLR , 2010.
8[8] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. 2015.