Deep Learning: A Bayesian Perspective

Nicholas Polson; Vadim Sokolov

arXiv:1706.00473·stat.ML·January 23, 2018

Deep Learning: A Bayesian Perspective

Nicholas Polson, Vadim Sokolov

PDF

TL;DR

This paper offers a Bayesian perspective on deep learning, providing insights into optimization, regularization, and data reduction techniques, and demonstrates their application through Airbnb booking data analysis.

Contribution

It introduces a Bayesian framework for understanding deep learning, highlighting how traditional data reduction methods relate to deep models and discussing optimization and regularization strategies.

Findings

01

Deep layers improve data reduction and predictive performance.

02

Bayesian regularization aids in weight and connection selection.

03

Application to Airbnb data demonstrates practical utility.

Abstract

Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are all shown to be shallow learners. Their deep learning counterparts exploit multiple deep layers of data reduction which provide predictive performance gains. Stochastic gradient descent (SGD) training optimisation and Dropout (DO) regularization provide estimation and variable selection. Bayesian regularization is central to finding weights and connections in networks to optimize the predictive bias-variance trade-off. To…

Tables1

Table 1. Table 1: Percent of each class in out-of-sample data set

Dest	AU	CA	DE	ES	FR	GB	IT	NDF	NL	PT	US	other
% obs	0.3	0.6	0.5	1	2.2	1.2	1.2	59	0.31	0.11	29	4.8

Equations127

Y = F (X) where X = (X_{1}, \dots, X_{p}) .

Y = F (X) where X = (X_{1}, \dots, X_{p}) .

f_{l}^{W, b} = f_{l} j = 1 \sum N_{j} W_{ij} z_{j} + b_{l}

f_{l}^{W, b} = f_{l} j = 1 \sum N_{j} W_{ij} z_{j} + b_{l}

\hat{Y} (X) : = (f_{1}^{W_{1}, b_{1}} \circ \dots \circ f_{L}^{W_{L}, b_{L}}) (X) .

\hat{Y} (X) : = (f_{1}^{W_{1}, b_{1}} \circ \dots \circ f_{L}^{W_{L}, b_{L}}) (X) .

Z^{(1)}

Z^{(1)}

Z^{(2)}

\dots

Z^{(L)}

\hat{Y} (X)

L (Y, \hat{Y}) = - lo g p (Y ∣ Y^{\hat{W}, \hat{b}} (X)) .

L (Y, \hat{Y}) = - lo g p (Y ∣ Y^{\hat{W}, \hat{b}} (X)) .

L_{λ} (Y, \hat{Y}) = - lo g p (Y ∣ Y^{\hat{W}, \hat{b}} (X)) - lo g p (ϕ (W, b) ∣ λ) .

L_{λ} (Y, \hat{Y}) = - lo g p (Y ∣ Y^{\hat{W}, \hat{b}} (X)) - lo g p (ϕ (W, b) ∣ λ) .

- lo g p (ϕ (W, b) ∣ λ)

- lo g p (ϕ (W, b) ∣ λ)

p (ϕ (W, b) ∣ λ)

p (W, b ∣ D)

p (W, b ∣ D)

\propto exp (- lo g p (Y ∣ Y^{W, b} (X)) - lo g p (W, b)) .

\hat{Y} : = Y^{\hat{W}, \hat{b}} (X) where (\hat{W}, \hat{b}) : = arg max_{W, b} lo g p (W, b ∣ D),

\hat{Y} : = Y^{\hat{W}, \hat{b}} (X) where (\hat{W}, \hat{b}) : = arg max_{W, b} lo g p (W, b ∣ D),

- lo g p (W, b ∣ D) = i = 1 \sum T L (Y^{(i)}, Y^{W, b} (X^{(i)})) + λ ϕ (W, b) .

- lo g p (W, b ∣ D) = i = 1 \sum T L (Y^{(i)}, Y^{W, b} (X^{(i)})) + λ ϕ (W, b) .

arg min_{W} E_{D \sim Ber (p)} ∥ Y - W (D ⋆ X) ∥_{2}^{2},

arg min_{W} E_{D \sim Ber (p)} ∥ Y - W (D ⋆ X) ∥_{2}^{2},

arg min_{W} ∥ Y - p W X ∥_{2}^{2} + p (1 - p) ∥Γ W ∥_{2}^{2} .

arg min_{W} ∥ Y - p W X ∥_{2}^{2} + p (1 - p) ∥Γ W ∥_{2}^{2} .

Y_{i}^{(l)}

Y_{i}^{(l)}

Z_{i}^{(l)}

D_{i}^{(l)}

D_{i}^{(l)}

\tilde{Y}_{i}^{(l)}

Y_{i}^{(l)}

Z_{i}^{(l)}

\hat{Y}=f_{1}^{W_{1},b_{1}}(f_{2}(W_{2}X+b_{2})\big{)}=f_{1}^{W_{1},b_{1}}(Z),\,\text{ where $Z:=f_{2}(W_{2}X+b_{2})$. }

\hat{Y}=f_{1}^{W_{1},b_{1}}(f_{2}(W_{2}X+b_{2})\big{)}=f_{1}^{W_{1},b_{1}}(Z),\,\text{ where $Z:=f_{2}(W_{2}X+b_{2})$. }

Z = f_{2} (X) = W^{⊤} X + b,

Z = f_{2} (X) = W^{⊤} X + b,

Z = f_{2} (X) = i = 1 \sum N_{1} f_{i} (W_{i 1} X_{1} + \dots + W_{i p} X_{p}) .

Z = f_{2} (X) = i = 1 \sum N_{1} f_{i} (W_{i 1} X_{1} + \dots + W_{i p} X_{p}) .

x_{1} x_{2} = \frac{1}{4} (x_{1} + x_{2})^{2} - \frac{1}{4} (x_{1} - x_{2})^{2}

x_{1} x_{2} = \frac{1}{4} (x_{1} + x_{2})^{2} - \frac{1}{4} (x_{1} - x_{2})^{2}

max (x_{1}, x_{2}) = \frac{1}{2} ∣ x_{1} + x_{2} ∣ + \frac{1}{2} ∣ x_{1} - x_{2} ∣

max (x_{1}, x_{2}) = \frac{1}{2} ∣ x_{1} + x_{2} ∣ + \frac{1}{2} ∣ x_{1} - x_{2} ∣

(x_{1} x_{2})^{2} = \frac{1}{4} (x_{1} + x_{2})^{4} + \frac{7}{4 \cdot 3 ^{3}} (x_{1} - x_{2})^{4} - \frac{1}{2 \cdot 3 ^{3}} (x_{1} + 2 x_{2})^{4} - \frac{2 ^{3}}{3 ^{3}} (x_{1} + \frac{1}{2} x_{2})^{4}

(x_{1} x_{2})^{2} = \frac{1}{4} (x_{1} + x_{2})^{4} + \frac{7}{4 \cdot 3 ^{3}} (x_{1} - x_{2})^{4} - \frac{1}{2 \cdot 3 ^{3}} (x_{1} + 2 x_{2})^{4} - \frac{2 ^{3}}{3 ^{3}} (x_{1} + \frac{1}{2} x_{2})^{4}

(f_{x_{1}} \circ \dots \circ f_{x_{k}}) (0) = (x_{1} + (x_{2} + \dots + (x_{k - 1} + x_{k}^{+})^{+})^{+} = 1 \leq j \leq k max (x_{1} + \dots + x_{j})^{+}

(f_{x_{1}} \circ \dots \circ f_{x_{k}}) (0) = (x_{1} + (x_{2} + \dots + (x_{k - 1} + x_{k}^{+})^{+})^{+} = 1 \leq j \leq k max (x_{1} + \dots + x_{j})^{+}

Y_{j} (x) = F_{W}^{m} (X)_{j}

Y_{j} (x) = F_{W}^{m} (X)_{j}

= k = 1 \sum K W_{2}^{j k} Z_{j} for Z_{j} = f (i = 1 \sum N W_{1}^{k i} x_{i}) .

L (W) =

L (W) =

with ϕ (W) =

arg min_{W, Z} ∥ X - W_{2} Z ∥^{2} + λ ϕ (Z) + ∥ Z - f (W_{1}, X) ∥^{2},

arg min_{W, Z} ∥ X - W_{2} Z ∥^{2} + λ ϕ (Z) + ∥ Z - f (W_{1}, X) ∥^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep Learning: A Bayesian Perspective

Nicholas G. Polson

*Booth School of Business

University of Chicago111Polson is Professor of Econometrics and Statistics at the Chicago Booth School of Business. email: [email protected]. Sokolov is an assistant professor at George Mason University, email: [email protected]

*Vadim O. Sokolov

*Systems Engineering and Operations Research

George Mason University

(First Draft: May 2017

This Draft: November 2017 )

Abstract

Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are all shown to be shallow learners. Their deep learning counterparts exploit multiple deep layers of data reduction which provide predictive performance gains. Stochastic gradient descent (SGD) training optimisation and Dropout (DO) regularization provide estimation and variable selection. Bayesian regularization is central to finding weights and connections in networks to optimize the predictive bias-variance trade-off. To illustrate our methodology, we provide an analysis of international bookings on Airbnb. Finally, we conclude with directions for future research.

1 Introduction

Deep learning (DL) is a form of machine learning that uses hierarchical abstract layers of latent variables to perform pattern matching and prediction. Deep learners are probabilistic predictors where the conditional mean is a stacked generalized linear model (sGLM). The current interest in DL stems from its remarkable success in a wide range of applications, including Artificial Intelligence (AI) (DeepMind, 2016; Kubota, 2017; Esteva et al., 2017), image processing (Simonyan and Zisserman, 2014), learning in games (DeepMind, 2017), neuroscience (Poggio, 2016), energy conservation (DeepMind, 2016), and skin cancer diagnostics (Kubota, 2017; Esteva et al., 2017). Schmidhuber (2015) provides a comprehensive historical survey of deep learning and their applications.

Deep learning is designed for massive data sets with many high dimensional input variables. For example, Google’s translation algorithm (Sutskever et al., 2014) uses $\sim$ 1-2 billion parameters and very large dictionaries. Computational speed is essential, and automated differentiation and matrix manipulations are available on TensorFlow Abadi et al. (2015). Baidu successfully deployed speech recognition systems (Amodei et al., 2016) with an extremely large deep learning model with over 100 million parameters, 11 layers and almost 12 thousand hours of speech for training. DL is an algorithmic approach rather than probabilistic in its nature, see Breiman (2001) for the merits of both approaches.

Our approach is Bayesian and probabilistic. We view the theoretical roots of DL in Kolmogorov’s representation of a multivariate response surface as a superposition of univariate activation functions applied to an affine transformation of the input variable (Kolmogorov, 1963). An affine transformation of a vector is a weighted sum of its elements (linear transformation) plus an offset constant (bias). Our Bayesian perspective on DL leads to new avenues of research including faster stochastic algorithms, hyper-parameter tuning, construction of good predictors, and model interpretation.

On the theoretical side, we show how DL exploits a Kolmogorov’s “universal basis”. By construction, deep learning models are very flexible and gradient information can be efficiently calculated for a variety of architectures. On the empirical side, we show that the advances in DL are due to:

(i)

New activation (a.k.a. link) functions, such as rectified linear unit ( $\text{ReLU}(x)=\max(0,x)$ ), instead of sigmoid function 2. (ii)

Depth of the architecture and dropout as a variable selection technique 3. (iii)

Computationally efficient routines to train and evaluate the models as well as accelerated computing via graphics processing unit (GPU) and tensor processing unit (TPU) 4. (iv)

Deep learning has very well developed computational software where pure MCMC is too slow.

To illustrate DL, we provide an analysis of a dataset from Airbnb on first time international bookings. Different statistical methodologies can then be compared, see Kaggle (2015) and Ripley (1994) who provides a comparison of traditional statistical methods with neural network based approaches for classification.

The rest of the paper is outlined as follows. Section 1.1 provides a review of deep learning. Section 2 provides a Bayesian probabilistic interpretation of many traditional statistical techniques (PCA, PCR, SIR, LDA) which are shown to be “shallow learners” with two layers. Much of the recent success in DL applications has been achieved by including deeper layers and these gains pass over to traditional statistical models. Section 3 provides heuristics on why Bayes procedures provide good predictors in high dimensional data reduction problems. Section 4 describes how to train, validate and test deep learning models. We provide computational details associated with stochastic gradient descent (SGD). Section 5 provides an application to bookings data from the Airbnb website. Finally, Section 6 concludes with directions for future research.

1.1 Deep Learning

Machine learning finds a predictor of an output $Y$ given a high dimensional input $X$ . A learning machine is an input-output mapping, $Y=F(X)$ , where the input space is high-dimensional,

[TABLE]

The output $Y$ can be continuous, discrete or mixed. For a classification problem, we need to learn $F:X\rightarrow Y$ , where $Y\in\{1,\ldots,K\}$ indexes categories. A predictor is denoted by $\hat{Y}(X)$ .

To construct a multivariate function, $F(X)$ , we start with building blocks of hidden layers. Let $f_{1},\ldots,f_{l}$ be univariate activation functions. A semi-affine activation rule is given by

[TABLE]

Here $W$ and $z$ are the weight matrix and inputs of the $l$ th layer.

Our deep predictor, given the number of layers $L$ , then becomes the composite map

[TABLE]

Put simply, a high dimensional mapping, $F$ , is modeled via the superposition of univariate semi-affine functions. Similar to a classic basis decomposition, the deep approach uses univariate activation functions to decompose a high dimensional $X$ . To select the number of hidden units (a.k.a neurons), $N_{l}$ , at each layer we will use a stochastic search technique known as dropout.

The offset vector is essential. For example, using $f(x)=\sin(x)$ without bias term $b$ would not allow to recover an even function like $\cos(x)$ . An offset element (e.g. $\sin(x+\pi/2)=\cos(x)$ ) immediately corrects this problem.

Let $Z^{(l)}$ denote the $l$ -th layer, and so $X=Z^{(0)}$ . The final output is the response $Y$ , which can be numeric or categorical. A deep prediction rule is then

[TABLE]

Here, $W^{(l)}$ are weight matrices, and $b^{(l)}$ are threshold or activation levels. Designing a good predictor depends crucially on the choice of univariate activation functions $f^{(l)}$ . Kolmogorov’s representation requires only two layers in principle. Vitushkin and Khenkin (1967) prove the remarkable fact that a discontinuous link is required at the second layer even though the multivariate function is continuous. Neural networks (NN) simply approximate a univariate function as mixtures of sigmoids, typically with an exponential number of neurons, which does not generalize well. They can simply be viewed as projection pursuit regression $F(X)=\sum_{i=1}^{N}g_{i}(WX+b))$ with the only difference being that in a neural network the nonlinear link functions, are parameter dependent and learned from training data.

Figure 1 illustrates a number of commonly used structures; for example, feed-forward architectures, auto-encoders, convolutional, and neural Turing machines. Once you have learned the dimensionality of the weight matrices which are non-zero, there’s an implied network structure.

Recently deep architectures (indicating non-zero weights) include convolutional neural networks (CNN), recurrent NN (RNN), long short-term memory (LSTM), and neural Turing machines (NTM). Pascanu et al. (2013) and Montúfar and Morton (2015) provide results on the advantage of representing some functions compactly with deep layers. Poggio (2016) extends theoretical results on when deep learning can be exponentially better than shallow learning. Bryant (2008) implements Sprecher (1972) algorithm to estimate the non-smooth inner link function. In practice, deep layers allow for smooth activation functions to provide “learned” hyper-planes which find the underlying complex interactions and regions without having to see an exponentially large number of training samples.

2 Deep Probabilistic Learning

Probabilistically, the output $Y$ can be viewed as a random variable being generated by a probability model $p(Y|Y^{W,b}(X))$ . Given $\hat{W},\hat{b}$ , the negative log-likelihood defines $\mathcal{L}$ as

[TABLE]

The $L_{2}$ -norm, $\mathcal{L}(Y_{i},\hat{Y}(X_{i}))=\|Y_{i}-\hat{Y}(X_{i})\|^{2}_{2}$ is traditional least squares, and negative cross-entropy loss is $\mathcal{L}(Y_{i},\hat{Y}(X_{i}))=-\sum_{i=1}^{n}Y_{i}\log\hat{Y}(X_{i})$ for multi-class logistic classification. The procedure to obtain estimates $\hat{W},\hat{b}$ of the deep learning model parameters is described in Section 4.

To control the predictive bias-variance trade-off we add a regularization term and optimize

[TABLE]

Probabilistically this is a negative log-prior distribution over parameters, namely

[TABLE]

Deep predictors are regularized maximum a posteriori (MAP) estimators, where

[TABLE]

Training requires the solution of a highly nonlinear optimization

[TABLE]

and the log-posterior is optimised given the training data, $D=\{Y^{(i)},X^{(i)}\}_{i=1}^{T}$ with

[TABLE]

Deep learning has the key property that $\nabla_{W,b}\log p(Y|Y^{W,b}(X))$ is computationally inexpensive to evaluate using tensor methods for very complicated architectures and fast implementation on large datasets. TensorFlow and TPUs provide a state-of-the-art framework for a plethora of architectures. From a statistical perspective, one caveat is that the posterior is highly multi-modal and providing good hyper-parameter tuning can be expensive. This is clearly a fruitful area of research for state-of-the-art stochastic Bayesian MCMC algorithms to provide more efficient algorithms. For shallow architectures, the alternating direction method of multipliers (ADMM) is an efficient solution to the optimization problem.

2.1 Dropout for Model and Variable Selection

Dropout is a model selection technique designed to avoid over-fitting in the training process. This is achieved by removing input dimensions in $X$ randomly with a given probability $p$ . It is instructive to see how this affects the underlying loss function and optimization problem. For example, suppose that we wish to minimise MSE, $\mathcal{L}(Y,\hat{Y})=\|Y-\hat{Y}\|^{2}_{2}$ , then, when marginalizing over the randomness, we have a new objective

[TABLE]

Where $\star$ denotes the element-wise product. It is equivalent to, with $\Gamma=({\rm diag}(X^{\top}X))^{\frac{1}{2}}$

[TABLE]

Dropout then is simply Bayes ridge regression with a $g$ -prior as an objective function. This reduces the likelihood of over-reliance on small sets of input data in training, see Hinton and Salakhutdinov (2006) and Srivastava et al. (2014). Dropout can also be viewed as the optimization version of the traditional spike-and-slab prior, which has proven so popular in Bayesian model averaging. For example, in a simple model with one hidden layer, we replace the network

[TABLE]

with the dropout architecture

[TABLE]

In effect, this replaces the input $X$ by $D\star X$ , where $D$ is a matrix of independent $\text{Ber}(p)$ distributed random variables.

Dropout also regularizes the choice of the number of hidden units in a layer. This can be achieved if we drop units of the hidden rather than the input layer and then establish which probability $p$ gives the best results. It is worth recalling though, as we have stated before, one of the dimension reduction properties of a network structure is that once a variable from a layer is dropped, all terms above it in the network also disappear.

2.2 Shallow Learners

Almost all shallow data reduction techniques can be viewed as consisting of a low dimensional auxiliary variable $Z$ and a prediction rule specified by a composition of functions

[TABLE]

The problem of high dimensional data reduction is to find the $Z$ -variable and to estimate the layer functions $(f_{1},f_{2})$ correctly. In the layers, we want to uncover the low-dimensional $Z$ -structure, in a way that does not disregard information about predicting the output $Y$ .

Principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), linear discriminant analysis (LDA), project pursuit regression (PPR), and logistic regression are all shallow learners. Mallows (1973) provides an interesting perspective on how Bayesian shrinkage provides good predictors in regression settings. Frank and Friedman (1993) provide excellent discussions of PLS and why Bayesian shrinkage methods provide good predictors. Wold (1956), Diaconis and Shahshahani (1984), Ripley (1994), Cook (2007); Hastie et al. (2016) provide further discussion of dimension reduction techniques. Other connections exists for Fisher’s Linear Discriminant classification rule, which is simply fitting $H(wX+b)$ , where $H$ is a Heaviside function. Polson et al. (2015a) provide a Bayesian version of support vector machines (SVMs) and a comparison with logistic regression for classification.

PCA reduces $X$ to $f_{2}(X)$ using a singular value decomposition of the form

[TABLE]

where the columns of the weight matrix $W$ form an orthogonal basis for directions of greatest variance (which is in effect an eigenvector problem).

Similarly PPR reduces $X$ to $f_{2}(X)$ by setting

[TABLE]

Example: Interaction terms, $x_{1}x_{2}$ and $(x_{1}x_{2})^{2}$ , and max functions, $\max(x_{1},x_{2})$ can be expressed as nonlinear functions of semi-affine combinations. Specifically,

[TABLE]

Diaconis and Shahshahani (1981) provide further discussion for Projection Pursuit Regression, where the network uses a layered model of the form $\sum_{i=1}^{N}f(w_{i}^{\top}X)$ . Diaconis et al. (1998) provide an ergodic view of composite iterated functions, a precursor to the use of multiple layers of single operators that can model complex multivariate systems. Sjöberg et al. (1995) provide the approximation theory for composite functions.

**Example: ** Deep ReLU architectures can be viewed as Max-Sum networks via the following simple identity. Define $x^{+}=\max(x,0)$ . Let $f_{x}(b)=(x+b)^{+}$ where $b$ is an offset. Then $(x+y^{+})^{+}=\max(0,x,x+y)$ . This is generalized in Feller (1971) (p.272) who shows by induction that

[TABLE]

A composition or convolution of $\max$ -layers is then a one layer max-sum network.

2.3 Stacked Auto-Encoders

Auto-encoding is an important data reduction technique. An auto-encoder is a deep learning architecture designed to replicate $X$ itself, namely $X=Y$ , via a bottleneck structure. This means we select a model $F^{W,b}(X)$ which aims to concentrate the information required to recreate $X$ . See Heaton et al. (2017) for an application to smart indexing in finance. Suppose that we have $N$ input vectors $X=\{x_{1},\ldots,x_{N}\}\in\mathbb{R}^{M\times N}$ and $N$ output (or target) vectors $\{x_{1},\ldots,x_{N}\}\in\mathbb{R}^{M\times N}$ .

Setting biases to zero, for the purpose of illustration, and using only one hidden layer ( $L=2$ ) with $K<N$ factors, gives for $j=1,\ldots,N$

[TABLE]

In an auto-encoder we fit the model $X=F_{W}(X)$ , and train the weights $W=(W_{1},W_{2})$ with regularization penalty of the form

[TABLE]

Writing our DL objective as an augmented Lagrangian (as in ADMM) with a hidden factor $Z$ , leads to a two step algorithm, an encoding step (a penalty for $Z$ ), and a decoding step for reconstructing the output signal via

[TABLE]

where the regularization on $W_{1}$ induces a penalty on $Z$ . The last term is the encoder, the first two the decoder.

If $W_{2}$ is estimated from the structure of the training data matrix, then we have a traditional factor model, and the $W_{1}$ matrix provides the factor loadings. PCA, PLS, SIR fall into this category, see Cook (2007) for further discussion. If $W_{2}$ is trained based on the pair $\hat{X}=\{Y,X\}$ than we have a sliced inverse regression model. If $W_{1}$ and $W_{2}$ are simultaneously estimated based on the training data $X$ , then we have a two layer deep learning model.

Auto-encoding demonstrates that deep learning does not directly model variance-covariance matrix explicitly as the architecture is already in predictive form. Given a hierarchical non-linear combination of deep learners, an implicit variance-covariance matrix exists, but that is not the focus of the algorithm.

Another interesting area for future research are long short-term memory models (LSTMs). For example, a dynamic one layer auto-encoder for a financial time series $(Y_{t})$ is a coupled system

[TABLE]

The state equation encodes and the matrix $W$ decodes the $Y_{t}$ vector into its history $Y_{t-1}$ and the current state $X_{t}$ .

2.4 Bayesian Inference for Deep Learning

Bayesian neural networks have a long history. Early results on stochastic recurrent neural networks (a.k.a Boltzmann machines) were published in Ackley et al. (1985). Accounting for uncertainty by integrating over parameters is discussed in Denker et al. (1987). MacKay (1992) proposed a general Bayesian framework for tuning network architecture and training parameters for feed forward architectures. Neal (1993) proposed using Hamiltonian Monte Carlo (HMC) to sample from posterior distribution over the set of model parameters and then averaging outputs of multiple models. Markov Chain Monte Carlo algorithms was proposed by Müller and Insua (1998) to jointly identify parameters of a feed forward neural network as well as the architecture. A connection of neural networks with Bayesian nonparametric techniques was demonstrated in Lee (2004).

A Bayesian extension of feed forward network architectures has been considered by several authors (Neal, 1990; Saul et al., 1996; Frey and Hinton, 1999; Lawrence, 2005; Adams et al., 2010; Mnih and Gregor, 2014; Kingma and Welling, 2013; Rezende et al., 2014). Recent results show how dropout regularization can be used to represent uncertainty in deep learning models. In particular, Gal (2015) shows that dropout technique provides uncertainty estimates for the predicted values. The predictions generated by the deep learning models with dropout are nothing but samples from predictive posterior distribution.

Graphical models with deep learning encode a joint distribution via a product of conditional distributions and allow for computing (inference) many different probability distributions associated with the same set of variables. Inference requires the calculation of a posterior distribution over the variables of interest, given the relations between the variables encoded in a graph and the prior distributions. This approach is powerful when learning from samples with missing values or predicting with some missing inputs.

A classical example of using neural networks to model a vector of binary variables is the Boltzmann machine (BM), with two layers. The first layer encodes latent variables and the second layer encodes the observed variables. Both conditional distributions $p(\text{data}\mid\text{latent variables})$ and $p(\text{latent variables}\mid\text{data})$ are specified using logistic function parametrized by weights and offset vectors. The size of the joint distribution table grows exponentially with the number of variables and Hinton and Sejnowski (1983) proposed using Gibbs sampler to calculate update to model weights on each iteration. The multimodal nature of the posterior distribution leads to prohibitive computational times required to learn models of a practical size. Tieleman (2008) proposed a variational approach that replaces the posterior $p(\text{latent variables}\mid\text{data})$ and approximates it with another easy to calculate distribution was considered in Salakhutdinov (2008). Several extensions to the BMs have been proposed. An exponential family extensions have been considered by Smolensky (1986); Salakhutdinov (2008); Salakhutdinov and Hinton (2009); Welling et al. (2005)

There have also been multiple approaches to building inference algorithms for deep learning models MacKay (1992); Hinton and Van Camp (1993); Neal (1992); Barber and Bishop (1998). Performing Bayesian inference on a neural network calculates the posterior distribution over the weights given the observations. In general, such a posterior cannot be calculated analytically, or even efficiently sampled from. However, several recently proposed approaches address the computational problem for some specific deep learning models (Graves, 2011; Kingma and Welling, 2013; Rezende et al., 2014; Blundell et al., 2015; Hernández-Lobato and Adams, 2015; Gal and Ghahramani, 2016).

The recent successful approaches to develop efficient Bayesian inference algorithms for deep learning networks are based on the reparameterization techniques for calculating Monte Carlo gradients while performing variational inference. Given the data $D=(X,Y)$ , the variation inference relies on approximating the posterior $p(\theta\mid D)$ with a variation distribution $q(\theta\mid D,\phi)$ , where $\theta=(W,b)$ . Then $q$ is found by minimizing the based on the Kullback-Leibler divergence between the approximate distribution and the posterior, namely

[TABLE]

Since $p(\theta\mid D)$ is not necessarily tractable, we replace minimization of $\text{KL}(q\mid\mid p)$ with maximization of evidence lower bound (ELBO)

[TABLE]

The $log$ of the total probability (evidence) is then

[TABLE]

The sum does not depend on $\phi$ , thus minimizing $\text{KL}(q\mid\mid p)$ is the same that maximizing $\text{ELBO}(q)$ . Also, since $\text{KL}(q\mid\mid p)\geq 0$ , which follows from Jensen’s inequality, we have $\log p(D)\geq\text{ELBO}(\phi)$ . Thus, the evidence lower bound name. The resulting maximization problem $\text{ELBO}(\phi)\rightarrow\max_{\phi}$ is solved using stochastic gradient descent.

To calculate the gradient, it is convenient to write the ELBO as

[TABLE]

The gradient of the first term $\nabla_{\phi}\int q(\theta\mid D,\phi)\log p(Y\mid X,\theta)d\theta=\nabla_{\phi}E_{q}\log p(Y\mid X,\theta)$ is not an expectation and thus cannot be calculated using Monte Carlo methods. The idea is to represent the gradient $\nabla_{\phi}E_{q}\log p(Y\mid X,\theta)$ as an expectation of some random variable, so that Monte Carlo techniques can be used to calculate it. There are two standard methods to do it. First, the log-derivative trick, uses the following identity $\nabla_{x}f(x)=f(x)\nabla_{x}\log f(x)$ to obtain $\nabla_{\phi}E_{q}\log p(Y\mid\theta)$ . Thus, if we select $q(\theta\mid\phi)$ so that it is easy to compute its derivative and generate samples from it, the gradient can be efficiently calculated using Monte Carlo technique. Second, we can use reparametrization trick by representing $\theta$ as a value of a deterministic function, $\theta=g(\epsilon,x,\phi)$ , where $\epsilon\sim r(\epsilon)$ does not depend on $\phi$ . The derivative is given by

[TABLE]

The reparametrization is trivial when $q(\theta\mid D,\phi)=N(\theta\mid\mu(D,\phi),\Sigma(D,\phi))$ , and $\theta=\mu(D,\phi)+\epsilon\Sigma(D,\phi),~{}\epsilon\sim N(0,I)$ . Kingma and Welling (2013) propose using $\Sigma(D,\phi)=I$ and representing $\mu(D,\phi)$ and $\epsilon$ as outputs of a neural network (multi-layer perceptron), the resulting approach was called variational auto-encoder. A generalized reparametrization has been proposed by Ruiz et al. (2016) and combines both log-derivative and reparametrization techniques by assuming that $\epsilon$ can depend on $\phi$ .

3 Finding Good Bayes Predictors

The Bayesian paradigm provides novel insights into how to construct estimators with good predictive performance. The goal is simply to find a good predictive MSE, namely $E_{Y,\hat{Y}}(\|\hat{Y}-Y\|^{2})$ , where $\hat{Y}$ denotes a prediction value. Stein shrinkage (a.k.a regularization with an $L^{2}$ norm) in known to provide good mean squared error properties in estimation, namely $E(||\hat{\theta}-\theta)||^{2})$ . These gains translate into predictive performance (in an iid setting) for $E(||\hat{Y}-Y||^{2})$ .

The main issue is how to tune the amount of regularisation (a.k.a prior hyper-parameters). Stein’s unbiased estimator of risk provides a simple empirical rule to address this problem as does cross-validation. From a Bayes perspective, the marginal likelihood (and full marginal posterior) provides a natural method for hyper-parameter tuning. The issue is computational tractability and scalability. In the context of DL, the posterior for $(W,b)$ is extremely high dimensional and multimodal and posterior MAP provides good predictors $\hat{Y}(X)$ .

Bayes conditional averaging performs well in high dimensional regression and classification problems. High dimensionality, however, brings with it the curse of dimensionality and it is instructive to understand why certain kernel can perform badly. Adaptive Kernel predictors (a.k.a. smart conditional averager) are of the form

[TABLE]

Here $\hat{Y}_{r}(X)$ is a deep predictor with its own trained parameters. For tree models, the kernel $K_{r}(X_{i},X)$ is a cylindrical region $R_{r}$ (open box set). Figure 2 illustrates the implied kernels for trees (cylindrical sets) and random forests. Not too many points will be neighbors in a high dimensional input space.

Constructing the regions to preform conditional averaging is fundamental to reduce the curse of dimensionality. Imagine a large dataset, e.g. 100k images and think about how a new image’s input coordinates, $X$ , are “neighbors" to data points in the training set. Our predictor is a smart conditional average of the observed outputs, $Y$ , from our neighbors. When $p$ is large, spheres ( $L^{2}$ balls or Gaussian kernels) are terrible, degenerate cases occur when either no points or all of the points are “neighbors" of the new input variable will appear. Tree-based models address this issue by limiting the number of “neighbors.

Figure 3 further illustrates the challenge with the 2D image of 1000 uniform samples from a 50-dimensional ball $B_{50}$ . The image is calculated as $w^{T}Y$ , where $w=(1,1,0,\ldots,0)$ and $Y\sim U(B_{50})$ . Samples are centered around the equators and none of the samples fall anywhere close to the boundary of the set.

As dimensionality of the space grows, the variance of the marginal distribution goes to zero. Figure 4 shows the histogram of 1D image of uniform sample from balls of different dimensionality, that is $e_{1}^{T}Y$ , where $e_{1}=(1,0,\ldots,0)$ .

Similar central limit results were known to Maxwell who has shown that the random variable $w^{T}Y$ is close to standard normal, when $Y\sim U(B_{p})$ , $p$ is large, and $w$ is a unit vector (lies on the boundary of the ball), see Diaconis and Freedman (1987). More general results in this direction were obtained in Klartag (2007) and Milman and Schechtman (2009) who presents many analytical and geometrical results for finite dimensional normed spaces, as the dimension grows to infinity.

Deep learning can improve on traditional methods by performing a sequence of GLM-like transformations. Effectively DL learns a distributed partition of the input space. For example, suppose that we have $K$ partitions and a DL predictor that takes the form of a weighted average or soft-max of the weighted average for classification. Given a new high dimensional input $X_{\mathrm{new}}$ , many deep learners are then an average of learners obtained by our hyper-plane decomposition. Our predictor takes the form

[TABLE]

where $w_{k}$ are the weights learned in region $K$ , and $k$ is an indicator of the region with appropriate weighting given the training data.

The partitioning of the input space by a deep learner is similar to the one performed by decision trees and partition-based models such as CART, MARS, RandomForests, BART, and Gaussian Processes. Each neuron in a deep learning model corresponds to a manifold that divides the input space. In the case of ReLU activation function $f(x)=\max(x,0)$ the manifold is simply a hyperplane and the neuron gets activated when the new observation is on the “right” side of this hyperplane, the activation amount is equal to how far from the boundary the given point is. For example in two dimensions, three neurons with ReLU activation functions will divide the space into seven regions, as shown on Figure 5.

The key difference between tree-based architecture and neural network based models is the way hyper-planes are combined. Figure 6 shows the comparison of space decomposition by hyperplanes, as performed by a tree-based and neural network architectures. We compare a neural network with two layers (bottom row) with tree mode trained with CART algorithm (top row). The network architecture is given by

[TABLE]

The weight matrices for simple data $W^{1},W^{2}\in\mathbb{R}^{2\times 2}$ , for circle data $W^{1}\in\mathbb{R}^{2\times 2}$ and $W^{2}\in\mathbb{R}^{3\times 2}$ , for spiral data we have $W^{1}\in\mathbb{R}^{2\times 2}$ and $W^{2}\in\mathbb{R}^{4\times 2}$ . In our notations, we assume that the activation function is applied point-vise at each layer. An advantage of deep architectures is that the number of hyper-planes grow exponentially with the number of layers. The key property of an activation function (link) is $f(0)=0$ and it has zero value in certain regions. For example, hinge or rectified learner $\max(x,0)$ box car (differences in Heaviside) functions are very common. As compared to a logistic regression, rather than using $\mathrm{softmax}(1/(1+e^{-x}))$ in deep learning $\tanh(x)$ is typically used for training, as $\tanh(0)=0$ .

Amit and Geman (1997) provide an interesting discussion of efficiency. Formally, a Bayesian probabilistic approach (if computationally feasible) optimally weights predictors via model averaging with $\hat{Y}_{k}(x)=E(Y\mid X_{k})$

[TABLE]

Such rules can achieve optimal out-of-sample performance. Amit et al. (2000) discusses the striking success of multiple randomized classifiers. Using a simple set of binary local features, one classification tree can achieve 5% error on the NIST data base with 100,000 training data points. On the other hand, 100 trees, trained under one hour, when aggregated, yield an error rate under 7%. This stems from the fact that a sample from a very rich and diverse set of classifiers produces, on average, weakly dependent classifiers conditional on class.

To further exploit this, consider the Bayesian model of weak dependence, namely exchangeability. Suppose that we have $K$ exchangeable, $\mathbb{E}(\hat{Y}_{i})=\mathbb{E}(\hat{Y}_{\pi(i)})$ , and stacked predictors

[TABLE]

Suppose that we wish to find weights, $w$ , to attain ${\rm arg\;min}_{W}\;El(Y,w^{T}\hat{Y})$ where $l$ convex in the second argument;

[TABLE]

where $\iota=(1,\ldots,1)$ . Hence, the randomised multiple predictor with weights $w=(1/K)\iota$ provides the optimal Bayes predictive performance.

We now turn to algorithmic issues.

4 Algorithmic Issues

In this section we discuss two types of algorithms for training learning models. First, stochastic gradient descent, which is a very general algorithm that efficiently works for large scale datasets and has been used for many deep learning applications. Second, we discuss specialized statistical learning algorithms, which are tailored for certain types of traditional statistical models.

4.1 Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a default gold standard for minimizing the a function $f(W,b)$ (maximizing the likelihood) to find the deep learning weights and offsets. SGD simply minimizes the function by taking a negative step along an estimate $g^{k}$ of the gradient $\nabla f(W^{k},b^{k})$ at iteration $k$ . The gradients are available via the chain rule applied to the superposition of semi-affine functions.

The approximate gradient is estimated by calculating

[TABLE]

where $E_{k}\subset\{1,\ldots,T\}$ and $|E_{k}|$ is the number of elements in $E_{k}$ .

When $|E_{k}|>1$ the algorithm is called batch SGD and simply SGD otherwise. Typically, the subset $E$ is chosen by going cyclically and picking consecutive elements of $\{1,\ldots,T\}$ , $E_{k+1}=[E_{k}\mod T]+1$ . The direction $g^{k}$ is calculated using a chain rule (a.k.a. back-propagation) providing an unbiased estimator of $\nabla f(W^{k},b^{k})$ . Specifically, this leads to

[TABLE]

At each iteration, SGD updates the solution

[TABLE]

Deep learning algorithms use a step size $t_{k}$ (a.k.a learning rate) that is either kept constant or a simple step size reduction strategy, such as $t_{k}=a\exp(-kt)$ is used. The hyper parameters of reduction schedule are usually found empirically from numerical experiments and observations of the loss function progression.

One caveat of SGD is that the descent in $f$ is not guaranteed, or it can be very slow at every iteration. Stochastic Bayesian approaches ought to alleviate these issues. The variance of the gradient estimate $g^{k}$ can also be near zero, as the iterates converge to a solution. To tackle those problems a coordinate descent (CD) and momentum-based modifications can be applied. Alternative directions method of multipliers (ADMM) can also provide a natural alternative, and leads to non-linear alternating updates, see Carreira-Perpinán and Wang (2014).

The CD evaluates a single component $E_{k}$ of the gradient $\nabla f$ at the current point and then updates the $E_{k}$ th component of the variable vector in the negative gradient direction. The momentum-based versions of SGD, or so-called accelerated algorithms were originally proposed by Nesterov (1983). For more recent discussion, see Nesterov (2013). The momentum term adds memory to the search process by combining new gradient information with the previous search directions. Empirically momentum-based methods have been shown a better convergence for deep learning networks Sutskever et al. (2013). The gradient only influences changes in the velocity of the update, which then updates the variable

[TABLE]

The hyper-parameter $\mu$ controls the dumping effect on the rate of update of the variables. The physical analogy is the reduction in kinetic energy that allows to “slow down" the movements at the minima. This parameter can also be chosen empirically using cross-validation.

Nesterov’s momentum method (a.k.a. Nesterov acceleration) calculates the gradient at the point predicted by the momentum. One can view this as a look-ahead strategy with updating scheme

[TABLE]

Another popular modification are the AdaGrad methods Zeiler (2012), which adaptively scales each of the learning parameter at each iteration

[TABLE]

where is usually a small number, e.g. $a=10^{-6}$ that prevents dividing by zero. PRMSprop takes the AdaGrad idea further and places more weight on recent values of gradient squared to scale the update direction, i.e. we have

[TABLE]

The Adam method (Kingma and Ba, 2014) combines both PRMSprop and momentum methods, and leads to the following update equations

[TABLE]

Second order methods solve the optimization problem by solving a system of nonlinear equations $\nabla f(W,b)=0$ by applying the Newton’s method

[TABLE]

Here SGD simply approximates $\nabla^{2}f(W,b)$ by $1/t$ . The advantages of a second order method include much faster convergence rates and insensitivity to the conditioning of the problem. In practice, second order methods are rarely used for deep learning applications (Dean et al., 2012b). The major disadvantage is its inability to train models using batches of data as SGD does. Since a typical deep learning model relies on large scale data sets, second order methods become memory and computationally prohibitive at even modest-sized training data sets.

4.2 Learning Shallow Predictors

Traditional factor models use linear combination of $K$ latent factors, $\{z_{1},z_{2},\ldots,z_{K}\}$ ,

[TABLE]

Here factors $z_{k}$ and weights $B_{ik}$ can be found by solving the following problem

[TABLE]

Then, we minimize the reconstruction error (a.k.a. accuracy), plus the regularization penalty, to control the variance-bias trade-off for out-of-sample prediction. Algorithms exist to solve this problem very efficiently. Such a model can be represented as a neural network model with $L=2$ with identity activation function.

The basic sliced inverse regression (SIR) model takes the form $Y=G(WX,\epsilon)$ , where $G(\cdot)$ is a nonlinear function and $W\in R^{k\times p}$ , with $k<p$ , in other words, $Y$ is a function of $k$ linear combinations of $X$ . To find $W$ , we first slice the feature matrix, then we analyze the data’s covariance matrices and slice means of $X$ , weighted by the size of slice. The function $G$ is found empirically by visually exploring relations. The key advantage of deep learning approach is that functional relation $G$ is found automatically. To extend the original SIR fitting algorithm, Jiang and Liu (2013) proposed a variable selection under the SIR modeling framework. A partial least squares regression (PLS) (Wold et al., 2001) finds $T$ , a lower dimensional representation of $X=TP^{T}$ and then regresses it onto $Y$ via $Y=TBC^{T}$ .

A deep learning least squares network arrives at a criterion function given by a negative log-posterior, which needs to be minimized. The penalized log-posterior, with $\phi$ denoting a generic regularization penalty is given by

[TABLE]

Carreira-Perpinán and Wang (2014) propose a method of auxiliary coordinates which replaces the original unconstrained optimization problem, associated with model training, with an alternative function in a constrained space, that can be optimized using alternating directions method and thus is highly parallelizable. An extension of these methods are ADMM and Divide and Concur (DC) algorithms, for further discussion see Polson et al. (2015a). The gains for applying these to deep layered models, in an iterative fashion, appear to be large but have yet to be quantified empirically.

5 Application: Predicting Airbnb Bookings

To illustrate our methodology, we use the dataset provided by the Airbnb Kaggle competition. This dataset whilst not designed to optimize the performance of DL provides a useful benchmark to compare and contrast traditional statistical models. The goal is to build a model that can predict which country a new user will make his or her first booking. Though Airbnb offers bookings in more than 190 countries, there are 10 countries where users make frequent bookings. We treat the problem as classification into one of the 12 classes (10 major countries + other + NDF); where other corresponds to any other country which is not in the list of top 10 and NDF corresponds to situations where no booking was made.

The data consists of two tables, one contains the attributes of each of the users and the other contains data about sessions of each user at the Airbnb website. The user data contains demographic characteristics, type of device and browser used to sign up, and the destination country of the first booking, which is our dependent variable $Y$ . The data involves 213,451 users and 1,056,7737 individual sessions. The sessions data contains information about actions taken during each session, duration and devices used. Both datasets has a large number of missing values. For example age information is missing for 42% of the users. Figure 7(a) shows that nearly half of the gender data is missing and there is slight imbalance between the genders.

Figure 7(b) shows the country of origin for the first booking by gender. Most of the entries in the destination columns are NDF, meaning no booking was made by the user. Further, Figure 7(c) shows relationship between gender and age, the gender value is missing for most of the users who did not identify their age.

We find that there is little difference in booking behavior between the genders. However, as we will see later, the fact that gender was specified, is an important predictor. Intuitively, users who filled the gender field are more likely to book.

On the other hand, as Figure 8 shows, the age variable does play a role.

Figure 8(a) shows that most of the users are of age between 25 and 40. Furthermore, looking at booking behavior between two different age groups, younger than 45 cohort and older than 45 cohort, (see Figure 8(b)) have very different booking behavior. Further, as we can see from Figure 8(c) half of the users who did not book did not identify their age either.

Another effect of interest is the non-linearity between the time the account was created and booking behavior. Figure 9 shows that “old timers" are more likely to book when compared to recent users. Since the number of records in sessions data is different for each users, we developed features from those records so that sessions data can be used for prediction. The general idea is to convert multiple session records to a single set of features per user. The list of the features we calculate is

(i)

Number of sessions records 2. (ii)

For each action type, we calculate the count and standard deviation 3. (iii)

For each device type, we calculate the count and standard deviation 4. (iv)

For session duration we calculate mean, standard deviation and median

Furthermore, we use one-hot encoding for categorical variables from the user table, e.g. gender, language, affiliate provider, etc. One-hot encoding replaces categorical variable with $K$ categories by $K$ binary dummy variable.

We build a deep learning model with two hidden dense layers and ReLU activation function $f(x)=\max(x,0)$ . We use ADAGRAD optimization to train the model. We predict probabilities of future destination booking for each of the new users. The evaluation metric for this competition is NDCG (Normalized discounted cumulative gain). We use top five predicted destinations and is calculated as:

[TABLE]

where $\mathrm{DCG}_{5}^{i}=1/\log_{2}{\left(p(i)+1\right)}$ and $p(i)$ is the position of the true destination in the list of five predicted destinations. For example, if for a particular user $i$ the destination is FR, and FR was at the top of the list of five predicted countries, then

[TABLE]

When FR is second, e.g. model prediction (US, FR, DE, NDF, IT) gives a

[TABLE]

We trained our deep learning network with 20 epochs and mini-batch size of 256. For a hold-out sample we used 10% of the data, namely 21346 observations. The fitting algorithm evaluates the $DCG$ function at every epoch to monitor the improvements of quality of predictions from epoch to epoch. It takes approximately 10 minutes to train, whereas the variational inference approach is computationally prohibitive at this scale.

Our model uses a two-hidden layer architecture with ReLU activation functions

[TABLE]

The weight matrices for simple data $W^{1}\in\mathbb{R}^{64\times p}$ , $W^{2}\in\mathbb{R}^{64\times 64}$ . In our notations, we assume that the activation function is applied point-vise at each layer.

The resulting model has out-of-sample $NDCG$ of $-0.8351$ . The classes are imbalanced in this problem. Table 1 shows percent of each class in out-of-sample data set.

Figure 10 shows out-of-sample NDCG for each of the destinations.

Figure 11 shows accuracy of prediction for each of the destination countries.

The model accurately predicts bookings in the US and FR and other when top three predictions are considered.

Furthermore, we compared the performance of our deep learning model with the XGBoost algorithms (Chen and Guestrin, 2016) for fitting gradient boosted tree model. The performance of the model is comparable and yields NGD of $-0.8476$ . One of the advantages of the tree-based model is its ability to calculate the importance of each of the features (Hastie et al., 2016). Figure 12 shows the variable performance calculated from our XGBoost model.

The importance scores calculated by the XGBoost model confirm our exploratory data analysis findings. In particular, we see the fact that a user specified gender is a strong predictor. Number of sessions on Airbnb site recorded for a given user before booking is a strong predictor as well. Intuitively, users who visited the site multiple times are more likely to book. Further, web-users who signed up via devices with large screens are also likely to book as well.

6 Discussion

Our view of deep learning is a high dimensional nonlinear data reduction scheme, generated probabilistically as a stacked generalized linear model (GLM). This sheds light on how to train a deep architecture using SGD. This is a first order gradient method for finding a posterior mode in a very high dimensional space. By taking a predictive approach, where regularization learns the architecture, deep learning has been very successful in many fields.

There are many areas of future research for Bayesian deep learning which include

(i)

By viewing deep learning probabilistically as stacked GLMs allows many statistical models such as exponential family models and heteroscedastic errors. 2. (ii)

Bayesian hierarchical models have similar advantages to deep learners. Hierarchical models include extra stochastic layers and provide extra interpretability and flexibility. 3. (iii)

By viewing deep learning as a Gaussian Process allows for exact Bayesian inference Neal (1996); Williams (1997); Lee et al. (2017). The Gaussian Process connection opens opportunities to develop more flexible and interpretable models for engineering Gramacy and Polson (2011) and natural science applications Banerjee et al. (2008). 4. (iv)

With gradient information easily available via the chain rule (a.k.a. back propagation), a new avenue of stochastic methods to fit networks exists, such as MCMC, HMC, proximal methods, and ADMM, which could dramatically speed up the time to train deep learners. 5. (v)

Comparison with traditional Bayesian non-parametric approaches, such as treed Gaussian Models (Gramacy, 2005), and BART (Chipman et al., 2010) or using hyperplanes in Bayesian non-parametric methods ought to yield good predictors (Francom, 2017). 6. (vi)

Improved Bayesian algorithms for hyper-parameter training and optimization (Snoek et al., 2012). Langevin diffusion MCMC, proximal MCMC and Hamiltonian Monte Carlo (HMC) can exploit the derivatives as well as Hessian information (Polson et al., 2015a, b; Dean et al., 2012a). 7. (vii)

Rather than searching a grid of values with a goal of minimising out-of-sample means squared error, one could place further regularisation penalties (priors) on these parameters and integrate them out.

MCMC methods also have lots to offer to DL and can be included seamlessly in TensorFlow (Abadi et al., 2015). Given the availability of high performance computing, it is now possible to implement high dimensional posterior inference on large data sets is now a possibility, see Dean et al. (2012a). The same advantages are now available for Bayesian inference. Further, we believe deep learning models have a bright future in many fields of applications, such as finance, where DL is a form of nonlinear factor models (Heaton et al., 2016a, b), with each layer capturing different time scale effects and spatio-temporal data is viewed as an image in space-time (Dixon et al., 2017; Polson and Sokolov, 2017). In summary, the Bayes perspective adds helpful interpretability, however, the full power of a Bayes approach has still not been explored. From a practical perspective, current regularization approaches have provided great gains in predictive model power for recovering nonlinear complex data relationships.

Bibliography92

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, V
2Ackley et al. (1985) David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science , 9(1):147–169, 1985.
3Adams et al. (2010) Ryan Adams, Hanna Wallach, and Zoubin Ghahramani. Learning the structure of deep sparse graphical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages 1–8, 2010.
4Amit and Geman (1997) Y. Amit and D. Geman. Shape Quantization and Recognition with Randomized Trees. Neural Computation , 9(7):1545–1588, July 1997.
5Amit et al. (2000) Yali Amit, Gilles Blanchard, and Kenneth Wilder. Multiple randomized classifiers: Mrcl. 2000.
6Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning , pages 173–182, 2016.
7Banerjee et al. (2008) Sudipto Banerjee, Alan E Gelfand, Andrew O Finley, and Huiyan Sang. Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 70(4):825–848, 2008.
8Barber and Bishop (1998) David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks. Neural Networks and Machine Learning , 168:215–238, 1998.