MaxEntropy Pursuit Variational Inference

Evgenii Egorov; Kirill Neklydov; Ruslan Kostoev; Evgeny; Burnaev

arXiv:1905.07855·cs.LG·May 21, 2019

MaxEntropy Pursuit Variational Inference

Evgenii Egorov, Kirill Neklydov, Ruslan Kostoev, Evgeny, Burnaev

PDF

TL;DR

This paper introduces MaxEntropy Pursuit, a variational inference method that uses a greedy approach with tractable base learners and a Max-Entropy framework to better capture complex, multimodal posterior distributions in neural networks.

Contribution

It presents a novel greedy variational inference technique leveraging Max-Entropy to improve approximation of complex posteriors in neural network models.

Findings

01

Effective in capturing multimodal posteriors

02

Demonstrates improved inference in continual learning settings

03

Utilizes tractable base learners for scalable inference

Abstract

One of the core problems in variational inference is a choice of approximate posterior distribution. It is crucial to trade-off between efficient inference with simple families as mean-field models and accuracy of inference. We propose a variant of a greedy approximation of the posterior distribution with tractable base learners. Using Max-Entropy approach, we obtain a well-defined optimization problem. We demonstrate the ability of the method to capture complex multimodal posterior via continual learning setting for neural networks.

Equations49

p (θ ∣ X) = \frac{p ( X , θ )}{\int p ( X , θ ) d θ} .

p (θ ∣ X) = \frac{p ( X , θ )}{\int p ( X , θ ) d θ} .

D_{K L} (q (θ) ∣∣ p (θ)) = - \int q (θ) lo g \frac{p ( θ )}{q ( θ )} d θ .

D_{K L} (q (θ) ∣∣ p (θ)) = - \int q (θ) lo g \frac{p ( θ )}{q ( θ )} d θ .

lo g p (X) = lo g \int p (X, θ) d θ = lo g \int \frac{p ( X , θ ) q _{λ} ( θ )}{q _{λ} ( θ )} d θ =

lo g p (X) = lo g \int p (X, θ) d θ = lo g \int \frac{p ( X , θ ) q _{λ} ( θ )}{q _{λ} ( θ )} d θ =

= lo g E_{q_{λ} (θ)} [\frac{p ( X , θ )}{q _{λ} ( θ )}] \geq E_{q_{λ} (θ)} [lo g \frac{p ( X , θ )}{q _{λ} ( θ )}] = F [q] =: E L B O .

lo g p (X) = F (λ) + D_{K L} (q_{λ} (θ) ∣∣ p (X, θ)),

lo g p (X) = F (λ) + D_{K L} (q_{λ} (θ) ∣∣ p (X, θ)),

q_{t + 1} = (1 - α) q_{t} + α h, α \in (0; 1), h \in Q .

q_{t + 1} = (1 - α) q_{t} + α h, α \in (0; 1), h \in Q .

h \in Q max H [h], s . t .

h \in Q max H [h], s . t .

F [q_{t + 1}] - F [q_{t}] > 0.

F [q_{t + 1}] = \int [q_{t} + α (h - q_{t})] (lo g \frac{L ( θ )}{q _{t}} - lo g (1 + α \frac{h - q _{t}}{q _{t}})) d θ =

F [q_{t + 1}] = \int [q_{t} + α (h - q_{t})] (lo g \frac{L ( θ )}{q _{t}} - lo g (1 + α \frac{h - q _{t}}{q _{t}})) d θ =

= F [q_{t}] \int q_{t} lo g \frac{L ( θ )}{q _{t}} d θ + α \int (h - q_{t}) (lo g \frac{L ( θ )}{q _{t}} - lo g [1 + α \frac{h - q _{t}}{q _{t}}]) d θ

- \int q_{t} lo g (1 + α \frac{h - q _{t}}{q _{t}}) d θ .

F [q_{t + 1}] - F [q_{t}] = α ⟨ h - q_{t}, lo g \frac{L ( θ )}{q _{t}} ⟩ - α^{2} \int \frac{( h - q _{t} ) ^{2}}{q _{t}} d θ + o (α \frac{h - q _{t}}{q _{t}}_{2}) .

F [q_{t + 1}] - F [q_{t}] = α ⟨ h - q_{t}, lo g \frac{L ( θ )}{q _{t}} ⟩ - α^{2} \int \frac{( h - q _{t} ) ^{2}}{q _{t}} d θ + o (α \frac{h - q _{t}}{q _{t}}_{2}) .

h \in Q max H [h] + λ ⟨ h, lo g \frac{L ( θ )}{q _{t}} ⟩ .

h \in Q max H [h] + λ ⟨ h, lo g \frac{L ( θ )}{q _{t}} ⟩ .

\frac{δ}{δ h} [H [h] + λ ⟨ h, lo g \frac{L ( θ )}{q _{t}} ⟩] + γ (\int h d θ - 1) = 0,

\frac{δ}{δ h} [H [h] + λ ⟨ h, lo g \frac{L ( θ )}{q _{t}} ⟩] + γ (\int h d θ - 1) = 0,

h^{*} = [\frac{L ( θ )}{q _{t}}]^{λ} exp (γ - 1) .

h^{*} = \frac{1}{Z ( λ )} [\frac{L ( θ )}{q _{t}}]^{λ} .

h^{*} = \frac{1}{Z ( λ )} [\frac{L ( θ )}{q _{t}}]^{λ} .

\displaystyle\min\limits_{h\in Q}D_{KL}\left(h\Big{|}\Big{|}\dfrac{1}{Z(\lambda)}\left[\dfrac{L(\theta)}{q_{t}}\right]^{\lambda}\right).

\displaystyle\min\limits_{h\in Q}D_{KL}\left(h\Big{|}\Big{|}\dfrac{1}{Z(\lambda)}\left[\dfrac{L(\theta)}{q_{t}}\right]^{\lambda}\right).

T_{λ} : p \to \frac{p ^{λ} ( θ )}{\int p ^{λ} ( θ ) d θ}, λ > 0.

T_{λ} : p \to \frac{p ^{λ} ( θ )}{\int p ^{λ} ( θ ) d θ}, λ > 0.

D_{K L} (U ∣∣ p) > D_{K L} (U ∣∣ T_{λ} p), for λ > 1,

D_{K L} (U ∣∣ p) > D_{K L} (U ∣∣ T_{λ} p), for λ > 1,

D_{K L} (U ∣∣ p) < D_{K L} (U ∣∣ T_{λ} p), for λ < 1.

ar g h \in Q max H [h] + ⟨ h, lo g \frac{L ( θ )}{q _{t}} ⟩ = ar g h \in Q max term (1) \int h lo g \frac{L ( θ )}{h} d θ term (2) - \int h lo g q_{t} d θ .

ar g h \in Q max H [h] + ⟨ h, lo g \frac{L ( θ )}{q _{t}} ⟩ = ar g h \in Q max term (1) \int h lo g \frac{L ( θ )}{h} d θ term (2) - \int h lo g q_{t} d θ .

q_{t + 1} (θ) = (1 - α) q_{t} (θ) + α h (θ) .

q_{t + 1} (θ) = (1 - α) q_{t} (θ) + α h (θ) .

α \in (0; 1) min D_{K L} ((1 - α) q_{t} (θ) + α h (θ) ∣∣ p (θ ∣ X)) .

α \in (0; 1) min D_{K L} ((1 - α) q_{t} (θ) + α h (θ) ∣∣ p (θ ∣ X)) .

D_{f} (q ∣∣ p) \approx f^{''} (1) χ^{2} (q ∣∣ p) .

D_{f} (q ∣∣ p) \approx f^{''} (1) χ^{2} (q ∣∣ p) .

α \in (0; 1) min \int \frac{1}{p ( θ ∣ X )} [q_{t} + α (h - q_{t})]^{2} d θ .

α \in (0; 1) min \int \frac{1}{p ( θ ∣ X )} [q_{t} + α (h - q_{t})]^{2} d θ .

\nabla_{α} \int \frac{1}{p} [q_{t} + α (h - q_{t})]^{2} d θ = 2 \int \frac{1}{p ( θ ∣ X )} [q_{t} + α (h - q_{t})] (h - q_{t}) d θ,

\nabla_{α} \int \frac{1}{p} [q_{t} + α (h - q_{t})]^{2} d θ = 2 \int \frac{1}{p ( θ ∣ X )} [q_{t} + α (h - q_{t})] (h - q_{t}) d θ,

\nabla_{α}^{2} \int \frac{1}{p ( θ ∣ X )} [q_{t} + α (h - q_{t})]^{2} d θ = 2 \int \frac{( h - q _{t} ) ^{2}}{p ( θ ∣ X )} > 0.

α^{*} = - \frac{\int \frac{1}{p ( θ ∣ X )} q _{t} ( h - q _{t} ) d θ}{\int \frac{1}{p ( θ ∣ X )} ( h - q _{t} ) ^{2} d θ} = - \frac{\int \frac{1}{L ( θ )} q _{t} ( h - q _{t} ) d θ}{\int \frac{1}{L ( θ )} ( h - q _{t} ) ^{2} d θ} .

α^{*} = - \frac{\int \frac{1}{p ( θ ∣ X )} q _{t} ( h - q _{t} ) d θ}{\int \frac{1}{p ( θ ∣ X )} ( h - q _{t} ) ^{2} d θ} = - \frac{\int \frac{1}{L ( θ )} q _{t} ( h - q _{t} ) d θ}{\int \frac{1}{L ( θ )} ( h - q _{t} ) ^{2} d θ} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Skolkovo Institute of Science and Technology, Moscow, Russia 22institutetext: National Research University Higher School of Economics, Moscow, Russia 33institutetext: Samsung AI Center, Moscow, Russia {e.egorov,r.kostoev,e.burnaev}@skoltech.ru,[email protected]

MaxEntropy Pursuit Variational Inference

Evgenii Egorov 11

Kirill Neklydov 2233

Ruslan Kostoev 11

Evgeny Burnaev 11

Abstract

One of the core problems in variational inference is a choice of approximate posterior distribution. It is crucial to trade-off between efficient inference with simple families as mean-field models and accuracy of inference. We propose a variant of a greedy approximation of the posterior distribution with tractable base learners. Using Max-Entropy approach, we obtain a well-defined optimization problem. We demonstrate the ability of the method to capture complex multimodal posterior via continual learning setting for neural networks.

Keywords:

Variational Inference Deep Learning Maximum Entropy Bayesian Inference

1 Introduction

The posterior distribution evaluation is the primary challenge in Bayesian model construction. Calculating the exact posterior distribution is intractable, and methods like MCMC while being flexible can also be unacceptably expensive. In turn, the variational inference is a method to approximate complicated probability distributions with the simpler ones. Now variational inference is used in semi-supervised classification, drives the most realistic generative models of images, and is a useful tool for analysis of any dynamical system. Inference requires that intractable posterior distributions be approximated by a class of known probability distributions, over which we search for the best representative of the chosen family.

We study the problem of the posterior approximation by a sequentially fitting composition of simple distributions given that one can turn the considered problem to the tractable optimization problem. The structure of the resulting model makes the work with the posterior approximation efficient.

The rest of the paper is organized in the following way. In Section 2, we review the variation inference framework. In Section 3, we derive the stochastic optimization algorithm for sequential approximation of posterior distribution, named MaxEntropy Pursuit Variational Inference. In Section 4, we apply the proposed approach to incremental learning of neural networks. In Section 5, we discuss the obtained results and future work.

Notations. We denote: the differential entropy of distribution $h$ by $\mathcal{H}[h]:=-\int h\log hd\theta$ ; the inner product between two Lebesgue integrable functions by $\langle f_{1},f_{2}\rangle:=\int f_{1}f_{2}d\theta$ ; the full likelihood of the probabilistic model over the dataset $X$ by $L(\theta):=p(X|\theta)p(\theta)$ ; the posterior distribution by $p(\theta|X)\propto L(\theta)$ .

2 Variational Inference

We consider the posterior distribution of latent variables $\theta$ given observations $X$ :

[TABLE]

The integral in the denominator is high dimensional, so the normalization is intractable.

The idea of the Variational Inference is to introduce some variational distribution $q_{\lambda}(\theta)$ , and instead of computing the normalization constant we approximate the posterior with the simpler distribution $q$ , parametrized by the variational parameter $\lambda$ to get the best matching with $p$ .

One of the most common approaches to evaluate proximity between $p$ and $q$ is to use KL-divergence (also known as relative entropy or information gain):

[TABLE]

KL-divergence is asymmetric ( $D_{KL}(q||p)\neq D_{KL}(p||q)$ ), non-negative and equals to zero iff $q(\theta)=p(\theta)$ .

KL-divergence asymmetry provides two different approximation methods: variational inference and expectation propagation (not reviewed in this paper).

Reducing KL-divergence to zero leads to exact matching of distributions, but usually, the variational family $q\in Q$ is not flexible enough for this.

We can formulate minimization of KL-divergence in another way:

[TABLE]

ELBO (Evidence Lower Bound) with the KL divergence between the variational distribution and the posterior form the true log marginal probability of the data:

[TABLE]

so the minimization of KL-divergence is equivalent to the maximization of ELBO.

However, optimizing over a parametric variational family of distributions and getting the optimal solution $q^{*}=\arg\max\limits_{\mathcal{Q}_{\lambda}}\mathcal{F}[q]$ still leads to the approximation gap [9], equal to $\log p(X)-\mathcal{F}[q^{*}]$ . Many papers showed that the choice of the variational family $\mathcal{Q}_{\lambda}$ is important for quality of the variational approximation [1, 25, 23].

There are a number of approaches for reducing the approximation gap. Some of them propose to increase the flexibility of the approximation family, e.g. normalizing flows [22] or hierarchical variational models [21]. The other research direction explores the idea of incrementally expanding variational family by the additive mixture of tractable base learners [11, 18]. In [17] they investigate the theoretical justification of such approach from an optimization perspective. In general, the both approaches are able to capture the multimodality and nonstandard posterior shapes. However, it seems that the incremental learning of the posterior approximation is more promising from the applied point of view, as the additive mixture composes the approximation using simple and easy-to-evaluate building blocks.

Here we address several problems with this approach. Firstly, starting from the Maximum Entropy principle [8], we obtain a natural regularized optimization problem, instead of the ad-hoc regularization, proposed in other works. This leads to interesting connections with other fields and allows to use stochastic optimization approaches in contrast to the original boosting approach [11]. We show the ability of the proposed approach to approximate complex posteriors by using Bayesian Neural Networks, which is a data-intensive and challenging task [28].

3 Max Entropy Pursuit Variational Inference

In this section we derive algorithm in which problem of the posterior distribution is solved by additive mixture. Each component is obtained sequentially. Each step consists of the two optimization problems: for new component $h$ and for the corresponding mixture weight $\alpha$ .

3.1 Optimization over new component $h$

Consider that we given some approximation of the posterior distribution $q_{t}$ . Our goal is to improve accuracy of the approximation in terms of the KL-divergence $D_{KL}[q_{t}(\theta)||p(\theta|X)]$ by using the additive mixture:

[TABLE]

Hence, using Maximum Entropy Approach [8] we can state the following optimization problem:

[TABLE]

As the optimization problem in Eq. (1) is highly non-linear, we propose to follow the framework based on the Frank-Wolfe algorithm [26, 17] and consider the constraint as a functional perturbation.

Expanding the $\mathcal{F}[q_{t+1}]$ term, we get

[TABLE]

Using Taylor expansion, we obtain the constraint in the following form:

[TABLE]

Considering the first order terms, we get the following optimization problem:

[TABLE]

We can perform scalable optimization by the doubly stochastic gradient descent [14, 24]. The $\lambda>0$ is the corresponding Lagrange multiplier of the constraint. Exact solution of the dual problem for the optimal $\lambda$ is intractable. Below we provide some analysis of how the solution depends on $\lambda$ . It allows us to propose practically useful heuristic to select a value of $\lambda$ .

Note, that retaining only the first order terms corresponds to the “functional gradient” of the KL-divergence [11]. However, MaxEntropy approach allows obtaining the natural regularization term. Further, we show that it is critical to obtain a data scalable algorithm and interpret the parameter $\lambda$ . Also, in Section 4 we discuss whether the first order terms expansion is enough for high dimensional problems.

3.2 Analysis of optimization problem for $h$

To provide the heuristic rule of choosing the $\lambda$ , we optimize in Eq. (2) not over some parametric family $Q$ of base learners $h$ , but over all probability densities. As the objective is concave over $h$ , we can derive the global optimal of the maximization problem from the first-order conditions:

[TABLE]

Hence, the optimal new component has the following form:

[TABLE]

The solution $h^{*}$ is intractable, as finding the normalization constant $Z=\int\left[\tfrac{L(\theta)}{q_{t}}\right]^{\lambda}d\theta$ has the same complexity as solving the original problem. Still, as the global optimum is known, instead of the optimization problem in Eq. (2) we can consider another optimization problem:

[TABLE]

The problem (4) is a well-known optimization problem for which there are a lot of black-box variational inference (BBVI) solvers, see e.g. [10, 20]. Hence, any practitioner can benefit from our approach without additional significant costs of implementing or reformulating the initial statistical problem. Moreover, we could provide intuition for selecting $\lambda$ by establishing a connection with Renyi divergence [16] thanks to the analyses of the form of (3). Namely, we consider a parametric mapping in the probability density space:

[TABLE]

Consider a pair of a uniform distribution $U$ and $p:\mathcal{H}[p]>\mathcal{H}[U]$ . We can easily prove that

[TABLE]

Hence, we can state that for $\lambda>1$ we obtain a mode-seeking solution and for $\lambda<1$ we get a mass covering solution. Interestingly, in case of the Renyi divergence optimization in [19] they describe the same behavior for different values of $\alpha$ . Hence, we can refer to $\lambda$ as the temperature and select some annealing schedule for each step of the optimization process to tune $\lambda$ .

Let us consider the corner case, i.e. $\lambda=1$ . Then we can rewrite the objective in (2):

[TABLE]

Hence, the term $1$ in (7) corresponds to the standard optimization objective in case of variational inference [12]. At the same time the term $2$ in (7) plays a role of a penalty for the similarity with the current solution $q_{t}$ .

3.3 Optimization over mixture weight $\alpha$ corresponding to $h$

After we obtain the new mixture component $h$ for the current variational approximation $q_{t}$ , we should select the mixture weight $\alpha$ to obtain a new variational approximation as a convex combination:

[TABLE]

Hence, let us state the optimization problem over $\alpha\in(0;1)$ :

[TABLE]

Using Taylor expansion we can get the approximation for any $f$ -divergence [27] by the Pearson Chi-squared divergence:

[TABLE]

Hence, we can re-formulate the approximation problem:

[TABLE]

Consider the gradient and the hessian of the objective in (9) w.r.t. $\alpha$ :

[TABLE]

.

As the objective (9) is convex, we can obtain the solution of the optimization problem (9) from the first order condition:

[TABLE]

In practice such estimator has high variance. Estimation for each sample requires the forward pass through the whole dataset, hence the variance can not be reduced by averaging efficiently. Therefore we propose to use the exact solution (10) in case of middle-size datasets and use the stochastic gradient approach with a projection for the objective from Eq. (8) in case of large-scale datasets.

4 Neural Network Incremental Learning via Bayesian Inference

Deep neural networks provide the state-of-art solution for the image classification problems. However, as a network is trained to do a specific classification task, it is problematic to incrementally learn any new task. This situation was described as the catastrophically forgetting behaviour of neural networks. However, intuitively we expect the other situation: performance should similar to that when training over the whole dataset in the offline mode [13]. In this section, we show how our approach helps to overcome this limitation.

Experimental Setup. We perform the incremental class learning experiment using the MNIST dataset with the LeNet-5 Convolutional Neural Networ (CNN) [15]. The dataset contains grey scale images belonging to 10 classes. We split the dataset in 5 tasks, the first task containing digits ’0’ and ’1’, the second task containing digits ’2’ and ’3’, and so on. For each task, we perform 10 epoch of training. We compare our incremental posterior approximation of the neural network parameters with a baseline naive continual neural network learning. The size of the test dataset is $10^{4}$ samples, the total train size for all tasks is $5\times 10^{4}$ . As the prior distribution on the neural network parameters we use the fully factorized standard normal distribution. The predictive distribution of the model is approximated by an ensemble of the weights sampled from the variational approximation.

Results. As result, we find our incremental posterior distribution approximation to maintain higher test accuracy through the whole sequence of tasks, almost matching the performance of a network trained simultaneously on all observed data. Fig. 1 shows the test accuracy as new tasks are observed. We conclude from our results that the incremental posterior approximation leads to a drastic increase in performance for incremental learning tasks.

5 Conclusion

In this work, we developed an efficient approach for learning complex multimodal posteriors by constructing an additive mixture of simple densities. Following the MaxEntropy approach, we state well defined and tractable optimization problem. Additive mixture allows us to control the complexity of the posterior by simply increasing or decreasing the number of components.

An important avenue of future research is to develop approaches for modeling covariance structure that accurately account for different characteristics of the posterior and that still allow for efficient computations in case of deep neural networks.

Also, we plan to consider various applications of the proposed approximation scheme including uncertainty quantification [2, 3, 4] and Bayesian parameter estimation for Gaussian Processes regression [5, 6, 7].

Acknowledgements

The work was supported by the Russian Science Foundation under Grant 19-41-04109.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519 (2015)
2[2] Burnaev, E., Panin, I.: Adaptive design of experiments for sobol indices estimation based on quadratic metamodel. In: Gammerman, A., Vovk, V., Papadopoulos, H. (eds.) Statistical Learning and Data Sciences. pp. 86–95. Springer (2015)
3[3] Burnaev, E., Panin, I., Sudret, B.: Effective design for sobol indices estimation based on polynomial chaos expansions. In: Proc. of the 5th International Symposium on Conformal and Probabilistic Prediction with Applications. vol. 9653, pp. 165–184. Springer (2016)
4[4] Burnaev, E., Panin, I., Sudret, B.: Efficient design of experiments for sensitivity analysis based on polynomial chaos expansions. Annals of Mathematics and Artificial Intelligence 81 (1), 187–207 (2017)
5[5] Burnaev, E., Zaytsev, A., Spokoiny, V.: Properties of the posterior distribution of a regression model based on gaussian random fields. Automation and Remote Control 74 (10), 1645–1655 (2013)
6[6] Burnaev, E., Zaytsev, A., Spokoiny, V.: The Bernstein-von Mises theorem for regression based on Gaussian processes. Russ. Math. Surv. 68 (5), 954–956 (2013)
7[7] Burnaev, E., Zaytsev, A., Spokoiny, V.: Properties of the bayesian parameter estimation of a regression based on gaussian processes. Journal of Mathematical Sciences 203 (6), 789–798 (2014)
8[8] Caticha, A.: Relative entropy and inductive inference. In: AIP conference proc. vol. 707, pp. 75–96 (2004)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

MaxEntropy Pursuit Variational Inference

Abstract

Keywords:

1 Introduction

2 Variational Inference

3 Max Entropy Pursuit Variational Inference

3.1 Optimization over new component hhh

3.2 Analysis of optimization problem for hhh

3.3 Optimization over mixture weight α\alphaα corresponding to hhh

4 Neural Network Incremental Learning via Bayesian Inference

5 Conclusion

Acknowledgements

3.1 Optimization over new component $h$

3.2 Analysis of optimization problem for $h$

3.3 Optimization over mixture weight $\alpha$ corresponding to $h$