Enhanced Variational Inference with Dyadic Transformation

Sarin Chandy; Amin Rasekh

arXiv:1901.10621·cs.LG·March 11, 2019

Enhanced Variational Inference with Dyadic Transformation

Sarin Chandy, Amin Rasekh

PDF

Open Access 1 Repo

TL;DR

This paper introduces dyadic transformation, a new method to improve the flexibility of variational autoencoders by better modeling the posterior distribution, leading to improved performance on MNIST.

Contribution

The paper proposes dyadic transformation, a computationally efficient single-stage transformation that enhances the posterior modeling capability of VAEs.

Findings

01

DT improves posterior flexibility in VAEs

02

Achieves competitive results on MNIST

03

Low computational overhead

Abstract

Variational autoencoder is a powerful deep generative model with variational inference. The practice of modeling latent variables in the VAE's original formulation as normal distributions with a diagonal covariance matrix limits the flexibility to match the true posterior distribution. We propose a new transformation, dyadic transformation (DT), that can model a multivariate normal distribution. DT is a single-stage transformation with low computational requirements. We demonstrate empirically on MNIST dataset that DT enhances the posterior flexibility and attains competitive results compared to other VAE enhancements.

Tables1

Table 1. Table 1: Lower bound of the marginal log-likelihood for MNIST test dataset for the regular VAE and VAE with our dyadic transformation, HF (Tomczak and Welling, 2016 ) , NF (Rezende and Mohamed, 2015 ) , and HVI (Salimans et al., 2015 ) . Listed are averages across 5 optimization runs. T 𝑇 T denotes the length of the flows. T 𝑇 T does not apply to Dyadic Transformation method because it is a single-step transformation.

Model	$\leq \log p (x)$
VAE	-89.93
VAE+DT (k=10)	-88.24
VAE+DT (k=20)	-88.00
VAE+DT (k=50)	-87.42
VAE+NF (T=80)	-85.1
VAE+HF (T=10)	-87.68
VAE+HVI (T=8)	-88.30

Equations29

\log p(\textbf{X})=\log p\big{(}\textbf{x}^{(1)},...,\textbf{x}^{(N)}\big{)}=\sum_{i=1}^{N}\log p\big{(}\textbf{x}^{(i)}\big{)}

\log p(\textbf{X})=\log p\big{(}\textbf{x}^{(1)},...,\textbf{x}^{(N)}\big{)}=\sum_{i=1}^{N}\log p\big{(}\textbf{x}^{(i)}\big{)}

\log p_{\theta}(\textbf{x})\geq\mathbb{E}_{q_{\phi}(\textbf{z}|\textbf{x})}\Big{[}\log p_{\theta}(\textbf{x}|\textbf{z})\Big{]}-D_{KL}\Big{(}q_{\phi}(\textbf{z}|\textbf{x})||p_{\theta}(\textbf{z})\Big{)}

\log p_{\theta}(\textbf{x})\geq\mathbb{E}_{q_{\phi}(\textbf{z}|\textbf{x})}\Big{[}\log p_{\theta}(\textbf{x}|\textbf{z})\Big{]}-D_{KL}\Big{(}q_{\phi}(\textbf{z}|\textbf{x})||p_{\theta}(\textbf{z})\Big{)}

\mathcal{L}\big{(}\theta,\phi;\textbf{x}\big{)}=\log p_{\theta}(\textbf{x})-D_{KL}\Big{(}q_{\phi}(\textbf{z}|\textbf{x})||p_{\theta}(\textbf{z}|\textbf{x})\Big{)}

\mathcal{L}\big{(}\theta,\phi;\textbf{x}\big{)}=\log p_{\theta}(\textbf{x})-D_{KL}\Big{(}q_{\phi}(\textbf{z}|\textbf{x})||p_{\theta}(\textbf{z}|\textbf{x})\Big{)}

G = B Y

G = B Y

B = I + ϵ UV

B = I + ϵ UV

(A + UCV)^{- 1} = A^{- 1} - A^{- 1} U (C^{- 1} + VA^{- 1} U)^{- 1} VA^{- 1}

(A + UCV)^{- 1} = A^{- 1} - A^{- 1} U (C^{- 1} + VA^{- 1} U)^{- 1} VA^{- 1}

det (I_{m} + UV) = det (I_{n} + VU),

det (I_{m} + UV) = det (I_{n} + VU),

D_{KL}\big{(}q(\textbf{z}|\textbf{x})||p(\textbf{z})\big{)}=\frac{1}{2}\sum_{j=1}^{J}\big{(}1+\log p\big{(}\boldsymbol{\sigma}_{j}^{2}\big{)}-\boldsymbol{\mu}_{j}^{2}-\boldsymbol{\sigma}_{j}^{2}\big{)}

D_{KL}\big{(}q(\textbf{z}|\textbf{x})||p(\textbf{z})\big{)}=\frac{1}{2}\sum_{j=1}^{J}\big{(}1+\log p\big{(}\boldsymbol{\sigma}_{j}^{2}\big{)}-\boldsymbol{\mu}_{j}^{2}-\boldsymbol{\sigma}_{j}^{2}\big{)}

D_{K L} (N_{0} ∣∣ N_{1}) = \frac{1}{2} \times

D_{K L} (N_{0} ∣∣ N_{1}) = \frac{1}{2} \times

\displaystyle\big{(}\mathrm{Tr}(\boldsymbol{\Sigma}_{1}^{-1}\boldsymbol{\Sigma}_{0})+(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0})^{T}\boldsymbol{\Sigma}_{1}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{0})-J+\ln(\det(\boldsymbol{\Sigma}_{1}/\boldsymbol{\Sigma}_{0})\big{)}

D_{KL}\big{(}q(\textbf{z}|\textbf{x})||p(\textbf{z})\big{)}=\frac{1}{2}\big{(}\mathrm{Tr}(\boldsymbol{\Sigma})+\boldsymbol{\mu}^{T}\boldsymbol{\mu}-J-\ln(\det(\boldsymbol{\Sigma}))\big{)}

D_{KL}\big{(}q(\textbf{z}|\textbf{x})||p(\textbf{z})\big{)}=\frac{1}{2}\big{(}\mathrm{Tr}(\boldsymbol{\Sigma})+\boldsymbol{\mu}^{T}\boldsymbol{\mu}-J-\ln(\det(\boldsymbol{\Sigma}))\big{)}

\frac{\partial (D) ^{- 1}}{\partial t} = - D^{- 1} \frac{\partial D}{\partial t} D^{- 1}

\frac{\partial (D) ^{- 1}}{\partial t} = - D^{- 1} \frac{\partial D}{\partial t} D^{- 1}

\frac{\partial det ( D )}{\partial t} = det (D) Tr (D^{- 1} \frac{\partial D}{\partial t})

\frac{\partial det ( D )}{\partial t} = det (D) Tr (D^{- 1} \frac{\partial D}{\partial t})

det (B) = 1 + ϵ Tr (UV) + O (ϵ^{2})

det (B) = 1 + ϵ Tr (UV) + O (ϵ^{2})

B^{- 1} = I - ϵ UV + O (ϵ^{2})

B^{- 1} = I - ϵ UV + O (ϵ^{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sarin1991/DyadicFlow
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Topic Modeling

MethodsSolana Customer Service Number +1-833-534-1729

Full text

Enhanced Variational Inference with Dyadic Transformation

Sarin Chandy

Amin Rasekh

[email protected]

Xylem Inc., 817 West Peachtree Street, Atlanta, GA 30308, USA

The University of Chicago, 5807 S Woodlawn Ave, Chicago, IL 60637, USA

Abstract

Variational autoencoder is a powerful deep generative model with variational inference. The practice of modeling latent variables in the VAE’s original formulation as normal distributions with a diagonal covariance matrix limits the flexibility to match the true posterior distribution. We propose a new transformation, dyadic transformation (DT), that can model a multivariate normal distribution. DT is a single-stage transformation with low computational requirements. We demonstrate empirically on MNIST dataset that DT enhances the posterior flexibility and attains competitive results compared to other VAE enhancements.

Keywords: Autoendcoder; Generative models; Variational inference; Dyadic transformation.

source code available at https://github.com/sarin1991/DyadicFlow

1 Introduction

A VAE is a deep generative model with variational inference. A generative model is an unsupervised learning approach that is able to learn a domain by processing a large amount of data from it and then generate new data like it (Hinton and Ghahramani, 1997; Yu et al., 2018). VAE, together with Generative Adversarial Networks (Goodfellow et al., 2016) and Deep Autoregressive Networks (Gregor et al., 2013), are amongst the most powerful and popular generative model techniques. VAE has been successfully applied in many domains, such as image processing (Pu et al., 2016), natural language processing (Semeniuta et al., 2017), and cybersecurity (Chandy et al., 2019).

A VAE works by maximizing a variational lower bound of the likelihood of the data (Kingma and Welling, 2013). A VAE has two halves: a recognition model (an encoder) and a generative model (a decoder). The recognition model learns a latent representation of the input data, and the generative model learns to transform this representation back into the original data. The recognition and generative models are jointly trained by optimizing the probability of the input data using stochastic gradient ascent.

Application of the VAE involves selection of an approximate posterior distribution for the latent variables. This decision determines the flexibility and tractability of the VAE, and hence the quality and efficiency of the inference made, and poses a core challenge in variational inference. Conventionally, the choice is the normal distribution with a diagonal covariance matrix. This pick helps with computation efficiency but limits the flexibility to match the true posterior. We introduce a new transformation, DT, which approximates the posterior as a normal distribution with full covariance. DT offers theoretical advantages of model flexibility, parallelizability, scalability, and efficiency, which together provide a clear improvement in VAE for its wider adoption for statistical inference in the presence of large, complex datasets.

2 Variational Autoencoder

2.1 Formulation

Let x be a (set of) observed variables, z a (set of) continuous, stochastic latent variables that represent their encoding, and $p(\textbf{x},\textbf{z})$ the parametric model of their joint distribution. The observations of x (datapoints) are generated by a random process, which involves the unobserved random variables z. The encoder network with parameters $\phi$ encodes the given dataset with an approximate posterior distribution given by $q_{\phi}(\textbf{z}|\textbf{x})$ defined over the latent variables, while the decoder network with parameters $\theta$ decodes z into x with probability $p_{\theta}(\textbf{x}|\textbf{z})$ . The encoder tries to approximate the true but intractable posterior represented as $p_{\theta}(\textbf{z}|\textbf{x})$ . By assuming a standard normal prior for the decoder and given a dataset X, we can optimize the network parameters by maximizing the log-probability of the data $p_{\theta}(\textbf{X})$ , i.e., to maximize

[TABLE]

where, given our approximation to the true posterior distribution, for each datapoint x we can write

[TABLE]

The RHS term is denoted as $\mathcal{L}\big{(}\theta,\phi;\textbf{x}\big{)}$ . Because KL divergence $D_{KL}(.)$ is always non-negative, it can be written as follows and is the (variational) lower bound on the marginal likelihood of datapoint x

[TABLE]

Therefore, maximizing the lower bound will simultaneously increase the probability of the data and reduce divergence from the true posterior. Thus, we would like to maximize it w.r.t. the encoder and decoder parameters, $\theta$ and $\phi$ , respectively.

2.2 Need for model flexibility

The encoder and decoder in a VAE are conventionally modeled using the normal distribution with a diagonal covariance matrix, i.e., $\mathcal{N}(\boldsymbol{\mu},\mathrm{diag}(\boldsymbol{\sigma}^{2}))$ , where $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ are commonly nonlinear functions parametrized by neural networks. This practice is mainly driven by the requirements for computational tractability. It, however, limits flexibility of the model, especially in the case of the encoder where the encoder will not be able to learn the true posterior distribution.

3 Dyadic Transformation

3.1 Motivation

Theoretically, the approximate model will be significantly more flexible if it is modeled as a multivariate normal distribution with a full covariance matrix.

A linear transformation matrix B of size $n\times n$ applied on an $n$ -dimensional normal distribution $\textbf{Y}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\sigma}^{2})$ produces another normal distribution $\textbf{G}\sim\mathcal{N}(\textbf{B}\boldsymbol{\mu},\,\textbf{B}\boldsymbol{\sigma}^{2}\textbf{B}^{T})$ . Thus, although Y is a normal distribution with diagonal covariance, its transformation through B would result in a multivariate normal distribution:

[TABLE]

This transformation matrix B introduces $O(n^{2})$ number of new parameters. In order to utilize this transformation in our generative model, we would need to compute the log-probability and KL divergence of G. These computations do not scale well with the size of B.

To overcome this issue, we define the transformation matrix B as follows:

[TABLE]

where I is an identity matrix, $\epsilon$ is a scalar parameter, U is an $n\times k$ matrix, and V is a $k\times n$ matrix. Here $k$ is a model hyper-parameter that can be adjusted to set the trade-off between algorithm flexibility and computational efficiency.

In what follows, we show that this affine transformation gives the higher flexibility desired without introducing much additional computational complexity and thus it scales well with n.

3.2 Efficient calculation of matrix determinant and inverse

Computing the log-probability and KL Divergence of the generative model involves the calculation of the determinant and inverse of the dyadic transformation matrix. We show that these operations can be efficiently computed with the help of the following theorems:

Theorem 1. (Sherman-Morrison-Woodbury). Given four matrices A, U, C, and V,

[TABLE]

if the matrices are of conformable sizes and also if the matrices A and $\textbf{C}^{-1}+\textbf{VA}^{-1}\textbf{U}$ are invertible (Woodbury, 1950). With the help of this theorem, we can efficiently calculate the inverse for the Dyadic Transformation matrix B.

Theorem 2. (Sylvester’s Determinant Identity). Given two matrices U and V of sizes $m\times n$ and $n\times m$ ,

[TABLE]

where $\mathbf{I}_{m}$ and $\mathbf{I}_{n}$ are identity matrices of orders $m$ and $n$ , respectively (Sylvester, 1851). This theorem relates the determinant of an $n\times n$ matrix with the determinant of an $m\times m$ matrix, which is very useful in regimes were $n\gg m$ . We use this property to make the determinant calculations of B computationally tractable.

3.3 KL divergence between two normal distributions

Using the above theorems we show that the KL divergence for a multivariate normal distribution obtained using Dyadic Transformation can be efficiently computed.

KL divergence between the independent normal posterior and standard normal prior can be written as (Kingma and Welling, 2013)

[TABLE]

where $J$ is the dimensionality of z. We can show that in general the KL divergence between two normal distributions, with means $\boldsymbol{\mu}_{0}$ and $\boldsymbol{\mu}_{1}$ , and covariance matrices $\boldsymbol{\Sigma}_{0}$ and $\boldsymbol{\Sigma}_{1}$ , is (Duchi, 2007):

[TABLE]

Given that in our case, $q(\textbf{z}|\textbf{x})\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ and $p(\textbf{z})\sim\textbf{}(0,1)$ , we can write

[TABLE]

We observe that calculation of KL divergence also involves the calculation of $\det(\boldsymbol{\Sigma})$ . This is performed efficiently using the Sylvester’s determinant theorem.

3.4 Calculation of the gradient of matrix determinant and inverse

Given a matrix D, the derivative of the inverse and determinant of D w.r.t. a variable t can be calculated as

[TABLE]

We make a key observation from the two derivative equations above. That is, given the determinant and inverse of a matrix are finite, their gradients will also be finite. Calculation of either derivatives thus may not lead to numerical instability even if the matrix is initialized randomly.

Also from Equations (11) and (12) we can show that for the Dyadic Transformation matrix B

[TABLE]

An important observation here is that if the value of $\epsilon$ is small enough then the determinant and inverse of the dyadic transformation matrix will be finite. This observation was crucial for us in order to make the make the numerical computations stable.

Pseudo code for VAE with Dyadic Transformation

1:repeatX ${}^{M}\leftarrow$ Random minibatch of $M$ datapoints $\boldsymbol{\alpha}\leftarrow$ Random samples from noise distribution $\textbf{U},\textbf{V},\boldsymbol{\mu},\boldsymbol{\sigma}\leftarrow$ Encoder NN $(\textbf{X},\theta)$ $\textbf{Y}\leftarrow\boldsymbol{\mu}+\boldsymbol{\alpha}\times\boldsymbol{\sigma}$ $\textbf{z}\leftarrow(\textbf{I}+\epsilon\textbf{{UV}})\textbf{Y}$ g $\leftarrow\nabla_{\theta,\phi}\hat{\mathcal{L}}^{M}(\theta,\phi;\textbf{X}^{M},\textbf{z})$ $\theta,\phi\leftarrow$ Update parameters using gradient g

2:until convergence of parameters ( $\theta$ , $\phi$ )

3:return $\theta$ , $\phi$

4 Related Work

Many recent strategies proposed to improve flexibility of inference models are based on the concept of normalizing flows, introduced by (Rezende and Mohamed, 2015) in the context of stochastic variational inference. Members of this family build a flexible variational posterior by starting with a conventional normal distribution for generating the latent variables and then applying a chain of invertible transformations, such as Householder transformation (Tomczak and Welling, 2016) and inverse autoregressive transformation (Kingma et al., 2016). Our proposed strategy requires only a single transformation and can be applied to both the encoder and the decoder.

5 Experiments

We conducted experiments on MNIST dataset to empirically evaluate our approach. MNIST is a dataset of 60,000 training and 10,000 test images of handwritten digits with a resolution of 28 $\times$ 28 pixels (LeCun et al., 1998). The dataset was dynamically binarized as in (Salakhutdinov and Murray, 2008).

Our model had 50 stochastic units each and the encoder and decoder were parameterized by a two-layer feed forward network with 500 units each. The model was trained using ADAM gradient-based optimization algorithm (Kingma and Ba, 2015) with a mini-batch size of 128. For the Dyadic Transformation matrix B we used a value of 0.001 for ${\epsilon}$ .

The results of the experiments are presented in Table 1. The results indicate that our proposed strategy is able to obtain competitively low log-likelihoods despite its inherent simplicity and low computational requirements. Compared to VAE, DT adds an additional computational cost of $O(k^{2.37})$ which is primarily for the determinant calculation. Hence, for smaller values of k, DT does not add any computational cost. Also the memory requirements for DT is $O(kn)$ which is also reasonable for small values of k.

Our idea is fundamentally different from the other strategies for improving VAE since it does not belong to the existing large family of normalizing flow transformations. Thus, it holds promise for creating a new family of strategies for building flexible distributions in the context of stochastic variational inference.

6 Conclusion

We presented Dyadic Transformation, a new transformation that builds flexible multivariate distribution to enhance variational inference without sacrificing computational tractability. Our elegantly-simple idea boosts model flexibility with only a single transformation step. The empirical experiments obtained indicated objectively that DT increases VAE performance and its results are competitive compared to the family of normalizing flows, which involve multiple levels of transformation. Our transformation can be readily integrated with the methods in this family to collectively build powerful hybrids. Dyadic Transformation can also be straightforwardly applied to the decoder to obtain even more significant performance gains. It can also be applied to binary data by modifying a Restricted Boltzmann Machine. These theoretical advantages will be explored in future research.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chandy et al. (2019) Sarin Chandy, Amin Rasekh, Zachary Barker, and Ehsan Shafiee. Cyberattack detection using deep generative models with variational inference. ASCE Journal of Water Resources Planning and Management , 2019.
2Duchi (2007) John Duchi. Derivations for linear algebra and optimization. Berkeley, California , 2007.
3Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning , volume 1. MIT press Cambridge, 2016.
4Gregor et al. (2013) Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. ar Xiv preprint ar Xiv:1310.8499 , 2013.
5Hinton and Ghahramani (1997) Geoffrey Hinton and Zoubin Ghahramani. Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B: Biological Sciences , 352(1358):1177–1190, 1997.
6Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference for Learning Representations , 2015.
7Kingma and Welling (2013) Diederik Kingma and Max Welling. Auto-encoding variational bayes. Proceedings of the 2nd International Conference on Learning Representations , 2013.
8Kingma et al. (2016) Diederik Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016) , 2016.