Amortized Inference of Variational Bounds for Learning Noisy-OR

Yiming Yan; Melissa Ailem; Fei Sha

arXiv:1906.02428·cs.LG·October 10, 2019

Amortized Inference of Variational Bounds for Learning Noisy-OR

Yiming Yan, Melissa Ailem, Fei Sha

PDF

Open Access

TL;DR

This paper introduces Amortized Conjugate Posterior (ACP), a hybrid inference method combining classical and modern variational techniques, applied to noisy-OR models, improving inference accuracy and efficiency.

Contribution

The paper proposes ACP, a novel hybrid inference approach that leverages classical conjugate priors with amortized inference, enhancing posterior approximation in noisy-OR models.

Findings

01

ACP outperforms classical methods in accuracy.

02

ACP matches or exceeds modern amortized inference.

03

The approach is effective for noisy-OR models.

Abstract

Classical approaches for approximate inference depend on cleverly designed variational distributions and bounds. Modern approaches employ amortized variational inference, which uses a neural network to approximate any posterior without leveraging the structures of the generative models. In this paper, we propose Amortized Conjugate Posterior (ACP), a hybrid approach taking advantages of both types of approaches. Specifically, we use the classical methods to derive specific forms of posterior distributions and then learn the variational parameters using amortized inference. We study the effectiveness of the proposed approach on the noisy-or model and compare to both the classical and the modern approaches for approximate inference and parameter learning. Our results show that the proposed method outperforms or are at par with other approaches.

Equations17

q (z ∣ x, ψ) =

q (z ∣ x, ψ) =

i : x_{i} = 0 \prod D p (x_{i} = 0∣ z) k = 0 \prod K p (z_{k})

Z =

Z =

i : x_{i} = 0 \prod D p (x_{i} = 0∣ z) k = 0 \prod K p (z_{k})

\medmath \tilde{p} (x, z, ψ) = i : x_{i} = 1 \prod D \tilde{p} (x_{i} = 1∣ z, ψ_{i}) i : x_{i} = 0 \prod D p (x_{i} = 0∣ z) k = 0 \prod K p (z_{k})

\medmath \tilde{p} (x, z, ψ) = i : x_{i} = 1 \prod D \tilde{p} (x_{i} = 1∣ z, ψ_{i}) i : x_{i} = 0 \prod D p (x_{i} = 0∣ z) k = 0 \prod K p (z_{k})

\displaystyle\medmath{=\exp\bigg{(}\sum_{i=1}^{D}x_{i}(\psi_{i}\bm{\theta}_{i}^{T}\mathbf{z}-g(\phi_{i}))-(1-x_{i})\bm{\theta}_{i}^{T}\mathbf{z}\bigg{)}p(\mathbf{z})}

\displaystyle\medmath{=\exp{\bigg{(}C+\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\sum_{k=0}^{K}\theta_{ik}z_{k}\bigg{)}}p(\mathbf{z})}

\displaystyle\medmath{Z=\exp{(C)}\mathbb{E}_{p(\mathbf{z})}\Big{[}\prod_{k=0}^{K}\exp\Big{(}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}z_{k}\Big{)}\Big{]}}

\displaystyle\medmath{Z=\exp{(C)}\mathbb{E}_{p(\mathbf{z})}\Big{[}\prod_{k=0}^{K}\exp\Big{(}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}z_{k}\Big{)}\Big{]}}

\displaystyle\medmath{=\exp{(C)}\prod_{k=0}^{K}\mathbb{E}_{p(z_{k})}\Big{[}\Big{(}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}z_{k}\Big{)}\Big{]}}

\displaystyle\medmath{=\exp{(C)}\prod_{k=0}^{K}\Big{[}\mu_{k}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}+(1-\mu_{k})\Big{]}}

\displaystyle\medmath{q(z_{k}=1|\mathbf{x},\bm{\psi})=\frac{\mu_{k}\exp\Big{(}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}\Big{)}}{\mu_{k}\exp\Big{(}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}\Big{)}+(1-\mu_{k})}}

\displaystyle\medmath{q(z_{k}=1|\mathbf{x},\bm{\psi})=\frac{\mu_{k}\exp\Big{(}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}\Big{)}}{\mu_{k}\exp\Big{(}\sum_{i=1}^{D}\big{(}x_{i}\psi_{i}-(1-x_{i})\big{)}\theta_{ik}\Big{)}+(1-\mu_{k})}}

\medmath = σ (i : x_{i} = 1 \sum ψ_{i} θ_{ik} - i : x_{i} = 0 \sum θ_{ik} + lo g \frac{μ _{k}}{1 - μ _{k}})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis

Full text

**Amortized Inference of Variational Bounds for Learning Noisy-OR

Supplementary Materials**

Author 1

Author 2

Author 3

Institution 1

Institution 2

Institution 3

1 Derivation of variational posterior

In this section we provide detailed derivation for variational posterior.

[TABLE]

where $Z$ is the normalization term and

[TABLE]

The approximate joint probability $\tilde{p}(\mathbf{x},\mathbf{z},\bm{\psi})$ is

[TABLE]

where $C=-\sum_{i=1}^{D}x_{i}g(\psi_{i})$ .

The normalized term $Z$ is the marginal likelihood $\tilde{p}(\mathbf{x},\bm{\psi})$ , which can be computed as

[TABLE]

We substitute eq. (3) and (4) to eq. (1), and obtain the variational posterior

[TABLE]

2 Implementation details

All our experiments were performed using Adam optimizer [kingma2014adam] with a batch size of $128$ . During training, we set the number of Monte Carlo samples to $L=10$ for each data point to compute the ELBO. We rely on Gumbel-softmax reparametrization trick [jang2016categorical] to approximate sampling latent variables $\mathbf{z}$ using continuous value to back-propagate gradients. Following [jang2016categorical], we schedule exponential temperature decay, with the initial temperature to be $0.5$ and the minimum temperature to be $0.2$ . While during testing, we use the true discrete samples from the posterior and sample $100$ times to compute ELBO. For ACP, the variational parameter $\psi$ is the output of a neural network, which is constrained to be greater than [math]. Thus we use a softplus layer as the last layer of the neural network. The architecture (number of hidden layers and hidden dimensions) of the inference model for both AVI and ACP, as well as other hyperparameters including learning rate, momentum, temperature decay rate and temperature decay step, are sampled randomly for $100$ times. We only report the result with the best hyperparameters. All experiments results are averaged from $5$ different random initializations.

3 Experiments

3.1 Parameter Estimation

Fig. 1 shows the recovered parameters using LB-CDI and SVI. Even with sufficient training data ( $N_{train}=1000$ ), both methods achieved bad estimation results. Both of them are able to learn the parameter patterns to some extend. However all the patterns are merged together. Hence we conclude ACP and AVI achieve better parameter estimation results comparing to the two non-amorized methods when we have sufficient training data.

Additionally, we did the parameter estimation experiments on multi-mnist dataset. And the experiment results are depicted in Fig. 2. Here, since the training set of multi-mnist is large, we did not do LB-CDI.

In Fig. 2, similar phenomenon has been observed. When we have large amount of training data, both AVI and ACP (Fig. 2(a) and 2(b)) recovered parameters well. Even though AVI did not capture pattern $``1"$ , it is indeed not trivial to separate pattern $``1"$ and $``7"$ in this dataset. However, SVI did not recover the parameters well.

When we reduce the amount of training data, the number of patterns detected by AVI decreased largely, as three weight patterns are recovered as $``0"$ , which also indicates worse latent representation learning. However for ACP, although it messed up pattern $``4"$ and $``5"$ , it recovered all other patterns, even with small amount of training data.

4 Additional experiments

4.1 Document classification

Herein, we aim to assess the impact of our inference method on noisy-or model’s learned representations. In particular, we rely on document classification task to evaluate the quality of the features learned by our model. To this end, we use the Reuters corpus111https://www.nltk.org/book/ch02.html from NLTK, which consists of $1.3$ million words and $10,788$ news articles organized into $90$ categories. For this experiment, we retain the top $3$ categories,222the 3 classes containing the most documents. namely acq, earn and money-fx. Each document is represented by its headline. We lemmatize the words, remove stop words, and remove words with less than $5$ occurrences. We obtain a final corpus of $839$ unique words and $7030$ documents, including $5048$ for training and $1982$ for test. Similar to topic modeling, each document is represented by a binary vector where each dimension indicates a word presence/absence.

After training AVI and ACP, we take the approximate posterior distribution $\big{\{}q(z_{k}^{(n)}=1|\mathbf{x}^{(n);\bm{\phi}})\big{\}}_{k=1}^{K}$ as the latent representation of document $\mathbf{x}^{(n)}$ . We evaluate the quality of learned representations on the test set. More specifically, we train a linear multilabel classifier, which takes the posterior distribution as input and predicts the document classes. We perform $5$ -fold cross-validation and report the average EM scores.

Fig. 3 shows the classification performance with different amount of training data and different dimensionality of latent variables. The black dashed line corresponds to the results obtained when performing classification on the original space $\mathbf{X}$ . We notice that when using a training set of more than $1000$ documents, AVI achieves higher classification accuracy owing to its larger inference capacity and flexibility. However, its performance drops quickly as we reduce the size of the training set. In contrast, our ACP inference offers more stability w.r.t. to the amount of training examples, and reaches higher classification performance when using smaller training sets.

We present in Fig. 4, 5 and 6 the t-SNE visualizations of the approximate posterior distributions learned by each model using 50, 100 and 150 hidden dimensions respectively. We observe that when using a small training set (middle and right columns), the acq and money-fx features learned by AVI tend to fuse together, while with ACP, we can still distinguish the three categories. This observation confirms our previous results and claims about the effectiveness of our model when lacking training data.