Amortized Inference of Variational Bounds for Learning Noisy-OR
Yiming Yan, Melissa Ailem, Fei Sha

TL;DR
This paper introduces Amortized Conjugate Posterior (ACP), a hybrid inference method combining classical and modern variational techniques, applied to noisy-OR models, improving inference accuracy and efficiency.
Contribution
The paper proposes ACP, a novel hybrid inference approach that leverages classical conjugate priors with amortized inference, enhancing posterior approximation in noisy-OR models.
Findings
ACP outperforms classical methods in accuracy.
ACP matches or exceeds modern amortized inference.
The approach is effective for noisy-OR models.
Abstract
Classical approaches for approximate inference depend on cleverly designed variational distributions and bounds. Modern approaches employ amortized variational inference, which uses a neural network to approximate any posterior without leveraging the structures of the generative models. In this paper, we propose Amortized Conjugate Posterior (ACP), a hybrid approach taking advantages of both types of approaches. Specifically, we use the classical methods to derive specific forms of posterior distributions and then learn the variational parameters using amortized inference. We study the effectiveness of the proposed approach on the noisy-or model and compare to both the classical and the modern approaches for approximate inference and parameter learning. Our results show that the proposed method outperforms or are at par with other approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis
**Amortized Inference of Variational Bounds for Learning Noisy-OR
Supplementary Materials**
Author 1
Author 2
Author 3
Institution 1
Institution 2
Institution 3
1 Derivation of variational posterior
In this section we provide detailed derivation for variational posterior.
[TABLE]
where is the normalization term and
[TABLE]
The approximate joint probability is
[TABLE]
where .
The normalized term is the marginal likelihood , which can be computed as
[TABLE]
We substitute eq. (3) and (4) to eq. (1), and obtain the variational posterior
[TABLE]
2 Implementation details
All our experiments were performed using Adam optimizer [kingma2014adam] with a batch size of . During training, we set the number of Monte Carlo samples to for each data point to compute the ELBO. We rely on Gumbel-softmax reparametrization trick [jang2016categorical] to approximate sampling latent variables using continuous value to back-propagate gradients. Following [jang2016categorical], we schedule exponential temperature decay, with the initial temperature to be and the minimum temperature to be . While during testing, we use the true discrete samples from the posterior and sample times to compute ELBO. For ACP, the variational parameter is the output of a neural network, which is constrained to be greater than [math]. Thus we use a softplus layer as the last layer of the neural network. The architecture (number of hidden layers and hidden dimensions) of the inference model for both AVI and ACP, as well as other hyperparameters including learning rate, momentum, temperature decay rate and temperature decay step, are sampled randomly for times. We only report the result with the best hyperparameters. All experiments results are averaged from different random initializations.
3 Experiments
3.1 Parameter Estimation
Fig. 1 shows the recovered parameters using LB-CDI and SVI. Even with sufficient training data (), both methods achieved bad estimation results. Both of them are able to learn the parameter patterns to some extend. However all the patterns are merged together. Hence we conclude ACP and AVI achieve better parameter estimation results comparing to the two non-amorized methods when we have sufficient training data.
Additionally, we did the parameter estimation experiments on multi-mnist dataset. And the experiment results are depicted in Fig. 2. Here, since the training set of multi-mnist is large, we did not do LB-CDI.
In Fig. 2, similar phenomenon has been observed. When we have large amount of training data, both AVI and ACP (Fig. 2(a) and 2(b)) recovered parameters well. Even though AVI did not capture pattern , it is indeed not trivial to separate pattern and in this dataset. However, SVI did not recover the parameters well.
When we reduce the amount of training data, the number of patterns detected by AVI decreased largely, as three weight patterns are recovered as , which also indicates worse latent representation learning. However for ACP, although it messed up pattern and , it recovered all other patterns, even with small amount of training data.
4 Additional experiments
4.1 Document classification
Herein, we aim to assess the impact of our inference method on noisy-or model’s learned representations. In particular, we rely on document classification task to evaluate the quality of the features learned by our model. To this end, we use the Reuters corpus111https://www.nltk.org/book/ch02.html from NLTK, which consists of million words and news articles organized into categories. For this experiment, we retain the top categories,222the 3 classes containing the most documents. namely acq, earn and money-fx. Each document is represented by its headline. We lemmatize the words, remove stop words, and remove words with less than occurrences. We obtain a final corpus of unique words and documents, including for training and for test. Similar to topic modeling, each document is represented by a binary vector where each dimension indicates a word presence/absence.
After training AVI and ACP, we take the approximate posterior distribution \big{\{}q(z_{k}^{(n)}=1|\mathbf{x}^{(n);\bm{\phi}})\big{\}}_{k=1}^{K} as the latent representation of document . We evaluate the quality of learned representations on the test set. More specifically, we train a linear multilabel classifier, which takes the posterior distribution as input and predicts the document classes. We perform -fold cross-validation and report the average EM scores.
Fig. 3 shows the classification performance with different amount of training data and different dimensionality of latent variables. The black dashed line corresponds to the results obtained when performing classification on the original space . We notice that when using a training set of more than documents, AVI achieves higher classification accuracy owing to its larger inference capacity and flexibility. However, its performance drops quickly as we reduce the size of the training set. In contrast, our ACP inference offers more stability w.r.t. to the amount of training examples, and reaches higher classification performance when using smaller training sets.
We present in Fig. 4, 5 and 6 the t-SNE visualizations of the approximate posterior distributions learned by each model using 50, 100 and 150 hidden dimensions respectively. We observe that when using a small training set (middle and right columns), the acq and money-fx features learned by AVI tend to fuse together, while with ACP, we can still distinguish the three categories. This observation confirms our previous results and claims about the effectiveness of our model when lacking training data.
