Learning Correlated Latent Representations with Adaptive Priors
Da Tang, Dawen Liang, Nicholas Ruozzi, Tony Jebara

TL;DR
This paper introduces ACVAEs, an advanced VAE model with adaptive priors that better capture correlations in data, leading to improved performance in link prediction and clustering tasks.
Contribution
We propose ACVAEs, which use adaptive priors to effectively learn correlated latent representations and enable tractable joint distributions, overcoming limitations of previous CVAEs.
Findings
ACVAEs outperform CVAEs in link prediction.
ACVAEs achieve better hierarchical clustering results.
Adaptive priors improve correlation modeling.
Abstract
Variational Auto-Encoders (VAEs) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (CVAEs), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enforce that the learned latent representations follow the same correlation structure. However, as we demonstrate in this work, such a choice cannot guarantee that CVAEs capture all the correlations. Furthermore, it prevents us from obtaining a tractable joint and marginal variational distribution. To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (ACVAEs), which apply an adaptive prior distribution that can be adjusted during training and can learn a tractable joint…
| Method | Epinions | Citation | LibraryThing |
|---|---|---|---|
| vae | |||
| GraphSAGE | |||
| cvae | |||
| cvae | |||
| acvae | |||
| acvae | |||
| acvae | |||
| acvae |
| Method | MI Scores |
|---|---|
| GraphSAGE | |
| cvae | |
| cvae | |
| acvae | |
| acvae |
| Epinions | acvae | acvae |
|---|---|---|
| -31.8 | 0.034 | -31.9 | 0.031 | |
| -36.4 | 0.031 | -38.3 | 0.035 | |
| -61.3 | 0.028 | -119 | 0.034 | |
| -674 | 0.028 | -1535 | 0.037 | |
| Citation | acvae | acvae |
| -7.48 | 0.126 | -7.48 | 0.124 | |
| -7.91 | 0.113 | -8.59 | 0.121 | |
| -24.4 | 0.112 | -49.2 | 0.120 | |
| -184 | 0.099 | -288 | 0.054 |
| Citation | acvae | acvae | cvae |
|---|---|---|---|
| -7.47 | 0.012 | -7.48 | 0.010 | -7.48 | 0.011 | |
| -7.88 | 0.031 | -8.51 | 0.025 | -8.49 | 0.023 | |
| -23.8 | 0.043 | -47.9 | 0.037 | -42.3 | 0.042 | |
| -183 | 0.049 | -286 | 0.039 | -267 | 0.058 |
| Method | NCRR |
|---|---|
| vae | 0.002 |
| GraphSAGE | 0.002 |
| acvae | 0.076 |
| acvae | 0.073 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
Learning Correlated Latent Representations with Adaptive Priors
Da Tang
Columbia University
&Dawen Liang
Netflix Inc.
Nicholas Ruozzi
The University of Texas at Dallas
&Tony Jebara
Columbia University & Spotify Inc.
Abstract
Variational Auto-Encoders (vaes) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (cvaes), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enforce that the learned latent representations follow the same correlation structure. However, as we demonstrate in this work, such a choice cannot guarantee that cvaes capture all the correlations. Furthermore, it prevents us from obtaining a tractable joint and marginal variational distribution. To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (acvaes), which apply an adaptive prior distribution that can be adjusted during training and can learn a tractable joint variational distribution. Its tractable form also enables further refinement with belief propagation. Experimental results on link prediction and hierarchical clustering show that acvaes significantly outperform cvaes among other benchmarks.
1 INTRODUCTION
Variational Auto-Encoders (vaes) [13, 23] are a family of deep generative models that learn latent embeddings for data. By applying variational inference on the latent variables, vaes learn a stochastic mapping from high-dimensional data to low-dimensional representations, which can be used for many downstream tasks, including classification, regression, and clustering.
vaes assume the data points are generated and treat the model and posterior approximations as factorized over data points. However, if we know a priori that there is structured correlation between the data points, e.g., for graph-structured datasets [24, 3, 8, 27], correlated variational approximations can help. Tang et al. [27] proposed Correlated Variational Auto-Encoders (cvaes), which take this kind of correlation structure as auxiliary information to guide the variational approximations for the latent embeddings by constructing a prior from a uniform mixture of tractable distributions on maximal acyclic subgraphs of the given undirected correlation graph.
However, there are several limitations that potentially prevent cvaes from learning better correlated latent embeddings. First, it is possible that some of the maximal acyclic subgraphs of the given graph can, by themselves, well-capture the correlation between the data points while others may poorly capture the correlation. As a result, taking a uniform average may yield a sub-optimal result. Second, while the prior in cvaes is over multiple subgraphs, each subgraph has a unique joint variational distribution, and there is no single global joint variational distribution over the latent variables. cvaes do learn pairwise variational approximation functions, but they are not exact pairwise marginal variational distributions on the latent variables. As a result, applying these variational approximation functions to some downstream tasks, e.g. link prediction, may result in poor performances due to the inexact approximations. In addition, cvaes require a pre-processing step that takes an amount of time cubic in the number of vertices, which limits its applicability to smaller datasets.
To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (acvaes), which chooses a non-uniform average over tractable distributions over the maximal acyclic subgraphs as a prior. This prior is adaptive, and will be adjusted during optimization. To learn the mixture weights, we provide two options, empirical Bayes or saddle-point optimization, both of which maximize the objective with respect to the model and variational parameters. The difference is that while empirical Bayes also maximizes the objective with respect to the prior structure, saddle-point optimization seeks to optimize the objective under the worst prior for more robust inference. In both cases, the non-uniform average converges to a tractable prior on a single graph, which ensures that we obtain a holistic tractable joint variational distribution. With this variational distribution, we obtain exact marginal evaluation using exact inference algorithms, e.g., belief propagation. Moreover, acvaes do not require the cubic time pre-processing step embedded in cvaes, and they are generally faster for evaluation in practice. We demonstrate the superior empirical performance of acvaes for link prediction and hierarchical clustering on various real datasets.
2 VAES WITH CORRELATIONS
In this section, we provide a brief overview of Variational Auto-Encoders (vaes) [13, 23] as well as Correlated Variational Auto-Encoders (cvaes) [27], which take the correlation structure among data points into consideration.
2.1 Variational Auto-Encoders
We use a latent variable model to fit data . The model assumes that there exist low-dimensional latent embeddings for each data point (), which come from a prior distribution , and ’s are drawn conditionally independently given . Denote the model parameters as . The likelihood of this model is .
To simultaneously learn the model parameters as well as a mapping from the observed data to the latent embeddings , Variational Auto-Encoders (vaes) [13, 23] apply a data-dependent variational approximation , where denotes the variational parameters, and maximize the evidence lower-bound (ELBO) on the log-likelihood of the data, :
[TABLE]
2.2 Correlated Variational Auto-Encoders
Standard vaes are capable of learning compact low-dimensional embeddings for high-dimensional data. However, due to the i.i.d. assumption, they fail to account for the correlations between data points when a priori we know such correlations exist. Correlated Variational Auto-Encoders (cvaes) [27] mitigate the issue by employing a structured prior as well as a structured variational posterior.
Formally, assume we are given an undirected correlation graph , where represents that the data points and are correlated. cvaes apply a correlated prior on the latent variables ’s which satisfies
[TABLE]
Here and are parameter-free functions that capture the singleton and pairwise marginal distributions of the latent variables. For example, we can set to be the density of a standard multivariate normal distribution and to be a multivariate normal density that has high values if the two inputs are close to each other. With such a prior, we again assume ’s are drawn conditionally independently given . When is acyclic, such a prior does exist [30]:
[TABLE]
However, when is not acyclic, Eq. 3 is not necessarily a valid probability density function. To deal with this issue, cvaes propose constructing a prior that is a mixture over the set of all of ’s maximal acyclic subgraphs, which are defined as follows.
Definition 1** (Maximal acyclic subgraph).**
For an undirected graph , an acyclic subgraph is a maximal acyclic subgraph of if:
- •
, i.e., contains all vertices of .
- •
Adding any edge from to will create a cycle in .
Each of the maximal acyclic subgraphs partially approximates the correlation structure in , and cvaes set the prior to be the uniform average over all of these tractable densities.
[TABLE]
where is a prior on a maximal acyclic subgraph with the same form as in Eq. 3. For each , we can similarly define a structured variational approximation following the form of Eq. 3 (see Appendix A for details). With this structured prior and variational posterior, cvaes optimize a different ELBO:
[TABLE]
Even though empirically Tang et al. [27] show that cvaes are capable of capitalizing on the correlation structure as auxiliary information when learning latent embeddings. As discussed in the introduction, there are a few limitations to this approach. In the next section, we will propose fixes to all of these limitations.
3 ADAPTIVE CORRELATED VAES
3.1 A Non-uniform Mixture Prior
As motivated in Section 2.2, rather than using a uniform average, we instead employ a categorical distribution representing the normalized weights over all maximal acyclic subgraphs of . In the ELBO in Eq. 5, we can replace the uniform average in the prior in Eq. 4 with the non-uniform distribution , which gives us the following ELBO:
[TABLE]
Here we define the non-uniform prior . From the above inequality we can see that, using the non-uniform prior , we are still able to obtain a lower bound of the log-likelihood , which is now also parametrized by the weight parameter . If we optimize together with all the other parameters, the above loss function implies that we are optimizing with an adaptive prior. Hence, we call the above model Adaptive Correlated Variational Auto-Encoders (acvaes). If we replace with a uniform distribution over all subgraphs in , we recover cvaes.
Plugging and from Section 2.2 into Eq. 6, yields the following ELBO for acvaes:
[TABLE]
Similar to cvaes, we have edge weights representing the expected appearance probability for edge over the set of maximal acyclic subgraphs given the distribution . In the following definition, we abusively write as the probability of being sampled from .
Definition 2** (Non-uniform maximal acyclic subgraph edge weight).**
For an undirected graph , an edge and a distribution on the set of maximal acyclic subgraphs of , define to be the expected appearance probability of the edge in a random maximal acyclic subgraph , i.e., .
Similar to cvaes, we can apply negative sampling (equivalent to applying a complete graph as a weak prior) to acvaes as regularization, which helps prevent overfitting on the learned pairwise variational approximation ( is the regularization strength):
[TABLE]
In what follows, we use to refer to for notational brevity.
3.2 Learning the Non-uniform Mixture
With the loss function in Eq. 8, an intuitive direction for estimating would be to perform empirical Bayes [7] and directly maximize with respect to , and , as in Eq. 9:
[TABLE]
Alternatively, we can consider a minimax saddle-point optimization, which may lead to more robust inference:
[TABLE]
As Eq. 10 indicates, we are optimizing the ELBO under the prior that produces the lowest lower bound. The intuition is that if we can even optimize the worst lower bound well, the variational distribution and the model distribution we learn would be robust and generalize better. This is similar to the least favorable prior, under which a Bayes estimator can achieve minimax risk [15].
Empirical Bayes (Eq. 9) aims to find the best variational approximation, while the saddle-point option (Eq. 10) aims for robust inference. At first glance, the empirical Bayes option seems more reasonable since it gives us the tightest lower bound. However, a better ELBO does not necessarily translate into better predictive performance in the downstream task. In Section 5, we compare these two optimization options on various datasets, and discuss the pros and cons of each.
An important observation is that, no matter which option is applied, for fixed and , the loss function is linear w.r.t. the weight parameter . Therefore, if optima for exist, then at least one optimum will have a which puts all of its probability mass on a single subgraph .
Proposition 1** (Optimum for ).**
If the optimization in Eq. 9 or Eq. 10 has global optima, then at least one optimum will have a that places all of its probability mass on a single maximal acyclic subgraph .
From this proposition, we know both Eq. 9 and Eq. 10 return a single subgraph , which drastically simplifies the structured prior. At this optimum, the loss function becomes the ELBO on a single acyclic subgraph , with as the variational distribution. Therefore, we have a holistic variational approximation, overcoming a limitations of cvaes.
3.3 Learning with Alternating Updates
Direct optimization of either Eq. 9 or Eq. 10 is non-trivial. Following similar saddle-point optimization for a spanning tree structured upper bound for the log-partition function of undirected graphical models [31, 32], we perform an alternating optimization procedure on the parameters , and . Details are shown in Algorithm 1.
Updates For
When the parameters and are fixed, the loss function is linear in . However, we cannot directly optimize over , as it may contain exponentially many dimensions. We can instead update the edge weights as the loss function is also linear in them.
By definition, we know that each maximal acyclic subgraph of is a forest, consisting of one spanning tree for each connected component of . Therefore, the domain for the edge weights is the projection of the Cartesian product of the spanning tree polytopes for all connected components of [31, 32] to the edge weight space. This Cartesian product on the polytopes is convex and its boundary is determined by potentially exponentially many linear inequalities. Despite that, directly maximizing (or minimizing) with respect to these weights is in fact tractable: the optimum for Eq. 9 or Eq. 10 is obtained at that has all the mass on a single maximal acyclic subgraph . This means the optimum for these edges weights can be obtained from a single subgraph . By re-arranging terms in Eq. 8 with respect to , it is not difficult to see that should have the smallest (for empirical Bayes) or largest (for saddle-point) “edge mass” sum over all maximal acyclic subgraphs , where the “edge mass” of edge is:
[TABLE]
which means is the combination of the minimum (for empirical Bayes) or maximum (for saddle-point) spanning trees of all connected components of the graph with as the weights.
Once we identify , the optimal weights are either 1 (if the edge is selected) or 0 (otherwise). Instead of directly updating the weights to the optimal values, we perform a soft update with step size at iteration , similar to Wainwright [31], Wainwright et al. [32]:
[TABLE]
This soft update helps prevent the algorithm from becoming trapped in bad local optima early in the optimization procedure. The step size can be either a constant or dynamically adjusted during optimization. We set it to be a constant in our experiments.
One of the limitations of cvaes mentioned in Section 2.2 is the pre-processing step to compute all the edge weights . We alleviate this bottleneck in acvaes, as it only takes operations per initialization (details in Section B.2) and per update on the weights, which ensures that acvaes can scale to datasets with many more vertices than would be feasible with cvaes.
Updates For And
When is fixed, and can be updated by taking a stochastic gradient step following with reparametrization gradient [13, 23], as done in standard vaes.
If empirical Bayes (Eq. 9) is applied, Algorithm 1 will converge with properly selected learning rates. On the other hand, it is difficult to make any general statement about the convergence for saddle-point optimization (Eq. 10) since the objective is generally non-concave in . However, as we show in Section 5, empirically we find that Algorithm 1 is stable for both options and performs well on multiple real datasets.
3.4 Exact Marginal Posterior Approximation with Belief Propagation
From Proposition 1, we know the weights returned from Algorithm 1 are from a single maximal acyclic subgraph . Consequently, we have a holistic variational approximation . However, by itself this variational approximation might not be necessarily better at the downstream predictive tasks than cvaes since it can only make use of the structure from one maximal acyclic subgraph .
On the plus side, the acyclic structure of makes it possible to compute the exact pairwise marginal variational distribution between any pair of vertices via a belief-propagation-style [21] message-passing algorithm, which is not possible for cvaes, as it does not have a single joint variational distribution on . This can be crucial in tasks in which we need an accurate pairwise marginal approximation, e.g., link prediction and hierarchical clustering.
Consider any that are in the same connected component of . Since is acyclic there is a unique path from to . Denote it as . The exact pairwise marginal equals
[TABLE]
The above pairwise marginal densities can be computed for all pairs of by doing a depth- or breadth-first search starting from each after we obtain the variational approximation from Algorithm 1, which has a total complexity of . Note that the time complexity for evaluating every pairwise marginal in cvaes is also . But the belief propagation refinement computation is usually more efficient in practice, since it involves much less neural network function evaluations, which dominate the runtime.
4 RELATED WORK
This work extends cvaes with the idea of learning a non-uniform average loss over some tractable loss functions on maximal acyclic subgraphs of the given graph. This is similar to the idea of obtaining a tighter upper bound on the log-partition function for an undirected graphical model by minimizing over a convex combination of spanning trees of the given graph [32]. To optimize the parameters, Wainwright et al. [32] also apply alternating updates on the parameters and the distributions over the spanning tress, similar to the approach in acvae learning. Alternating parameter updates are useful for many other cases. For example, Alternating Least Squares for matrix factorization [25] and Alternating Direction Method of Multipliers (ADMM) for convex optimization [4, 26, 10].
Some recent work also focuses on incorporating correlation structures over latent variables. For example, Hoffman and Blei [9] proposed structured variational families that can improve over traditional mean-field variational inference. Johnson et al. [11] proposed Structured vaes that apply more complex forms for the priors on the latent embeddings. Recently in the NLP community, Yin et al. [34] proposed utilizing tree-structured latent variable models to deal with semantic parsing. However, most of these works focus on correlations within dimensions of latent variables whereas our work focus on correlations between latent variables, similar to the setting of cvaes. In addition, Luo et al. [17] incorporated pairwise correlations between latent variables into deep generative models for semi-crowdsourced clustering.
Another line of related work appears in convolutional networks for graphs and their extensions [3, 6, 5, 20, 8, 28], which also take graph structure of data into considerations.
5 EXPERIMENTS
In this section, we evaluate acvaes on the task of link prediction and hierarchical clustering. We show that our method significantly outperforms various baselines. We attempt to identify the contributing factors for the gain, answering the following questions:
Q1: Uniform mixture (cvae) versus non-uniform mixture (acvae), which one is better? (Section 5.2.1)
Q2: How important is the belief propagation refinement for acvae? (Section 5.2.1)
Q3: Empirical Bayes versus saddle-point, which one performs better? Can we select purely based on ELBO? (Section 5.2.2)
Q4: Does the learned single graph capture more information than singleton representations? What do the learned latent embeddings look like? (Section 5.2.3)
Q5: Can acvae scale to datasets that cvae cannot? (Section 5.2.4)
5.1 Experiment Settings
Before presenting our experimental results, we describe the tasks, datasets, baslines, and metrics for evaluation. Additional details can be found in Appendix B.
5.1.1 Tasks
For each of the tasks, we are given a correlation graph and a feature vector for each .
For the link prediction task, we keep consistent with the setting of Tang et al. [27]. For the hierarchical clustering experiments, we apply the complete-linkage algorithm [33], which is relatively more stable among common hierarchical clustering algorithms. We cluster all data points into clusters.
5.1.2 Datasets
We evaluate acvaes on the following 3 datasets. All of 3 datasets are tested for link prediction and in addition the LibraryThing dataset is tested for the hierarchical clustering experiment:
- •
Epinions111http://www.trustlet.org/downloaded_epinions.html [19], a public product rating dataset that contains users and products. After pre-processing, the dataset contains users.
- •
Citation222http://snap.stanford.edu/data/cit-HepTh.html [16], a High-energy physics theory citation network dataset, which has a citation graph with papers and citation edges. After preprocessing, the dataset contains users (for the results as in Section 5.2.1). We also perform an experiment in Section 5.2.4 on a larger version of this dataset, which contains users.
- •
LibraryThing333https://cseweb.ucsd.edu/~jmcauley/datasets.html#social_data [14], a public book review data set that contains users and items. After pre-processing, the dataset contains users.
For the hierarchical clustering task, the LibraryThing dataset does not contain cluster labels for users. We generate the cluster labels for each user by learning a standard vae on the feature vectors , and perform the complete-linkage algorithm to cluster the data points into clusters. This helps us generate a semi-synthetic dataset.
5.1.3 Baselines
We compare acvae with 4 baseline methods:
- •
vae [13]: standard variational auto-encoders, with no information about the correlations.
- •
GraphSAGE [8]: the state-of-the-art method for learning latent embeddings that takes the correlation structure into account with graph convolutional neural networks.
- •
cvae and cvae [27]: Two variations of cvaes with factorized and structured variational approximations, respectively.
There are many different variants of GraphSAGE, and we applied one of them (details in Section B.2). It is possible that some other variants or parameter settings of this method may perform better on our tasks. But our main goal is not to derive a state-of-the-art method for these tasks. Instead, we aim to show insights on how to improve over standard vaes and cvaes through learning adaptive correlated priors.
5.1.4 Metrics
For all methods, we first learn latent embeddings , which are deterministic for GraphSAGE and stochastic for the vaes-based methods. Then we compute the pairwise distance between each pair of the latent embeddings as . Recall that the embeddings are stochastic for the vaes-based methods, hence we use as the pairwise distance. The expectation is taken over the variational pairwise marginal or the refined pairwise marginal if we perform belief propagation (Section 3.4).
For the link prediction experiments, for each user , we compute the Cumulative Reciprocal Rank (CRR) as follows.
[TABLE]
A larger CRR value indicates the heldout edges have a higher rank among all the candidates. We further normalize the CRR values to be in , and report the normalized CRR (NCRR).
For hierarchical clustering, we apply the normalized mutual-information scores [29] as the metric. These scores are in the range and a larger score indicates better clustering performance.
5.2 Results
We show the heldout NCRR values for link predictions and the normalized MI scores in Table 1 and Table 2, respectively. acvae and acvae stand for empirical Bayes (Eq. 9) and saddle-point optimization (Eq. 10), respectively. The rows with BP mean we perform belief-propagation refinement (Section 3.4). We dissect the results in the following sections.
5.2.1 Advantages of the Non-uniform Mixture
As motivated in Section 2.2, acvaes improve over the limitations of cvaes by providing a holistic variational approximation at the end of the empirical Bayes or saddle-point optimization, which further enables applying belief propagation for more accurate marginal approximation.
At first glance, the performance results in Table 1 for the single joint distribution (the rows acvae and acvae ) is no better than that of cvae, which applies a uniform mixture. We speculate in Section 3.4 that by itself this holistic variational approximation might not necessarily be better at the downstream predictive tasks since it can only make use of the structure from one maximal acyclic subgraph, even though it sometimes has a higher ELBO (Table 4). However, we can observe a huge performance boost after applying the belief propagation refinement, which outperforms the baseline methods by a wide margin for link prediction and performs comparably better for hierarchical clustering.444vae does not count as a baseline method for the clustering experiment since it is applied as an oracle in the pre-processing steps. We omitted the results for acvaes without belief propagation for the hierarchical clustering experiments since empirically we found their performance are much worse compared to the case of using belief propagation refinement.
Recall that the prerequisite for applying the belief propagation is to have a variational distribution on a single acyclic subgraph (i.e., we cannot perform BP with cvaes). This answers two questions we sought to answer: First, the non-uniform mixture is not necessarily better than the uniform mixture at the downstream task even when it has a higher ELBO, but it opens up the possibility to perform exact inference; Second, variational approximations has a lot of room for improvement when compared with exact inference (i.e., belief propagation) on an acyclic graph.
5.2.2 Empirical Bayes versus Saddle-Point
As shown in Table 1 and Table 2, both empirical Bayes and saddle-point optimization perform similarly on most tasks, though the saddle-point option is often more stable (normally having a smaller variance in the metrics). This is reasonable since the saddle-point objective optimizes the most conservative lower bound.
Moreover, we show that we should not select between these two methods purely based on ELBO: By definition, the saddle-point optimization will yield an ELBO lower than empirical Bayes. In Table 3 and Table 4 we report ELBO as well as NCRR for 4 choices of the negative sampling parameter (Eq. 8) on Epinions and Citations with and without belief propagation refinement. We can see clearly that a better ELBO does not necessarily correlate with a better NCRR, regardless of belief propagation refinement or not.
In general, both methods have their advantages. On simpler datasets, e.g., Citation, on which all methods perform well, empirical Bayes is preferred since it can easily capture the best correlation structure. On the other hand, with more complex datasets/difficult tasks, saddle-point optimization tends to provide more robust inference and stable results.
5.2.3 Learned Graph Structures
In Figure 1 we visualize part of the largest connected component of the maximal acyclic subgraph that acvaes learn for the variational distribution on the Citation dataset with both empirical Bayes and saddle-point optimization (colors for better clarity only). The coordinates are t-SNE embeddings for the variational approximation mean of the latent variables. The edge widths are proportional to the strength of the learned correlations. We can see some of the learned embeddings are not necessarily close to each other even when they have high correlations. This indicates that the learned provides some additional information that singleton marginals cannot provide.
5.2.4 Scalability to Large Datasets
To demonstrate the scalability of acvae compared to cvae, we perform an experiment on a larger version of the Citation dataset with times more vertices, which cvae can not easily scale to due to the cubic time initialization step and the quadratic pairwise marginal evaluations.
We compare the performance of acvae plus the belief propagation refinement on both the empirical Bayes and the saddle-point schemes with the other two baseline methods (vae and GraphSAGE). As shown in Table 5, both schemes of acvae can significantly outperform the baseline methods.
6 CONCLUSION
In this paper, we introduce acvaes, which learn a joint variational distribution on the latent embeddings of input data via optimizing loss function that is a non-uniform average over some tractable correlated ELBOs. To learn the mixture weights, we provide two different options, and compare them on various datasets and tasks. The learned joint variational distribution can be used to perform efficient evaluation using belief propagation. Experiment results show that acvaes can outperform existing methods for link prediction and hierarchical clustering on three real datasets. Future work will include better understanding the learned graph structures from both options and learning higher-order correlations between latent variables.
Appendix
In the appendix, we provide more details on our baseline method cvae [27] as well as the experiment data pre-processing and protocols.
Appendix A MORE DETAILS ON cvaes
cvaes set the prior to be the uniform average over all of these tractable densities:
[TABLE]
where is a prior on a maximal acyclic subgraph with the same form as in Eq. 3. For each , we can similarly define a structured variational approximation following the form of Eq. 3:
[TABLE]
where and are two conditional density functions that captures the singleton and pairwise variational approximation densities. These two functions need to satisfy the symmetry and consistency properties:
[TABLE]
The ELBO in Eq. 5 is an average over potentially exponential many ELBOs. To make computations tractable, Tang et al. [27] simplifies this lower bound and represent it as
[TABLE]
Where for each edge represents the fraction of ’s maximal acyclic subgraphs of that contain . These weights can be computed easily from the Moore-Penrose inverse of the Laplacian matrix of .
Appendix B EXPERIMENT DETAILS
B.1 Dataset Pre-processing Details
Epinions
We follow the same pre-processing scheme as Tang et al. [27]: binarize the rating data and create a bag-of-words binary feature vector for each user. We only retain the items that have been rated for at least 100 times. We construct the graph and only keep an edge to be in if both and appear in the original directed graph. At last, we only retain users that have at least one edge in (i.e. having at least one bi-directional edge in the original dataset).
Citations
This dataset includes the abstract and the citation information for high-energy physic theory papers on arXiv from 1992 to 2003. We work on all papers from 1998 in this dataset (in total papers). We treat all citation edges as undirected edges and build the graph . We only retain papers that cite or are cited by at least one of the other papers within this subset (for year 1998) of the dataset. We compute the TF-IDF (with stop words removed) for the abstract of each paper as the raw feature vectors, retaining only the coordinates corresponding to the top 50 words. Then we binarize the raw feature vectors that considers only the non-zero entries that are above the median of all of the non-zero entries and use these binarized vectors as the feature vectors.
For the larger experiment on this dataset, we apply the same pre-processing steps but work on the whole dataset (instead of the subset for year 1998).
LibraryThing
For the link prediction experiment, We follow the same pre-processing scheme as for the Epinions dataset, except that we only retain the items that have been rated for at least 200 times (since this dataset is larger than the Epinions dataset). For the clustering experiment, we follow the same scheme to get a graph , but we do not split the edges to training/testing (since clustering is unsupervised), and apply a normal vae to generate the labels (as mentioned in the main paper). This normal vae has the same hidden layer size with the one used in testing, but has a smaller latent representation (we use 10) to avoid generating non-reasonable labels due to overfitting.
B.2 Experimental Protocol
We run 3 runs for each methods for the Epinions experiments, and 5 runs for the other experiments (except we run only 1 run for the Citation experiments on the larger dataset as in Section 5.2.4). This is since the Epinions experiments work more stable empirically.
For vae, cvae and acvae, we apply a two-layer feed-forward neural inference network for the singleton variational distribution ’s and a two-layer feed-forward neural generative network for the model distribution ’s. is a diagonal normal distribution with the mean and standard deviation outputted from the inference network and is a multinomial distribution with the logits outputted from the generative networks. The latent dimensionality is 100 for the Epinions experiments and the LibraryThing clustering, and 10 for the other two link prediction experiments. The hidden layer dimensionality is 300 for the Epinions experiments and 30 for the other experiments.
For GraphSAGE, we choose to use aggregations, the mean aggregator, and negative samples to optimize the loss function. The hidden layer size and latent dimensionality we apply to GraphSAGE are the same with that of the standard vae.
For cvae and acvae, we set the pairwise marginal prior density function to be with . For cvae and acvae, we model the pairwise variational approximations to be a multi-variate normal distribution that can be factorized across the dimensions as the product of independent bi-variate normal distributions. The correlation coefficients of these bi-variate normal distributions are computed from two-layer feed-forward neural networks that taking and as inputs. These two-layer neural networks have latent dimensionality to be 1000 for the Epinions experiments and 100 for the other experiments. For cvae and acvae on link prediction experiments, we select the negative sampling parameter from set of choices, and report the performances with the best average train NCRR metrics. This parameter is selected from for the LibraryThing dataset and the Citation dataset, and for the Epinions dataset. For the clustering experiments, we select for cvae and acvae since empirically we found this choice gives us a reasonably good performance.
For link prediction, for all methods, we look into the performances for every fixed number of iterations (the specific numbers depend on models) and update the current best test NCRR values if both the train ELBO and the train NCRR reach better values. We report the final current best test NCRR values as the results. For clustering, we update the current best normalized MI scores if the train ELBO reaches better values and report the final current best normalized MI scores.
For acvae, we set the step size parameter (in Eq. 12) to be a constant. We train the parameters using alternating updates as in Algorithm 1. We switch between updates on the parameters , for an epoch of the edges in , and a single update on the weights according to Eq. 12. For the random initialization on the tree weights , we just assign random weights to the graph . Then we use Kruskal’s algorithm to compute the maximal acyclic subgraph according to these random weights, and set . It is straightforward to see that this is a valid initialization for the weights ’s since these weights relate to the distribution that has all of its mass on the single subgraph .
For acvae, after running the algorithm for some iterations, we use Kruskal’s algorithm to compute the maximal acyclic subgraph on the converged edge weights to find the learned single graph . This heuristic helps us solve the issues of finding the converged maximal acyclic subgraph if we want to perform an early stopping (recall that we evaluate our metrics for every fixed number of iterations) or if there is an numerical issue.
For all methods, we apply stochastic gradient optimizations and use Adam [12] to adjust the learning rates. We set the step size to be . For all methods, we use a batch size for sampling the vertices. For cvae and acvae, we use a batch size for sampling the edges and non-edges.
All experiments are done using Python. The training and evaluations are done with TensorFlow [1] and Numpy. The TF-IDF and the t-SNE embeddings [18] in the visualization (Figure 1) are computed using Scikit-learn [22]. For faster computations, we call C++ functions to do belief propagation and the Kruskal’s algorithm using Cython [2].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) , pages 265–283, 2016.
- 2Behnel et al. [2011] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds. Computing in Science & Engineering , 13(2):31, 2011.
- 3Bruna et al. [2015] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. In 3rd International Conference on Learning Representations , 2015.
- 4Chambolle and Pock [2011] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision , 40(1):120–145, 2011.
- 5Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems , pages 3844–3852, 2016.
- 6Duvenaud et al. [2015] David Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems , pages 2224–2232, 2015.
- 7Efron [2012] Bradley Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , volume 1. Cambridge University Press, 2012.
- 8Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems , pages 1025–1035, 2017.
