Learning Correlated Latent Representations with Adaptive Priors

Da Tang; Dawen Liang; Nicholas Ruozzi; Tony Jebara

arXiv:1906.06419·cs.LG·December 20, 2019

Learning Correlated Latent Representations with Adaptive Priors

Da Tang, Dawen Liang, Nicholas Ruozzi, Tony Jebara

PDF

Open Access

TL;DR

This paper introduces ACVAEs, an advanced VAE model with adaptive priors that better capture correlations in data, leading to improved performance in link prediction and clustering tasks.

Contribution

We propose ACVAEs, which use adaptive priors to effectively learn correlated latent representations and enable tractable joint distributions, overcoming limitations of previous CVAEs.

Findings

01

ACVAEs outperform CVAEs in link prediction.

02

ACVAEs achieve better hierarchical clustering results.

03

Adaptive priors improve correlation modeling.

Abstract

Variational Auto-Encoders (VAEs) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (CVAEs), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enforce that the learned latent representations follow the same correlation structure. However, as we demonstrate in this work, such a choice cannot guarantee that CVAEs capture all the correlations. Furthermore, it prevents us from obtaining a tractable joint and marginal variational distribution. To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (ACVAEs), which apply an adaptive prior distribution that can be adjusted during training and can learn a tractable joint…

Tables5

Table 1. Table 1: Link Prediction Normalized CRR

Method	Epinions	Citation	LibraryThing
vae	$0.005 \pm 0.001$	$0.018 \pm 0.004$	$0.006 \pm 0.000$
GraphSAGE	$0.012 \pm 0.003$	$0.020 \pm 0.002$	$0.004 \pm 0.001$
cvae $_{ind}$	$0.016 \pm 0.000$	$0.040 \pm 0.003$	$0.012 \pm 0.001$
cvae $_{corr}$	$0.017 \pm 0.001$	$0.058 \pm 0.002$	$0.020 \pm 0.001$
acvae $_{EB}$	$0.013 \pm 0.000$	$0.049 \pm 0.001$	$0.018 \pm 0.001$
acvae $_{SP}$	$0.010 \pm 0.003$	$0.039 \pm 0.002$	$0.018 \pm 0.001$
acvae $_{EB+BP}$	$0.034 \pm 0.003$	$0.126 \pm 0.005$	$0.032 \pm 0.002$
acvae $_{SP+BP}$	$0.035 \pm 0.001$	$0.123 \pm 0.007$	$0.032 \pm 0.001$

Table 2. Table 2: Hierarchical Clustering Normalized MI Scores

Method	MI Scores
GraphSAGE	$0.002 \pm 0.000$
cvae $_{ind}$	$0.010 \pm 0.004$
cvae $_{corr}$	$0.002 \pm 0.000$
acvae $_{EB+BP}$	$0.012 \pm 0.003$
acvae $_{SP+BP}$	$0.011 \pm 0.002$

Table 3. Table 3: ELBO | Average NCRR Comparisons between ACVAE (with BP) on Epinions and Citation

Epinions	acvae $_{EB+BP}$	acvae $_{SP+BP}$
$γ = 0.001$	-31.8 \| 0.034	-31.9 \| 0.031
$γ = 0.1$	-36.4 \| 0.031	-38.3 \| 0.035
$γ = 10 .$	-61.3 \| 0.028	-119 \| 0.034
$γ = 1000 .$	-674 \| 0.028	-1535 \| 0.037
Citation	acvae $_{EB+BP}$	acvae $_{SP+BP}$
$γ = 0.001$	-7.48 \| 0.126	-7.48 \| 0.124
$γ = 0.1$	-7.91 \| 0.113	-8.59 \| 0.121
$γ = 10 .$	-24.4 \| 0.112	-49.2 \| 0.120
$γ = 1000 .$	-184 \| 0.099	-288 \| 0.054

Table 4. Table 4: ELBO | Average NCRR Comparisons between ACVAE (without BP) and CVAE On Citation

Citation	acvae $_{EB}$	acvae $_{SP}$	cvae
$γ = 0.001$	-7.47 \| 0.012	-7.48 \| 0.010	-7.48 \| 0.011
$γ = 0.1$	-7.88 \| 0.031	-8.51 \| 0.025	-8.49 \| 0.023
$γ = 10 .$	-23.8 \| 0.043	-47.9 \| 0.037	-42.3 \| 0.042
$γ = 1000 .$	-183 \| 0.049	-286 \| 0.039	-267 \| 0.058

Table 5. Table 5: NCRR on The Larger Citation Dataset

Method	NCRR
vae	0.002
GraphSAGE	0.002
acvae $_{EB+BP}$	0.076
acvae $_{SP+BP}$	0.073

Equations42

L (λ, θ)

L (λ, θ)

{p_{0}^{corr} (z_{i}) = p_{0} (z_{i}) for all v_{i} \in V p_{0}^{corr} (z_{i}, z_{j}) = p_{0} (z_{i}, z_{j}) if (v_{i}, v_{j}) \in E .

{p_{0}^{corr} (z_{i}) = p_{0} (z_{i}) for all v_{i} \in V p_{0}^{corr} (z_{i}, z_{j}) = p_{0} (z_{i}, z_{j}) if (v_{i}, v_{j}) \in E .

p_{0}^{corr} (z) = i = 1 \prod n p_{0} (z_{i}) (v_{i}, v_{j}) \in E \prod \frac{p _{0} ( z _{i} , z _{j} )}{p _{0} ( z _{i} ) p _{0} ( z _{j} )} .

p_{0}^{corr} (z) = i = 1 \prod n p_{0} (z_{i}) (v_{i}, v_{j}) \in E \prod \frac{p _{0} ( z _{i} , z _{j} )}{p _{0} ( z _{i} ) p _{0} ( z _{j} )} .

p_{0}^{corr_{g}} (z) ≜ \frac{1}{∣ A _{G} ∣} G^{'} = (V, E^{'}) \in A_{G} \sum p_{0}^{G^{'}} (z),

p_{0}^{corr_{g}} (z) ≜ \frac{1}{∣ A _{G} ∣} G^{'} = (V, E^{'}) \in A_{G} \sum p_{0}^{G^{'}} (z),

lo g p_{θ} (x)

lo g p_{θ} (x)

\displaystyle\geq\frac{1}{|\mathcal{A}_{G}|}\sum\limits_{G^{\prime}\in\mathcal{A}_{G}}\Big{(}\mathbb{E}_{q_{\bm{\lambda}}^{G^{\prime}}(\bm{z}|\bm{x})}[\log p_{\bm{\theta}}(\bm{x}|\bm{z})]-\text{KL}(q_{\bm{\lambda}}^{G^{\prime}}(\bm{z}|\bm{x})||p_{0}^{G^{\prime}}(\bm{z}))\Big{)}.

E_{G^{'} \sim π} [E_{q_{λ}^{G^{'}} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - KL (q_{λ}^{G^{'}} (z ∣ x) ∣∣ p_{0}^{G^{'}} (z))]

E_{G^{'} \sim π} [E_{q_{λ}^{G^{'}} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - KL (q_{λ}^{G^{'}} (z ∣ x) ∣∣ p_{0}^{G^{'}} (z))]

\leq E_{G^{'} \sim π} [E_{p_{0}^{G^{'}} (z)} [lo g p_{θ} (x ∣ z)]] := E_{p_{0}^{π} (z)} [lo g p_{θ} (x ∣ z)]

\leq lo g p_{π, θ} (x) .

\displaystyle\mathcal{L}^{\textrm{ACVAE}}(\bm{\pi},\bm{\lambda},\bm{\theta}):=\sum\limits_{i=1}^{n}\Big{(}\mathbb{E}_{q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})}\left[\log p_{\bm{\theta}}(\bm{x}_{i}|\bm{z}_{i})\right]-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})||p_{0}(\bm{z}_{i}))\Big{)}-\sum\limits_{(v_{i},v_{j})\in E}w^{\text{MAS}}_{G,\bm{\pi},(v_{i},v_{j})}\cdot

\displaystyle\mathcal{L}^{\textrm{ACVAE}}(\bm{\pi},\bm{\lambda},\bm{\theta}):=\sum\limits_{i=1}^{n}\Big{(}\mathbb{E}_{q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})}\left[\log p_{\bm{\theta}}(\bm{x}_{i}|\bm{z}_{i})\right]-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})||p_{0}(\bm{z}_{i}))\Big{)}-\sum\limits_{(v_{i},v_{j})\in E}w^{\text{MAS}}_{G,\bm{\pi},(v_{i},v_{j})}\cdot

\displaystyle\Big{(}\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})||p_{0}(\bm{z}_{i},\bm{z}_{j}))-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})||p_{0}(\bm{z}_{i}))-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{j}|\bm{x}_{j})||p_{0}(\bm{z}_{j}))\Big{)}.

L^{ACVAE -NS} (π, λ, θ) :=

L^{ACVAE -NS} (π, λ, θ) :=

\displaystyle+\frac{2}{n}\sum\limits_{1\leq i<j\leq n}\mathbb{E}_{q_{\bm{\lambda}}(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})}\log\frac{q_{\bm{\lambda}}(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})}{q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})q_{\bm{\lambda}}(\bm{z}_{j}|\bm{x}_{j})}\Big{)}.

λ, θ max π max L^{ACVAE} (π, λ, θ) .

λ, θ max π max L^{ACVAE} (π, λ, θ) .

λ, θ max π min L^{ACVAE} (π, λ, θ) .

λ, θ max π min L^{ACVAE} (π, λ, θ) .

m_{(v_{i}, v_{j})} := KL (q_{λ} (z_{i}, z_{j} ∣ x_{i}, x_{j}) ∣∣ p_{0} (z_{i}, z_{j})) - KL (q_{λ} (z_{i} ∣ x_{i}) ∣∣ p_{0} (z_{i})) - KL (q_{λ} (z_{j} ∣ x_{j}) ∣∣ p_{0} (z_{j})),

m_{(v_{i}, v_{j})} := KL (q_{λ} (z_{i}, z_{j} ∣ x_{i}, x_{j}) ∣∣ p_{0} (z_{i}, z_{j})) - KL (q_{λ} (z_{i} ∣ x_{i}) ∣∣ p_{0} (z_{i})) - KL (q_{λ} (z_{j} ∣ x_{j}) ∣∣ p_{0} (z_{j})),

w_{G, π, e}^{MAS^{t + 1}} \leftarrow (1 - α^{t}) w_{G, π, e}^{MAS^{t}} + α^{t} \overset{w}{^}_{G, \hat{π}, e}^{MAS} .

w_{G, π, e}^{MAS^{t + 1}} \leftarrow (1 - α^{t}) w_{G, π, e}^{MAS^{t}} + α^{t} \overset{w}{^}_{G, \hat{π}, e}^{MAS} .

\int l = 0 \prod k_{i, j} - 1 q_{λ} (z_{u_{l}^{i, j}}, z_{u_{l + 1}^{i, j}} ∣ x_{u_{l}^{i, j}}, x_{u_{l + 1}^{i, j}}) l = 1 \prod k_{i, j} - 1 \frac{d z _{u_{l}^{i, j}}}{q _{λ} ( z _{u_{l}^{i, j}} ∣ x _{u_{l}^{i, j}} )} .

\int l = 0 \prod k_{i, j} - 1 q_{λ} (z_{u_{l}^{i, j}}, z_{u_{l + 1}^{i, j}} ∣ x_{u_{l}^{i, j}}, x_{u_{l + 1}^{i, j}}) l = 1 \prod k_{i, j} - 1 \frac{d z _{u_{l}^{i, j}}}{q _{λ} ( z _{u_{l}^{i, j}} ∣ x _{u_{l}^{i, j}} )} .

CRR_{i} = (v_{i}, v_{j}) \in E_{test} \sum \frac{1}{∣ { k : ( v _{i} , v _{k} ) \neq \in E _{train} , dis _{i, k} \leq dis _{i, j} } ∣} .

CRR_{i} = (v_{i}, v_{j}) \in E_{test} \sum \frac{1}{∣ { k : ( v _{i} , v _{k} ) \neq \in E _{train} , dis _{i, k} \leq dis _{i, j} } ∣} .

p_{0}^{corr_{g}} (z) = \frac{1}{∣ A _{G} ∣} G^{'} = (V, E^{'}) \in A_{G} \sum p_{0}^{G^{'}} (z),

p_{0}^{corr_{g}} (z) = \frac{1}{∣ A _{G} ∣} G^{'} = (V, E^{'}) \in A_{G} \sum p_{0}^{G^{'}} (z),

q_{λ}^{G^{'}} (z ∣ x) = i = 1 \prod n q_{λ} (z_{i} ∣ x_{i}) (v_{i}, v_{j}) \in E^{'} \prod \frac{q _{λ} ( z _{i} , z _{j} ∣ x _{i} , x _{j} )}{q _{λ} ( z _{i} ∣ x _{i} ) q _{λ} ( z _{j} ∣ x _{j} )},

q_{λ}^{G^{'}} (z ∣ x) = i = 1 \prod n q_{λ} (z_{i} ∣ x_{i}) (v_{i}, v_{j}) \in E^{'} \prod \frac{q _{λ} ( z _{i} , z _{j} ∣ x _{i} , x _{j} )}{q _{λ} ( z _{i} ∣ x _{i} ) q _{λ} ( z _{j} ∣ x _{j} )},

{q_{λ} (z_{i}, z_{j} ∣ x_{i}, x_{j}) = q_{λ} (z_{j}, z_{i} ∣ x_{j}, x_{i}) for all z_{i}, z_{j}, x_{i}, x_{j}, \int q_{λ} (z_{i}, z_{j} ∣ x_{i}, x_{j}) d z_{j} = q_{λ} (z_{i} ∣ x_{i}) for all z_{i}, x_{i}, x_{j} .

{q_{λ} (z_{i}, z_{j} ∣ x_{i}, x_{j}) = q_{λ} (z_{j}, z_{i} ∣ x_{j}, x_{i}) for all z_{i}, z_{j}, x_{i}, x_{j}, \int q_{λ} (z_{i}, z_{j} ∣ x_{i}, x_{j}) d z_{j} = q_{λ} (z_{i} ∣ x_{i}) for all z_{i}, x_{i}, x_{j} .

\displaystyle\mathcal{L}^{\textrm{CVAE}}(\bm{\lambda},\bm{\theta}):=\sum\limits_{i=1}^{n}\Big{(}\mathbb{E}_{q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})}\left[\log p_{\bm{\theta}}(\bm{x}_{i}|\bm{z}_{i})\right]-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})||p_{0}(\bm{z}_{i}))\Big{)}-\sum\limits_{(v_{i},v_{j})\in E}w^{\text{MAS}}_{G,(v_{i},v_{j})}\cdot

\displaystyle\mathcal{L}^{\textrm{CVAE}}(\bm{\lambda},\bm{\theta}):=\sum\limits_{i=1}^{n}\Big{(}\mathbb{E}_{q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})}\left[\log p_{\bm{\theta}}(\bm{x}_{i}|\bm{z}_{i})\right]-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})||p_{0}(\bm{z}_{i}))\Big{)}-\sum\limits_{(v_{i},v_{j})\in E}w^{\text{MAS}}_{G,(v_{i},v_{j})}\cdot

\displaystyle\Big{(}\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})||p_{0}(\bm{z}_{i},\bm{z}_{j}))-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})||p_{0}(\bm{z}_{i}))-\text{KL}(q_{\bm{\lambda}}(\bm{z}_{j}|\bm{x}_{j})||p_{0}(\bm{z}_{j}))\Big{)}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

Full text

Learning Correlated Latent Representations with Adaptive Priors

Da Tang

Columbia University

[email protected]

&Dawen Liang

Netflix Inc.

[email protected]

Nicholas Ruozzi

The University of Texas at Dallas

[email protected]

&Tony Jebara

Columbia University & Spotify Inc.

[email protected]

Abstract

Variational Auto-Encoders (vaes) have been widely applied for learning compact, low-dimensional latent representations of high-dimensional data. When the correlation structure among data points is available, previous work proposed Correlated Variational Auto-Encoders (cvaes), which employ a structured mixture model as prior and a structured variational posterior for each mixture component to enforce that the learned latent representations follow the same correlation structure. However, as we demonstrate in this work, such a choice cannot guarantee that cvaes capture all the correlations. Furthermore, it prevents us from obtaining a tractable joint and marginal variational distribution. To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (acvaes), which apply an adaptive prior distribution that can be adjusted during training and can learn a tractable joint variational distribution. Its tractable form also enables further refinement with belief propagation. Experimental results on link prediction and hierarchical clustering show that acvaes significantly outperform cvaes among other benchmarks.

1 INTRODUCTION

Variational Auto-Encoders (vaes) [13, 23] are a family of deep generative models that learn latent embeddings for data. By applying variational inference on the latent variables, vaes learn a stochastic mapping from high-dimensional data to low-dimensional representations, which can be used for many downstream tasks, including classification, regression, and clustering.

vaes assume the data points are $i.i.d.$ generated and treat the model and posterior approximations as factorized over data points. However, if we know a priori that there is structured correlation between the data points, e.g., for graph-structured datasets [24, 3, 8, 27], correlated variational approximations can help. Tang et al. [27] proposed Correlated Variational Auto-Encoders (cvaes), which take this kind of correlation structure as auxiliary information to guide the variational approximations for the latent embeddings by constructing a prior from a uniform mixture of tractable distributions on maximal acyclic subgraphs of the given undirected correlation graph.

However, there are several limitations that potentially prevent cvaes from learning better correlated latent embeddings. First, it is possible that some of the maximal acyclic subgraphs of the given graph can, by themselves, well-capture the correlation between the data points while others may poorly capture the correlation. As a result, taking a uniform average may yield a sub-optimal result. Second, while the prior in cvaes is over multiple subgraphs, each subgraph has a unique joint variational distribution, and there is no single global joint variational distribution over the latent variables. cvaes do learn pairwise variational approximation functions, but they are not exact pairwise marginal variational distributions on the latent variables. As a result, applying these variational approximation functions to some downstream tasks, e.g. link prediction, may result in poor performances due to the inexact approximations. In addition, cvaes require a pre-processing step that takes an amount of time cubic in the number of vertices, which limits its applicability to smaller datasets.

To address these issues, we propose Adaptive Correlated Variational Auto-Encoders (acvaes), which chooses a non-uniform average over tractable distributions over the maximal acyclic subgraphs as a prior. This prior is adaptive, and will be adjusted during optimization. To learn the mixture weights, we provide two options, empirical Bayes or saddle-point optimization, both of which maximize the objective with respect to the model and variational parameters. The difference is that while empirical Bayes also maximizes the objective with respect to the prior structure, saddle-point optimization seeks to optimize the objective under the worst prior for more robust inference. In both cases, the non-uniform average converges to a tractable prior on a single graph, which ensures that we obtain a holistic tractable joint variational distribution. With this variational distribution, we obtain exact marginal evaluation using exact inference algorithms, e.g., belief propagation. Moreover, acvaes do not require the cubic time pre-processing step embedded in cvaes, and they are generally faster for evaluation in practice. We demonstrate the superior empirical performance of acvaes for link prediction and hierarchical clustering on various real datasets.

2 VAES WITH CORRELATIONS

In this section, we provide a brief overview of Variational Auto-Encoders (vaes) [13, 23] as well as Correlated Variational Auto-Encoders (cvaes) [27], which take the correlation structure among data points into consideration.

2.1 Variational Auto-Encoders

We use a latent variable model to fit data $\bm{x}=\{\bm{x}_{1},\ldots,\bm{x}_{n}\}\subset\mathbb{R}^{D}$ . The model assumes that there exist low-dimensional latent embeddings for each data point $\bm{z}=\{\bm{z}_{1},\ldots,\bm{z}_{n}\}\subset\mathbb{R}^{d}$ ( $d\ll D$ ), which come from a prior distribution $p_{0}(\cdot)$ , and $\bm{x}_{i}$ ’s are drawn conditionally independently given $\bm{z}_{i}$ . Denote the model parameters as $\bm{\theta}$ . The likelihood of this model is $p_{\bm{\theta}}(\bm{x})=\prod\limits_{i=1}^{n}\int p_{0}(\bm{z}_{i})p_{\bm{\theta}}(\bm{x}_{i}|\bm{z}_{i})d\bm{z}_{i}$ .

To simultaneously learn the model parameters $\bm{\theta}$ as well as a mapping from the observed data $\bm{x}$ to the latent embeddings $\bm{z}$ , Variational Auto-Encoders (vaes) [13, 23] apply a data-dependent variational approximation $q_{\bm{\lambda}}(\bm{z}|\bm{x})=\prod\limits_{i=1}^{n}q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})$ , where $\bm{\lambda}$ denotes the variational parameters, and maximize the evidence lower-bound (ELBO) on the log-likelihood of the data, $\log p_{\bm{\theta}}(\bm{x})$ :

[TABLE]

2.2 Correlated Variational Auto-Encoders

Standard vaes are capable of learning compact low-dimensional embeddings for high-dimensional data. However, due to the i.i.d. assumption, they fail to account for the correlations between data points when a priori we know such correlations exist. Correlated Variational Auto-Encoders (cvaes) [27] mitigate the issue by employing a structured prior as well as a structured variational posterior.

Formally, assume we are given an undirected correlation graph $G=(V,E)$ , where $(v_{i},v_{j})\in E$ represents that the data points $x_{i}$ and $x_{j}$ are correlated. cvaes apply a correlated prior $p^{\textrm{corr}}_{0}(\bm{z})$ on the latent variables $\bm{z}_{i}$ ’s which satisfies

[TABLE]

Here $p_{0}(\cdot)$ and $p_{0}(\cdot,\cdot)$ are parameter-free functions that capture the singleton and pairwise marginal distributions of the latent variables. For example, we can set $p_{0}(\cdot)$ to be the density of a standard multivariate normal distribution and $p_{0}(\cdot,\cdot)$ to be a multivariate normal density that has high values if the two inputs are close to each other. With such a prior, we again assume $\bm{x}_{i}$ ’s are drawn conditionally independently given $\bm{z}_{i}$ . When $G$ is acyclic, such a prior $p_{0}^{\text{corr}}(\bm{z})$ does exist [30]:

[TABLE]

However, when $G$ is not acyclic, Eq. 3 is not necessarily a valid probability density function. To deal with this issue, cvaes propose constructing a prior that is a mixture over the set $\mathcal{A}_{G}$ of all of $G$ ’s maximal acyclic subgraphs, which are defined as follows.

Definition 1 (Maximal acyclic subgraph).

For an undirected graph $G=(V,E)$ , an acyclic subgraph $G^{\prime}=(V^{\prime},E^{\prime})$ is a maximal acyclic subgraph of $G$ if:

•

$V^{\prime}=V$ , i.e., $G^{\prime}$ contains all vertices of $G$ .

•

Adding any edge from $E/E^{\prime}$ to $E^{\prime}$ will create a cycle in $G^{\prime}$ .

Each of the maximal acyclic subgraphs $G^{\prime}\in\mathcal{A}_{G}$ partially approximates the correlation structure in $G$ , and cvaes set the prior $p_{0}^{\textrm{corr}_{g}}(\bm{z})$ to be the uniform average over all of these tractable densities.

[TABLE]

where $p_{0}^{G^{\prime}}(\bm{z})=\prod\limits_{i=1}^{n}p_{0}(\bm{z}_{i})\prod\limits_{(v_{i},v_{j})\in E^{\prime}}\frac{p_{0}(\bm{z}_{i},\bm{z}_{j})}{p_{0}(\bm{z}_{i})p_{0}(\bm{z}_{j})}$ is a prior on a maximal acyclic subgraph $G^{\prime}=(V,E^{\prime})$ with the same form as in Eq. 3. For each $G^{\prime}\in\mathcal{A}_{G}$ , we can similarly define a structured variational approximation $q_{\bm{\lambda}}^{G^{\prime}}(\bm{z}|\bm{x})$ following the form of Eq. 3 (see Appendix A for details). With this structured prior and variational posterior, cvaes optimize a different ELBO:

[TABLE]

Even though empirically Tang et al. [27] show that cvaes are capable of capitalizing on the correlation structure as auxiliary information when learning latent embeddings. As discussed in the introduction, there are a few limitations to this approach. In the next section, we will propose fixes to all of these limitations.

3 ADAPTIVE CORRELATED VAES

3.1 A Non-uniform Mixture Prior

As motivated in Section 2.2, rather than using a uniform average, we instead employ a categorical distribution $\bm{\pi}\in\triangle^{|\mathcal{A}_{G}|-1}$ representing the normalized weights over all maximal acyclic subgraphs $G^{\prime}\in\mathcal{A}_{G}$ of $G$ . In the ELBO in Eq. 5, we can replace the uniform average in the prior $p_{0}^{\textrm{corr}_{g}}(\bm{z})$ in Eq. 4 with the non-uniform distribution $\bm{\pi}$ , which gives us the following ELBO:

[TABLE]

Here we define the non-uniform prior $p_{0}^{\bm{\pi}}(\bm{z})=\mathbb{E}_{G^{\prime}\sim\bm{\pi}}[p_{0}^{G^{\prime}}(\bm{z})]$ . From the above inequality we can see that, using the non-uniform prior $p_{0}^{\bm{\pi}}$ , we are still able to obtain a lower bound of the log-likelihood $\log p_{\bm{\pi},\bm{\theta}}(\bm{x})$ , which is now also parametrized by the weight parameter $\bm{\pi}$ . If we optimize $\bm{\pi}$ together with all the other parameters, the above loss function implies that we are optimizing with an adaptive prior. Hence, we call the above model Adaptive Correlated Variational Auto-Encoders (acvaes). If we replace $\bm{\pi}$ with a uniform distribution over all subgraphs in $\mathcal{A}_{G}$ , we recover cvaes.

Plugging $q_{\bm{\lambda}}^{G^{\prime}}(\bm{z}|\bm{x})$ and $p_{0}^{G^{\prime}}(\bm{z})$ from Section 2.2 into Eq. 6, yields the following ELBO for acvaes:

[TABLE]

Similar to cvaes, we have edge weights $w^{\text{MAS}}_{G,\bm{\pi},(v_{i},v_{j})}$ representing the expected appearance probability for edge $(v_{i},v_{j})$ over the set of maximal acyclic subgraphs $\mathcal{A}_{G}$ given the distribution $\bm{\pi}$ . In the following definition, we abusively write $\bm{\pi}(G^{\prime})$ as the probability of $G^{\prime}$ being sampled from $\mathcal{A}_{G}$ .

Definition 2 (Non-uniform maximal acyclic subgraph edge weight).

For an undirected graph $G=(V,E)$ , an edge $e\in E$ and a distribution $\bm{\pi}$ on the set $\mathcal{A}_{G}$ of maximal acyclic subgraphs of $G$ , define $w^{\text{MAS}}_{G,\bm{\pi},e}$ to be the expected appearance probability of the edge $e$ in a random maximal acyclic subgraph $G^{\prime}=(V,E^{\prime})\sim\bm{\pi}$ , i.e., $w^{\text{MAS}}_{G,\bm{\pi},e}:=\sum\limits_{G^{\prime}\in\mathcal{A}_{G},e\in E^{\prime}}\bm{\pi}(G^{\prime})$ .

Similar to cvaes, we can apply negative sampling (equivalent to applying a complete graph as a weak prior) to acvaes as regularization, which helps prevent overfitting on the learned pairwise variational approximation ( $\gamma>0$ is the regularization strength):

[TABLE]

In what follows, we use $\mathcal{L}^{\textrm{ACVAE}}$ to refer to $\mathcal{L}^{\textrm{ACVAE}\textrm{-NS}}$ for notational brevity.

3.2 Learning the Non-uniform Mixture

With the loss function in Eq. 8, an intuitive direction for estimating $\bm{\pi}$ would be to perform empirical Bayes [7] and directly maximize $\mathcal{L}^{\textrm{ACVAE}}(\bm{\pi},\bm{\lambda},\bm{\theta})$ with respect to $\bm{\pi}$ , $\bm{\lambda}$ and $\bm{\theta}$ , as in Eq. 9:

[TABLE]

Alternatively, we can consider a minimax saddle-point optimization, which may lead to more robust inference:

[TABLE]

As Eq. 10 indicates, we are optimizing the ELBO under the prior that produces the lowest lower bound. The intuition is that if we can even optimize the worst lower bound well, the variational distribution and the model distribution we learn would be robust and generalize better. This is similar to the least favorable prior, under which a Bayes estimator can achieve minimax risk [15].

Empirical Bayes (Eq. 9) aims to find the best variational approximation, while the saddle-point option (Eq. 10) aims for robust inference. At first glance, the empirical Bayes option seems more reasonable since it gives us the tightest lower bound. However, a better ELBO does not necessarily translate into better predictive performance in the downstream task. In Section 5, we compare these two optimization options on various datasets, and discuss the pros and cons of each.

An important observation is that, no matter which option is applied, for fixed $\bm{\lambda}$ and $\bm{\theta}$ , the loss function $\mathcal{L}^{\textrm{ACVAE}}$ is linear w.r.t. the weight parameter $\bm{\pi}$ . Therefore, if optima for $\mathcal{L}^{\textrm{ACVAE}}(\bm{\pi},\bm{\lambda},\bm{\theta})$ exist, then at least one optimum will have a $\bm{\pi}^{*}$ which puts all of its probability mass on a single subgraph $G^{\prime*}$ .

Proposition 1 (Optimum for $\bm{\pi}$ ).

If the optimization in Eq. 9 or Eq. 10 has global optima, then at least one optimum $(\bm{\pi}^{*},\bm{\lambda}^{*},\bm{\theta}^{*})$ will have a $\bm{\pi}^{*}$ that places all of its probability mass on a single maximal acyclic subgraph $G^{\prime*}\in\mathcal{A}_{G}$ .

From this proposition, we know both Eq. 9 and Eq. 10 return a single subgraph $G^{\prime*}$ , which drastically simplifies the structured prior. At this optimum, the loss function becomes the ELBO on a single acyclic subgraph $G^{\prime*}$ , with $q_{\bm{\lambda}^{*}}^{G^{\prime*}}(\bm{z}|\bm{x})$ as the variational distribution. Therefore, we have a holistic variational approximation, overcoming a limitations of cvaes.

3.3 Learning with Alternating Updates

Direct optimization of either Eq. 9 or Eq. 10 is non-trivial. Following similar saddle-point optimization for a spanning tree structured upper bound for the log-partition function of undirected graphical models [31, 32], we perform an alternating optimization procedure on the parameters $\bm{\lambda}$ , $\bm{\theta}$ and $\bm{\pi}$ . Details are shown in Algorithm 1.

Updates For $\bm{\pi}$

When the parameters $\bm{\lambda}$ and $\bm{\theta}$ are fixed, the loss function $\mathcal{L}^{\textrm{ACVAE}}(\bm{\pi},\bm{\lambda},\bm{\theta})$ is linear in $\bm{\pi}$ . However, we cannot directly optimize over $\bm{\pi}\in\triangle^{|\mathcal{A}_{G}|-1}$ , as it may contain exponentially many dimensions. We can instead update the edge weights $w^{\text{MAS}}_{G,\bm{\pi},(v_{i},v_{j})}$ as the loss function is also linear in them.

By definition, we know that each maximal acyclic subgraph $G^{\prime}$ of $G$ is a forest, consisting of one spanning tree for each connected component of $G$ . Therefore, the domain for the edge weights $\bigcup\limits_{e\in E}\{w^{\text{MAS}}_{G,\bm{\pi},e}\}$ is the projection of the Cartesian product of the spanning tree polytopes for all connected components of $G$ [31, 32] to the edge weight space. This Cartesian product on the polytopes is convex and its boundary is determined by potentially exponentially many linear inequalities. Despite that, directly maximizing (or minimizing) $\mathcal{L}^{\textrm{ACVAE}}(\bm{\pi},\bm{\lambda},\bm{\theta})$ with respect to these weights $\bigcup\limits_{e\in E}\{w^{\text{MAS}}_{G,\bm{\pi},e}\}$ is in fact tractable: the optimum for Eq. 9 or Eq. 10 is obtained at $\hat{\bm{\pi}}$ that has all the mass on a single maximal acyclic subgraph $\hat{G}^{\prime}$ . This means the optimum for these edges weights can be obtained from a single subgraph $\hat{G}^{\prime}$ . By re-arranging terms in Eq. 8 with respect to $\bigcup\limits_{e\in E}\{w^{\text{MAS}}_{G,\bm{\pi},e}\}$ , it is not difficult to see that $\hat{G}^{\prime}$ should have the smallest (for empirical Bayes) or largest (for saddle-point) “edge mass” sum over all maximal acyclic subgraphs $\mathcal{A}_{G}$ , where the “edge mass” $m_{(v_{i},v_{j})}$ of edge $e=(v_{i},v_{j})$ is:

[TABLE]

which means $\hat{G}^{\prime}$ is the combination of the minimum (for empirical Bayes) or maximum (for saddle-point) spanning trees of all connected components of the graph with $m_{(v_{i},v_{j})}$ as the weights.

Once we identify $\hat{G}^{\prime}$ , the optimal weights $\hat{w}^{\text{MAS}}_{G,\bm{\hat{\pi}},e}$ are either 1 (if the edge $e$ is selected) or 0 (otherwise). Instead of directly updating the weights to the optimal values, we perform a soft update with step size $\alpha^{t}$ at iteration $t$ , similar to Wainwright [31], Wainwright et al. [32]:

[TABLE]

This soft update helps prevent the algorithm from becoming trapped in bad local optima early in the optimization procedure. The step size $\alpha^{t}$ can be either a constant or dynamically adjusted during optimization. We set it to be a constant in our experiments.

One of the limitations of cvaes mentioned in Section 2.2 is the $O(|V|^{3})$ pre-processing step to compute all the edge weights $w^{\text{MAS}}_{G,e}$ . We alleviate this bottleneck in acvaes, as it only takes $O(\min(|V|^{2},|E|\log|V|))$ operations per initialization (details in Section B.2) and per update on the weights, which ensures that acvaes can scale to datasets with many more vertices than would be feasible with cvaes.

Updates For $\bm{\lambda}$ And $\bm{\theta}$

When $\bm{\pi}$ is fixed, $\bm{\lambda}$ and $\bm{\theta}$ can be updated by taking a stochastic gradient step following $\nabla_{\bm{\lambda},\bm{\theta}}\mathcal{L}^{\textrm{ACVAE}}(\bm{\pi},\bm{\lambda},\bm{\theta})$ with reparametrization gradient [13, 23], as done in standard vaes.

If empirical Bayes (Eq. 9) is applied, Algorithm 1 will converge with properly selected learning rates. On the other hand, it is difficult to make any general statement about the convergence for saddle-point optimization (Eq. 10) since the objective is generally non-concave in $(\bm{\lambda},\bm{\theta})$ . However, as we show in Section 5, empirically we find that Algorithm 1 is stable for both options and performs well on multiple real datasets.

3.4 Exact Marginal Posterior Approximation with Belief Propagation

From Proposition 1, we know the weights $w^{\text{MAS}}_{G,\bm{\pi},e}$ returned from Algorithm 1 are from a single maximal acyclic subgraph $G^{\prime}\in\mathcal{A}_{G}$ . Consequently, we have a holistic variational approximation $q_{\bm{\lambda}}^{G^{\prime}}(\bm{z}|\bm{x})$ . However, by itself this variational approximation might not be necessarily better at the downstream predictive tasks than cvaes since it can only make use of the structure from one maximal acyclic subgraph $G^{\prime}$ .

On the plus side, the acyclic structure of $G^{\prime}$ makes it possible to compute the exact pairwise marginal variational distribution between any pair of vertices via a belief-propagation-style [21] message-passing algorithm, which is not possible for cvaes, as it does not have a single joint variational distribution on $\bm{z}$ . This can be crucial in tasks in which we need an accurate pairwise marginal approximation, e.g., link prediction and hierarchical clustering.

Consider any $v_{i}\neq v_{j}\in V$ that are in the same connected component of $G^{\prime}$ . Since $G^{\prime}$ is acyclic there is a unique path from $v_{i}$ to $v_{j}$ . Denote it as $v_{i}=u^{i,j}_{0}\rightarrow u^{i,j}_{1}\rightarrow\ldots\rightarrow u^{i,j}_{k_{i,j}}=v_{j}$ . The exact pairwise marginal $r_{\bm{\lambda}}(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})$ equals

[TABLE]

The above pairwise marginal densities can be computed for all pairs of $(v_{i},v_{j})$ by doing a depth- or breadth-first search starting from each $v_{i}\in V$ after we obtain the variational approximation $q_{\bm{\lambda}}^{G^{\prime}}(\bm{z}|\bm{x})$ from Algorithm 1, which has a total complexity of $O(|V|^{2})$ . Note that the time complexity for evaluating every pairwise marginal in cvaes is also $O(|V|^{2})$ . But the belief propagation refinement computation is usually more efficient in practice, since it involves much less neural network function evaluations, which dominate the runtime.

4 RELATED WORK

This work extends cvaes with the idea of learning a non-uniform average loss over some tractable loss functions on maximal acyclic subgraphs of the given graph. This is similar to the idea of obtaining a tighter upper bound on the log-partition function for an undirected graphical model by minimizing over a convex combination of spanning trees of the given graph [32]. To optimize the parameters, Wainwright et al. [32] also apply alternating updates on the parameters and the distributions over the spanning tress, similar to the approach in acvae learning. Alternating parameter updates are useful for many other cases. For example, Alternating Least Squares for matrix factorization [25] and Alternating Direction Method of Multipliers (ADMM) for convex optimization [4, 26, 10].

Some recent work also focuses on incorporating correlation structures over latent variables. For example, Hoffman and Blei [9] proposed structured variational families that can improve over traditional mean-field variational inference. Johnson et al. [11] proposed Structured vaes that apply more complex forms for the priors on the latent embeddings. Recently in the NLP community, Yin et al. [34] proposed utilizing tree-structured latent variable models to deal with semantic parsing. However, most of these works focus on correlations within dimensions of latent variables whereas our work focus on correlations between latent variables, similar to the setting of cvaes. In addition, Luo et al. [17] incorporated pairwise correlations between latent variables into deep generative models for semi-crowdsourced clustering.

Another line of related work appears in convolutional networks for graphs and their extensions [3, 6, 5, 20, 8, 28], which also take graph structure of data into considerations.

5 EXPERIMENTS

In this section, we evaluate acvaes on the task of link prediction and hierarchical clustering. We show that our method significantly outperforms various baselines. We attempt to identify the contributing factors for the gain, answering the following questions:

Q1: Uniform mixture (cvae) versus non-uniform mixture (acvae), which one is better? (Section 5.2.1)

Q2: How important is the belief propagation refinement for acvae? (Section 5.2.1)

Q3: Empirical Bayes versus saddle-point, which one performs better? Can we select purely based on ELBO? (Section 5.2.2)

Q4: Does the learned single graph capture more information than singleton representations? What do the learned latent embeddings look like? (Section 5.2.3)

Q5: Can acvae scale to datasets that cvae cannot? (Section 5.2.4)

5.1 Experiment Settings

Before presenting our experimental results, we describe the tasks, datasets, baslines, and metrics for evaluation. Additional details can be found in Appendix B.

5.1.1 Tasks

For each of the tasks, we are given a correlation graph $G=(V=\{v_{1},\ldots,v_{n}\},E)$ and a feature vector $\bm{x}_{i}\in\mathbb{R}^{N}$ for each $i\in\{1,\ldots,n\}$ .

For the link prediction task, we keep consistent with the setting of Tang et al. [27]. For the hierarchical clustering experiments, we apply the complete-linkage algorithm [33], which is relatively more stable among common hierarchical clustering algorithms. We cluster all data points into $K=5$ clusters.

5.1.2 Datasets

We evaluate acvaes on the following 3 datasets. All of 3 datasets are tested for link prediction and in addition the LibraryThing dataset is tested for the hierarchical clustering experiment:

•

Epinions111http://www.trustlet.org/downloaded_epinions.html [19], a public product rating dataset that contains $\approx 49\text{K}$ users and $\approx 140\text{K}$ products. After pre-processing, the dataset contains $\approx 16{K}$ users.

•

Citation222http://snap.stanford.edu/data/cit-HepTh.html [16], a High-energy physics theory citation network dataset, which has a citation graph with $\approx 28\text{K}$ papers and $\approx 353\text{K}$ citation edges. After preprocessing, the dataset contains $\approx 2{K}$ users (for the results as in Section 5.2.1). We also perform an experiment in Section 5.2.4 on a larger version of this dataset, which contains $\approx 26{K}$ users.

•

LibraryThing333https://cseweb.ucsd.edu/~jmcauley/datasets.html#social_data [14], a public book review data set that contains $\approx 73\text{K}$ users and $\approx 337\text{K}$ items. After pre-processing, the dataset contains $\approx 6{K}$ users.

For the hierarchical clustering task, the LibraryThing dataset does not contain cluster labels for users. We generate the cluster labels for each user by learning a standard vae on the feature vectors $\bm{x}$ , and perform the complete-linkage algorithm to cluster the data points into $K=5$ clusters. This helps us generate a semi-synthetic dataset.

5.1.3 Baselines

We compare acvae with 4 baseline methods:

•

vae [13]: standard variational auto-encoders, with no information about the correlations.

•

GraphSAGE [8]: the state-of-the-art method for learning latent embeddings that takes the correlation structure into account with graph convolutional neural networks.

•

cvae ${}_{\textrm{ind}}$ and cvae ${}_{\textrm{corr}}$ [27]: Two variations of cvaes with factorized and structured variational approximations, respectively.

There are many different variants of GraphSAGE, and we applied one of them (details in Section B.2). It is possible that some other variants or parameter settings of this method may perform better on our tasks. But our main goal is not to derive a state-of-the-art method for these tasks. Instead, we aim to show insights on how to improve over standard vaes and cvaes through learning adaptive correlated priors.

5.1.4 Metrics

For all methods, we first learn latent embeddings $\bm{z}_{1},\ldots,\bm{z}_{n}$ , which are deterministic for GraphSAGE and stochastic for the vaes-based methods. Then we compute the pairwise distance $\text{dis}_{i,j}$ between each pair $(\bm{z}_{i},\bm{z}_{j})$ of the latent embeddings as $\|\bm{z}_{i}-\bm{z}_{j}\|_{2}^{2}$ . Recall that the embeddings are stochastic for the vaes-based methods, hence we use $\mathbb{E}[\|\bm{z}_{i}-\bm{z}_{j}\|_{2}^{2}]$ as the pairwise distance. The expectation is taken over the variational pairwise marginal $q(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})$ or the refined pairwise marginal $r(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})$ if we perform belief propagation (Section 3.4).

For the link prediction experiments, for each user $u_{i}$ , we compute the Cumulative Reciprocal Rank (CRR) as follows.

[TABLE]

A larger CRR value indicates the heldout edges have a higher rank among all the candidates. We further normalize the CRR values to be in $[0,1]$ , and report the normalized CRR (NCRR).

For hierarchical clustering, we apply the normalized mutual-information scores [29] as the metric. These scores are in the range $[0,1]$ and a larger score indicates better clustering performance.

5.2 Results

We show the heldout NCRR values for link predictions and the normalized MI scores in Table 1 and Table 2, respectively. acvae ${}_{\textrm{EB}}$ and acvae ${}_{\textrm{SP}}$ stand for empirical Bayes (Eq. 9) and saddle-point optimization (Eq. 10), respectively. The rows with BP mean we perform belief-propagation refinement (Section 3.4). We dissect the results in the following sections.

5.2.1 Advantages of the Non-uniform Mixture

As motivated in Section 2.2, acvaes improve over the limitations of cvaes by providing a holistic variational approximation at the end of the empirical Bayes or saddle-point optimization, which further enables applying belief propagation for more accurate marginal approximation.

At first glance, the performance results in Table 1 for the single joint distribution (the rows acvae ${}_{\textrm{EB}}$ and acvae ${}_{\textrm{SP}}$ ) is no better than that of cvae ${}_{\textrm{corr}}$ , which applies a uniform mixture. We speculate in Section 3.4 that by itself this holistic variational approximation might not necessarily be better at the downstream predictive tasks since it can only make use of the structure from one maximal acyclic subgraph, even though it sometimes has a higher ELBO (Table 4). However, we can observe a huge performance boost after applying the belief propagation refinement, which outperforms the baseline methods by a wide margin for link prediction and performs comparably better for hierarchical clustering.444vae does not count as a baseline method for the clustering experiment since it is applied as an oracle in the pre-processing steps. We omitted the results for acvaes without belief propagation for the hierarchical clustering experiments since empirically we found their performance are much worse compared to the case of using belief propagation refinement.

Recall that the prerequisite for applying the belief propagation is to have a variational distribution on a single acyclic subgraph (i.e., we cannot perform BP with cvaes). This answers two questions we sought to answer: First, the non-uniform mixture is not necessarily better than the uniform mixture at the downstream task even when it has a higher ELBO, but it opens up the possibility to perform exact inference; Second, variational approximations has a lot of room for improvement when compared with exact inference (i.e., belief propagation) on an acyclic graph.

5.2.2 Empirical Bayes versus Saddle-Point

As shown in Table 1 and Table 2, both empirical Bayes and saddle-point optimization perform similarly on most tasks, though the saddle-point option is often more stable (normally having a smaller variance in the metrics). This is reasonable since the saddle-point objective optimizes the most conservative lower bound.

Moreover, we show that we should not select between these two methods purely based on ELBO: By definition, the saddle-point optimization will yield an ELBO lower than empirical Bayes. In Table 3 and Table 4 we report ELBO as well as NCRR for 4 choices of the negative sampling parameter $\gamma$ (Eq. 8) on Epinions and Citations with and without belief propagation refinement. We can see clearly that a better ELBO does not necessarily correlate with a better NCRR, regardless of belief propagation refinement or not.

In general, both methods have their advantages. On simpler datasets, e.g., Citation, on which all methods perform well, empirical Bayes is preferred since it can easily capture the best correlation structure. On the other hand, with more complex datasets/difficult tasks, saddle-point optimization tends to provide more robust inference and stable results.

5.2.3 Learned Graph Structures

In Figure 1 we visualize part of the largest connected component of the maximal acyclic subgraph $\hat{G}^{\prime}=(V,\hat{E}^{\prime})$ that acvaes learn for the variational distribution on the Citation dataset with both empirical Bayes and saddle-point optimization (colors for better clarity only). The coordinates are t-SNE embeddings for the variational approximation mean of the latent variables. The edge widths are proportional to the strength of the learned correlations. We can see some of the learned embeddings are not necessarily close to each other even when they have high correlations. This indicates that the learned $\hat{G}^{\prime}$ provides some additional information that singleton marginals cannot provide.

5.2.4 Scalability to Large Datasets

To demonstrate the scalability of acvae compared to cvae, we perform an experiment on a larger version of the Citation dataset with $>12$ times more vertices, which cvae can not easily scale to due to the cubic time initialization step and the quadratic pairwise marginal evaluations.

We compare the performance of acvae plus the belief propagation refinement on both the empirical Bayes and the saddle-point schemes with the other two baseline methods (vae and GraphSAGE). As shown in Table 5, both schemes of acvae can significantly outperform the baseline methods.

6 CONCLUSION

In this paper, we introduce acvaes, which learn a joint variational distribution on the latent embeddings of input data via optimizing loss function that is a non-uniform average over some tractable correlated ELBOs. To learn the mixture weights, we provide two different options, and compare them on various datasets and tasks. The learned joint variational distribution can be used to perform efficient evaluation using belief propagation. Experiment results show that acvaes can outperform existing methods for link prediction and hierarchical clustering on three real datasets. Future work will include better understanding the learned graph structures from both options and learning higher-order correlations between latent variables.

Appendix

In the appendix, we provide more details on our baseline method cvae [27] as well as the experiment data pre-processing and protocols.

Appendix A MORE DETAILS ON cvaes

cvaes set the prior $p_{0}^{\textrm{corr}_{g}}(\bm{z})$ to be the uniform average over all of these tractable densities:

[TABLE]

where $p_{0}^{G^{\prime}}(\bm{z})=\prod\limits_{i=1}^{n}p_{0}(\bm{z}_{i})\prod\limits_{(v_{i},v_{j})\in E^{\prime}}\frac{p_{0}(\bm{z}_{i},\bm{z}_{j})}{p_{0}(\bm{z}_{i})p_{0}(\bm{z}_{j})}$ is a prior on a maximal acyclic subgraph $G^{\prime}=(V,E^{\prime})$ with the same form as in Eq. 3. For each $G^{\prime}\in\mathcal{A}_{G}$ , we can similarly define a structured variational approximation $q_{\bm{\lambda}}^{G^{\prime}}(\bm{z}|\bm{x})$ following the form of Eq. 3:

[TABLE]

where $q_{\bm{\lambda}}(\cdot|\cdot)$ and $q_{\bm{\lambda}}(\cdot,\cdot|\cdot,\cdot)$ are two conditional density functions that captures the singleton and pairwise variational approximation densities. These two functions need to satisfy the symmetry and consistency properties:

[TABLE]

The ELBO in Eq. 5 is an average over potentially exponential many ELBOs. To make computations tractable, Tang et al. [27] simplifies this lower bound and represent it as

[TABLE]

Where $w^{\text{MAS}}_{G,e}:=\frac{|\{G^{\prime}\in\mathcal{A}_{G}:e\in G^{\prime}\}|}{|\mathcal{A}_{G}|}$ for each edge $e=(v_{i},v_{j})$ represents the fraction of $G$ ’s maximal acyclic subgraphs of $G$ that contain $e$ . These weights can be computed easily from the Moore-Penrose inverse of the Laplacian matrix of $G$ .

Appendix B EXPERIMENT DETAILS

B.1 Dataset Pre-processing Details

Epinions

We follow the same pre-processing scheme as Tang et al. [27]: binarize the rating data and create a bag-of-words binary feature vector for each user. We only retain the items that have been rated for at least 100 times. We construct the graph $G=(V,E)$ and only keep an edge $(v_{i},v_{j})$ to be in $E$ if both $v_{i}\rightarrow v_{j}$ and $v_{j}\rightarrow v_{i}$ appear in the original directed graph. At last, we only retain users that have at least one edge in $E$ (i.e. having at least one bi-directional edge in the original dataset).

Citations

This dataset includes the abstract and the citation information for high-energy physic theory papers on arXiv from 1992 to 2003. We work on all papers from 1998 in this dataset (in total $\approx 2.8\text{K}$ papers). We treat all citation edges as undirected edges and build the graph $G=(V,E)$ . We only retain papers that cite or are cited by at least one of the other papers within this subset (for year 1998) of the dataset. We compute the TF-IDF (with stop words removed) for the abstract of each paper as the raw feature vectors, retaining only the coordinates corresponding to the top 50 words. Then we binarize the raw feature vectors that considers only the non-zero entries that are above the median of all of the non-zero entries and use these binarized vectors as the feature vectors.

For the larger experiment on this dataset, we apply the same pre-processing steps but work on the whole dataset (instead of the subset for year 1998).

LibraryThing

For the link prediction experiment, We follow the same pre-processing scheme as for the Epinions dataset, except that we only retain the items that have been rated for at least 200 times (since this dataset is larger than the Epinions dataset). For the clustering experiment, we follow the same scheme to get a graph $G=(V,E)$ , but we do not split the edges to training/testing (since clustering is unsupervised), and apply a normal vae to generate the labels (as mentioned in the main paper). This normal vae has the same hidden layer size with the one used in testing, but has a smaller latent representation (we use 10) to avoid generating non-reasonable labels due to overfitting.

B.2 Experimental Protocol

We run 3 runs for each methods for the Epinions experiments, and 5 runs for the other experiments (except we run only 1 run for the Citation experiments on the larger dataset as in Section 5.2.4). This is since the Epinions experiments work more stable empirically.

For vae, cvae and acvae, we apply a two-layer feed-forward neural inference network for the singleton variational distribution $q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})$ ’s and a two-layer feed-forward neural generative network for the model distribution $p_{\bm{\theta}}(\bm{x}|\bm{z})$ ’s. $q_{\bm{\lambda}}(\bm{z}_{i}|\bm{x}_{i})$ is a diagonal normal distribution with the mean and standard deviation outputted from the inference network and $p_{\bm{\theta}}(\bm{x}|\bm{z})$ is a multinomial distribution with the logits outputted from the generative networks. The latent dimensionality $d$ is 100 for the Epinions experiments and the LibraryThing clustering, and 10 for the other two link prediction experiments. The hidden layer dimensionality $h_{1}$ is 300 for the Epinions experiments and 30 for the other experiments.

For GraphSAGE, we choose to use $K=2$ aggregations, the mean aggregator, and $Q=20$ negative samples to optimize the loss function. The hidden layer size and latent dimensionality we apply to GraphSAGE are the same with that of the standard vae.

For cvae and acvae, we set the pairwise marginal prior density function to be $p_{0}(\cdot)=\mathcal{N}\left(\bm{\mu}=\bm{0}_{2d},\Sigma=\begin{pmatrix}I_{d}&\tau\cdot I_{d}\\ \tau\cdot I_{d}&I_{d}\end{pmatrix}\right)$ with $\tau=0.99$ . For cvae ${}_{\textrm{corr}}$ and acvae, we model the pairwise variational approximations $q(\bm{z}_{i},\bm{z}_{j}|\bm{x}_{i},\bm{x}_{j})$ to be a multi-variate normal distribution that can be factorized across the $d$ dimensions as the product of $d$ independent bi-variate normal distributions. The correlation coefficients of these bi-variate normal distributions are computed from two-layer feed-forward neural networks that taking $\bm{x}_{i}$ and $\bm{x}_{j}$ as inputs. These two-layer neural networks have latent dimensionality $h_{2}$ to be 1000 for the Epinions experiments and 100 for the other experiments. For cvae and acvae on link prediction experiments, we select the negative sampling parameter $\gamma$ from set of choices, and report the performances with the best average train NCRR metrics. This parameter is selected from $\{0.0001,0.001,0.01,0.1,1,10,100,1000\}$ for the LibraryThing dataset and the Citation dataset, and $\{0.001,0.1,10.,1000.\}$ for the Epinions dataset. For the clustering experiments, we select $\gamma=1$ for cvae and acvae since empirically we found this choice gives us a reasonably good performance.

For link prediction, for all methods, we look into the performances for every fixed number of iterations (the specific numbers depend on models) and update the current best test NCRR values if both the train ELBO and the train NCRR reach better values. We report the final current best test NCRR values as the results. For clustering, we update the current best normalized MI scores if the train ELBO reaches better values and report the final current best normalized MI scores.

For acvae, we set the step size parameter (in Eq. 12) $\alpha^{t}=0.1$ to be a constant. We train the parameters using alternating updates as in Algorithm 1. We switch between updates on the parameters $\bm{\lambda}$ , $\bm{\theta}$ for an epoch of the edges in $E$ , and a single update on the weights $w^{\text{MAS}}_{G,\pi,e}$ according to Eq. 12. For the random initialization on the tree weights $w^{\text{MAS}}_{G,\pi,e}$ , we just assign random weights to the graph $G=(V,E)$ . Then we use Kruskal’s algorithm to compute the maximal acyclic subgraph $\tilde{G}=(V,\tilde{E})$ according to these random weights, and set $w^{\text{MAS}}_{G,\pi,e}=I[e\in\tilde{E}]$ . It is straightforward to see that this is a valid initialization for the weights $w^{\text{MAS}}_{G,\pi,e}$ ’s since these weights relate to the distribution $\bm{\tilde{\pi}}$ that has all of its mass on the single subgraph $\tilde{G}$ .

For acvae, after running the algorithm for some iterations, we use Kruskal’s algorithm to compute the maximal acyclic subgraph $\hat{G}=(V,\hat{E})$ on the converged edge weights $\hat{w}^{\text{MAS}}_{G,\pi,e}$ to find the learned single graph $\hat{G}^{\prime}$ . This heuristic helps us solve the issues of finding the converged maximal acyclic subgraph if we want to perform an early stopping (recall that we evaluate our metrics for every fixed number of iterations) or if there is an numerical issue.

For all methods, we apply stochastic gradient optimizations and use Adam [12] to adjust the learning rates. We set the step size to be $10^{-3}$ . For all methods, we use a batch size $B_{1}=64$ for sampling the vertices. For cvae and acvae, we use a batch size $B_{2}=256$ for sampling the edges and non-edges.

All experiments are done using Python. The training and evaluations are done with TensorFlow [1] and Numpy. The TF-IDF and the t-SNE embeddings [18] in the visualization (Figure 1) are computed using Scikit-learn [22]. For faster computations, we call C++ functions to do belief propagation and the Kruskal’s algorithm using Cython [2].

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) , pages 265–283, 2016.
2Behnel et al. [2011] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds. Computing in Science & Engineering , 13(2):31, 2011.
3Bruna et al. [2015] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. In 3rd International Conference on Learning Representations , 2015.
4Chambolle and Pock [2011] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision , 40(1):120–145, 2011.
5Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems , pages 3844–3852, 2016.
6Duvenaud et al. [2015] David Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems , pages 2224–2232, 2015.
7Efron [2012] Bradley Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , volume 1. Cambridge University Press, 2012.
8Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems , pages 1025–1035, 2017.

Epinions	acvae $_{EB+BP}$	acvae $_{SP+BP}$
$γ = 0.001$	-31.8 \| 0.034	-31.9 \| 0.031
$γ = 0.1$	-36.4 \| 0.031	-38.3 \| 0.035
$γ = 10 .$	-61.3 \| 0.028	-119 \| 0.034
$γ = 1000 .$	-674 \| 0.028	-1535 \| 0.037
Citation	acvae $_{EB+BP}$	acvae $_{SP+BP}$
$γ = 0.001$	-7.48 \| 0.126	-7.48 \| 0.124
$γ = 0.1$	-7.91 \| 0.113	-8.59 \| 0.121
$γ = 10 .$	-24.4 \| 0.112	-49.2 \| 0.120
$γ = 1000 .$	-184 \| 0.099	-288 \| 0.054

Citation	acvae $_{EB}$	acvae $_{SP}$	cvae
$γ = 0.001$	-7.47 \| 0.012	-7.48 \| 0.010	-7.48 \| 0.011
$γ = 0.1$	-7.88 \| 0.031	-8.51 \| 0.025	-8.49 \| 0.023
$γ = 10 .$	-23.8 \| 0.043	-47.9 \| 0.037	-42.3 \| 0.042
$γ = 1000 .$	-183 \| 0.049	-286 \| 0.039	-267 \| 0.058

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning Correlated Latent Representations with Adaptive Priors

Abstract

1 INTRODUCTION

2 VAES WITH CORRELATIONS

2.1 Variational Auto-Encoders

2.2 Correlated Variational Auto-Encoders

Definition 1** (Maximal acyclic subgraph).**

3 ADAPTIVE CORRELATED VAES

3.1 A Non-uniform Mixture Prior

Definition 2** (Non-uniform maximal acyclic subgraph edge weight).**

3.2 Learning the Non-uniform Mixture

Proposition 1** (Optimum for π\bm{\pi}π).**

3.3 Learning with Alternating Updates

Updates For π\bm{\pi}π

Updates For λ\bm{\lambda}λ And θ\bm{\theta}θ

3.4 Exact Marginal Posterior Approximation with Belief Propagation

4 RELATED WORK

5 EXPERIMENTS

5.1 Experiment Settings

5.1.1 Tasks

5.1.2 Datasets

5.1.3 Baselines

5.1.4 Metrics

5.2 Results

5.2.1 Advantages of the Non-uniform Mixture

5.2.2 Empirical Bayes versus Saddle-Point

5.2.3 Learned Graph Structures

5.2.4 Scalability to Large Datasets

6 CONCLUSION

Appendix

Appendix A MORE DETAILS ON cvaes

Appendix B EXPERIMENT DETAILS

B.1 Dataset Pre-processing Details

Epinions

Citations

LibraryThing

B.2 Experimental Protocol

Definition 1 (Maximal acyclic subgraph).

Definition 2 (Non-uniform maximal acyclic subgraph edge weight).

Proposition 1 (Optimum for $\bm{\pi}$ ).

Updates For $\bm{\pi}$

Updates For $\bm{\lambda}$ And $\bm{\theta}$