Exact Inference with Latent Variables in an Arbitrary Domain

Chuyang Ke; Jean Honorio

arXiv:1902.03099·cs.SI·June 30, 2020

Exact Inference with Latent Variables in an Arbitrary Domain

Chuyang Ke, Jean Honorio

PDF

Open Access

TL;DR

This paper establishes conditions under which exact inference in latent variable models is possible using semidefinite programming, without prior knowledge of the latent variables or their domain, supported by theoretical analysis and concentration inequalities.

Contribution

It introduces a novel SDP-based method for exact inference in latent models without prior domain knowledge, supported by theoretical guarantees and spectral analysis.

Findings

01

SDP approach achieves exact inference without latent domain knowledge

02

KKT conditions and spectral analysis predict SDP correctness accurately

03

Provides new concentration inequalities related to latent variables

Abstract

We analyze the necessary and sufficient conditions for exact inference of a latent model. In latent models, each entity is associated with a latent variable following some probability distribution. The challenging question we try to solve is: can we perform exact inference without observing the latent variables, even without knowing what the domain of the latent variables is? We show that exact inference can be achieved using a semidefinite programming (SDP) approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker (KKT) conditions and the spectrum of a properly defined matrix. As a byproduct of our analysis, we also provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as…

Tables1

Table 1. Table 1: Comparison of various latent models. In dot product models, g : ℝ → [ 0 , 1 ] : 𝑔 → ℝ 0 1 g:{\mathbb{R}}\to[0,1] is a function that normalizes dot products to the range of [ 0 , 1 ] 0 1 [0,1] . In kernel models, K : 𝒳 × 𝒳 → ℝ : 𝐾 → 𝒳 𝒳 ℝ K:\mathcal{X}\times\mathcal{X}\to{\mathbb{R}} is an arbitrary kernel function.

Models	$𝒳$	$f (x_{i}, x_{j})$
Latent space model [16]	$ℝ^{d}$	$\exp (- {‖ x_{i} - x_{j} ‖}^{2})$
Exchangeable graph model [11]	${0, 1}^{d}$	$\exp (- {‖ x_{i} - x_{j} ‖}_{1})$
Dot product graph (DPG) [22]	$ℝ^{d}$	$g (x_{i} \cdot x_{j})$
Uniform DPG [24]	${[0, 1]}^{d}$	$g (x_{i} \cdot x_{j})$
Extremal vertices model [8]	${x ∣ x \in ℝ^{d}, x_{i} \geq 0, \sum_{i = 1}^{d} x_{i} = 1}$	$g (x_{i} \cdot x_{j})$
Kernel latent variable model	$ℝ^{d}$ , sets, graphs, text, etc.	$g (K (x_{i}, x_{j}))$

Equations191

p := E_{X} [f (x_{i}, x_{j}) ∣ z_{i}^{*} = z_{j}^{*}], q := E_{X} [f (x_{i}, x_{j}) ∣ z_{i}^{*} \neq = z_{j}^{*}] .

p := E_{X} [f (x_{i}, x_{j}) ∣ z_{i}^{*} = z_{j}^{*}], q := E_{X} [f (x_{i}, x_{j}) ∣ z_{i}^{*} \neq = z_{j}^{*}] .

P_{X Y} {i \sum (x_{i} - μ_{i}) \geq ϵ} \leq exp (- \frac{ϵ ^{2}}{2 \sum _{i} σ _{i}^{2}}) .

P_{X Y} {i \sum (x_{i} - μ_{i}) \geq ϵ} \leq exp (- \frac{ϵ ^{2}}{2 \sum _{i} σ _{i}^{2}}) .

P_{X Y} {i \sum (x_{i} - μ_{i}) \geq ϵ} \leq

P_{X Y} {i \sum (x_{i} - μ_{i}) \geq ϵ} \leq

P_{X Y} {i \sum x_{i} \geq ϵ} \leq exp (- \frac{ϵ ^{2} /2}{\sum _{i} ν _{i}^{2} + M ϵ /3}) .

P_{X Y} {i \sum x_{i} \geq ϵ} \leq exp (- \frac{ϵ ^{2} /2}{\sum _{i} ν _{i}^{2} + M ϵ /3}) .

P_{X Y} {λ_{m a x} (i \sum X_{i}) \geq ϵ} \leq d \cdot θ > 0 in f e^{- θ ϵ + g (θ) \cdot ρ} .

P_{X Y} {λ_{m a x} (i \sum X_{i}) \geq ϵ} \leq d \cdot θ > 0 in f e^{- θ ϵ + g (θ) \cdot ρ} .

P_{X Y} {λ_{m a x} (i \sum X_{i}) \geq ϵ} \leq d \cdot exp (- \frac{ϵ ^{2} /2}{σ ^{2} + R ϵ /3}) .

P_{X Y} {λ_{m a x} (i \sum X_{i}) \geq ϵ} \leq d \cdot exp (- \frac{ϵ ^{2} /2}{σ ^{2} + R ϵ /3}) .

Z maximize

Z maximize

subject to

Y maximize

Y maximize

subject to

Y ⪰_{S_{+}^{n}} 0, Y ⪰_{R_{+}^{n}} 0, rank (Y) = k .

Y maximize

Y maximize

subject to

Y ⪰_{S_{+}^{n}} 0, Y ⪰_{R_{+}^{n}} 0 .

v, A, Γ minimize

v, A, Γ minimize

subject to

A is diagonal, Γ_{S_{i}} = 0, \forall i \in [k], Γ ⪰_{R_{+}^{n}} 0 .

v = \frac{ϕ}{2} 1_{n}, A = D - m ϕ I,

v = \frac{ϕ}{2} 1_{n}, A = D - m ϕ I,

Γ_{S_{i}} = 0, \forall i \in [k], Γ_{S_{i} S_{j}} = ϕ 1_{m} 1_{m}^{⊤} + P W_{S_{i} S_{j}} P - W_{S_{i} S_{j}}, \forall i \neq = j,

Λ := D - m ϕ I - W + ϕ 1_{n} 1_{n}^{⊤} - Γ ⪰_{S_{+}^{n}} 0,

Λ := D - m ϕ I - W + ϕ 1_{n} 1_{n}^{⊤} - Γ ⪰_{S_{+}^{n}} 0,

Γ_{S_{i} S_{j}} ⪰_{R_{+}^{m}} 0

Γ_{S_{i} S_{j}} ⪰_{R_{+}^{m}} 0

λ_{k + 1} (Λ) > 0 .

λ_{k + 1} (Λ) > 0 .

Γ_{S_{i} S_{j}} ⪰_{R_{+}^{m}} 0

Γ_{S_{i} S_{j}} ⪰_{R_{+}^{m}} 0

λ_{k + 1} (Λ) > 0,

λ_{k + 1} (Λ) > 0,

λ_{k + 1} (E_{W X} [Λ]) = m (p - ϕ) .

λ_{k + 1} (E_{W X} [Λ]) = m (p - ϕ) .

i min (d_{i} - E_{W X} [d_{i}]) + \frac{m}{2} (p - ϕ)

i min (d_{i} - E_{W X} [d_{i}]) + \frac{m}{2} (p - ϕ)

- λ_{m a x} (W - E_{W X} [W]) + \frac{m}{2} (p - ϕ)

- λ_{m a x} (W - E_{W X} [W]) + \frac{m}{2} (p - ϕ)

\frac{( p - q ) ^{2}}{k ^{2}} = Ω (\frac{lo g n}{n}),

\frac{( p - q ) ^{2}}{k ^{2}} = Ω (\frac{lo g n}{n}),

\frac{p}{k} lo g (p / q) = O (\frac{1}{n}),

\frac{p}{k} lo g (p / q) = O (\frac{1}{n}),

Y \in Y maximize

Y \in Y maximize

Y = {Z Z^{⊤} ∣ Z \in {0, 1}^{n \times k}, Z^{⊤} 1_{n} = \frac{n}{k} 1_{k}, Z 1_{k} = 1_{n}},

Y = {Z Z^{⊤} ∣ Z \in {0, 1}^{n \times k}, Z^{⊤} 1_{n} = \frac{n}{k} 1_{k}, Z 1_{k} = 1_{n}},

\frac{( p - q ) ^{2}}{k} = Ω (\frac{lo g n}{n}),

\frac{( p - q ) ^{2}}{k} = Ω (\frac{lo g n}{n}),

P_{X Y} {i \sum (x_{i} - μ_{i}) \geq ϵ}

P_{X Y} {i \sum (x_{i} - μ_{i}) \geq ϵ}

\leq e^{- t ϵ} \cdot E_{X Y} [e^{t \sum_{i} (x_{i} - μ_{i})}]

= e^{- t ϵ} \cdot E_{Y} [E_{X} [e^{t \sum_{i} (x_{i} - μ_{i})} Y]]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Machine Learning and Algorithms · Markov Chains and Monte Carlo Methods

Full text

Exact Inference with Latent Variables

in an Arbitrary Domain

**Chuyang Ke

**Department of Computer Science

Purdue University

[email protected]

**Jean Honorio

**Department of Computer Science

Purdue University

[email protected]

Abstract

We analyze the necessary and sufficient conditions for exact inference of a latent model. In latent models, each entity is associated with a latent variable following some probability distribution. The challenging question we try to solve is: can we perform exact inference without observing the latent variables, even without knowing what the domain of the latent variables is? We show that exact inference can be achieved using a semidefinite programming (SDP) approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker (KKT) conditions and the spectrum of a properly defined matrix. As a byproduct of our analysis, we also provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices. To the best of our knowledge, these results are novel and could be useful for many other problems.

1 Introduction

Generative network models have become a powerful tool for researchers in various fields, including data mining, social sciences, and biology [11, 9]. With the emergence of social media in the past decade, researchers are now exposed to millions of records of interaction generated on the Internet everyday. One can note that the generic structure and organization of social media resemble certain network models, for instance, the Erdos-Renyi model, the stochastic block model, the latent space model, the random dot product model [11, 21, 24]. The analogy comes from the fact that, in a social network each user can be modeled as an entity, and the interaction of users can be modeled as edges. One common assumption is that nodes belong to different groups. In social networks this can be users’ political view, music genre preferences, or whether the user is a cat or dog person. Another common assumption, often referred to as homophily in prior literature, suggests that entities from the same group are more likely to be connected with each other than those from different groups [11, 15, 18]. The core task of inference, also known as graph partitioning, is to partition the nodes into groups based on the observed interaction information [1, 17, 9].

In this paper, we are particularly interested in the class of latent models beyond graphs, with latent variables in arbitrary domains. In a latent model, every entity belongs to one of $k$ groups. Every entity is associated with a latent variable in some arbitrary latent domain. It is natural to assume that for entities from the same group, their associated latent variables follow the same probability distribution. The latent model is equipped with a function to measure the homophily of two latent variables. Finally, two entities have some affinity score depending on their homophily in the latent domain. In other words, similar entities are more likely to have a higher affinity score. We want to highlight that, for the particular case of binary (i.e., $\{0,1\}$ ) affinity scores, the latent model is a random graph model. The challenging problem problem we try to solve is to infer the true group assignments without observing the the latent variables nor knowing the latent domain.

In the past decade there have actually existed a large amount of literature on network models, and most focus on the class of fully observed models, for example, the Erdos-Renyi Model, and the Stochastic Block Model. These models are called “fully observed”, because there are no latent variables, and edges are generated based on the agreement of entity labels. Some efficient algorithms have also been proposed for inference in these fully observed models [2, 4, 14, 6]. On the other hand, there is limited research on the class of latent models. Researchers have motivated various network models with latent variables, including the latent space model [16], the exchangeable graph model [11], the dot product model [22], the uniform dot product model [24], and the extremal vertices model [8]. However to the best of our knowledge, no efficient polynomial time algorithms with formal guarantees have been proposed or analyzed for exact inference in latent models.

In this paper we address the problem of exact inference in latent models with arbitrary domains. More specifically, our goal is to correctly infer the group assignment of every entity in a latent model without observing the latent variables or the latent domain. We also propose a polynomial-time algorithm for exact inference in latent models using semidefinite programming (SDP). We want to highlight that many techniques used in the analysis of fully observed models do not directly apply to latent models. This is because in latent models, affinities are no longer statistically independent. As a result, latent models are more challenging to analyze than fully observed models, such as the stochastic block model.

While SDP has been heavily proposed for different machine learning problems, our goal in this paper is to study the optimality of SDP for our more challenging model. Our analysis focuses on Karush-Kuhn-Tucker (KKT) conditions and the spectrum of a carefully constructed primal-dual certificate. For convex problems including SDPs, the KKT conditions are sufficient and necessary for strong duality and optimality [5]. To the best of our knowledge, we are providing the first polynomial time method for a generally computationally hard problem with formal guarantees. In general, problems involving latent variables are computationally hard and nonconvex, for instance, learning restricted Boltzmann machines [20] or structural Support Vector Machines with latent variables [26]. It is worth mentioning that theoretical computer science typically assumes arbitrary inputs ("worst-case" computationally hard), whereas we assume inputs are generated by a probabilistic generative model. Our results could be seen as "average-case" polynomial time: we provide exact inference conditions with respect to the model parameters $(p,q)$ .

Summary of our contributions. We provide a series of novel results in this paper:

•

We propose the definition of the latent model class, which is highly general and subsumes several latent models from prior literature (see Table 1).

•

We provide the first polynomial time algorithm for a generally computationally hard problem with formal guarantees. We also analyze the sufficient conditions for exact inference in latent models using a semidefinite programming approach.

•

For completeness, we provide an information-theoretic lower bound on exact inference, and we analyze when nonconvex maximum likelihood estimation is correct.

•

As a byproduct of our analysis, we provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices. To the best of our knowledge, these results are novel and could be useful for many other problems.

2 Preliminaries

In this section, we introduce the notations that will be used in later sections. First we provide the definition of the class of latent models.

Definition 1 (Class of latent models).

A model $\mathcal{M}$ is called a latent model with $n$ entities and $k$ clusters, if $\mathcal{M}$ is equipped with structure $(\mathcal{X},f,\mathcal{P})$ satisfying the following properties:

•

$\mathcal{X}$ * is an arbitrary latent domain;*

•

$f:\mathcal{X}\times\mathcal{X}\to[0,1]$ * is a homophily function, such that $f({x},{x}^{\prime})=f({x}^{\prime},{x})$ ;*

•

$\mathcal{P}=(\mathcal{P}_{1},\dots,\mathcal{P}_{k})$ * is the collection of $k$ distributions with support on $\mathcal{X}$ .*

For simplicity we consider the balance case in this paper: each cluster has the same size $m:=n/k$ . Let $Z^{\ast}\in\{0,1\}^{n\times k}$ be the true cluster assignment matrix, such that $Z_{ij}^{\ast}=1$ if entity $i$ is in cluster $j$ , and $Z_{ij}^{\ast}=0$ otherwise. For every entity $i$ in cluster $j$ , nature randomly generates a latent vector $x_{i}\in\mathcal{X}$ from distribution $P_{j}$ . A random observed affinity matrix $W\in[0,1]^{n\times n}$ is generated, such that the conditional expectation fulfills $\mathbb{E}\left[W_{ij}|x_{i},x_{j}\right]=f(x_{i},x_{j})$ .

Remark. We use $[0,1]$ for $f$ and $W$ for clarity of exposition. Our results can be trivially extended to a general domain $[0,B]$ for $B>0$ using the same techniques in later sections.

Remark. A particular case of the latent model is a random graph model, in which every entry $W_{ij}$ in the affinity matrix is binary (i.e., $W_{ij}\in\{0,1\}$ ) and generated from a Bernoulli distribution with parameter $f(x_{i},x_{j})$ .

Our definition of latent models is highly general. In Table 1, we illustrate several latent models motivated from prior literature that can be subsumed under our model class by properly defining $\mathcal{X}$ and $f$ .

In latent models, affinities are not independent if not conditioning on the latent variables. For example, suppose $i,j$ and $k$ are three entities. In fully observed models the affinities $W_{ij}$ and $W_{ik}$ are independent, but this is not true in latent models, as shown graphically in Figure 1. This motivates our following definition of latent conditional independence (LCI).

Definition 2 (Latent Conditional Independence).

We say random variables $Y=(y_{1},\dots,y_{n})$ are latently conditional independent given $X$ , if $y_{1},\dots,y_{n}$ are conditional independent given the unobserved latent random variable $X$ .

2.1 Notations

We denote $[n]:=\{1,2,\dots,n\}$ . We use ${\mathcal{S}_{+}^{n}}$ to denote the $n$ -dimensional positive semidefinite matrix cone, and ${{\mathbb{R}}_{+}^{n}}$ to denote the $n$ -dimensional nonnegative orthant.

For simplicity of analysis, we use $z_{i}\in\{0,1\}^{k}$ to denote the $i$ -th row of $Z$ , and $z^{(i)}\in\{0,1\}^{n}$ to denote the $i$ -th column of $Z$ . We use $X=(x_{1},\dots,x_{n})$ to denote the collection of latent variables.

Regarding eigenvalues of matrices, we use $\lambda_{i}(\cdot)$ to refer to the $i$ -th smallest eigenvalue, and $\lambda_{\max}(\cdot)$ to refer to the maximum eigenvalue.

Regarding probabilities $\mathbb{P}_{{W}}\left\{{\cdot}\right\},\mathbb{P}_{{X}}\left\{{\cdot}\right\}$ , and $\mathbb{P}_{{WX}}\left\{{\cdot}\right\}$ , the subscripts indicate the random variables. Regarding expectations $\mathbb{E}_{{W}}\left[\cdot\right],\mathbb{E}_{{X}}\left[\cdot\right]$ , and $\mathbb{E}_{{WX}}\left[\cdot\right]$ , the subscripts indicate which variables we are averaging over. We use $\mathbb{P}_{{W}}\left\{{\cdot\mid{X}}\right\}$ to denote the conditional probability with respect to ${W}$ given ${X}$ , and $\mathbb{E}_{{W}}\left[\cdot\mid{X}\right]$ to denote the conditional expectation with respect to ${W}$ given ${X}$ .

For matrices, we use $\left\|{\cdot}\right\|$ to denote the spectral norm of a matrix, and ${\left\|{\cdot}\right\|}_{F}$ to denote the Frobenius norm. We use $\operatorname{tr}{(}\cdot)$ to denote the trace of a matrix, and $\operatorname{rank}{(}\cdot)$ to denote the rank. We use the notation $\operatorname{diag}{(}a_{1},\dots,a_{n})$ to denote a diagonal matrix with diagonal entries $a_{1},\dots,a_{n}$ . We also use $\mathbf{I}$ to refer to the identity matrix, and ${1}_{n}$ to refer to an all-one vector of length $n$ . We use ${\mathbb{S}^{n-1}}$ to denote the unit $(n-1)$ -sphere.

Let $S_{i}\in[n]^{m}$ denote the index set of the $i$ -th cluster. For any vector $v\in{\mathbb{R}}^{n}$ , we define $v_{S_{i}}$ to be the subvector of $v$ on indices $S_{i}$ . Similarly for any matrix $X\in{\mathbb{R}}^{n\times n}$ , we define $X_{S_{i}S_{j}}$ to be the submatrix of $X$ on indices $S_{i}\times S_{j}$ . Denote the shorthand notation $X_{S_{i}}:=X_{S_{i}S_{i}}$ .

Define $d_{i}(S_{l}):=\sum_{j\in S_{l}}W_{ij}$ to be the degree of entity $i$ with respect to cluster $l$ . Define shorthand notation $d_{i}$ to be the degree of entity $i$ with respect to its own cluster. Algebraically, we have $d_{i}:=\sum_{j}W_{ij}z_{i}^{\ast\top}z_{j}^{\ast}$ . We also denote $D:=\operatorname{diag}{(}d_{1},\dots,d_{n})$ .

In the following sections we will frequently use the expected values related to the observed affinity matrix $W$ . It would be tedious to derive every expression from $(\mathcal{X},f,\mathcal{P})$ . To simplify this, we introduce the following induced model parameters, which will be used throughout the paper.

Definition 3 (Induced model parameters).

In a latent model $\mathcal{M}$ equipped with structure $(\mathcal{X},f,\mathcal{P})$ , one can derive the induced parameters $(p,q)$ defined as

[TABLE]

Note that both $p,q\in[0,1]$ .

2.2 LCI Concentration Inequalities

Here we provide new concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices.

Lemma 1 (LCI Tail Bound).

Consider a finite sequence of random variables $\{x_{i}\}$ that are LCI given $Y$ . Assume that $\mathbb{E}_{x_{i}Y}\left[x_{i}\right]=\mu_{i}$ , and $\mathbb{E}_{x_{i}}\left[\mathrm{e}^{t(x_{i}-\mu_{i})}\mid Y\right]\leq\mathrm{e}^{t^{2}\sigma_{i}^{2}/2}$ for all $Y$ . Then for all positive $\epsilon$ ,

[TABLE]

Corollary 1 (LCI Hoeffding’s Inequality).

Consider a finite sequence of random variables $\{x_{i}\}$ that are LCI given $Y$ . Assume that and $x_{i}\in[a_{i},b_{i}]$ almost surely, and $\mathbb{E}_{x_{i}Y}\left[x_{i}\right]=\mu_{i}$ . Then for all positive $\epsilon$ ,

[TABLE]

Corollary 2 (LCI Bernstein Inequality).

Consider a finite sequence of random variables $\{x_{i}\}$ that are LCI given $Y$ . Assume that $\left|{x_{i}}\right|\leq M$ almost surely, $\mathbb{E}_{x_{i}}\left[x_{i}\mid Y\right]=0$ , and $\operatorname{Var}_{x_{i}}\left[x_{i}\mid Y\right]\leq\nu_{i}^{2}$ for all $Y$ . Then for all positive $\epsilon$ ,

[TABLE]

Lemma 2 (LCI Matrix Tail Bound).

Consider a finite sequence of random symmetric matrices $\{X_{i}\}$ with dimension $d$ that are LCI given $Y$ . Assume there is a function $g:(0,\infty)\to[0,\infty]$ and a sequence $\{A_{i}\}$ of fixed symmetric matrices that satisfy the relations $\mathbb{E}_{X_{i}}\left[\mathrm{e}^{\theta X_{i}}\mid Y\right]\preceq\mathrm{e}^{g(\theta)\cdot A_{i}}$ for $\theta>0$ and for all $Y$ . Define the scale parameter $\rho:=\lambda_{\max}\left(\sum_{i}A_{i}\right)$ . Then for all positive $\epsilon$ ,

[TABLE]

Corollary 3 (LCI Matrix Bernstein Inequality).

Consider a finite sequence of random symmetric matrices $\{X_{i}\}$ with dimension $d$ that are LCI given $Y$ . Assume that $\mathbb{E}_{X_{i}}\left[X_{i}\mid Y\right]=0$ for all $Y$ , and $\lambda_{\max}(X_{i})\leq R$ almost surely. Also assume that the norm of the total variance $\left\|{\sum_{i}\mathbb{E}_{X_{i}}\left[X_{i}^{2}\mid Y\right]}\right\|\leq\sigma^{2}$ for all $Y$ . Then Then for all positive $\epsilon$ ,

[TABLE]

3 Polynomial-Time Regime with Semidefinite Programming

In this section we investigate the sufficient conditions for exactly inferring the group assignment of entities in latent models. An algorithm achieves exact inference if the recovered group assignment matrix $Z\in\{0,1\}^{n\times k}$ is identical to the true assignment matrix $Z^{\ast}$ up to permutation of its columns (without prior knowledge it is impossible to infer the order of groups).

Overview of the proof. Our proof starts by looking at a maximum likelihood estimation (MLE) problem (1), which cannot be solved efficiently (for more details see Section 4). We relax the MLE problem (1) to problem (2) (matrix-form relaxation), then to problem (3) (convex SDP relaxation). We ask under what conditions the relaxation holds (i.e., returns the groundtruth). Our analysis proves that, if the statistical conditions in Theorem 1 are satisfied, by solving the relaxed convex optimization problem (3), one can recover the true group assignment $Z^{\ast}$ perfectly and efficiently with probability tending to $1$ .

Our analysis can be broken down into two parts. In the first part we demonstrate that the exact inference problem in latent models can be relaxed to a semidefinite programming problem. It is well-known that SDP problems can be solved efficiently [5]. Motivated by [3] we employ Karush-Kuhn-Tucker (KKT) conditions in our proof to construct a pair of primal-dual certificates, which shows that the SDP relaxation leads to the optimal solution under certain deterministic spectrum conditions. In the second part we analyze the statistical conditions for exact inference to succeed with high probability.

3.1 SDP Relaxation

We first consider a maximum likelihood estimation approach to recover the true assignment $Z^{\ast}$ . The use of MLE in graph partitioning and community detection literature is customary [4, 2, 6]. The motivation is to find cluster assignments, such that the number of edges within clusters is maximized. Recall that $z_{i}\in\{0,1\}^{k}$ is the $i$ -th row of $Z$ , and $z^{(i)}\in\{0,1\}^{n}$ is the $i$ -th column of $Z$ . Given the observed matrix $W$ , the goal is to find a binary assignment matrix $Z$ , such that $\sum_{i,j}W_{ij}z_{i}^{\top}z_{j}$ is maximized. In the matrix form, MLE can be cast as the following optimization problem:

[TABLE]

where the last two constraints enforce that each entity is in one of the $k$ groups, and each group has size $m=n/k$ .

Problem (1) is nonconvex and hard to solve because of the $\{0,1\}$ constraint. In fact, in the case of two clusters ( $k=2$ ) and [math]- $1$ weights, the MLE formulation reduces to the Minimum Bisection problem, which is known to be NP-hard [10]. To relax it, we introduce the cluster matrix $Y=ZZ^{\top}$ . One can see that $Y$ is a rank- $k$ , $\{0,1\}$ positive semidefinite matrix. Each entry is $1$ if and only if the corresponding two entities are in the same group ( $z_{i}=z_{j}$ ). Similarly we can define $Y^{\ast}=Z^{\ast}Z^{\ast\top}$ for the true cluster matrix. Then the optimization problem becomes

[TABLE]

Problem (2) is still nonconvex because of the rank constraint. By dropping this constraint, we obtain the main SDP problem:

[TABLE]

Problem (3) is now convex and can be solved efficiently. A natural question is: under what circumstances the optimal solution to (3) will match the solution to the original problem (1)? To answer the question, we take a primal-dual approach. One can easily see there exists a strictly feasible $Y$ for the constraints in (3). Thus Slater’s condition guarantees strong duality [5]. We now proceed to derive the dual problem.

Lemma 3 (Lagrangian Dual).

The dual problem of (3) is

[TABLE]

We now construct the primal-dual certificates to close the duality gap between problem (3) and (4).

Lemma 4 (Primal-dual Certificates).

Let $P:=\mathbf{I}-\frac{1}{m}{1}_{m}{1}_{m}^{\top}$ to be the projection onto the orthogonal complement of $\operatorname*{span}\left({{1}_{m}}\right)$ . By setting the dual variables as follows

[TABLE]

where $\phi\in{\mathbb{R}}$ is a constant to be determined later, the duality gap between (3) and (4) is closed.

It remains to verify feasibility of the dual constraints in (4). It is trivial to verify that $A=D-m\phi\mathbf{I}$ is diagonal, and $\Gamma_{S_{i}}\succeq_{{\mathbb{R}}_{+}^{m}}0$ . We now summarize the dual feasibility conditions.

Lemma 5 (Dual Feasibility).

Let $v,A,\Gamma$ be defined as in Lemma 4. If

[TABLE]

and

[TABLE]

for every $i,j\in[k]$ with $i\neq j$ , then the dual constraints in (4) are satisfied.

We also require the optimal solution to be unique. This means $Y^{\ast}=Z^{\ast}Z^{\ast\top}$ should be the only optimal solution to problem (3). To do so we look into the eigenvalues of $\Lambda$ defined in Lemma 5. It is easy to verify that every $z^{\ast(i)}$ is an eigenvector of $\Lambda$ with $\Lambda z^{\ast(i)}=0$ . To ensure uniqueness, it is sufficient to require that all other $n-k$ eigenvalues of $\Lambda$ are strictly positive. We now provide the following lemma about uniqueness.

Lemma 6 (Uniqueness).

The convex relaxed problem (3) achieves exact inference and outputs the unique optimal solution $Y=Y^{\ast}=Z^{\ast}Z^{\ast\top}$ , if

[TABLE]

Remark. Why is the requirement of uniqueness reasonable? Because our latent models are generative, i.e., the ground truth $Z^{\ast}$ is unique and generates everything, including the latent variables $X$ and the observed matrix $W$ (see Figure 1). From the perspective of optimization, in some cases there may exist multiple optimal solutions, but we are only interested in the cases in which the preexisting groundtruth $Z^{\ast}$ is returned. In fact, the requirement of uniqueness is customary in generative models [2, 4, 6].

Combining the results above, we now give the sufficient conditions for exact inference.

Lemma 7 (Deterministic Sufficient Conditions).

Let $v,A,\Gamma$ be defined as in Lemma 4. If

[TABLE]

for every $i,j\in[k]$ with $i\neq j$ , and

[TABLE]

then $Y^{\ast}=Z^{\ast}Z^{\ast\top}$ is the unique primal optimal solution to (3), and $(v,A,\Gamma)$ is the dual optimal solution to (4).

Note that Lemma 7 gives the deterministic condition for our SDP relaxation to succeed. In the following two sections, we characterize the statistical conditions for (8) and (9) to hold with probability tending to $1$ .

3.2 Entrywise Nonnegativity of $\Gamma$

In this section we analyze the statistical conditions for (8) to hold with high probability. From Lemma 4 it follows that $\Gamma_{S_{i}S_{j}}=\phi{1}_{m}{1}_{m}^{\top}+PW_{S_{i}S_{j}}P-W_{S_{i}S_{j}},\forall i\neq j$ . To ensure dual feasibility, it is necessary to ensure that every entry in $\Gamma_{S_{i}S_{j}}$ is nonnegative with high probability by setting a proper $\phi$ .

We now present the condition for (8) to hold with high probability.

Lemma 8 (Choice of $\phi$ ).

If $\phi\geq q+O\left(\sqrt{\frac{k\log n}{n}}\right)$ , then $\Gamma_{S_{i}S_{j}}\succeq_{{\mathbb{R}}_{+}^{m}}0$ holds for every $i,j\in[k]$ with probability at least $1-O\left(\frac{1}{n}\right)$ .

Remark. To ensure nonnegativity, one may think about setting $\phi$ to be some sufficiently large constant (for example, set $\phi=2$ ). This is not going to work, however, as the choice of $\phi$ also plays a critical role in the analysis of (9) in the next section. In order to obtain a tighter final result, it is necessary to pick the smallest possible $\phi$ , without breaking the nonnegativity of $\Gamma$ . For further details see Lemma 10.

3.3 Statistical Conditions of Efficient Inference

In this section we analyze the statistical conditions for (9) to hold with high probability. To do so, we first look at the expectation of $\Lambda$ .

Lemma 9.

It follows that

[TABLE]

Remark. The expectation above shows why the choice of $\phi$ matters. With a larger $\phi$ , one has less degree of freedom to work with, in terms of the concentration inequalities.

The next step is to show that the eigenvalue of $\Lambda$ will not deviate too much from its expectation, so that $\lambda_{k+1}(\Lambda)$ is greater than [math] with high probability. In fact we have the following lemma.

Lemma 10.

Assuming that $\phi<p$ . To prove (9) holds with high probability, it is sufficient to prove

[TABLE]

and

[TABLE]

hold with high probability.

We now present the statistical conditions for exact inference of latent models using semidefinite programming.

Theorem 1.

In a latent model of $k$ clusters and $n$ entities, and with induced parameters $(p,q)$ as in Definition 3, if

[TABLE]

then the SDP-relaxed problem (3) achieves exact inference, i.e., $Y=Y^{\ast}=Z^{\ast}Z^{\ast\top}$ , with probability at least $1-O\left(\frac{1}{n}\right)$ .

4 Additional Analysis

In this section, for completeness, we also provide an information-theoretic lower bound on exact inference (i.e., the impossible regime), and we analyze when (nonconvex) maximum likelihood estimation is correct (i.e., the hard regime).

4.1 Impossible Regime

In this section we analyze the necessary conditions for exact inference of latent models. Our goal is to characterize the information-theoretic lower limit of any algorithm for inferring the true labels $Z^{\ast}$ in our model. More specifically, we would like to infer labels $\hat{Z}$ given the observation of the adjacency matrix $W$ . Also note that we do not observe the collection of latent variables ${X}$ . We present the following information-theoretic lower bound for our model.

Claim 1.

Let $Z^{\ast}$ be the true assignment matrix sampled uniformly at random. In a latent model of $k$ clusters and $n$ entities, and with induced parameters $(p,q)$ as in Definition 3, if

[TABLE]

then the probability of error $\mathbb{P}\left\{{\hat{Z}\neq Z^{\ast}}\right\}\geq 1/2$ , for any algorithm that a learner could use for picking $\hat{Z}$ .

4.2 Hard Regime with Maximum Likelihood Estimation

In this section we analyze the conditions for exact inference of the true labels in latent models using nonconvex maximum likelihood estimation by solving optimization problem (1). We call this the hard regime because without some convex relaxation, enumerating $Z$ takes $O(k^{n})$ iterations. The problem can be rewritten in the following square matrix form:

[TABLE]

where

[TABLE]

is the space of all feasible solutions. We now state the conditions for exact inference of latent models using maximum likelihood estimation.

Claim 2.

In a latent model of $k$ clusters and $n$ entities, and with induced parameters $(p,q)$ as in Definition 3, if

[TABLE]

then maximum likelihood estimation (13) achieves exact inference, i.e., $Y=Y^{\ast}=Z^{\ast}Z^{\ast\top}$ , with probability at least $1-O\left(\frac{1}{n}\right)$ .

5 Experiments

We validate our theoretical findings through experiments. We run synthetic experiments for the latent space model, the exchangeable graph model, and the kernel latent variable model. We also test our algorithm in a real-world dataset in which assumptions might not necessarily hold. See Appendix for details.

Appendix A Proof of LCI Concentration Inequalities

In this section we present the proof of LCI concentration inequalities used in the main paper.

Proof of Lemma 1.

Starting from the left-hand side, we have

[TABLE]

The second line follows from Markov’s inequality, the third line follows from the law of total expectation, the fourth line follows from the LCI assumption, and the fifth line follows from the assumption $\mathbb{E}_{x_{i}}\left[\mathrm{e}^{t(x_{i}-\mu_{i})}\mid Y\right]\leq\mathrm{e}^{t^{2}\sigma_{i}^{2}/2}$ . This completes the proof. ∎

Proof of Corollary 1.

By Hoeffding’s lemma we have $\mathbb{E}_{x_{i}}\left[\mathrm{e}^{t(x_{i}-\mu_{i})}\middle|Y\right]\leq\mathrm{e}^{t^{2}(b_{i}-a_{i})^{2}/8}$ . Setting $\sigma_{i}^{2}=(b_{i}-a_{i})^{2}/4$ in the statement of Theorem 1 leads to the desired result. ∎

Proof of Corollary 2.

For any single $x_{i}$ , by Taylor expansion and the assumption of $\left|{x_{i}}\right|\leq M$ and $\operatorname{Var}_{x_{i}}\left[x_{i}\mid Y\right]\leq\nu_{i}^{2}$ , we have

[TABLE]

for any $t>0$ . Setting $\sigma_{i}^{2}=\frac{2\nu_{i}^{2}(\mathrm{e}^{tM}-tM-1)}{t^{2}M^{2}}$ in the statement of Theorem 1 leads to the desired result. ∎

Before we present the proof of LCI matrix Bernstein ineqauality, we first introduce the proof of Lemma 2, which is motivated by [23].

Proof of Lemma 2.

Starting from the left-hand side, we have

[TABLE]

The second line follows from Markov’s inequality, the third line follows from the spectral mapping theorem, the fourth line follows from the law of total expectation, the fifth line follows from the LCI assumption and the fact that the matrix cumulant generating functions are subadditive, and the sixth line follows from the assumption $\mathbb{E}_{X_{i}}\left[\mathrm{e}^{\theta X_{i}}\mid Y\right]\preceq\mathrm{e}^{g(\theta)\cdot A_{i}}$ . This completes the proof. ∎

We now present the proof of LCI matrix Bernstein inequality.

Proof of Corollary 3.

In this proof we assume $R=1$ for simplicity. The general case follows by scaling the corresponding terms.

For any single $X_{i}$ , by Taylor expansion and the assumption of $\mathbb{E}_{X_{i}}\left[X_{i}\mid Y\right]=0$ , we have

[TABLE]

for any $\theta>0$ . Then by Lemma 2 we have

[TABLE]

Setting $\theta=\log(1+\epsilon/\sigma^{2})$ completes the proof. ∎

Appendix B Proofs for Polynomial-Time Regime with Semidefinite Programming

Proof of Lemma 3.

We define the Lagrangian variables $a_{1},\dots,a_{n}\in{\mathbb{R}},v_{1},v_{2}\in{\mathbb{R}}^{n},\Lambda\succeq_{\mathcal{S}_{+}^{n}}0,\Gamma\succeq_{{\mathbb{R}}_{+}^{n}}0$ for the constraints in (3) respectively. Then Lagrangian of (3) is

[TABLE]

For simplicity we denote $A=\operatorname{diag}{(}a_{1},\dots,a_{n})$ to be a diagonal matrix. By the KKT stationarity condition and dual feasibility, we have

[TABLE]

Note that in the equation above, positive semidefiniteness requires symmetry. Thus we set $v:=v_{1}=v_{2}$ , and we require $\Gamma$ to be symmetric. Then we obtain the dual objective function $\operatorname{tr}{(}A)+2mv^{\top}{1}_{n}$ .

We now look at the remaining constraints. The KKT complementary slackness condition requires that

[TABLE]

and

[TABLE]

for every $i$ and $j$ . We want to highlight that (15) is equivalent to $\Lambda Y=0$ , given that both matrices are positive semidefinite. Since $Y^{\ast}=Z^{\ast}Z^{\ast\top}$ , this implies that for the optimal solution $Y^{\ast}$ , every $z^{\ast(i)}$ is an eigenvector of $\Lambda$ with an eigenvalue of [math]. Furthermore $\eqref{opt:CSb}$ implies that $\Gamma_{S_{i}}=0$ for all $i\in[k]$ , because $Y^{\ast}_{S_{i}}$ is an all-one submatrix. ∎

Proof of Lemma 4.

Strong duality requires that the optimal primal and dual objective values are equal. In other words, the objective value of problem (3) and (4) should match. Note that the optimal primal solution $Y^{\ast}$ can be decomposed as $Y^{\ast}=Z^{\ast}Z^{\ast\top}$ . Thus the primal objective function can be rewritten as

[TABLE]

On the other hand, the dual objective function is equal to

[TABLE]

Recall that $d_{i}=\sum_{j=1}^{n}W_{ij}z_{i}^{\ast\top}z_{j}^{\ast}$ and $D=\operatorname{diag}{(}d_{1},\dots,d_{n})$ . One can see that by setting $a_{i}=d_{i}-m\phi$ , or $A=D-m\phi\mathbf{I}$ , the duality gap is closed. One may notice that the choice of $\phi$ does not change the objective values here. For the sole purpose of strong duality, $\phi$ is an arbitrary constant that will be determined later. ∎

Proof of Lemma 5.

This directly follows from the constraint in (4), by plugging in the construction of $v,A$ and $\Gamma$ . ∎

Proof of Lemma 7.

Again we use $Y^{\ast}=Z^{\ast}Z^{\ast\top}$ to denote the optimal primal solution. Since $\Lambda$ and $Y^{\ast}$ are both positive semidefinite, the KKT complementary slackness condition (15) is equivalent to $\Lambda Y^{\ast}=\Lambda Z^{\ast}Z^{\ast\top}=0$ , which implies that every $z^{\ast(i)}$ is an eigenvector of $\Lambda$ with an eigenvalue of [math]. Condition (7) further requires that $\{z^{\ast(i)}\}_{i=1}^{k}$ spans the whole null space of $\Lambda$ . As a result, any optimal primal solution $Y$ needs to be a multiple of $Y^{\ast}$ . Since $Y_{ii}=1$ , the choice of $Y$ is unique. ∎

Proof of Lemma 7.

This directly follows from Lemma 5 and 7. ∎

Proof of Lemma 8.

Motivated by [3], in the following proof we introduce the notation $\bar{d}(S_{l},S_{r})$ to denote the average degree of connectivity between cluster $l$ and $r$ . In other words, we have $\bar{d}(S_{l},S_{r}):=\frac{1}{m}\sum_{i\in S_{r}}d_{i}(S_{l})=\frac{1}{m}\sum_{j\in S_{l}}d_{j}(S_{r})$ . Note that dual feasibility condition (8) is satisfied, if for every $i\in S_{r},j\in S_{l}$ , we have

[TABLE]

By definition, dividing both sides by $m$ , this is equivalent to

[TABLE]

One may note that each random variable $d_{i}(S_{l})$ is the summation of $m$ LCI random variables given $X$ , with the expectation $\mathbb{E}_{WX}\left[d_{i}(S_{l})\right]=mq$ . Using LCI Hoeffding’s inequality, we obtain

[TABLE]

Taking a union bound for all $S_{l},S_{r}$ and $i\in S_{r}$ gives us

[TABLE]

where the last inequality holds if $t\geq O\left(\sqrt{\frac{n\log n}{k}}\right)$ .

Note that by definition, the average degree $\bar{d}(S_{l},S_{r})$ is always bounded between the minimum and the maximum of $d_{i}(S_{l})$ and $d_{j}(S_{r})$ . Then with probability at least $1-O(n^{-1})$ , it follows that

[TABLE]

Dividing both sides by $m=\frac{n}{k}$ , this is equivalent to

[TABLE]

Thus, by setting $\phi\geq q+O\left(\sqrt{\frac{k\log n}{n}}\right)$ , nonnegativity is satisfied with probability at least $1-O(n^{-1})$ . ∎

Proof of Lemma 10.

Here we look at the expectation of $\Lambda$ . Note that

[TABLE]

Note that for each summand above, we have

[TABLE]

given that $u_{i}$ ’s are orthogonal to ${1}$ . Thus we obtain

[TABLE]

∎

Proof of Lemma 10.

Starting from (9), we have

[TABLE]

Regarding (18), note that $D$ is a diagonal matrix. As a result, $\lambda_{k+1}\left(D-\mathbb{E}_{WX}\left[D\right]\right)\geq\min_{i}(d_{i}-\mathbb{E}_{WX}\left[d_{i}\right])$ .

Regarding (19), it follows that $\lambda_{k+1}\left(-W+\mathbb{E}_{WX}\left[W\right]\right)\geq-\lambda_{\max}\left(W-\mathbb{E}_{WX}\left[W\right]\right)$ .

Regarding (20), we have

[TABLE]

Note that

[TABLE]

given that $u_{i}$ ’s are orthogonal to ${1}$ . Thus $\lambda_{k+1}\left(-\Gamma+\mathbb{E}_{WX}\left[\Gamma\right]\right)=0$ .

Combining the results above, it is sufficient to prove that

[TABLE]

This gives us the result in the statement. ∎

Proof of Theorem 1.

Our proof relies on the use of LCI concentration inequalities. First we show that (11) holds with high probability. Note that, for any fixed latent variable $X$ and any $i\in[n]$ , we have $\mathbb{P}_{WX}\left\{{d_{i}-\mathbb{E}_{WX}\left[d_{i}\middle|X\right]\leq-t}\right\}\leq\exp(-2t^{2}/n)$ by LCI Hoeffding’s inequality. By a union bound, it follows that

[TABLE]

Setting $t=\frac{1}{2}m(p-\phi)$ , we obtain

[TABLE]

where the last inequality holds given that $\left(\frac{p-\phi}{k}\right)^{2}\geq 4\frac{\log n}{n}-2\frac{\log c_{1}}{n}$ .

Next we show that (12) holds with high probability, and we use LCI Bernstein inequality in our proof. In this part we denote $\bar{W}_{ij}:=W_{ij}-\mathbb{E}_{WX}\left[W_{ij}\right]$ , and $\delta_{ij}$ to be the matrix with $1$ in entry $(i,j),(j,i)$ , and [math] everywhere else. Note that $\delta_{ij}^{2}$ is a matrix with $1$ in entry $(i,i),(j,j)$ , and [math] everywhere else. Furthermore we define the matrix $\Delta_{ij}:=\bar{W}_{ij}\delta_{ij}$ . One can note that $\Delta_{ij}$ ’s are LCI random matrices given $X$ , with the maximum eigenvalue bounded above by $1$ . Also note that for any given $X$ , we have $\mathbb{E}_{W}\left[\Delta_{ij}\middle|X\right]=0$ . By our construction, it follows that $\sum_{i<j}\Delta_{ij}=W-\mathbb{E}_{WX}\left[W\right]$ . Thus for any given $X$ , it follows that $\left\|{\sum_{i<j}\mathbb{E}_{W}\left[\Delta_{ij}^{2}\middle|X\right]}\right\|=\left\|{\sum_{i<j}\delta_{ij}^{2}\mathbb{E}_{WX}\left[\bar{W}_{ij}^{2}\middle|X\right]}\right\|\leq n-1$ . Then applying the LCI matrix Bernstein inequality, we obtain

[TABLE]

Setting $t=\frac{1}{2}m(p-\phi)$ , we obtain

[TABLE]

where the last inequality holds given that $\left(\frac{p-\phi}{k}\right)^{2}\geq 32\frac{\log n}{n}-16\frac{\log c_{2}}{n}$ .

Combining the results above, the probability of $\lambda_{k+1}(\Lambda)$ being greater than zero is at least $1-(c_{1}+c_{2})n^{-1}$ , as long as $\left(\frac{p-\phi}{k}\right)^{2}\geq 32\frac{\log n}{n}$ . The last remaining task is to take $\phi$ into account. By Lemma 8, setting $\phi=q+c\sqrt{\frac{k\log n}{n}}$ for some constant $c$ gives us

[TABLE]

Simplification leads to

[TABLE]

To further simplify the bound above we consider two cases. If $2c(p-q)\geq 32k\sqrt{\frac{k\log n}{n}}$ , a sufficient condition will be $(p-q)^{2}\geq 16c^{2}k\frac{\log n}{n}$ . On the other hand if $2c(p-q)\leq 32k\sqrt{\frac{k\log n}{n}}$ , a sufficient condition will be $(p-q)^{2}\geq 64k^{2}\frac{\log n}{n}$ . Thus for either case, $\frac{(p-q)^{2}}{k^{2}}\geq c^{\prime}\frac{\log n}{n}$ , for some large constant $c^{\prime}$ , is a sufficient condition. This completes our proof. ∎

Appendix C Proofs for Additional Analysis

C.1 Proof of Claim 1

In the following proof, we use notation $\mathcal{Z}$ to denote the space of feasible solutions. Mathematically, we have the following definition

[TABLE]

and we assume that the groundtruth $Z^{\ast}$ is sampled uniformly at random from $\mathcal{Z}$ .

Proof.

First we characterize the mutual information between the true labels $Z^{\ast}$ and the observed matrix $W$ . Using the pairwise KL-based bound [25], we obtain

[TABLE]

where $D_{\text{KL}}{(}\cdot\mid\cdot)$ denotes the KL-divergence between two probability distributions. Then we can apply Fano’s inequality [7]. For any predicted labels $\hat{Z}$ , we have

[TABLE]

By definition of $\mathcal{Z}$ and counting, it follows that

[TABLE]

Note that $\sqrt{n}(n/\mathrm{e})^{n}\leq n!\leq\mathrm{e}\sqrt{n}(n/\mathrm{e})^{n}$ . It follows that

[TABLE]

which indicates that

[TABLE]

and the last inequality holds under the mild assumption of $n\geq 2k$ .

Finally, by Fano’s inequality, for the probability of error to be at least $1/2$ , it is sufficient to require the lower bound to be greater than $1/2$ . Hence

[TABLE]

and the last inequality holds provided that $\frac{p}{k}\log(p/q)\leq\frac{1}{8n}$ and $nk\geq 8$ . ∎

C.2 Proof of Claim 2

In the following proof we define $d({Y})=\langle{Y}^{\ast},{Y}^{\ast}-{Y}\rangle\geq 0$ . Before we start our proof we first present the following result.

Lemma 11 (Lemma 1.1, [6]).

For each $t\in[2(n/k-1),n^{2}/k]$ , we have

[TABLE]

Our proof consists of two steps. We first show the deterministic condition for problem (13) to succeed, and then derive the statistical condition by bounding ${W}$ from its expectation $\mathbb{E}_{{WX}}\left[{W}\right]$ . We present the following lemma.

Lemma 12.

If the following condition

[TABLE]

holds, then maximum likelihood estimation (13) achieves exact inference.

Proof.

To prove problem (13) returns the optimal solution, it is sufficient to prove that for every ${Y}\neq{Y}^{\ast}$ , $\langle{W},{Y}^{\ast}-{Y}\rangle$ is strictly positive. Note that

[TABLE]

Regarding the last term in (23), note that $\mathbb{E}_{{WX}}\left[{W}\right]=q{1}_{n}{1}_{n}^{\top}+(p-q){Y}^{\ast}$ . Given the fact that ${\left\|{{Y}^{\ast}}\right\|}_{F}={\left\|{{Y}}\right\|}_{F}$ for every ${Y}\in\mathcal{Y}$ , we have

[TABLE]

∎

We now present the proof of Theorem 2.

Proof.

To show that (22) holds with high probability, we use LCI Bernstein inequality in our proof. For any fixed collection of latent variables ${X}$ and any $i\neq j$ , $({W}_{ij}-\mathbb{E}_{{WX}}\left[{W}_{ij}\right])({Y}^{\ast}_{ij}-{Y}_{ij})$ is a Bernoulli random variable centered at [math], bounded between $-1$ and $1$ , with a variance bounded above by $\frac{1}{4}$ . Thus $\langle{W}-\mathbb{E}_{{WX}}\left[{W}\right],{Y}^{\ast}-{Y}\rangle=2\sum_{i<j}({W}_{ij}-\mathbb{E}_{{WX}}\left[{W}_{ij}\right])({Y}^{\ast}_{ij}-{Y}_{ij})$ is the summation of $\frac{1}{2}d({Y})$ LCI random variables given $X$ . LCI Bernstein inequality implies

[TABLE]

for every $t>0$ .

Setting $t=(p-q)d({Y})$ , it follows that

[TABLE]

By a union bound we obtain

[TABLE]

where the third line follows Lemma 11, and the second to last inequality holds given that $\frac{(p-q)^{2}}{k}\geq 175\frac{\log n}{n}$ .

Finally applying Lemma 12, the probability of $\langle{W},{Y}^{\ast}-{Y}\rangle$ being greater than zero is at least $1-n^{-1}$ . This completes our proof. ∎

Appendix D Experiments

In this section, we validate our theoretical findings through synthetic experiments. Here we compare the theoretic exact inference condition suggested by our SDP analysis, and the experimental results of exact inference using CVX [13, 12] to solve the SDP problem. We run synthetic experiments on four models: latent space model with three clusters, latent space model with two clusters, exchangeable graph model with two clusters, and kernel latent variable model with two clusters.

Latent space model with three clusters (Fig. 2). We pick $\mathcal{X}={\mathbb{R}}^{3}$ as the latent domain. We fix the number of entities $n$ to be $30$ . We generate $Z^{\ast}$ by randomly assigning entities to three groups of equal size. We generate the latent variables using Gaussian distributions, such that $P_{1}=\mathcal{N}_{3}((\left\|{\mu}\right\|,0,0),\sigma^{2}\mathbf{I})$ , $P_{2}=\mathcal{N}_{3}((0,\left\|{\mu}\right\|,0),\sigma^{2}\mathbf{I})$ , $P_{3}=\mathcal{N}_{3}((0,0,\left\|{\mu}\right\|),\sigma^{2}\mathbf{I})$ , and $f({x},{x}^{\prime})=\exp(-{\left\|{{x}-{x^{\prime}}}\right\|}^{2})$ . The parameters in our simulations are $\left\|{{\mu}}\right\|$ and $\sigma$ . Each entry ${W}_{ij}$ follows Bernoulli distribution with probability $f({x}_{i},{x}_{j})$ . For each pair of $\left\|{{\mu}}\right\|$ and $\sigma$ , we count: a) how many times (out of $10$ ) the fourth smallest eigenvalue of $\Lambda$ is greater than zero, and b) how many times (out of $5$ ) CVX returns the correct ${Y}={Y}^{\ast}=Z^{\ast}Z^{\ast\top}$ . This allows us to compute an empirical probability of success for the statistical condition $\lambda_{k+1}(\Lambda)>0$ and CVX, respectively. Our experiments show that if the fourth smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.

Latent space model with two clusters (Fig. 3). We pick $\mathcal{X}={\mathbb{R}}^{2}$ as the latent domain. We fix the number of entities $n$ to be $150$ . Note that in the two cluster case, we can let the group assignment matrix $Z$ become a vector by using the $\{+1,-1\}$ encoding. We generate $Z^{\ast}$ by randomly assigning $n/2$ entities to one group ( $z^{\ast}_{i}=1$ ), and $n/2$ entities to the other group ( $z^{\ast}_{i}=-1$ ). Since we are using the $\{+1,-1\}$ encoding, we only need to check the second smallest eigenvalue $\lambda_{2}(D-W)>0$ as the sufficient condition. We generate the latent variables using Gaussian distributions, such that $P_{1}=\mathcal{N}_{2}({\mu},\sigma^{2}\mathbf{I})$ , $P_{2}=\mathcal{N}_{2}(-{\mu},\sigma^{2}\mathbf{I})$ , where $\mathcal{N}$ denotes the Gaussian distribution. We also set $f({x},{x}^{\prime})=\exp(-{\left\|{{x}-{x^{\prime}}}\right\|}^{2})$ . The parameters in our simulations are $\left\|{{\mu}}\right\|$ and $\sigma$ . Each entry ${W}_{ij}$ follows Bernoulli distribution with probability $f({x}_{i},{x}_{j})$ . For each pair of $\left\|{{\mu}}\right\|$ and $\sigma$ , we count: a) how many times (out of $10$ ) the second smallest eigenvalue of $D-W$ is greater than zero, and b) how many times (out of $5$ ) CVX returns the correct ${Y}={Y}^{\ast}=Z^{\ast}Z^{\ast\top}$ . Our experiments show that if the second smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.

Exchangeable graph model with two clusters (Fig. 4). We pick $\mathcal{X}=\{0,1\}^{32}$ as the latent domain. We fix the number of entities $n$ to be $150$ . We generate $Z$ using the same method as in the latent space model with two clusters. We generate the latent variables as follows: for every $x_{i}\in\{0,1\}^{32}$ , its digits follow Bernoulli distribution with parameter $\alpha$ , if entity $i$ is in the first group; its digits follow Bernoulli distribution with parameter $1-\alpha$ , if entity $i$ is in the second group. We set $f({x},{x}^{\prime})=\exp(-\left\|{{x}-{x^{\prime}}}\right\|_{1}/\beta)$ . The parameters in our simulations are $\alpha$ and $\beta$ . Each entry ${W}_{ij}$ follows Bernoulli distribution with probability $f({x}_{i},{x}_{j})$ . Our experiments show that if the second smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.

Kernel latent variable model with two clusters (Fig. 5). We pick $\mathcal{X}$ to be the power set of $\{1,\dots,32\}$ as the latent domain. We fix the number of entities $n$ to be $150$ . We generate $Z$ using the same method as in the latent space model with two clusters. We generate the latent variables as follows: every $x_{i}$ is a subset of $\{1,\dots,32\}$ . Each element $1$ through $16$ is in set $x_{i}$ with probability $p$ if entity $i$ is in the first group, and with probability $1-p$ if entity $i$ is in the second group. Each element $17$ through $32$ is in set $x_{i}$ with probability $1-p$ if entity $i$ is in the first group, and with probability $p$ if entity $i$ is in the second group. We set the kernel $K({x},{x}^{\prime})=2^{\left|{x\cap x^{\prime}}\right|}$ , and $f({x}_{i},{x}_{j})=(\log_{2}K(x_{i},x_{j}))/32$ . The parameters in our simulations are $p$ and $\alpha$ . Each entry ${W}_{ij}$ follows Beta distribution with parameters $(\alpha,\alpha\frac{1-f({x}_{i},{x}_{j})}{f({x}_{i},{x}_{j})})$ . Our experiments show that if the second smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.

D.1 Larger Number of Entities

Here we provide synthetic experiment results for a large number of entities with $n=5000$ in the latent space model with two clusters. We pick $\mathcal{X}={\mathbb{R}}^{2}$ as the latent domain, $\left\|{{\mu}}\right\|$ to be $1$ , and the number of trials to be $10$ . We compute the second minimum eigenvalue with $\sigma$ being $0.05$ and $0.3$ . With $\sigma=0.05$ , the number of runs with positive second minimum eigenvalue is $10$ (out of $10$ ). With $\sigma=0.3$ , the number of runs with positive second minimum eigenvalue is [math] (out of $10$ ). We also run SDP for both cases. With $\sigma=0.05$ the number of runs where SDP succeeded is $10$ (out of $10$ ). With $\sigma=0.3$ the number of runs where SDP succeeded is [math] (out of $10$ ). Both results (success for $\sigma=0.05$ and failure for $\sigma=0.3$ ) confirm our finding in Theorem 1.

D.2 Real-world Data

To test the adequacy of SDP in a real-world dataset in which assumptions might not necessarily hold, we use an openly available Stanford large network dataset, email-Eu-core [19]. In our experiments we used CVX [13, 12] as the solver.

The procedure is as follows. We select the two largest clusters from the dataset as the test data. The size of the test data is $n=201$ , and the sizes of the two clusters are $109$ and $92$ , respectively. The adjacency matrix is shown in Figure 6. Note that in the diagonal blocks in the adjacency matrix, the distribution of edges is not uniform, and seem to depend highly on the entities. That is, some rows are more dense than other rows, indicating that some entities might be closer (in a latent space) to other entities. We run SDP with the adjacency matrix and obtain the solution ${{Y}}$ . We then set ${Y}=\text{sign}({{Y}})$ as the output of the algorithm. Comparing our test result with the ground truth, our algorithm achieved an accuracy of 95.52%.

For comparison, we ran the same real-world experiment using Kernighan-Lin algorithm with random initialization for $100$ iterations. The average accuracy was 52.91%, with a standard error of 0.21%.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Emmanuel Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research , 18(177):1–86, 2018.
2[2] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory , 62(1):471–487, 2016.
3[3] Arash A Amini, Elizaveta Levina, et al. On semidefinite relaxations for the block model. The Annals of Statistics , 46(1):149–179, 2018.
4[4] Afonso S Bandeira. Random laplacian matrices and convex relaxations. Foundations of Computational Mathematics , 18(2):345–379, 2018.
5[5] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridge university press, 2004.
6[6] Yudong Chen and Jiaming Xu. Statistical-computational phase transitions in planted models: The high-dimensional setting. In International Conference on Machine Learning , pages 244–252, 2014.
7[7] Thomas M Cover and Joy A Thomas. Elements of information theory . John Wiley & Sons, 2012.
8[8] Jean-Jacques Daudin, Laurent Pierre, and Corinne Vacher. Model for heterogeneous random networks using continuous latent variables and an application to a tree–fungus network. Biometrics , 66(4):1043–1051, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Exact Inference with Latent Variables

Abstract

1 Introduction

2 Preliminaries

Definition 1** (Class of latent models).**

Definition 2** (Latent Conditional Independence).**

2.1 Notations

Definition 3** (Induced model parameters).**

2.2 LCI Concentration Inequalities

Lemma 1** (LCI Tail Bound).**

Corollary 1** (LCI Hoeffding’s Inequality).**

Corollary 2** (LCI Bernstein Inequality).**

Lemma 2** (LCI Matrix Tail Bound).**

Corollary 3** (LCI Matrix Bernstein Inequality).**

3 Polynomial-Time Regime with Semidefinite Programming

3.1 SDP Relaxation

Lemma 3** (Lagrangian Dual).**

Lemma 4** (Primal-dual Certificates).**

Lemma 5** (Dual Feasibility).**

Lemma 6** (Uniqueness).**

Lemma 7** (Deterministic Sufficient Conditions).**

3.2 Entrywise Nonnegativity of Γ\GammaΓ

Lemma 8** (Choice of ϕ\phiϕ).**

3.3 Statistical Conditions of Efficient Inference

Lemma 9**.**

Lemma 10**.**

Theorem 1**.**

4 Additional Analysis

4.1 Impossible Regime

Claim 1**.**

4.2 Hard Regime with Maximum Likelihood Estimation

Claim 2**.**

5 Experiments

Appendix A Proof of LCI Concentration Inequalities

Proof of Lemma 1.

Proof of Corollary 1.

Proof of Corollary 2.

Proof of Lemma 2.

Proof of Corollary 3.

Appendix B Proofs for Polynomial-Time Regime with Semidefinite Programming

Proof of Lemma 3.

Proof of Lemma 4.

Proof of Lemma 5.

Proof of Lemma 7.

Proof of Lemma 7.

Proof of Lemma 8.

Proof of Lemma 10.

Proof of Lemma 10.

Proof of Theorem 1.

Appendix C Proofs for Additional Analysis

C.1 Proof of Claim 1

Proof.

C.2 Proof of Claim 2

Lemma 11** (Lemma 1.1, [6]).**

Lemma 12**.**

Proof.

Proof.

Appendix D Experiments

D.1 Larger Number of Entities

D.2 Real-world Data

Definition 1 (Class of latent models).

Definition 2 (Latent Conditional Independence).

Definition 3 (Induced model parameters).

Lemma 1 (LCI Tail Bound).

Corollary 1 (LCI Hoeffding’s Inequality).

Corollary 2 (LCI Bernstein Inequality).

Lemma 2 (LCI Matrix Tail Bound).

Corollary 3 (LCI Matrix Bernstein Inequality).

Lemma 3 (Lagrangian Dual).

Lemma 4 (Primal-dual Certificates).

Lemma 5 (Dual Feasibility).

Lemma 6 (Uniqueness).

Lemma 7 (Deterministic Sufficient Conditions).

3.2 Entrywise Nonnegativity of $\Gamma$

Lemma 8 (Choice of $\phi$ ).

Lemma 9.

Lemma 10.

Theorem 1.

Claim 1.

Claim 2.

Lemma 11 (Lemma 1.1, [6]).

Lemma 12.