Exact Inference with Latent Variables in an Arbitrary Domain
Chuyang Ke, Jean Honorio

TL;DR
This paper establishes conditions under which exact inference in latent variable models is possible using semidefinite programming, without prior knowledge of the latent variables or their domain, supported by theoretical analysis and concentration inequalities.
Contribution
It introduces a novel SDP-based method for exact inference in latent models without prior domain knowledge, supported by theoretical guarantees and spectral analysis.
Findings
SDP approach achieves exact inference without latent domain knowledge
KKT conditions and spectral analysis predict SDP correctness accurately
Provides new concentration inequalities related to latent variables
Abstract
We analyze the necessary and sufficient conditions for exact inference of a latent model. In latent models, each entity is associated with a latent variable following some probability distribution. The challenging question we try to solve is: can we perform exact inference without observing the latent variables, even without knowing what the domain of the latent variables is? We show that exact inference can be achieved using a semidefinite programming (SDP) approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker (KKT) conditions and the spectrum of a properly defined matrix. As a byproduct of our analysis, we also provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Machine Learning and Algorithms · Markov Chains and Monte Carlo Methods
Exact Inference with Latent Variables
in an Arbitrary Domain
**Chuyang Ke
**Department of Computer Science
Purdue University
**Jean Honorio
**Department of Computer Science
Purdue University
Abstract
We analyze the necessary and sufficient conditions for exact inference of a latent model. In latent models, each entity is associated with a latent variable following some probability distribution. The challenging question we try to solve is: can we perform exact inference without observing the latent variables, even without knowing what the domain of the latent variables is? We show that exact inference can be achieved using a semidefinite programming (SDP) approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker (KKT) conditions and the spectrum of a properly defined matrix. As a byproduct of our analysis, we also provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices. To the best of our knowledge, these results are novel and could be useful for many other problems.
1 Introduction
Generative network models have become a powerful tool for researchers in various fields, including data mining, social sciences, and biology [11, 9]. With the emergence of social media in the past decade, researchers are now exposed to millions of records of interaction generated on the Internet everyday. One can note that the generic structure and organization of social media resemble certain network models, for instance, the Erdos-Renyi model, the stochastic block model, the latent space model, the random dot product model [11, 21, 24]. The analogy comes from the fact that, in a social network each user can be modeled as an entity, and the interaction of users can be modeled as edges. One common assumption is that nodes belong to different groups. In social networks this can be users’ political view, music genre preferences, or whether the user is a cat or dog person. Another common assumption, often referred to as homophily in prior literature, suggests that entities from the same group are more likely to be connected with each other than those from different groups [11, 15, 18]. The core task of inference, also known as graph partitioning, is to partition the nodes into groups based on the observed interaction information [1, 17, 9].
In this paper, we are particularly interested in the class of latent models beyond graphs, with latent variables in arbitrary domains. In a latent model, every entity belongs to one of groups. Every entity is associated with a latent variable in some arbitrary latent domain. It is natural to assume that for entities from the same group, their associated latent variables follow the same probability distribution. The latent model is equipped with a function to measure the homophily of two latent variables. Finally, two entities have some affinity score depending on their homophily in the latent domain. In other words, similar entities are more likely to have a higher affinity score. We want to highlight that, for the particular case of binary (i.e., ) affinity scores, the latent model is a random graph model. The challenging problem problem we try to solve is to infer the true group assignments without observing the the latent variables nor knowing the latent domain.
In the past decade there have actually existed a large amount of literature on network models, and most focus on the class of fully observed models, for example, the Erdos-Renyi Model, and the Stochastic Block Model. These models are called “fully observed”, because there are no latent variables, and edges are generated based on the agreement of entity labels. Some efficient algorithms have also been proposed for inference in these fully observed models [2, 4, 14, 6]. On the other hand, there is limited research on the class of latent models. Researchers have motivated various network models with latent variables, including the latent space model [16], the exchangeable graph model [11], the dot product model [22], the uniform dot product model [24], and the extremal vertices model [8]. However to the best of our knowledge, no efficient polynomial time algorithms with formal guarantees have been proposed or analyzed for exact inference in latent models.
In this paper we address the problem of exact inference in latent models with arbitrary domains. More specifically, our goal is to correctly infer the group assignment of every entity in a latent model without observing the latent variables or the latent domain. We also propose a polynomial-time algorithm for exact inference in latent models using semidefinite programming (SDP). We want to highlight that many techniques used in the analysis of fully observed models do not directly apply to latent models. This is because in latent models, affinities are no longer statistically independent. As a result, latent models are more challenging to analyze than fully observed models, such as the stochastic block model.
While SDP has been heavily proposed for different machine learning problems, our goal in this paper is to study the optimality of SDP for our more challenging model. Our analysis focuses on Karush-Kuhn-Tucker (KKT) conditions and the spectrum of a carefully constructed primal-dual certificate. For convex problems including SDPs, the KKT conditions are sufficient and necessary for strong duality and optimality [5]. To the best of our knowledge, we are providing the first polynomial time method for a generally computationally hard problem with formal guarantees. In general, problems involving latent variables are computationally hard and nonconvex, for instance, learning restricted Boltzmann machines [20] or structural Support Vector Machines with latent variables [26]. It is worth mentioning that theoretical computer science typically assumes arbitrary inputs ("worst-case" computationally hard), whereas we assume inputs are generated by a probabilistic generative model. Our results could be seen as "average-case" polynomial time: we provide exact inference conditions with respect to the model parameters .
Summary of our contributions. We provide a series of novel results in this paper:
- •
We propose the definition of the latent model class, which is highly general and subsumes several latent models from prior literature (see Table 1).
- •
We provide the first polynomial time algorithm for a generally computationally hard problem with formal guarantees. We also analyze the sufficient conditions for exact inference in latent models using a semidefinite programming approach.
- •
For completeness, we provide an information-theoretic lower bound on exact inference, and we analyze when nonconvex maximum likelihood estimation is correct.
- •
As a byproduct of our analysis, we provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices. To the best of our knowledge, these results are novel and could be useful for many other problems.
2 Preliminaries
In this section, we introduce the notations that will be used in later sections. First we provide the definition of the class of latent models.
Definition 1** (Class of latent models).**
A model is called a latent model with entities and clusters, if is equipped with structure satisfying the following properties:
- •
* is an arbitrary latent domain;*
- •
* is a homophily function, such that ;*
- •
* is the collection of distributions with support on .*
For simplicity we consider the balance case in this paper: each cluster has the same size . Let be the true cluster assignment matrix, such that if entity is in cluster , and otherwise. For every entity in cluster , nature randomly generates a latent vector from distribution . A random observed affinity matrix is generated, such that the conditional expectation fulfills .
Remark. We use for and for clarity of exposition. Our results can be trivially extended to a general domain for using the same techniques in later sections.
Remark. A particular case of the latent model is a random graph model, in which every entry in the affinity matrix is binary (i.e., ) and generated from a Bernoulli distribution with parameter .
Our definition of latent models is highly general. In Table 1, we illustrate several latent models motivated from prior literature that can be subsumed under our model class by properly defining and .
In latent models, affinities are not independent if not conditioning on the latent variables. For example, suppose and are three entities. In fully observed models the affinities and are independent, but this is not true in latent models, as shown graphically in Figure 1. This motivates our following definition of latent conditional independence (LCI).
Definition 2** (Latent Conditional Independence).**
We say random variables are latently conditional independent given , if are conditional independent given the unobserved latent random variable .
2.1 Notations
We denote . We use to denote the -dimensional positive semidefinite matrix cone, and to denote the -dimensional nonnegative orthant.
For simplicity of analysis, we use to denote the -th row of , and to denote the -th column of . We use to denote the collection of latent variables.
Regarding eigenvalues of matrices, we use to refer to the -th smallest eigenvalue, and to refer to the maximum eigenvalue.
Regarding probabilities , and , the subscripts indicate the random variables. Regarding expectations , and , the subscripts indicate which variables we are averaging over. We use to denote the conditional probability with respect to given , and to denote the conditional expectation with respect to given .
For matrices, we use to denote the spectral norm of a matrix, and to denote the Frobenius norm. We use to denote the trace of a matrix, and to denote the rank. We use the notation to denote a diagonal matrix with diagonal entries . We also use to refer to the identity matrix, and to refer to an all-one vector of length . We use to denote the unit -sphere.
Let denote the index set of the -th cluster. For any vector , we define to be the subvector of on indices . Similarly for any matrix , we define to be the submatrix of on indices . Denote the shorthand notation .
Define to be the degree of entity with respect to cluster . Define shorthand notation to be the degree of entity with respect to its own cluster. Algebraically, we have . We also denote .
In the following sections we will frequently use the expected values related to the observed affinity matrix . It would be tedious to derive every expression from . To simplify this, we introduce the following induced model parameters, which will be used throughout the paper.
Definition 3** (Induced model parameters).**
In a latent model equipped with structure , one can derive the induced parameters defined as
[TABLE]
Note that both .
2.2 LCI Concentration Inequalities
Here we provide new concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices.
Lemma 1** (LCI Tail Bound).**
Consider a finite sequence of random variables that are LCI given . Assume that , and for all . Then for all positive ,
[TABLE]
Corollary 1** (LCI Hoeffding’s Inequality).**
Consider a finite sequence of random variables that are LCI given . Assume that and almost surely, and . Then for all positive ,
[TABLE]
Corollary 2** (LCI Bernstein Inequality).**
Consider a finite sequence of random variables that are LCI given . Assume that almost surely, , and for all . Then for all positive ,
[TABLE]
Lemma 2** (LCI Matrix Tail Bound).**
Consider a finite sequence of random symmetric matrices with dimension that are LCI given . Assume there is a function and a sequence of fixed symmetric matrices that satisfy the relations for and for all . Define the scale parameter . Then for all positive ,
[TABLE]
Corollary 3** (LCI Matrix Bernstein Inequality).**
Consider a finite sequence of random symmetric matrices with dimension that are LCI given . Assume that for all , and almost surely. Also assume that the norm of the total variance for all . Then Then for all positive ,
[TABLE]
3 Polynomial-Time Regime with Semidefinite Programming
In this section we investigate the sufficient conditions for exactly inferring the group assignment of entities in latent models. An algorithm achieves exact inference if the recovered group assignment matrix is identical to the true assignment matrix up to permutation of its columns (without prior knowledge it is impossible to infer the order of groups).
Overview of the proof. Our proof starts by looking at a maximum likelihood estimation (MLE) problem (1), which cannot be solved efficiently (for more details see Section 4). We relax the MLE problem (1) to problem (2) (matrix-form relaxation), then to problem (3) (convex SDP relaxation). We ask under what conditions the relaxation holds (i.e., returns the groundtruth). Our analysis proves that, if the statistical conditions in Theorem 1 are satisfied, by solving the relaxed convex optimization problem (3), one can recover the true group assignment perfectly and efficiently with probability tending to .
Our analysis can be broken down into two parts. In the first part we demonstrate that the exact inference problem in latent models can be relaxed to a semidefinite programming problem. It is well-known that SDP problems can be solved efficiently [5]. Motivated by [3] we employ Karush-Kuhn-Tucker (KKT) conditions in our proof to construct a pair of primal-dual certificates, which shows that the SDP relaxation leads to the optimal solution under certain deterministic spectrum conditions. In the second part we analyze the statistical conditions for exact inference to succeed with high probability.
3.1 SDP Relaxation
We first consider a maximum likelihood estimation approach to recover the true assignment . The use of MLE in graph partitioning and community detection literature is customary [4, 2, 6]. The motivation is to find cluster assignments, such that the number of edges within clusters is maximized. Recall that is the -th row of , and is the -th column of . Given the observed matrix , the goal is to find a binary assignment matrix , such that is maximized. In the matrix form, MLE can be cast as the following optimization problem:
[TABLE]
where the last two constraints enforce that each entity is in one of the groups, and each group has size .
Problem (1) is nonconvex and hard to solve because of the constraint. In fact, in the case of two clusters () and [math]- weights, the MLE formulation reduces to the Minimum Bisection problem, which is known to be NP-hard [10]. To relax it, we introduce the cluster matrix . One can see that is a rank-, positive semidefinite matrix. Each entry is if and only if the corresponding two entities are in the same group (). Similarly we can define for the true cluster matrix. Then the optimization problem becomes
[TABLE]
Problem (2) is still nonconvex because of the rank constraint. By dropping this constraint, we obtain the main SDP problem:
[TABLE]
Problem (3) is now convex and can be solved efficiently. A natural question is: under what circumstances the optimal solution to (3) will match the solution to the original problem (1)? To answer the question, we take a primal-dual approach. One can easily see there exists a strictly feasible for the constraints in (3). Thus Slater’s condition guarantees strong duality [5]. We now proceed to derive the dual problem.
Lemma 3** (Lagrangian Dual).**
The dual problem of (3) is
[TABLE]
We now construct the primal-dual certificates to close the duality gap between problem (3) and (4).
Lemma 4** (Primal-dual Certificates).**
Let to be the projection onto the orthogonal complement of . By setting the dual variables as follows
[TABLE]
where is a constant to be determined later, the duality gap between (3) and (4) is closed.
It remains to verify feasibility of the dual constraints in (4). It is trivial to verify that is diagonal, and . We now summarize the dual feasibility conditions.
Lemma 5** (Dual Feasibility).**
Let be defined as in Lemma 4. If
[TABLE]
and
[TABLE]
for every with , then the dual constraints in (4) are satisfied.
We also require the optimal solution to be unique. This means should be the only optimal solution to problem (3). To do so we look into the eigenvalues of defined in Lemma 5. It is easy to verify that every is an eigenvector of with . To ensure uniqueness, it is sufficient to require that all other eigenvalues of are strictly positive. We now provide the following lemma about uniqueness.
Lemma 6** (Uniqueness).**
The convex relaxed problem (3) achieves exact inference and outputs the unique optimal solution , if
[TABLE]
Remark. Why is the requirement of uniqueness reasonable? Because our latent models are generative, i.e., the ground truth is unique and generates everything, including the latent variables and the observed matrix (see Figure 1). From the perspective of optimization, in some cases there may exist multiple optimal solutions, but we are only interested in the cases in which the preexisting groundtruth is returned. In fact, the requirement of uniqueness is customary in generative models [2, 4, 6].
Combining the results above, we now give the sufficient conditions for exact inference.
Lemma 7** (Deterministic Sufficient Conditions).**
Let be defined as in Lemma 4. If
[TABLE]
for every with , and
[TABLE]
then is the unique primal optimal solution to (3), and is the dual optimal solution to (4).
Note that Lemma 7 gives the deterministic condition for our SDP relaxation to succeed. In the following two sections, we characterize the statistical conditions for (8) and (9) to hold with probability tending to .
3.2 Entrywise Nonnegativity of
In this section we analyze the statistical conditions for (8) to hold with high probability. From Lemma 4 it follows that . To ensure dual feasibility, it is necessary to ensure that every entry in is nonnegative with high probability by setting a proper .
We now present the condition for (8) to hold with high probability.
Lemma 8** (Choice of ).**
If , then holds for every with probability at least .
Remark. To ensure nonnegativity, one may think about setting to be some sufficiently large constant (for example, set ). This is not going to work, however, as the choice of also plays a critical role in the analysis of (9) in the next section. In order to obtain a tighter final result, it is necessary to pick the smallest possible , without breaking the nonnegativity of . For further details see Lemma 10.
3.3 Statistical Conditions of Efficient Inference
In this section we analyze the statistical conditions for (9) to hold with high probability. To do so, we first look at the expectation of .
Lemma 9**.**
It follows that
[TABLE]
Remark. The expectation above shows why the choice of matters. With a larger , one has less degree of freedom to work with, in terms of the concentration inequalities.
The next step is to show that the eigenvalue of will not deviate too much from its expectation, so that is greater than [math] with high probability. In fact we have the following lemma.
Lemma 10**.**
Assuming that . To prove (9) holds with high probability, it is sufficient to prove
[TABLE]
and
[TABLE]
hold with high probability.
We now present the statistical conditions for exact inference of latent models using semidefinite programming.
Theorem 1**.**
In a latent model of clusters and entities, and with induced parameters as in Definition 3, if
[TABLE]
then the SDP-relaxed problem (3) achieves exact inference, i.e., , with probability at least .
4 Additional Analysis
In this section, for completeness, we also provide an information-theoretic lower bound on exact inference (i.e., the impossible regime), and we analyze when (nonconvex) maximum likelihood estimation is correct (i.e., the hard regime).
4.1 Impossible Regime
In this section we analyze the necessary conditions for exact inference of latent models. Our goal is to characterize the information-theoretic lower limit of any algorithm for inferring the true labels in our model. More specifically, we would like to infer labels given the observation of the adjacency matrix . Also note that we do not observe the collection of latent variables . We present the following information-theoretic lower bound for our model.
Claim 1**.**
Let be the true assignment matrix sampled uniformly at random. In a latent model of clusters and entities, and with induced parameters as in Definition 3, if
[TABLE]
then the probability of error , for any algorithm that a learner could use for picking .
4.2 Hard Regime with Maximum Likelihood Estimation
In this section we analyze the conditions for exact inference of the true labels in latent models using nonconvex maximum likelihood estimation by solving optimization problem (1). We call this the hard regime because without some convex relaxation, enumerating takes iterations. The problem can be rewritten in the following square matrix form:
[TABLE]
where
[TABLE]
is the space of all feasible solutions. We now state the conditions for exact inference of latent models using maximum likelihood estimation.
Claim 2**.**
In a latent model of clusters and entities, and with induced parameters as in Definition 3, if
[TABLE]
then maximum likelihood estimation (13) achieves exact inference, i.e., , with probability at least .
5 Experiments
We validate our theoretical findings through experiments. We run synthetic experiments for the latent space model, the exchangeable graph model, and the kernel latent variable model. We also test our algorithm in a real-world dataset in which assumptions might not necessarily hold. See Appendix for details.
Appendix A Proof of LCI Concentration Inequalities
In this section we present the proof of LCI concentration inequalities used in the main paper.
Proof of Lemma 1.
Starting from the left-hand side, we have
[TABLE]
The second line follows from Markov’s inequality, the third line follows from the law of total expectation, the fourth line follows from the LCI assumption, and the fifth line follows from the assumption . This completes the proof. ∎
Proof of Corollary 1.
By Hoeffding’s lemma we have . Setting in the statement of Theorem 1 leads to the desired result. ∎
Proof of Corollary 2.
For any single , by Taylor expansion and the assumption of and , we have
[TABLE]
for any . Setting in the statement of Theorem 1 leads to the desired result. ∎
Before we present the proof of LCI matrix Bernstein ineqauality, we first introduce the proof of Lemma 2, which is motivated by [23].
Proof of Lemma 2.
Starting from the left-hand side, we have
[TABLE]
The second line follows from Markov’s inequality, the third line follows from the spectral mapping theorem, the fourth line follows from the law of total expectation, the fifth line follows from the LCI assumption and the fact that the matrix cumulant generating functions are subadditive, and the sixth line follows from the assumption . This completes the proof. ∎
We now present the proof of LCI matrix Bernstein inequality.
Proof of Corollary 3.
In this proof we assume for simplicity. The general case follows by scaling the corresponding terms.
For any single , by Taylor expansion and the assumption of , we have
[TABLE]
for any . Then by Lemma 2 we have
[TABLE]
Setting completes the proof. ∎
Appendix B Proofs for Polynomial-Time Regime with Semidefinite Programming
Proof of Lemma 3.
We define the Lagrangian variables for the constraints in (3) respectively. Then Lagrangian of (3) is
[TABLE]
For simplicity we denote to be a diagonal matrix. By the KKT stationarity condition and dual feasibility, we have
[TABLE]
Note that in the equation above, positive semidefiniteness requires symmetry. Thus we set , and we require to be symmetric. Then we obtain the dual objective function .
We now look at the remaining constraints. The KKT complementary slackness condition requires that
[TABLE]
and
[TABLE]
for every and . We want to highlight that (15) is equivalent to , given that both matrices are positive semidefinite. Since , this implies that for the optimal solution , every is an eigenvector of with an eigenvalue of [math]. Furthermore implies that for all , because is an all-one submatrix. ∎
Proof of Lemma 4.
Strong duality requires that the optimal primal and dual objective values are equal. In other words, the objective value of problem (3) and (4) should match. Note that the optimal primal solution can be decomposed as . Thus the primal objective function can be rewritten as
[TABLE]
On the other hand, the dual objective function is equal to
[TABLE]
Recall that and . One can see that by setting , or , the duality gap is closed. One may notice that the choice of does not change the objective values here. For the sole purpose of strong duality, is an arbitrary constant that will be determined later. ∎
Proof of Lemma 5.
This directly follows from the constraint in (4), by plugging in the construction of and . ∎
Proof of Lemma 7.
Again we use to denote the optimal primal solution. Since and are both positive semidefinite, the KKT complementary slackness condition (15) is equivalent to , which implies that every is an eigenvector of with an eigenvalue of [math]. Condition (7) further requires that spans the whole null space of . As a result, any optimal primal solution needs to be a multiple of . Since , the choice of is unique. ∎
Proof of Lemma 7.
This directly follows from Lemma 5 and 7. ∎
Proof of Lemma 8.
Motivated by [3], in the following proof we introduce the notation to denote the average degree of connectivity between cluster and . In other words, we have . Note that dual feasibility condition (8) is satisfied, if for every , we have
[TABLE]
By definition, dividing both sides by , this is equivalent to
[TABLE]
One may note that each random variable is the summation of LCI random variables given , with the expectation . Using LCI Hoeffding’s inequality, we obtain
[TABLE]
Taking a union bound for all and gives us
[TABLE]
where the last inequality holds if .
Note that by definition, the average degree is always bounded between the minimum and the maximum of and . Then with probability at least , it follows that
[TABLE]
Dividing both sides by , this is equivalent to
[TABLE]
Thus, by setting , nonnegativity is satisfied with probability at least . ∎
Proof of Lemma 10.
Here we look at the expectation of . Note that
[TABLE]
Note that for each summand above, we have
[TABLE]
given that ’s are orthogonal to . Thus we obtain
[TABLE]
∎
Proof of Lemma 10.
Starting from (9), we have
[TABLE]
Regarding (18), note that is a diagonal matrix. As a result, .
Regarding (19), it follows that .
Regarding (20), we have
[TABLE]
Note that
[TABLE]
given that ’s are orthogonal to . Thus .
Combining the results above, it is sufficient to prove that
[TABLE]
This gives us the result in the statement. ∎
Proof of Theorem 1.
Our proof relies on the use of LCI concentration inequalities. First we show that (11) holds with high probability. Note that, for any fixed latent variable and any , we have by LCI Hoeffding’s inequality. By a union bound, it follows that
[TABLE]
Setting , we obtain
[TABLE]
where the last inequality holds given that .
Next we show that (12) holds with high probability, and we use LCI Bernstein inequality in our proof. In this part we denote , and to be the matrix with in entry , and [math] everywhere else. Note that is a matrix with in entry , and [math] everywhere else. Furthermore we define the matrix . One can note that ’s are LCI random matrices given , with the maximum eigenvalue bounded above by . Also note that for any given , we have . By our construction, it follows that . Thus for any given , it follows that . Then applying the LCI matrix Bernstein inequality, we obtain
[TABLE]
Setting , we obtain
[TABLE]
where the last inequality holds given that .
Combining the results above, the probability of being greater than zero is at least , as long as . The last remaining task is to take into account. By Lemma 8, setting for some constant gives us
[TABLE]
Simplification leads to
[TABLE]
To further simplify the bound above we consider two cases. If , a sufficient condition will be . On the other hand if , a sufficient condition will be . Thus for either case, , for some large constant , is a sufficient condition. This completes our proof. ∎
Appendix C Proofs for Additional Analysis
C.1 Proof of Claim 1
In the following proof, we use notation to denote the space of feasible solutions. Mathematically, we have the following definition
[TABLE]
and we assume that the groundtruth is sampled uniformly at random from .
Proof.
First we characterize the mutual information between the true labels and the observed matrix . Using the pairwise KL-based bound [25], we obtain
[TABLE]
where denotes the KL-divergence between two probability distributions. Then we can apply Fano’s inequality [7]. For any predicted labels , we have
[TABLE]
By definition of and counting, it follows that
[TABLE]
Note that . It follows that
[TABLE]
which indicates that
[TABLE]
and the last inequality holds under the mild assumption of .
Finally, by Fano’s inequality, for the probability of error to be at least , it is sufficient to require the lower bound to be greater than . Hence
[TABLE]
and the last inequality holds provided that and . ∎
C.2 Proof of Claim 2
In the following proof we define . Before we start our proof we first present the following result.
Lemma 11** (Lemma 1.1, [6]).**
For each , we have
[TABLE]
Our proof consists of two steps. We first show the deterministic condition for problem (13) to succeed, and then derive the statistical condition by bounding from its expectation . We present the following lemma.
Lemma 12**.**
If the following condition
[TABLE]
holds, then maximum likelihood estimation (13) achieves exact inference.
Proof.
To prove problem (13) returns the optimal solution, it is sufficient to prove that for every , is strictly positive. Note that
[TABLE]
Regarding the last term in (23), note that . Given the fact that for every , we have
[TABLE]
∎
We now present the proof of Theorem 2.
Proof.
To show that (22) holds with high probability, we use LCI Bernstein inequality in our proof. For any fixed collection of latent variables and any , is a Bernoulli random variable centered at [math], bounded between and , with a variance bounded above by . Thus is the summation of LCI random variables given . LCI Bernstein inequality implies
[TABLE]
for every .
Setting , it follows that
[TABLE]
By a union bound we obtain
[TABLE]
where the third line follows Lemma 11, and the second to last inequality holds given that .
Finally applying Lemma 12, the probability of being greater than zero is at least . This completes our proof. ∎
Appendix D Experiments
In this section, we validate our theoretical findings through synthetic experiments. Here we compare the theoretic exact inference condition suggested by our SDP analysis, and the experimental results of exact inference using CVX [13, 12] to solve the SDP problem. We run synthetic experiments on four models: latent space model with three clusters, latent space model with two clusters, exchangeable graph model with two clusters, and kernel latent variable model with two clusters.
Latent space model with three clusters (Fig. 2). We pick as the latent domain. We fix the number of entities to be . We generate by randomly assigning entities to three groups of equal size. We generate the latent variables using Gaussian distributions, such that , , , and . The parameters in our simulations are and . Each entry follows Bernoulli distribution with probability . For each pair of and , we count: a) how many times (out of ) the fourth smallest eigenvalue of is greater than zero, and b) how many times (out of ) CVX returns the correct . This allows us to compute an empirical probability of success for the statistical condition and CVX, respectively. Our experiments show that if the fourth smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.
Latent space model with two clusters (Fig. 3). We pick as the latent domain. We fix the number of entities to be . Note that in the two cluster case, we can let the group assignment matrix become a vector by using the encoding. We generate by randomly assigning entities to one group (), and entities to the other group (). Since we are using the encoding, we only need to check the second smallest eigenvalue as the sufficient condition. We generate the latent variables using Gaussian distributions, such that , , where denotes the Gaussian distribution. We also set . The parameters in our simulations are and . Each entry follows Bernoulli distribution with probability . For each pair of and , we count: a) how many times (out of ) the second smallest eigenvalue of is greater than zero, and b) how many times (out of ) CVX returns the correct . Our experiments show that if the second smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.
Exchangeable graph model with two clusters (Fig. 4). We pick as the latent domain. We fix the number of entities to be . We generate using the same method as in the latent space model with two clusters. We generate the latent variables as follows: for every , its digits follow Bernoulli distribution with parameter , if entity is in the first group; its digits follow Bernoulli distribution with parameter , if entity is in the second group. We set . The parameters in our simulations are and . Each entry follows Bernoulli distribution with probability . Our experiments show that if the second smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.
Kernel latent variable model with two clusters (Fig. 5). We pick to be the power set of as the latent domain. We fix the number of entities to be . We generate using the same method as in the latent space model with two clusters. We generate the latent variables as follows: every is a subset of . Each element through is in set with probability if entity is in the first group, and with probability if entity is in the second group. Each element through is in set with probability if entity is in the first group, and with probability if entity is in the second group. We set the kernel , and . The parameters in our simulations are and . Each entry follows Beta distribution with parameters . Our experiments show that if the second smallest eigenvalue is strictly positive, then exact inference can be performed efficiently by semidefinite programming.
D.1 Larger Number of Entities
Here we provide synthetic experiment results for a large number of entities with in the latent space model with two clusters. We pick as the latent domain, to be , and the number of trials to be . We compute the second minimum eigenvalue with being and . With , the number of runs with positive second minimum eigenvalue is (out of ). With , the number of runs with positive second minimum eigenvalue is [math] (out of ). We also run SDP for both cases. With the number of runs where SDP succeeded is (out of ). With the number of runs where SDP succeeded is [math] (out of ). Both results (success for and failure for ) confirm our finding in Theorem 1.
D.2 Real-world Data
To test the adequacy of SDP in a real-world dataset in which assumptions might not necessarily hold, we use an openly available Stanford large network dataset, email-Eu-core [19]. In our experiments we used CVX [13, 12] as the solver.
The procedure is as follows. We select the two largest clusters from the dataset as the test data. The size of the test data is , and the sizes of the two clusters are and , respectively. The adjacency matrix is shown in Figure 6. Note that in the diagonal blocks in the adjacency matrix, the distribution of edges is not uniform, and seem to depend highly on the entities. That is, some rows are more dense than other rows, indicating that some entities might be closer (in a latent space) to other entities. We run SDP with the adjacency matrix and obtain the solution . We then set as the output of the algorithm. Comparing our test result with the ground truth, our algorithm achieved an accuracy of 95.52%.
For comparison, we ran the same real-world experiment using Kernighan-Lin algorithm with random initialization for iterations. The average accuracy was 52.91%, with a standard error of 0.21%.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Emmanuel Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research , 18(177):1–86, 2018.
- 2[2] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory , 62(1):471–487, 2016.
- 3[3] Arash A Amini, Elizaveta Levina, et al. On semidefinite relaxations for the block model. The Annals of Statistics , 46(1):149–179, 2018.
- 4[4] Afonso S Bandeira. Random laplacian matrices and convex relaxations. Foundations of Computational Mathematics , 18(2):345–379, 2018.
- 5[5] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridge university press, 2004.
- 6[6] Yudong Chen and Jiaming Xu. Statistical-computational phase transitions in planted models: The high-dimensional setting. In International Conference on Machine Learning , pages 244–252, 2014.
- 7[7] Thomas M Cover and Joy A Thomas. Elements of information theory . John Wiley & Sons, 2012.
- 8[8] Jean-Jacques Daudin, Laurent Pierre, and Corinne Vacher. Model for heterogeneous random networks using continuous latent variables and an application to a tree–fungus network. Biometrics , 66(4):1043–1051, 2010.
