Sequential Local Learning for Latent Graphical Models
Sejun Park, Eunho Yang, Jinwoo Shin

TL;DR
This paper introduces a sequential local learning framework for latent graphical models, expanding the class of models that can be effectively learned by leveraging marginalization and conditioning techniques.
Contribution
It proposes a novel sequential learning approach that enlarges the class of latent GMs solvable by method of moments, including complex loopy models.
Findings
Enlarged the class of learnable latent GMs
Successfully applied to convolutional and random regular models
Achieved broader applicability over existing methods
Abstract
Learning parameters of latent graphical models (GM) is inherently much harder than that of no-latent ones since the latent variables make the corresponding log-likelihood non-concave. Nevertheless, expectation-maximization schemes are popularly used in practice, but they are typically stuck in local optima. In the recent years, the method of moments have provided a refreshing angle for resolving the non-convex issue, but it is applicable to a quite limited class of latent GMs. In this paper, we aim for enhancing its power via enlarging such a class of latent GMs. To this end, we introduce two novel concepts, coined marginalization and conditioning, which can reduce the problem of learning a larger GM to that of a smaller one. More importantly, they lead to a sequential learning framework that repeatedly increases the learning portion of given latent GM, and thus covers a significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling · Natural Language Processing Techniques
Sequential Local Learning for Latent Graphical Models
Sejun Park S. Park and J. Shin are with Department of Electrical Engineering, Korea Advanced Institute of Science & Technology, Republic of Korea. Email: [email protected], [email protected]
Eunho Yang E. Yang is with Department of Computer Science, Korea Advanced Institute of Science & Technology, Republic of Korea. Email: [email protected]
Jinwoo Shin∗
Abstract
Learning parameters of latent graphical models (GM) is inherently much harder than that of no-latent ones since the latent variables make the corresponding log-likelihood non-concave. Nevertheless, expectation-maximization schemes are popularly used in practice, but they are typically stuck in local optima. In the recent years, the method of moments have provided a refreshing angle for resolving the non-convex issue, but it is applicable to a quite limited class of latent GMs. In this paper, we aim for enhancing its power via enlarging such a class of latent GMs. To this end, we introduce two novel concepts, coined marginalization and conditioning, which can reduce the problem of learning a larger GM to that of a smaller one. More importantly, they lead to a sequential learning framework that repeatedly increases the learning portion of given latent GM, and thus covers a significantly broader and more complicated class of loopy latent GMs which include convolutional and random regular models.
1 Introduction
Graphical models (GM) are succinct representation of a joint distribution on a graph where each node corresponds to a random variable and each edge represents the conditional independence between random variables. GM have been successfully applied for various fields including information theory [12, 19], physics [24] and machine learning [18, 11]. Introducing latent variables to GM has been popular approaches for enhancing their representation powers in recent deep models, e.g., convolutional/restricted/deep Boltzmann machines [20, 27]. Furthermore, they are inevitable in certain scenarios when a part of samples is missing, e.g., see [10].
However, learning parameters of latent GMs is significantly harder than that of no-latent ones since the latent variables make the corresponding negative log-likelihood non-convex. The main challenge comes from the difficulty of inferring unobserved/latent marginal probabilities associated to latent/hidden variables. Nevertheless, the expectation-maximization (EM) schemes [9] have been popularly used in practice with empirical successes, e.g., contrastive divergence learning for deep models [14]. They iteratively infer unobserved marginals given current estimation of parameters, and typically stuck at local optima of the log-likelihood function [26].
To address this issue, the spectral methods have provided a refreshing angle on learning probabilistic latent models [2]. These theoretical methods exploit the linear algebraic properties of a model to factorize observed (low-order) moments/marginals into unobserved ones. Furthermore, the factorization methods can be combined with convex log-likelihood optimizations under certain structures, coined exclusive views, of latent GMs [7]. Both factorization methods and exclusive views can be understood as ‘local algorithms’ handling certain partial structures of latent GMs. However, up to now, they are known to be applicable to a quite limited class of latent GMs, and not as broadly applicable as EM, which is the main motivation of this paper.
Contribution. Our major question is “Can we learn latent GMs of more complicated structures beyond naive applications of local algorithms, e.g., known factorization methods or exclusive views?”. To address this, we introduce two novel concepts, called marginalization and conditioning, which reduce the problem of learning a larger GM to that of a smaller one. Hence, if the smaller one is possible to be processed by known local algorithms, then the larger one is too. Our marginalization concept suggests to search a ‘marginalizable’ subset of variables of GM so that their marginal distributions are invariant with respect to other variables under certain graphical transformations. It allows to focus on learning the smaller transformed GM, instead of the original larger one. On the other hand, our conditioning concept removes some dependencies among variables of GM, simply by conditioning some subset of variables. Hence, it enables us to discover marginalizable structures which was not before conditioning. At first glance, conditioning looks very powerful as conditioning more variables would discover more desired marginalizable structures. However, as more variables are conditioned, the algorithmic complexity grows exponentially. Therefore, we set an upper bound of those conditioned variables.
Marginalization and conditioning naturally motivate a sequential scheme that repeatedly recover larger portions of unobserved marginals given previous recovered/ observed ones, i.e., recursively recovering unobserved marginals utilizing any ‘black-box’ local algorithms. Developing new local algorithms, other than known factorization methods and exclusive views, are not of major scope. Nevertheless, we provide two new such algorithms, coined disjoint views and linear views, which play a similar role to exclusive views, i.e., can also be combined with known factorization methods. Given these local algorithms, the proposed sequential learning scheme can learn a significantly broader and more complicated class of latent GMs, than known ones, including convolutional restricted Boltzmann machines and GMs on random regular graphs, as described in Section 5. Consequently, our results imply that there exists a one-to-one correspondence between observed distributions and parameters for the class of latent GMs. Furthermore, for arbitrary latent GMs, it can be used for boosting the performance of EM as a pre-processing stage: first run it to recover as large unobserved marginals as possible, and then run EM using the additional information. We believe that our approach provides a new angle for the important problem of learning latent GMs.
Related works. Parameter estimation of latent GMs has a long history, dating back to [9]. While it can be broadly applied to most of latent GMs, EM algorithm suffers not only from local optima but from a risk of slow convergence. A natural alternative to general method of EM is to constrain the structure of graphical models. In independent component analysis (ICA) and its extensions [17, 4], latent variables are assumed to be independent inducing simple form of latent distribution using products. Recently, spectral methods has been successfully applied for various classes of GMs including latent tree [21, 31], ICA [8, 25], Gaussian mixture models [15], hidden Markov models [28, 30, 16, 3, 34], latent Dirichlet allocation [1] and others [13, 6, 35, 29]. In particular [2] proposed an algorithm of tensor type under certain graph structures.
Another important line of work using method of moments for latent GMs, concerns on recovering joint or conditional probabilities only among observable variables (see [5] and its references). [23, 22] proposed spectral algorithms to recover the joint among observable variables when the graph structure is bottlenecked tree. [7] relaxed the constraint of tree structure and proposed a technique to combine method of moments in conjunction with likelihood for certain structures. Our generic sequential learning framework allows to use of all these approaches as key components, in order to broaden the applicability of methods. We note that we primarily focus on undirected pairwise binary GMs in this paper, but our results can be naturally extended for other GMs.
2 Preliminaries
2.1 Graphical Model and Parameter Learning
Given undirected graph , we consider the following pairwise binary Graphical Model (GM), where the joint probability distribution on is defined as:
[TABLE]
for some parameter and . The normalization constant is called the partition function.
Given samples drawn from the distribution (1) with some true (fixed but unknown) parameter , the problem of our interest is recovering it. The popular method for the parameter learning task is the following maximum likelihood estimation (MLE):
[TABLE]
where it is well known [32] that the log-likelihood is concave with respect to , and the gradient of the log-likelihood is
[TABLE]
Here, the last term, expectation of corresponding sufficient statistics, comes from the partial derivative of the log-partition function. Furthermore, it is well known that there exists a one-to-one correspondence between parameter and sufficient statistics (see [32] for details).
One can further observe that if the number of samples is sufficiently large, i.e., , then (2) is equivalent to
[TABLE]
where the true parameter achieves the (unique) optimal solution. This directly implies that, once empirical nodewise and pairwise marginals in (3) and (4) approach the true marginals, the gradient method can recover modulo the difficulty of exactly computing the expectations of sufficient statistics.
Now let us consider more challenging task: parameter learning under latent variables. Given a subset of and , we assume that for every sample , are observed/visible and other variables are hidden/latent. In this case, MLE only involves observed variables:
[TABLE]
where . Similarly as before, the true parameter achieves the optimal solution of (5) if the number of samples is large enough. However, the log-likelihood under latent variables is no longer concave, which makes the parameter learning task harder. One can apply an expectation-maximization (EM) scheme, but it is typically stuck in local optima.
2.2 Tensor Decomposition
The fundamental issue on parameter learning of latent GM is that it is hard to infer the pairwise marginals for latent variables, directly from samples. If one could infer them, it is also possible to recover as we discussed in previous section. Somewhat surprisingly, however, under certain conditions of latent GM, pairwise marginals including latent variables can be recovered using low-order visible marginals. Before introducing such conditions, we first make the following assumption for any GM on a graph considered throughout this paper.
Assumption 1** (Faithful).**
*For any two nodes , if are connected, then are dependent. *
This faithfulness assumption implies that GM only has conditional independences given by the graph . We also introduce the following notion [2].
Definition 1** (Bottleneck).**
*A node is a bottleneck if there exists , denoted as ‘views’, such that every path between two of contains . *
Figure 1(a) illustrates the bottleneck. By construction, views are conditionally independent given the bottleneck. Armed with this notion, now we introduce the following theorem to provide sufficient conditions for recovering unobserved/latent marginals [2].
Theorem 1**.**
*Given GM with a parameter , suppose is a bottleneck with views . If is given, then there exists an algorithm which outputs up to relabeling of , i.e. ignoring symmetry of and . *
The above theorem implies that using visible marginals , one can recover unobserved marginals involving . For a bottleneck with more than three views, the joint distribution of the bottleneck and views are recoverable using Theorem 1 by choosing three views at once.
Besides , there are other conditions of latent GM which marginals including latent variables are recoverable. Before elaborating on the conditions, we further introduce the following notion for GM on a graph [7].
Definition 2** (Exclusive View).**
*For a set of nodes , we say it satisfies the exclusive view property if for each , there exists , denoted as ‘exclusive view’, such that every path between and contains . *
Figure 1(b) illustrates the exclusive view property. Now, we are ready to state the conditions for recovering unobserved marginals using the property [7].
Theorem 2**.**
*Given GM with a parameter , suppose a set of nodes satisfies the exclusive view property with a set of exclusive views . If and are given for all and an exclusive view of , then there exists an algorithm which outputs . *
At first glance, Theorem 2 does not seems to be useful as it requires a set of marginals including every variable corresponding to . However, suppose a set of latent nodes satisfying the property while its set of exclusive views is visible, i.e., is observed. If for all , is a bottleneck with views containing its exclusive view , then one can resort to to obtain .
3 Marginalizing and Conditioning
In Section 2.2, we introduced sufficient conditions for recovering unobservable marginals. Specifically, Theorem 1 and 2 state that for certain structures of latent GMs, it is possible to recover latent marginals simply from low-order visible marginals and in turn the parameters of latent GMs via convex MLE estimators in (2).
Now, a natural question arises: “Can we even recover unobserved marginals for latent GMs with more complicated structures beyond naive applications of the bottlenecks or exclusive views?” To address this, in this section we enlarge the class of such latent GMs by proposing generic concepts, marginalization and conditioning.
3.1 Key Ideas
We start by defining two concepts, marginalization and conditioning, formally. The former is a combinatorial concept defined as follows.
Definition 3** (Marginalization).**
Given graph , we say is marginalizable if for all , there exists a (minimal) set with such that and are disconnected in .111 is the subgraph of induced by . For marginalizable set in , the marginalization of , denoted by , is the graph on with edges
[TABLE]
In Figure 2, for example, node is disconnected with when removing . Hence, the edge between and is additionally included in the marginalization of .
With the definition of marginalization, the following key proposition reveals that recovering unobserved marginals of a latent GM can be actually reduced to that of much smaller latent GM.
Proposition 3**.**
Consider a GM on with a parameter . If is marginalizable in , then there exists (unique) such that GM on with a parameter inducing the same distribution on , i.e.,
[TABLE]
The proof of the above proposition is presented in Appendix A. Proposition 3 indeed provides a way of representing the marginal probability on of GM via the smaller GM on . Suppose there exists any algorithm (e.g., via bottleneck, but we don’t restrict ourselves on this method) that can recover a joint distribution , or equivalently sufficient statistics, of latent GM on only using observed marginals in . Then, it should be
[TABLE]
where is the unique parameter satisfying (6). Using Proposition 3 and marginalization, one can recover unobserved marginals of a large GM by considering smaller GMs corresponding to marginalizations of the large one. The role of marginalization will be further discussed and clarified in Section 4.
In addition to marginalizing, we introduce the second key ingredient, called conditioning, with which the class of recoverable latent GMs can be further expanded.
Proposition 4**.**
*For a graph , for and , is a subgraph of . *
The proof of the above proposition is straightforward since (defined in Definition 3) for in contains that for in , i.e., the edge set of contains that of . Figure 3 illustrates the example on how conditioning actually broaden the recoverable latent GMs, as suggested in Proposition 4. Once the node is conditioned out, the marginalization (Figure 3(c)) is a form that can be handled by .
3.2 Labeling Issues
In spite of its usefulness, there is a caveat in performing conditioning: consistent labeling of latent nodes. For example, consider the latent GM as in Figure 3. Conditioned on , is a bottleneck with views , , (Figure 3(c)). If is given, one can recover the conditional distribution up to labeling of , from Theorem 1 and conditioning. Here, the conditioning worsens the relabeling problem in the sense that we might choose different labels for for each conditioned value and . As a result, the recovered joint distribution computed as with mixed labeling of , would be different from the true joint. To handle this issue, we define the following concept for consistent labeling of latent variables.
Definition 4** (Label-Consistency).**
Given GM on with a parameter , we say is label-consistent for if there exists , called ‘reference’, such that
[TABLE]
*called ‘preference’, is consistently positive or negative for all .333Note that the preference cannot be zero due to Assumption 1. *
In Figure 3 for example, is label-consistent for with reference since the corresponding preference is the function only on , which is fixed as either or (note that the reference can be arbitrarily chosen due to the symmetry of structure). Using the label-consistency of , one can choose a consistent label of by choosing the label consistent to the preference of the reference node .
Even if is label-consistent under GM with the true known parameter, we need to specify the reference and corresponding preference to obtain a correct labeling on . We note however that attractive GMs (i.e., for all ) always satisfy the label-consistency with any reference node since for any and where are connected in ,
[TABLE]
Furthermore, there can be some settings in which we can force the label-consistency from the structure of latent GMs even without the information of its true parameter. For example, consider a latent GM on and a parameter . For a set , a latent node and its neighbor such that is the only path from to in , by symmetry of labels of latent nodes, one can assume that , i.e.,
[TABLE]
to force the label-consistency of for . In general, one can still choose labels of latent variables to maximize the log-likelihood of observed variables.
As in conditioning, marginalization also has a labeling issue. Consider a latent GM on . Suppose that every unobserved pairwise marginal can be recovered by two marginalizations of . If there is a common latent node , then the labeling for might be inconsistent. To address this issue, we make the following assumption on graph , node , and parameter of GM.
Assumption 2** (Degeneracy).**
*. *
Under the assumption, one can choose a label of to satisfy using the symmetry of labels of latent nodes.
4 Sequential Marginalizing and Conditioning
In the previous section, we introduced two concepts marginalization and conditioning to translate the marginal recovery problem of a large GM into that of smaller and tractable GMs. In this section, we present a sequential strategy, adaptively applying marginalization and conditioning, by which we substantially enlarge the class of tractable GMs with hidden/latent variables.
4.1 Example
We begin with a simple example describing our sequential learning framework. Consider a latent GM as illustrated in Figure 4(a) and a parameter . Given visible marginal , our goal is to recover all unobserved pairwise marginals including or in order to learn via convex MLE (2). As both nodes and are not a bottleneck, one can consider the conditioning strategy described in the previous section, i.e., the conditional distribution in Figure 4(b). Now, node is a bottleneck with views . Hence, one can recover using where the label of is set to satisfy
[TABLE]
i.e., node is label consistent. Further, can be recovered using the known visible marginals and the following identity
[TABLE]
Since we recovered pairwise marginals between and , , , the remaining goal is to recover pairwise marginals including . Now consider a latent GM where is conditioned and it is illustrated in Figure 4(c). At this time, the node is a bottleneck with views , which can be handled by an additional application of (the details are same as the previous case on node ).
This example shows that the sequential application of conditioning extends a class of latent GM that unobserved pairwise marginals are recoverable. Here, we use an algorithm as a black-box, hence one can consider other algorithms as long as they have similar guarantees. One caveat is that conditioning an arbitrary number of variables is very expensive as the learning algorithmic (and sampling) complexity grows exponentially with respect to the number of conditioned variables. Therefore, it would be reasonable to bound the number of conditioned variables.
4.2 Algorithm Design
Now, we are ready to state the main learning framework sequentially applying marginalization and conditioning, summarized in Algorithm 1. Suppose that there exists an algorithm, called , e.g., , for a class of pairs such that all satisfy the following:
Given GM with a parameter on and marginals , outputs the entire distribution , up to labeling of variables on .
For example, consider a graph illustrated in Figure 1(a) with \mathcal{S}_{G}=\big{\{}\{j,k,\ell\}\big{\}}. Then, outputs the entire distribution .
In addition, suppose that there exists an algorithm, called , e.g., , for a class of pairs such that all satisfy the following:
Given GM with a parameter on and marginals , outputs the distribution where .
Namely, simply merges the small marginal distributions for into the entire distribution on . For example, consider a graph illustrated in Figure 1(b) with
[TABLE]
where have exclusive views , respectively. Then, outputs the distribution .
For a GM on with a parameter , suppose we know a family of label-consistency quadruples
[TABLE]
and marginals for some . As we mentioned in the previous section, we also bound the number of conditioning variables by some . Under the setting, our goal is to recover more marginals beyond initially known ones .
The following conditions for with and are sufficient so that additional marginals can be recovered by conditioning variables on , marginalizing and applying :
for some
For all , there exists such that
For all , there exist and such that ,
where . In the above, implies that if are given, then outputs up to labeling of . In addition, says that the required marginals and are known. Finally, is necessary that all nodes which we need to infer their labels are label-consistent.
Similarly, the following conditions for with and are sufficient so that can be recovered by conditioning variables on and applying where :
For all , there exists such that ,
In the above, says that the required marginals for merging are given.
The above procedures imply that given initial marginals , one can recover additional marginals , where
[TABLE]
from and , respectively. One can repeat the above procedure for recovering more marginals as
[TABLE]
Recall that we are primarily interested in recovering all pairwise marginals, i.e.,
[TABLE]
The following theorem implies that one can check the success of Algorithm 1 in time, where are typically chosen as small constants.
Theorem 5**.**
*Suppose we have a label-consistency family of GM on and marginals for some . If Algorithm 1 eventually recover all pairwise marginals, then they do in iterations, where and denote the maximum numbers of conditioning variables and nodes of graphs in , respectively. *
The proof of the above theorem is presented in Appendix B. We note that one can design their own sequence of recovering marginals rather than recovering all marginals in for computational efficiency. In Section 5, we provide such examples, of which strategy has the linear-time complexity at each iteration. We also remark that even when Algorithm 1 recovers some, not all, pairwise unobserved marginals for given latent GMs, it is still useful since one can run the EM algorithm using the additional information provided by Algorithm 1. We leave this suggestion for further exploration in the future.
4.3 Recoverable Local Structures
For running the sequential learning framework in the previous section, one requires ‘black-box’ knowledge of a label-consistency family and a class of locally recoverable structures of latent GMs, i.e., and . The complete study on them is out of our scope, but we provide the following guidelines on their choices.
As mentioned in Section 3.2, can be found easily for some class of GMs including attractive ones. One can also infer it heuristically for general GMs in practice. As we mentioned in the previous section, one can choose that corresponds to . Beyond , in practice, one might hope to choose an additional option for small sized latent GMs since even a generic non-convex solver might compute an almost optimum of MLE due to their small dimensionality.
For the choice of , we mentioned those corresponding to in the previous section. In addition, we provide the following two more examples, called and , as described in Algorithm 2 and 3, respectively. In Algorithm 3, is defined as
[TABLE]
Figure 5 illustrates and .
5 Examples
In this section, we provide concrete examples of loopy latent GM where the proposed sequential learning framework is applicable. In what follows, we assume that it uses classes corresponding to , , and .
Grid graph. We first consider a latent GM on a grid graph illustrated in Figure 6(a) where boundary nodes are visible and internal nodes are latent. The following lemma states that all pairwise marginals can be successfully recovered given observed ones, utilizing the proposed sequential learning algorithm.
Lemma 6**.**
*Consider any latent GM with a parameter illustrated in Figure 6(a), , and . Then, updated under Algorithm 1 contains all pairwise marginals. *
In the above, recall that is the set of visible nodes. The proof strategy is illustrated in Figure 6 and the formal proof is presented in Appendix C. We remark that to prove Lemma 6, and are not necessary to use.
Convolutional graph. Second, we consider a latent GM illustrated in Figure 7(a), which corresponds to a convolutional restricted Boltzmann machine (CRBM) [20], and also prove the following lemma.
Lemma 7**.**
*Consider any latent GM with a parameter illustrated in Figure 7(a), , and . Then, updated under Algorithm 1 contains all pairwise marginals. *
The proof strategy is illustrated in Figure 7 and the formal proof is presented again in Appendix D. We remark that to prove Lemma 7, and are not necessary to use. Furthermore, it is straightforward to generalize the proof of Lemma 7 for arbitrary CRBM.
Lemma 8**.**
*Consider any CRBM with visible nodes and a filter size , , and . Then, updated under Algorithm 1 contains all pairwise marginals.444The theorem holds for arbitrary stride of CRBM. *
Random regular graph.
Finally, we state the following theorem for latent random regular GMs.
Lemma 9**.**
*Consider any latent GM with a parameter on a random -regular graph for some constant , and . There exists a constant such that if the number of latent variables is at most , updated under Algorithm 1 contains all pairwise marginals a.a.s. *
The proof of the above lemma is presented in Appendix E, where it is impossible without using our sequential learning strategy. One can obtain an explicit formula of from our proof, but it is quite a loose bound since we do not make much efforts to optimize it.
6 Conclusion
In this paper, we present a new learning strategy for latent graphical models. Unlike known algebraic, e.g., and optimization, e.g., , approaches for this non-convex problem, ours is of combinatorial flavor and more generic using them as subroutines. We believe that our approach provides a new angle for the important learning task.
Appendix A Proof of Proposition 3
We use the mathematical induction on where is defined in Definition 3. Before starting the proof we define the equivalence class . Now, we start the proof by considering
[TABLE]
where is some positive function. Since , one can modify a parameter only between elements of to achieve the following identity
[TABLE]
where . Using the induction hypothesis, the above identity completes the proof of Proposition 3.
Appendix B Proof of Theorem 5
Since the algorithm only uses the marginals of at most dimensions, instead of , consider the following sequence
[TABLE]
One can observe that if , then one can observe that the sequential local framework cannot recover more marginals after -th iteration, while increases its cardinality at least otherwise. However, the maximum cardinality of is and this implies that the algorithm always terminates in . This completes the proof of Theorem 5.
Appendix C Proof of Lemma 6
We first consider the distribution conditioned on as illustrated in Figure 6(b). In Figure 6(b), observe that is a bottleneck with views . Furthermore, is label consistent for with a reference by assuming (or ). Hence, one can recover using and obtain using the following identity.
[TABLE]
Similarly, one can recover , , .
In order to recover marginals including or , and should be bottlenecks. Conditioned on , as illustrated in Figure 6(d), is a bottleneck with views , however, we do not have a marginal currently. Now, we recover the marginal . Consider the distribution conditioned on as illustrated in Figure 6(c). In Figure 6(c), observe that and are disconnected if is removed. Furthermore, and are already observed. Hence, using by setting , , and conditioning , one can obtain . Now, is a bottleneck with views by conditioning . Using one can obtain . Using same procedure, one can also obtain .
Until now, we have recovered every pairwise marginals between visible variable and latent variable. The remaining goal is to recover pairwise marginals between latent variables. First, by setting , , and conditioning , one can recover using . Consecutively, by setting , , and conditioning , one can recover using which includes the pairwise marginals . Other pairwise marginals between latent variables can be also recovered using the same procedure. Since we end the sequence in 5 steps, this completes the proof of Lemma 6.
Appendix D Proof of Lemma 7
We first consider the distribution conditioned on as illustrated in Figure 7(b). In Figure 7(b), observe that is a bottleneck with views with a reference by assuming (or ). Hence, one can recover using and obtain using the following identity.
[TABLE]
Similarly, one can recover , , .
In order to recover marginals including or , and should be bottlenecks. Conditioned on , is a bottleneck with views , however we do not have a marginal currently. Now, we recover the marginal . Since we observed and , we can recover using by setting , and . Likewise, using , one can recover a marginal as well. Using the recovered marginal , conditioning and using , one can recover . Similarly, one can recover . Since we end the sequence in 4 steps, this completes the proof of Lemma 6.
Appendix E Proof of Lemma 9
The main idea of the proof is to show that every latent nodes of size contains at least a single recoverable latent node using where . We first state the following condition for a latent node .
Condition 1**.**
*For a latent node , two of its neighbors are visible and a set of neighbors of are visible except for , not containing . Also, there exists such that is a bottleneck with views in . *
In the above condition, denote the set of visible nodes. One can easily observe that if any latent node satisfies the above condition, then it is recoverable by conditioning neighbors of and apply with views and some other.
Now consider the following procedure. First, duplicate for each into where is visible/latent if is visible/latent. Let be a such duplicated vertex set and be a set of visible nodes and be a set of latent nodes. The procedure starts with a graph on without edges.
Choose latent nodes . For each if deg, Choose a single neighbor of with probability
[TABLE]
- 2.
Similarly, for each neighbor of , for all satisfying deg, add neighbors of as in step 1.
- 3.
Check whether there exists an edge or a pair of edges , . If such edge or a pair of edges exists, then the procedure restarts from the beginning.
- 4.
Let be a graph such that contracting into for all . Check whether satisfies Condition 1 with and is a bottleneck by conditioning neighbors of .
- 5.
If satisfies the condition in step 3, then the procedure succeeds. If not, repeat the procedure for the next latent node until every latent node decides its neighbor.
- 6.
If every latent nodes decided its neighbor, the procedure fails.
The above procedure is constructing the fractional edges of random -regular graph by contracting into . step 3 checks whether the procedure creates a loop or multiple edges. One can notice that if any node satisfies Condition 1 in step 3, then there exists a recoverable latent node. Our primary goal is to bound the probability that the procedure fails, i.e., no latent node satisfies Condition 1 under the fractional graph.
One can observe that if some visible node is chosen to be a neighbor of a latent node in the procedure but it is already a neighbor of other latent node, then it cannot help to satisfy Condition 1. Also, at each iteration, choosing neighbor has an effect that reducing at most nodes from whole nodes as at most edges are created. Now, suppose there exist latent nodes where . Using this fact, one can observe that the probability that a visible node connected to a latent node has visible neighbors is at least . We also note that the probability that the procedure start over in step 3 is at each iteration. Therefore, one can conclude that
[TABLE]
for sufficiently small (up to constant) where in the bracelet represents the probability that non-existence of in Condition 1 and the degree varies as the procedure iterates. Also, is an indicator function having a value if an event occurs, [math] if not. The second last inequality follows from the fact that we can choose at least latent nodes of degree [math] at first, and then, we can choose at least latent nodes of degree less than or equal to . in the last inequality is
[TABLE]
for all . One might concern that after the procedure succeeds, the extension of the procedure to the all vertices may start over with high probability so that the probability becomes significantly larger than (9). However, we note that the restarting probability that extending the procedure to all vertices is a.a.s., i.e., constant, (see [33]) and therefore
[TABLE]
for in the above equation. Now, we consider all and all choices of sets of latent node to apply the union bound as below. The explicit choice of will be presented later.
[TABLE]
where the first inequality is from Stirling’s formula and we choose to satisfy that to obtain the last equality. Such always exists as
[TABLE]
for a sufficiently small .
Now, we know that at each iteration of the sequential learning framework, there exists at least one bottleneck latent node which can be recovered without labeling issue (forcing labels). Furthermore, using and conditioning, one can also treat recovered latent nodes as visible nodes while the marginals including latent nodes always containing the conditioned variables, i.e., the order of marginals reduces in some sense as recovered marginals has fixed order while a part of order is the constant number (at most ) of conditioned variables. Using this fact, one can conclude that the sequential learning framework recovers every pairwise marginals in iterations. where follows from that the upperbound of calls of for recovering a single latent node is and at most two bottleneck calls are required. This completes the proof of Theorem 9.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Animashree Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems , pages 917–925, 2012.
- 2[2] Animashree Anandkumar, Rong Ge, Daniel J Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research , 15(1):2773–2832, 2014.
- 3[3] Animashree Anandkumar, Daniel J Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory , 2012.
- 4[4] Francis R Bach and Michael I Jordan. Kernel independent component analysis. Journal of machine learning research , 3(Jul):1–48, 2002.
- 5[5] Borja Balle and Mehryar Mohri. Spectral learning of general weighted automata via constrained matrix completion. In Advances in Neural Information Processing Systems , pages 2159–2167, 2012.
- 6[6] Arun T. Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning , pages 1040–1048, 2013.
- 7[7] Arun T. Chaganty and Percy Liang. Estimating latent-variable graphical models using moments and likelihoods. In International Conference on Machine Learning , pages 1872–1880, 2014.
- 8[8] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications . Academic press, 2010.
