Sequential Local Learning for Latent Graphical Models

Sejun Park; Eunho Yang; Jinwoo Shin

arXiv:1703.04082·cs.LG·March 17, 2017

Sequential Local Learning for Latent Graphical Models

Sejun Park, Eunho Yang, Jinwoo Shin

PDF

Open Access

TL;DR

This paper introduces a sequential local learning framework for latent graphical models, expanding the class of models that can be effectively learned by leveraging marginalization and conditioning techniques.

Contribution

It proposes a novel sequential learning approach that enlarges the class of latent GMs solvable by method of moments, including complex loopy models.

Findings

01

Enlarged the class of learnable latent GMs

02

Successfully applied to convolutional and random regular models

03

Achieved broader applicability over existing methods

Abstract

Learning parameters of latent graphical models (GM) is inherently much harder than that of no-latent ones since the latent variables make the corresponding log-likelihood non-concave. Nevertheless, expectation-maximization schemes are popularly used in practice, but they are typically stuck in local optima. In the recent years, the method of moments have provided a refreshing angle for resolving the non-convex issue, but it is applicable to a quite limited class of latent GMs. In this paper, we aim for enhancing its power via enlarging such a class of latent GMs. To this end, we introduce two novel concepts, coined marginalization and conditioning, which can reduce the problem of learning a larger GM to that of a smaller one. More importantly, they lead to a sequential learning framework that repeatedly increases the learning portion of given latent GM, and thus covers a significantly…

Equations77

P (x) = P_{β, γ} (x) = \frac{1}{Z} exp (i, j) \in E \sum β_{ij} x_{i} x_{j} + i \in V \sum γ_{i} x_{i},

P (x) = P_{β, γ} (x) = \frac{1}{Z} exp (i, j) \in E \sum β_{ij} x_{i} x_{j} + i \in V \sum γ_{i} x_{i},

\mbox ma x imi z e_{β, γ} \frac{1}{N} n = 1 \sum N lo g P_{β, γ} (x^{(n)}),

\mbox ma x imi z e_{β, γ} \frac{1}{N} n = 1 \sum N lo g P_{β, γ} (x^{(n)}),

\frac{\partial}{\partial γ _{i}} \frac{1}{N} n = 1 \sum N lo g P_{β, γ} (x^{(n)}) = \frac{1}{N} n = 1 \sum N x_{i}^{(n)} - E_{β, γ} [x_{i}]

\frac{\partial}{\partial γ _{i}} \frac{1}{N} n = 1 \sum N lo g P_{β, γ} (x^{(n)}) = \frac{1}{N} n = 1 \sum N x_{i}^{(n)} - E_{β, γ} [x_{i}]

\frac{\partial}{\partial β _{ij}} \frac{1}{N} n = 1 \sum N lo g P_{β, γ} (x^{(n)}) = \frac{1}{N} n = 1 \sum N x_{i}^{(n)} x_{j}^{(n)} - E_{β, γ} [x_{i} x_{j}] .

\mbox ma x imi z e_{β, γ} x \in {0, 1}^{V} \sum P_{β^{*}, γ^{*}} (x) lo g P_{β, γ} (x),

\mbox ma x imi z e_{β, γ} x \in {0, 1}^{V} \sum P_{β^{*}, γ^{*}} (x) lo g P_{β, γ} (x),

\mbox ma x imi z e_{β, γ} \frac{1}{N} n = 1 \sum N lo g P_{β, γ} (x_{O}^{(n)}),

\mbox ma x imi z e_{β, γ} \frac{1}{N} n = 1 \sum N lo g P_{β, γ} (x_{O}^{(n)}),

\big{\{}(i,j)\in E\,:\,i,j\in S\big{\}}\cup\big{\{}(j,k)\,:\,S_{i}=\{j,k\}\text{ for }i\in V\setminus S\big{\}}.

\big{\{}(i,j)\in E\,:\,i,j\in S\big{\}}\cup\big{\{}(j,k)\,:\,S_{i}=\{j,k\}\text{ for }i\in V\setminus S\big{\}}.

P_{β, γ} (x_{S}) = P_{β^{'}, γ^{'}} (x_{S}) .

P_{β, γ} (x_{S}) = P_{β^{'}, γ^{'}} (x_{S}) .

P_{β^{†}, γ^{†}} (x_{S}) = P_{β^{'}, γ^{'}} (x_{S}),

P_{β^{†}, γ^{†}} (x_{S}) = P_{β^{'}, γ^{'}} (x_{S}),

lo g \frac{P _{β, γ} ( x _{j} = 1∣ x _{i} = 1 , x _{C} = s )}{P _{β, γ} ( x _{j} = 1∣ x _{i} = 0 , x _{C} = s )},

lo g \frac{P _{β, γ} ( x _{j} = 1∣ x _{i} = 1 , x _{C} = s )}{P _{β, γ} ( x _{j} = 1∣ x _{i} = 0 , x _{C} = s )},

P_{β, γ} (x_{j} = 1∣ x_{i} = 1, x_{C}) > P_{β, γ} (x_{j} = 1∣ x_{i} = 0, x_{C}) .

P_{β, γ} (x_{j} = 1∣ x_{i} = 1, x_{C}) > P_{β, γ} (x_{j} = 1∣ x_{i} = 0, x_{C}) .

P_{β, γ} (x_{j} = 1∣ x_{i} = 1, x_{C}) > P_{β, γ} (x_{j} = 1∣ x_{i} = 0, x_{C}),

P_{β, γ} (x_{j} = 1∣ x_{i} = 1, x_{C}) > P_{β, γ} (x_{j} = 1∣ x_{i} = 0, x_{C}),

P_{β, γ}

P_{β, γ}

P_{β, γ} (x_{{i, j, k, ℓ, m, n}}) = P_{β, γ} (x_{{i, ℓ, m, n}} ∣ x_{{j, k}}) P_{β, γ} (x_{{j, k}}) .

P_{β, γ} (x_{{i, j, k, ℓ, m, n}}) = P_{β, γ} (x_{{i, ℓ, m, n}} ∣ x_{{j, k}}) P_{β, γ} (x_{{j, k}}) .

\mathcal{T}_{G}=\big{\{}\{i,j,k,\ell\},\{i,i^{\prime}\},\{j,j^{\prime}\},\{k,k^{\prime}\},\{\ell,\ell^{\prime}\}\big{\}}

\mathcal{T}_{G}=\big{\{}\{i,j,k,\ell\},\{i,i^{\prime}\},\{j,j^{\prime}\},\{k,k^{\prime}\},\{\ell,\ell^{\prime}\}\big{\}}

\displaystyle\mathcal{L}=\big{\{}

\displaystyle\mathcal{L}=\big{\{}

\displaystyle\qquad\qquad\text{with reference $j$ and preference $p$}\big{\}}

A_{0} = {

A_{0} = {

B_{0} = {

satisfy C 4 where T = \cup_{S \in T_{G}} S},

σ_{t + 1} = σ_{t} \cup A_{t} \cup B_{t} .

σ_{t + 1} = σ_{t} \cup A_{t} \cup B_{t} .

{P_{β, γ} (x_{i}, x_{j}) : (i, j) \in E} .

{P_{β, γ} (x_{i}, x_{j}) : (i, j) \in E} .

[P_{β, γ} (x_{j} = 0, x_{i} = 0) P_{β, γ} (x_{j} = 1, x_{i} = 0) P_{β, γ} (x_{j} = 0, x_{i} = 1) P_{β, γ} (x_{j} = 1, x_{i} = 1)]^{- 1} .

[P_{β, γ} (x_{j} = 0, x_{i} = 0) P_{β, γ} (x_{j} = 1, x_{i} = 0) P_{β, γ} (x_{j} = 0, x_{i} = 1) P_{β, γ} (x_{j} = 1, x_{i} = 1)]^{- 1} .

i \in V ∖ S \sum P_{β, γ} (x) = i \in V ∖ S : i \in / [ℓ] \sum i \in V ∖ S : i \in [ℓ] \sum \frac{1}{Z} exp (i, j) \in E \sum β_{ij} x_{i} x_{j} + i \in V \sum γ_{i} x_{i}

i \in V ∖ S \sum P_{β, γ} (x) = i \in V ∖ S : i \in / [ℓ] \sum i \in V ∖ S : i \in [ℓ] \sum \frac{1}{Z} exp (i, j) \in E \sum β_{ij} x_{i} x_{j} + i \in V \sum γ_{i} x_{i}

= i \in V ∖ S : i \in / [ℓ] \sum \frac{1}{Z} exp (i, j) \in E : i, j \in / [ℓ] \sum β_{ij} x_{i} x_{j} + i \in V ∖ [ℓ] \sum γ_{i} x_{i}

\times i \in V ∖ S : i \in [ℓ] \sum exp (i, j) \in E : i \in [ℓ] \sum β_{ij} x_{i} x_{j} + i \in [ℓ] \sum γ_{i} x_{i}

= i \in V ∖ S : i \in / [ℓ] \sum \frac{1}{Z} exp (i, j) \in E : i, j \in / [ℓ] \sum β_{ij} x_{i} x_{j} + i \in V ∖ [ℓ] \sum γ_{i} x_{i} f_{[ℓ]} (x_{S_{ℓ}})

i \in V ∖ S \sum P_{β, γ} (x) = i \in V ∖ {S \cup [ℓ]} \sum \frac{1}{Z ^{†}} exp (i, j) \in E^{†} \sum β_{ij}^{†} x_{i} x_{j} + i \in V ∖ [ℓ] \sum γ_{i}^{†} x_{i}

i \in V ∖ S \sum P_{β, γ} (x) = i \in V ∖ {S \cup [ℓ]} \sum \frac{1}{Z ^{†}} exp (i, j) \in E^{†} \sum β_{ij}^{†} x_{i} x_{j} + i \in V ∖ [ℓ] \sum γ_{i}^{†} x_{i}

σ_{t}^{'} = {S \subset S^{'} : S^{'} \in σ_{t}, ∣ S ∣ \leq K + L} .

σ_{t}^{'} = {S \subset S^{'} : S^{'} \in σ_{t}, ∣ S ∣ \leq K + L} .

\displaystyle\mathbb{P}_{\beta,\gamma}\big{(}

\displaystyle\mathbb{P}_{\beta,\gamma}\big{(}

\displaystyle\mathbb{P}_{\beta,\gamma}\big{(}

\displaystyle\mathbb{P}_{\beta,\gamma}\big{(}

P (j_{m} is chosen) = \frac{1 - deg ( j _{m} )}{\sum _{k_{o} \in V^{'}} ( 1 - deg ( k _{o} ))} .

P (j_{m} is chosen) = \frac{1 - deg ( j _{m} )}{\sum _{k_{o} \in V^{'}} ( 1 - deg ( k _{o} ))} .

P (the procedure fails)

P (the procedure fails)

\displaystyle\leq\prod_{i\in H}\Bigg{[}O(1)\sum_{n=0}^{d-\text{deg}(i)}\left(\frac{\alpha N}{(1-2d\alpha)N}\right)^{d-\text{deg}(i)-n}\left((1-p)^{n}+np(1-p)^{n-1}+O\left(\frac{1}{N}\right)\right)^{\mathbf{1}_{n\geq 2}}\Bigg{]}^{\mathbf{1}_{d-\text{deg}(i)\geq 2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Topic Modeling · Natural Language Processing Techniques

Full text

Sequential Local Learning for Latent Graphical Models

Sejun Park S. Park and J. Shin are with Department of Electrical Engineering, Korea Advanced Institute of Science & Technology, Republic of Korea. Email: [email protected], [email protected]

Eunho Yang E. Yang is with Department of Computer Science, Korea Advanced Institute of Science & Technology, Republic of Korea. Email: [email protected]

Jinwoo Shin∗

Abstract

Learning parameters of latent graphical models (GM) is inherently much harder than that of no-latent ones since the latent variables make the corresponding log-likelihood non-concave. Nevertheless, expectation-maximization schemes are popularly used in practice, but they are typically stuck in local optima. In the recent years, the method of moments have provided a refreshing angle for resolving the non-convex issue, but it is applicable to a quite limited class of latent GMs. In this paper, we aim for enhancing its power via enlarging such a class of latent GMs. To this end, we introduce two novel concepts, coined marginalization and conditioning, which can reduce the problem of learning a larger GM to that of a smaller one. More importantly, they lead to a sequential learning framework that repeatedly increases the learning portion of given latent GM, and thus covers a significantly broader and more complicated class of loopy latent GMs which include convolutional and random regular models.

1 Introduction

Graphical models (GM) are succinct representation of a joint distribution on a graph where each node corresponds to a random variable and each edge represents the conditional independence between random variables. GM have been successfully applied for various fields including information theory [12, 19], physics [24] and machine learning [18, 11]. Introducing latent variables to GM has been popular approaches for enhancing their representation powers in recent deep models, e.g., convolutional/restricted/deep Boltzmann machines [20, 27]. Furthermore, they are inevitable in certain scenarios when a part of samples is missing, e.g., see [10].

However, learning parameters of latent GMs is significantly harder than that of no-latent ones since the latent variables make the corresponding negative log-likelihood non-convex. The main challenge comes from the difficulty of inferring unobserved/latent marginal probabilities associated to latent/hidden variables. Nevertheless, the expectation-maximization (EM) schemes [9] have been popularly used in practice with empirical successes, e.g., contrastive divergence learning for deep models [14]. They iteratively infer unobserved marginals given current estimation of parameters, and typically stuck at local optima of the log-likelihood function [26].

To address this issue, the spectral methods have provided a refreshing angle on learning probabilistic latent models [2]. These theoretical methods exploit the linear algebraic properties of a model to factorize observed (low-order) moments/marginals into unobserved ones. Furthermore, the factorization methods can be combined with convex log-likelihood optimizations under certain structures, coined exclusive views, of latent GMs [7]. Both factorization methods and exclusive views can be understood as ‘local algorithms’ handling certain partial structures of latent GMs. However, up to now, they are known to be applicable to a quite limited class of latent GMs, and not as broadly applicable as EM, which is the main motivation of this paper.

Contribution. Our major question is “Can we learn latent GMs of more complicated structures beyond naive applications of local algorithms, e.g., known factorization methods or exclusive views?”. To address this, we introduce two novel concepts, called marginalization and conditioning, which reduce the problem of learning a larger GM to that of a smaller one. Hence, if the smaller one is possible to be processed by known local algorithms, then the larger one is too. Our marginalization concept suggests to search a ‘marginalizable’ subset of variables of GM so that their marginal distributions are invariant with respect to other variables under certain graphical transformations. It allows to focus on learning the smaller transformed GM, instead of the original larger one. On the other hand, our conditioning concept removes some dependencies among variables of GM, simply by conditioning some subset of variables. Hence, it enables us to discover marginalizable structures which was not before conditioning. At first glance, conditioning looks very powerful as conditioning more variables would discover more desired marginalizable structures. However, as more variables are conditioned, the algorithmic complexity grows exponentially. Therefore, we set an upper bound of those conditioned variables.

Marginalization and conditioning naturally motivate a sequential scheme that repeatedly recover larger portions of unobserved marginals given previous recovered/ observed ones, i.e., recursively recovering unobserved marginals utilizing any ‘black-box’ local algorithms. Developing new local algorithms, other than known factorization methods and exclusive views, are not of major scope. Nevertheless, we provide two new such algorithms, coined disjoint views and linear views, which play a similar role to exclusive views, i.e., can also be combined with known factorization methods. Given these local algorithms, the proposed sequential learning scheme can learn a significantly broader and more complicated class of latent GMs, than known ones, including convolutional restricted Boltzmann machines and GMs on random regular graphs, as described in Section 5. Consequently, our results imply that there exists a one-to-one correspondence between observed distributions and parameters for the class of latent GMs. Furthermore, for arbitrary latent GMs, it can be used for boosting the performance of EM as a pre-processing stage: first run it to recover as large unobserved marginals as possible, and then run EM using the additional information. We believe that our approach provides a new angle for the important problem of learning latent GMs.

Related works. Parameter estimation of latent GMs has a long history, dating back to [9]. While it can be broadly applied to most of latent GMs, EM algorithm suffers not only from local optima but from a risk of slow convergence. A natural alternative to general method of EM is to constrain the structure of graphical models. In independent component analysis (ICA) and its extensions [17, 4], latent variables are assumed to be independent inducing simple form of latent distribution using products. Recently, spectral methods has been successfully applied for various classes of GMs including latent tree [21, 31], ICA [8, 25], Gaussian mixture models [15], hidden Markov models [28, 30, 16, 3, 34], latent Dirichlet allocation [1] and others [13, 6, 35, 29]. In particular [2] proposed an algorithm of tensor type under certain graph structures.

Another important line of work using method of moments for latent GMs, concerns on recovering joint or conditional probabilities only among observable variables (see [5] and its references). [23, 22] proposed spectral algorithms to recover the joint among observable variables when the graph structure is bottlenecked tree. [7] relaxed the constraint of tree structure and proposed a technique to combine method of moments in conjunction with likelihood for certain structures. Our generic sequential learning framework allows to use of all these approaches as key components, in order to broaden the applicability of methods. We note that we primarily focus on undirected pairwise binary GMs in this paper, but our results can be naturally extended for other GMs.

2 Preliminaries

2.1 Graphical Model and Parameter Learning

Given undirected graph $G=(V,E)$ , we consider the following pairwise binary Graphical Model (GM), where the joint probability distribution on $x=[x_{i}\in\{0,1\}:i\in V]$ is defined as:

[TABLE]

for some parameter $\beta=[\beta_{ij}:(i,j)\in E]\in\mathbb{R}^{E}$ and $\gamma=[\gamma_{i}:i\in V]\in\mathbb{R}^{V}$ . The normalization constant $Z$ is called the partition function.

Given samples $x^{(1)},x^{(2)},\cdots,x^{(N)}\in\{0,1\}^{V}$ drawn from the distribution (1) with some true (fixed but unknown) parameter $\beta^{*},\gamma^{*}$ , the problem of our interest is recovering it. The popular method for the parameter learning task is the following maximum likelihood estimation (MLE):

[TABLE]

where it is well known [32] that the log-likelihood $\log\mathbb{P}_{\beta,\gamma}\left(\cdot\right)$ is concave with respect to $\beta,\gamma$ , and the gradient of the log-likelihood is

[TABLE]

Here, the last term, expectation of corresponding sufficient statistics, comes from the partial derivative of the log-partition function. Furthermore, it is well known that there exists a one-to-one correspondence between parameter $\beta,\gamma$ and sufficient statistics $\mathbb{E}_{\beta,\gamma}[x_{i}x_{j}],\mathbb{E}_{\beta,\gamma}[x_{i}]$ (see [32] for details).

One can further observe that if the number of samples is sufficiently large, i.e., $N\to\infty$ , then (2) is equivalent to

[TABLE]

where the true parameter $\beta^{*},\gamma^{*}$ achieves the (unique) optimal solution. This directly implies that, once empirical nodewise and pairwise marginals in (3) and (4) approach the true marginals, the gradient method can recover $\beta^{*},\gamma^{*}$ modulo the difficulty of exactly computing the expectations of sufficient statistics.

Now let us consider more challenging task: parameter learning under latent variables. Given a subset $H$ of $V$ and $O=V\setminus H$ , we assume that for every sample $x=(x_{O},x_{H})$ , $x_{O}=\left[x_{i}\in\{0,1\}:i\in O\right]$ are observed/visible and other variables $x_{H}=\left[x_{i}\in\{0,1\}:i\in H\right]$ are hidden/latent. In this case, MLE only involves observed variables:

[TABLE]

where $\mathbb{P}_{\beta,\gamma}(x_{O})=\sum_{x_{H}\in\{0,1\}^{H}}\mathbb{P}_{\beta,\gamma}(x_{O},x_{H})$ . Similarly as before, the true parameter $\beta^{*},\gamma^{*}$ achieves the optimal solution of (5) if the number of samples is large enough. However, the log-likelihood under latent variables is no longer concave, which makes the parameter learning task harder. One can apply an expectation-maximization (EM) scheme, but it is typically stuck in local optima.

2.2 Tensor Decomposition

The fundamental issue on parameter learning of latent GM is that it is hard to infer the pairwise marginals for latent variables, directly from samples. If one could infer them, it is also possible to recover $\beta^{*},\gamma^{*}$ as we discussed in previous section. Somewhat surprisingly, however, under certain conditions of latent GM, pairwise marginals including latent variables can be recovered using low-order visible marginals. Before introducing such conditions, we first make the following assumption for any GM on a graph $G=(V,E)$ considered throughout this paper.

Assumption 1 (Faithful).

*For any two nodes $i,j\in V$ , if $i,j$ are connected, then $x_{i},x_{j}$ are dependent. *

This faithfulness assumption implies that GM only has conditional independences given by the graph $G$ . We also introduce the following notion [2].

Definition 1 (Bottleneck).

*A node $i\in V$ is a bottleneck if there exists $j,k,\ell\in V$ , denoted as ‘views’, such that every path between two of $j,k,\ell$ contains $i$ . *

Figure 1(a) illustrates the bottleneck. By construction, views are conditionally independent given the bottleneck. Armed with this notion, now we introduce the following theorem to provide sufficient conditions for recovering unobserved/latent marginals [2].

Theorem 1.

*Given GM with a parameter $\beta,\gamma$ , suppose $i$ is a bottleneck with views $j,k,\ell$ . If $\mathbb{P}_{\beta,\gamma}\left(x_{\{j,k,\ell\}}\right)$ is given, then there exists an algorithm $\mathtt{TensorDecomp}$ which outputs $\mathbb{P}_{\beta,\gamma}\left(x_{\{i,j,k,\ell\}}\right)$ up to relabeling of $x_{i}$ , i.e. ignoring symmetry of $x_{i}=0$ and $x_{i}=1$ . *

The above theorem implies that using visible marginals $\mathbb{P}_{\beta,\gamma}\left(x_{\{j,k,\ell\}}\right)$ , one can recover unobserved marginals $\mathbb{P}_{\beta,\gamma}\left(x_{\{i,j,k,\ell\}}\right)$ involving $x_{i}$ . For a bottleneck with more than three views, the joint distribution of the bottleneck and views are recoverable using Theorem 1 by choosing three views at once.

Besides $\mathtt{TensorDecomp}$ , there are other conditions of latent GM which marginals including latent variables are recoverable. Before elaborating on the conditions, we further introduce the following notion for GM on a graph $G=(V,E)$ [7].

Definition 2 (Exclusive View).

*For a set of nodes $S\subset V$ , we say it satisfies the exclusive view property if for each $i\in S$ , there exists $j\in V\setminus S$ , denoted as ‘exclusive view’, such that every path between $j$ and $S\setminus\{i\}$ contains $i$ . *

Figure 1(b) illustrates the exclusive view property. Now, we are ready to state the conditions for recovering unobserved marginals using the property [7].

Theorem 2.

*Given GM with a parameter $\beta,\gamma$ , suppose a set of nodes $S$ satisfies the exclusive view property with a set of exclusive views $E$ . If $\mathbb{P}_{\beta,\gamma}(x_{E})$ and $\mathbb{P}_{\beta,\gamma}\left(x_{i},x_{j}\right)$ are given for all $i\in S$ and an exclusive view $j\in E$ of $i$ , then there exists an algorithm $\mathtt{ExclusiveView}$ which outputs $\mathbb{P}_{\beta,\gamma}(x_{S\cup E})$ . *

At first glance, Theorem 2 does not seems to be useful as it requires a set of marginals including every variable corresponding to $S\cup E$ . However, suppose a set of latent nodes $S$ satisfying the property while its set of exclusive views $E$ is visible, i.e., $\mathbb{P}_{\beta,\gamma}(x_{E})$ is observed. If for all $i\in S$ , $i$ is a bottleneck with views containing its exclusive view $j\in E$ , then one can resort to $\mathtt{TensorDecomp}$ to obtain $\mathbb{P}_{\beta,\gamma}(x_{i},x_{j})$ .

3 Marginalizing and Conditioning

In Section 2.2, we introduced sufficient conditions for recovering unobservable marginals. Specifically, Theorem 1 and 2 state that for certain structures of latent GMs, it is possible to recover latent marginals simply from low-order visible marginals and in turn the parameters of latent GMs via convex MLE estimators in (2).

Now, a natural question arises: “Can we even recover unobserved marginals for latent GMs with more complicated structures beyond naive applications of the bottlenecks or exclusive views?” To address this, in this section we enlarge the class of such latent GMs by proposing generic concepts, marginalization and conditioning.

3.1 Key Ideas

We start by defining two concepts, marginalization and conditioning, formally. The former is a combinatorial concept defined as follows.

Definition 3 (Marginalization).

Given graph $G=(V,E)$ , we say $S\subset V$ is marginalizable if for all $i\in V\setminus S$ , there exists a (minimal) set $S_{i}\subset S$ with $|S_{i}|\leq 2$ such that $i$ and $S\setminus S_{i}$ are disconnected in $G\setminus S_{i}$ .111 $G\setminus S_{i}$ is the subgraph of $G=(V,E)$ induced by $V\setminus S_{i}$ . For marginalizable set $S$ in $G=(V,E)$ , the marginalization of $S$ , denoted by $\mathtt{Marg}(S,G)$ , is the graph on $S$ with edges

[TABLE]

In Figure 2, for example, node $i$ is disconnected with $\{k,o\}$ when removing $S_{i}=\{j,n\}$ . Hence, the edge between $j$ and $n$ is additionally included in the marginalization of $S$ .

With the definition of marginalization, the following key proposition reveals that recovering unobserved marginals of a latent GM can be actually reduced to that of much smaller latent GM.

Proposition 3.

Consider a GM on $G=(V,E)$ with a parameter $\beta,\gamma$ . If $S\subset V$ is marginalizable in $G$ , then there exists (unique) $\beta^{\prime},\gamma^{\prime}$ such that GM on $\mathtt{Marg}(S,G)$ with a parameter $\beta^{\prime},\gamma^{\prime}$ inducing the same distribution on $x_{S}$ , i.e.,

[TABLE]

The proof of the above proposition is presented in Appendix A. Proposition 3 indeed provides a way of representing the marginal probability on $S$ of GM via the smaller GM on $\mathtt{Marg}(S,G)$ . Suppose there exists any algorithm (e.g., via bottleneck, but we don’t restrict ourselves on this method) that can recover a joint distribution $\mathbb{P}_{\beta^{\dagger},\gamma^{\dagger}}(x_{S})$ , or equivalently sufficient statistics, of latent GM on $\mathtt{Marg}(S,G)$ only using observed marginals in $S$ . Then, it should be

[TABLE]

where $\beta^{\prime},\gamma^{\prime}$ is the unique parameter satisfying (6). Using Proposition 3 and marginalization, one can recover unobserved marginals of a large GM by considering smaller GMs corresponding to marginalizations of the large one. The role of marginalization will be further discussed and clarified in Section 4.

In addition to marginalizing, we introduce the second key ingredient, called conditioning, with which the class of recoverable latent GMs can be further expanded.

Proposition 4.

*For a graph $G=(V,E)$ , for $C\subset V$ and $S\subset V\setminus C$ , $\mathtt{Marg}(S,G\setminus C)$ is a subgraph of $\mathtt{Marg}(S,G)$ . *

The proof of the above proposition is straightforward since $S_{i}$ (defined in Definition 3) for $S$ in $G$ contains that for $S$ in $G\setminus C$ , i.e., the edge set of $\mathtt{Marg}(S,G)$ contains that of $\mathtt{Marg}(S,G\setminus C)$ . Figure 3 illustrates the example on how conditioning actually broaden the recoverable latent GMs, as suggested in Proposition 4. Once the node $\ell$ is conditioned out, the marginalization $\mathtt{Marg}(S,G\setminus\{\ell\})$ (Figure 3(c)) is a form that can be handled by $\mathtt{TensorDecomp}$ .

3.2 Labeling Issues

In spite of its usefulness, there is a caveat in performing conditioning: consistent labeling of latent nodes. For example, consider the latent GM as in Figure 3. Conditioned on $x_{\ell}$ , $h$ is a bottleneck with views $i$ , $j$ , $k$ (Figure 3(c)). If $\mathbb{P}_{\beta,\gamma}\left(x_{\{i,j,k,\ell\}}\right)$ is given, one can recover the conditional distribution $\mathbb{P}_{\beta,\gamma}\left(x_{\{h,i,j,k\}}|x_{\ell}=s\right)$ up to labeling of $x_{h}$ , from Theorem 1 and conditioning. Here, the conditioning worsens the relabeling problem in the sense that we might choose different labels for $x_{h}$ for each conditioned value $x_{\ell}=0$ and $x_{\ell}=1$ . As a result, the recovered joint distribution computed as $\sum_{x_{\ell}\in\{0,1\}}\mathbb{P}_{\beta,\gamma}\left(x_{\{h,i,j,k\}}|x_{\ell}\right)\mathbb{P}_{\beta,\gamma}(x_{\ell})$ with mixed labeling of $x_{h}$ , would be different from the true joint. To handle this issue, we define the following concept for consistent labeling of latent variables.

Definition 4 (Label-Consistency).

Given GM on $G=(V,E)$ with a parameter $\beta,\gamma$ , we say $i\in V$ is label-consistent for $C\subset V\setminus\{i\}$ if there exists $j\in V\setminus(C\cup\{i\})$ , called ‘reference’, such that

[TABLE]

*called ‘preference’, is consistently positive or negative for all $s\in\{0,1\}^{C}$ .333Note that the preference cannot be zero due to Assumption 1. *

In Figure 3 for example, $h$ is label-consistent for $\{\ell\}$ with reference $i$ since the corresponding preference is the function only on $\beta_{hi}$ , which is fixed as either $\beta_{hi}>0$ or $\beta_{hi}<0$ (note that the reference can be arbitrarily chosen due to the symmetry of structure). Using the label-consistency of $h$ , one can choose a consistent label of $x_{h}$ by choosing the label consistent to the preference of the reference node $i$ .

Even if $i\in V$ is label-consistent under GM with the true known parameter, we need to specify the reference and corresponding preference to obtain a correct labeling on $x_{i}$ . We note however that attractive GMs (i.e., $\beta_{ij}>0$ for all $(i,j)\in E$ ) always satisfy the label-consistency with any reference node since for any $i,j\in V$ and $C\subset V\setminus\{i,j\}$ where $i,j$ are connected in $G\setminus C$ ,

[TABLE]

Furthermore, there can be some settings in which we can force the label-consistency from the structure of latent GMs even without the information of its true parameter. For example, consider a latent GM on $G=(V,E)$ and a parameter $\beta,\gamma$ . For a set $C\subset V$ , a latent node $i\in V\setminus C$ and its neighbor $j\in V\setminus(C\cup\{i\})$ such that $(i,j)\in E$ is the only path from $i$ to $j$ in $G\setminus C$ , by symmetry of labels of latent nodes, one can assume that $\beta_{ij}>0$ , i.e.,

[TABLE]

to force the label-consistency of $i$ for $C$ . In general, one can still choose labels of latent variables to maximize the log-likelihood of observed variables.

As in conditioning, marginalization also has a labeling issue. Consider a latent GM on $G=(V,E)$ . Suppose that every unobserved pairwise marginal can be recovered by two marginalizations of $S_{1},S_{2}\subset V$ . If there is a common latent node $i\in S_{1}\cap S_{2}$ , then the labeling for $x_{i}$ might be inconsistent. To address this issue, we make the following assumption on graph $G=(V,E)$ , node $i\in V$ , and parameter $\beta,\gamma$ of GM.

Assumption 2 (Degeneracy).

$\mathbb{P}_{\beta,\gamma}(x_{i}=1)\neq 0.5$ *. *

Under the assumption, one can choose a label of $x_{i}$ to satisfy $\mathbb{P}_{\beta,\gamma}(x_{i}=1)>0.5$ using the symmetry of labels of latent nodes.

4 Sequential Marginalizing and Conditioning

In the previous section, we introduced two concepts marginalization and conditioning to translate the marginal recovery problem of a large GM into that of smaller and tractable GMs. In this section, we present a sequential strategy, adaptively applying marginalization and conditioning, by which we substantially enlarge the class of tractable GMs with hidden/latent variables.

4.1 Example

We begin with a simple example describing our sequential learning framework. Consider a latent GM as illustrated in Figure 4(a) and a parameter $\beta,\gamma$ . Given visible marginal $\mathbb{P}_{\beta,\gamma}\left(x_{\{j,k,\ell,m,n\}}\right)$ , our goal is to recover all unobserved pairwise marginals including $x_{h}$ or $x_{i}$ in order to learn $\beta,\gamma$ via convex MLE (2). As both nodes $h$ and $i$ are not a bottleneck, one can consider the conditioning strategy described in the previous section, i.e., the conditional distribution $\mathbb{P}_{\beta,\gamma}\left(x_{\{h,i,\ell,m,n\}}|x_{\{j,k\}}\right)$ in Figure 4(b). Now, node $i$ is a bottleneck with views $\ell,m,n$ . Hence, one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{i,\ell,m,n\}}|x_{\{j,k\}}\right)$ using $\mathtt{TensorDecomp}$ where the label of $x_{i}$ is set to satisfy

[TABLE]

i.e., node $i$ is label consistent. Further, $\mathbb{P}_{\beta,\gamma}\left(x_{\{i,j,k,\ell,m,n\}}\right)$ can be recovered using the known visible marginals $\mathbb{P}_{\beta,\gamma}\left(x_{\{j,k\}}\right)$ and the following identity

[TABLE]

Since we recovered pairwise marginals between $x_{i}$ and $x_{\ell}$ , $x_{m}$ , $x_{n}$ , the remaining goal is to recover pairwise marginals including $x_{h}$ . Now consider a latent GM where $x_{\{\ell,m,n\}}$ is conditioned and it is illustrated in Figure 4(c). At this time, the node $h$ is a bottleneck with views $i,j,k$ , which can be handled by an additional application of $\mathtt{TensorDecomp}$ (the details are same as the previous case on node $i$ ).

This example shows that the sequential application of conditioning extends a class of latent GM that unobserved pairwise marginals are recoverable. Here, we use an algorithm $\mathtt{TensorDecomp}$ as a black-box, hence one can consider other algorithms as long as they have similar guarantees. One caveat is that conditioning an arbitrary number of variables is very expensive as the learning algorithmic (and sampling) complexity grows exponentially with respect to the number of conditioned variables. Therefore, it would be reasonable to bound the number of conditioned variables.

4.2 Algorithm Design

Now, we are ready to state the main learning framework sequentially applying marginalization and conditioning, summarized in Algorithm 1. Suppose that there exists an algorithm, called $\mathtt{NonConvexSolver}$ , e.g., $\mathtt{TensorDecomp}$ , for a class of pairs $\mathcal{N}\subset\{(G,\mathcal{S}_{G}):G=(V,E),\mathcal{S}_{G}\subset 2^{V}\}$ such that all $(G,\mathcal{S}_{G})\in\mathcal{N}$ satisfy the following:

$\circ$

Given GM with a parameter $\beta,\gamma$ on $G=(V,E)$ and marginals $\{\mathbb{P}_{\beta,\gamma}\left(x_{S}\right):S\in\mathcal{S}_{G}\}$ , $\mathtt{NonConvexSolver}$ outputs the entire distribution $\mathbb{P}_{\beta,\gamma}(x)$ , up to labeling of variables on $V\setminus\left(\bigcup_{S\in\mathcal{S}_{G}}S\right)$ .

For example, consider a graph $G$ illustrated in Figure 1(a) with $\mathcal{S}_{G}=\big{\{}\{j,k,\ell\}\big{\}}$ . Then, $\mathtt{TensorDecomp}$ outputs the entire distribution $\mathbb{P}_{\beta,\gamma}\left(x_{\{i,j,k,\ell\}}\right)$ .

In addition, suppose that there exists an algorithm, called $\mathtt{Merge}$ , e.g., $\mathtt{ExclusiveView}$ , for a class of pairs $\mathcal{M}\subset\{(G,\mathcal{T}_{G}):G=(V,E),\mathcal{T}_{G}\subset 2^{V}\}$ such that all $(G,\mathcal{T}_{G})$ satisfy the following:

$\circ$

Given GM with a parameter $\beta,\gamma$ on $G=(V,E)$ and marginals $\{\mathbb{P}_{\beta,\gamma}\left(x_{S}\right):S\in\mathcal{T}_{G}\}$ , $\mathtt{Merge}$ outputs the distribution $\mathbb{P}_{\beta,\gamma}\left(x_{T}\right)$ where $T=\bigcup_{S\in\mathcal{T}_{G}}S$ .

Namely, $\mathtt{Merge}$ simply merges the small marginal distributions for $S\in\mathcal{T}_{G}$ into the entire distribution on $\bigcup_{S\in\mathcal{T}_{G}}S$ . For example, consider a graph $G$ illustrated in Figure 1(b) with

[TABLE]

where $i^{\prime},j^{\prime},k^{\prime},\ell^{\prime}\in S$ have exclusive views $i,j,k,\ell$ , respectively. Then, $\mathtt{ExclusiveView}$ outputs the distribution $\mathbb{P}_{\beta,\gamma}\left(x_{S\cup\{i,j,k,\ell\}}\right)$ .

For a GM on $G=(V,E)$ with a parameter $\beta,\gamma$ , suppose we know a family of label-consistency quadruples

[TABLE]

and marginals $\{\mathbb{P}_{\beta,\gamma}(x_{S}):S\in\sigma_{0}\}$ for some $\sigma_{0}\subset 2^{V}$ . As we mentioned in the previous section, we also bound the number of conditioning variables by some $K\geq 0$ . Under the setting, our goal is to recover more marginals beyond initially known ones $\{\mathbb{P}_{\beta,\gamma}(x_{S}):S\in\sigma_{0}\}$ .

The following conditions for $C\subset V$ with $|C|\leq K$ and $R\subset V\setminus C$ are sufficient so that additional marginals $\mathbb{P}_{\beta,\gamma}(x_{R\cup C})$ can be recovered by conditioning variables on $C$ , marginalizing $R$ and applying $\mathtt{NonConvexSolver}$ :

$\mathcal{C}1.$

$(H,\mathcal{S}_{H})\in\mathcal{N}$ for some $\mathcal{S}_{H}\subset 2^{V}$

$\mathcal{C}2.$

For all $S\in\mathcal{S}_{H}$ , there exists $S^{\prime}\in\sigma_{0}$ such that $S\cup C\subset S^{\prime}$

$\mathcal{C}3.$

For all $i\in R\setminus\left(\bigcup_{S\in\mathcal{S}_{H}}S\right)$ , there exist $j\in\bigcup_{S\in\mathcal{S}_{H}}S$ and $p$ such that $(i,j,p,C)\in\mathcal{L}$ ,

where $H=\mathtt{Marg}(R,G\setminus C)$ . In the above, $\mathcal{C}1$ implies that if $\{\mathbb{P}_{\beta,\gamma}(x_{S}|x_{C}):S\in\mathcal{S}_{H}\}$ are given, then $\mathtt{NonConvexSolver}$ outputs $\mathbb{P}_{\beta,\gamma}(x_{R}|x_{C})$ up to labeling of $R\setminus\left(\bigcup_{S\in\mathcal{S}_{H}}S\right)$ . In addition, $\mathcal{C}2$ says that the required marginals $\{\mathbb{P}_{\beta,\gamma}(x_{S}|x_{C}):S\in\mathcal{S}_{H}\}$ and $\mathbb{P}(x_{C})$ are known. Finally, $\mathcal{C}3$ is necessary that all nodes which we need to infer their labels are label-consistent.

Similarly, the following conditions for $C\subset V$ with $|C|\leq K$ and $(G\setminus C,\mathcal{T}_{G\setminus C})\in\mathcal{M}$ are sufficient so that $\mathbb{P}_{\beta,\gamma}(x_{T\cup C})$ can be recovered by conditioning variables on $C$ and applying $\mathtt{Merge}$ where $T=\bigcup_{S\in\mathcal{T}_{G}}S$ :

$\mathcal{C}4.$

For all $S\in\mathcal{T}_{G\setminus C}$ , there exists $S^{\prime}\in\sigma_{0}$ such that $S\cup C\subset S^{\prime}$ ,

In the above, $\mathcal{C}4$ says that the required marginals for merging are given.

The above procedures imply that given initial marginals $\{\mathbb{P}_{\beta,\gamma}(x_{S}):S\in\sigma_{0}\}$ , one can recover additional marginals $\{\mathbb{P}_{\beta,\gamma}(x_{S}):S\in\mathcal{A}_{0}\cup\mathcal{B}_{0}\}$ , where

[TABLE]

from $\mathtt{NonConvexSolver}$ and $\mathtt{Merge}$ , respectively. One can repeat the above procedure for recovering more marginals as

[TABLE]

Recall that we are primarily interested in recovering all pairwise marginals, i.e.,

[TABLE]

The following theorem implies that one can check the success of Algorithm 1 in $O\left(|V|^{K+L}\right)$ time, where $K,L$ are typically chosen as small constants.

Theorem 5.

*Suppose we have a label-consistency family $\mathcal{L}$ of GM on $G=(V,E)$ and marginals $\{\mathbb{P}_{\beta,\gamma}(x_{S}):S\in\sigma_{0}\}$ for some $\sigma_{0}\subset 2^{V}$ . If Algorithm 1 eventually recover all pairwise marginals, then they do in $O\left(|V|^{K+L}\right)$ iterations, where $K$ and $L$ denote the maximum numbers of conditioning variables and nodes of graphs in $\mathcal{N},\mathcal{M}$ , respectively. *

The proof of the above theorem is presented in Appendix B. We note that one can design their own sequence of recovering marginals rather than recovering all marginals in $\mathcal{A}_{t},\mathcal{B}_{t}$ for computational efficiency. In Section 5, we provide such examples, of which strategy has the linear-time complexity at each iteration. We also remark that even when Algorithm 1 recovers some, not all, pairwise unobserved marginals for given latent GMs, it is still useful since one can run the EM algorithm using the additional information provided by Algorithm 1. We leave this suggestion for further exploration in the future.

4.3 Recoverable Local Structures

For running the sequential learning framework in the previous section, one requires ‘black-box’ knowledge of a label-consistency family $\mathcal{L}$ and a class of locally recoverable structures of latent GMs, i.e., $\mathcal{N}$ and $\mathcal{M}$ . The complete study on them is out of our scope, but we provide the following guidelines on their choices.

As mentioned in Section 3.2, $\mathcal{L}$ can be found easily for some class of GMs including attractive ones. One can also infer it heuristically for general GMs in practice. As we mentioned in the previous section, one can choose $(G,\mathcal{S}_{G})\in\mathcal{N}$ that corresponds to $\mathtt{TensorDecomp}$ . Beyond $\mathtt{TensorDecomp}$ , in practice, one might hope to choose an additional option for small sized latent GMs since even a generic non-convex solver might compute an almost optimum of MLE due to their small dimensionality.

For the choice of $(G,\mathcal{T}_{G})\in\mathcal{M}$ , we mentioned those corresponding to $\mathtt{ExclusiveView}$ in the previous section. In addition, we provide the following two more examples, called $\mathtt{DisjointView}$ and $\mathtt{LinearView}$ , as described in Algorithm 2 and 3, respectively. In Algorithm 3, $[\mathbb{P}_{\beta,\gamma}(x_{j},x_{i})]^{-1}$ is defined as

[TABLE]

Figure 5 illustrates $\mathtt{DisjointView}$ and $\mathtt{LinearView}$ .

5 Examples

In this section, we provide concrete examples of loopy latent GM where the proposed sequential learning framework is applicable. In what follows, we assume that it uses classes $\mathcal{N},\mathcal{M}$ corresponding to $\mathtt{TensorDecomp}$ , $\mathtt{ExculsiveView}$ , $\mathtt{DisjointView}$ and $\mathtt{LinearView}$ .

Grid graph. We first consider a latent GM on a grid graph illustrated in Figure 6(a) where boundary nodes are visible and internal nodes are latent. The following lemma states that all pairwise marginals can be successfully recovered given observed ones, utilizing the proposed sequential learning algorithm.

Lemma 6.

*Consider any latent GM with a parameter $\beta,\gamma$ illustrated in Figure 6(a), $K=3$ , and $\sigma_{0}=\{S\subset O:|S|\leq 6\}$ . Then, $\sigma_{5}$ updated under Algorithm 1 contains all pairwise marginals. *

In the above, recall that $O$ is the set of visible nodes. The proof strategy is illustrated in Figure 6 and the formal proof is presented in Appendix C. We remark that to prove Lemma 6, $\mathtt{ExclusiveView}$ and $\mathtt{DisjointView}$ are not necessary to use.

Convolutional graph. Second, we consider a latent GM illustrated in Figure 7(a), which corresponds to a convolutional restricted Boltzmann machine (CRBM) [20], and also prove the following lemma.

Lemma 7.

*Consider any latent GM with a parameter $\beta,\gamma$ illustrated in Figure 7(a), $K=3$ , and $\sigma_{0}=\{S\subset O:|S|\leq 8\}$ . Then, $\sigma_{4}$ updated under Algorithm 1 contains all pairwise marginals. *

The proof strategy is illustrated in Figure 7 and the formal proof is presented again in Appendix D. We remark that to prove Lemma 7, $\mathtt{ExclusiveView}$ and $\mathtt{LinearView}$ are not necessary to use. Furthermore, it is straightforward to generalize the proof of Lemma 7 for arbitrary CRBM.

Lemma 8.

*Consider any CRBM with $N\times M$ visible nodes and a filter size $n\times m$ , $2\leq n\leq m$ , $K=2mn-4$ and $\sigma_{0}=\{S\subset O:|S|\leq 4mn-2m\}$ . Then, $\sigma_{MNmn/2}$ updated under Algorithm 1 contains all pairwise marginals.444The theorem holds for arbitrary stride of CRBM. *

Random regular graph.

Finally, we state the following theorem for latent random regular GMs.

Lemma 9.

*Consider any latent GM with a parameter $\beta,\gamma$ on a random $d$ -regular graph $(V,E)$ for some constant $d\geq 5$ , $K=2d-2$ and $\sigma_{0}=\{S\subset O:|S|\leq 2(d-1)^{2}|H|)\}$ . There exists a constant $c=c(d)$ such that if the number of latent variables is at most $c=c|V|$ , $\sigma_{2d|H|}$ updated under Algorithm 1 contains all pairwise marginals a.a.s. *

The proof of the above lemma is presented in Appendix E, where it is impossible without using our sequential learning strategy. One can obtain an explicit formula of $c(d)$ from our proof, but it is quite a loose bound since we do not make much efforts to optimize it.

6 Conclusion

In this paper, we present a new learning strategy for latent graphical models. Unlike known algebraic, e.g., $\mathtt{TensorDecomp}$ and optimization, e.g., $\mathtt{ExculsiveView}$ , approaches for this non-convex problem, ours is of combinatorial flavor and more generic using them as subroutines. We believe that our approach provides a new angle for the important learning task.

Appendix A Proof of Proposition 3

We use the mathematical induction on $|\{S_{i}:i\in V\setminus S\}|$ where $S_{i}$ is defined in Definition 3. Before starting the proof we define the equivalence class $[\ell]=\{i\in V\setminus S:S_{i}=S_{\ell}\}$ . Now, we start the proof by considering

[TABLE]

where $f_{[\ell]}(x_{S_{\ell}})$ is some positive function. Since $|S_{\ell}|\leq 2$ , one can modify a parameter $\beta^{\dagger},\gamma^{\dagger}$ only between elements of $S_{\ell}$ to achieve the following identity

[TABLE]

where $E^{\prime}=E\cup\{(j,k):S_{\ell}=\{j,k\}\}$ . Using the induction hypothesis, the above identity completes the proof of Proposition 3.

Appendix B Proof of Theorem 5

Since the algorithm only uses the marginals of at most $K+L$ dimensions, instead of $\sigma_{t}$ , consider the following sequence

[TABLE]

One can observe that if $\sigma^{\prime}_{t}=\sigma^{\prime}_{t-1}$ , then one can observe that the sequential local framework cannot recover more marginals after $t$ -th iteration, while $\sigma_{t}$ increases its cardinality at least $1$ otherwise. However, the maximum cardinality of $\sigma_{t}$ is $O(|V|^{K+L})$ and this implies that the algorithm always terminates in $O(|V|^{K+L})$ . This completes the proof of Theorem 5.

Appendix C Proof of Lemma 6

We first consider the distribution conditioned on $x_{\{a,c,k\}}$ as illustrated in Figure 6(b). In Figure 6(b), observe that $g$ is a bottleneck with views $b,f,p$ . Furthermore, $g$ is label consistent for $\{a,c,k\}$ with a reference $b$ by assuming $\beta_{bg}>0$ (or $\beta_{bg}<0$ ). Hence, one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{b,f,g,p\}}|x_{\{a,c,k\}}\right)$ using $\mathtt{TensorDecomp}$ and obtain $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,c,f,g,k,p\}}\right)$ using the following identity.

[TABLE]

Similarly, one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,f,k,\ell,p,q,r\}}\right)$ , $\mathbb{P}_{\beta,\gamma}\left(x_{\{c,d,e,i,j,o,t\}}\right)$ , $\mathbb{P}_{\beta,\gamma}\left(x_{\{e,j,n,o,r,s,t\}}\right)$ .

In order to recover marginals including $x_{h}$ or $x_{m}$ , $h$ and $m$ should be bottlenecks. Conditioned on $x_{\{b,d,\ell,q\}}$ , as illustrated in Figure 6(d), $h$ is a bottleneck with views $c,p,r$ , however, we do not have a marginal $\mathbb{P}_{\beta,\gamma}\left(x_{\{b,c,d,\ell,p,q,r\}}\right)$ currently. Now, we recover the marginal $\mathbb{P}_{\beta,\gamma}\left(x_{\{b,c,d,\ell,p,q,r\}}\right)$ . Consider the distribution conditioned on $x_{\{p,r\}}$ as illustrated in Figure 6(c). In Figure 6(c), observe that $q$ and $b,c,d$ are disconnected if $\ell$ is removed. Furthermore, $\mathbb{P}_{\beta,\gamma}\left(x_{\{\ell,q\}}|x_{\{p,r\}}\right)$ and $\mathbb{P}_{\beta,\gamma}\left(x_{\{b,c,d,q\}}|x_{\{p,r\}}\right)$ are already observed. Hence, using $\mathtt{LinearView}$ by setting $S\leftarrow\{b,c,d\}$ , $i\leftarrow\ell$ , $j\leftarrow q$ and conditioning $x_{\{p,r\}}$ , one can obtain $\mathbb{P}_{\beta,\gamma}\left(x_{\{b,c,d,\ell,p,q,r\}}\right)$ . Now, $h$ is a bottleneck with views $c,p,r$ by conditioning $x_{\{b,d,\ell,q\}}$ . Using $\mathtt{TensorDecomp}$ one can obtain $\mathbb{P}_{\beta,\gamma}\left(x_{\{b,c,d,h,\ell,p,q,r\}}\right)$ . Using same procedure, one can also obtain $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,c,g,m,q,r,s\}}\right)$ .

Until now, we have recovered every pairwise marginals between visible variable and latent variable. The remaining goal is to recover pairwise marginals between latent variables. First, by setting $S\leftarrow\{e,j,o\}$ , $i\leftarrow h$ , $j\leftarrow c$ and conditioning $x_{\{b,d\}}$ , one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{b,c,d,e,h,j,o\}}\right)$ using $\mathtt{LinearView}$ . Consecutively, by setting $S\leftarrow\{h\}$ , $i\leftarrow i$ , $j\leftarrow j$ and conditioning $x_{\{e,o\}}$ , one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{e,h,i,j,o\}}\right)$ using $\mathtt{LinearView}$ which includes the pairwise marginals $\mathbb{P}_{\beta,\gamma}\left(x_{\{i,j\}}\right)$ . Other pairwise marginals between latent variables can be also recovered using the same procedure. Since we end the sequence in 5 steps, this completes the proof of Lemma 6.

Appendix D Proof of Lemma 7

We first consider the distribution conditioned on $x_{\{c,e,f\}}$ as illustrated in Figure 7(b). In Figure 7(b), observe that $m$ is a bottleneck with views $a,b,d$ with a reference $a$ by assuming $\beta_{am}>0$ (or $\beta_{am}<0$ ). Hence, one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,d,m\}}|x_{\{c,e,f\}}\right)$ using $\mathtt{TensorDecomp}$ and obtain $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,c,d,e,f,m\}}\right)$ using the following identity.

[TABLE]

Similarly, one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,c,d,e,f,n\}}\right)$ , $\mathbb{P}_{\beta,\gamma}\left(x_{\{g,h,i,j,k,\ell,q\}}\right)$ , $\mathbb{P}_{\beta,\gamma}\left(x_{\{g,h,i,j,k,\ell,r\}}\right)$ .

In order to recover marginals including $x_{o}$ or $x_{p}$ , $o$ and $p$ should be bottlenecks. Conditioned on $x_{\{h,m,q\}}$ , $o$ is a bottleneck with views $d,e,g$ , however we do not have a marginal $\mathbb{P}_{\beta,\gamma}\left(x_{\{d,e,g,h,m,q\}}\right)$ currently. Now, we recover the marginal $\mathbb{P}_{\beta,\gamma}\left(x_{\{d,e,g,m,q\}}\right)$ . Since we observed $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,d,e,g,h,j,k\}}\right)$ and $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,d,e,m\}}\right)$ , we can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,d,e,g,h,j,k,m\}}\right)$ using $\mathtt{DisjointView}$ by setting $S\leftarrow\{g,h,j,k\}$ , $T\leftarrow\{m\}$ and $C\leftarrow\{a,b,d,e\}$ . Likewise, using $\mathtt{DisjointView}$ , one can recover a marginal $\mathbb{P}_{\beta,\gamma}\left(x_{\{a,b,d,e,g,h,j,k,m,q\}}\right)$ as well. Using the recovered marginal $\mathbb{P}_{\beta,\gamma}\left(x_{\{d,e,g,h,m,q\}}\right)$ , conditioning $x_{\{h,m,q\}}$ and using $\mathtt{TensorDeomp}$ , one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{d,e,g,h,m,o,q\}}\right)$ . Similarly, one can recover $\mathbb{P}_{\beta,\gamma}\left(x_{\{e,f,h,i,n,p,r\}}\right)$ . Since we end the sequence in 4 steps, this completes the proof of Lemma 6.

Appendix E Proof of Lemma 9

The main idea of the proof is to show that every latent nodes of size $\leq cN$ contains at least a single recoverable latent node using $\mathtt{TensorDecomp}$ where $N=|V|$ . We first state the following condition for a latent node $i$ .

Condition 1.

*For a latent node $i$ , two of its neighbors $j,k$ are visible and a set of neighbors $S$ of $j,k$ are visible except for $i$ , not containing $j,k$ . Also, there exists $\ell\in O\setminus S$ such that $i$ is a bottleneck with views $j,k,\ell$ in $G\setminus S$ . *

In the above condition, $O$ denote the set of visible nodes. One can easily observe that if any latent node satisfies the above condition, then it is recoverable by conditioning neighbors of $j,k$ and apply $\mathtt{TensorDecomp}$ with views $j,k$ and some other.

Now consider the following procedure. First, duplicate for each $i\in V$ into $i_{1},\dots,i_{d}$ where $i_{n}$ is visible/latent if $i$ is visible/latent. Let $V^{\prime}$ be a such duplicated vertex set and $O^{\prime}\subset V^{\prime}$ be a set of visible nodes and $H^{\prime}=V^{\prime}\setminus O^{\prime}$ be a set of latent nodes. The procedure starts with a graph on $V^{\prime}$ without edges.

Choose latent nodes $i_{1},\dots,i_{d}\in H^{\prime}$ . For each $n\in\{1,\dots,n\}$ if deg $(i_{n})\neq 1$ , Choose a single neighbor $j_{m}$ of $i_{n}$ with probability

[TABLE]

2.

Similarly, for each neighbor $j_{m}\in O^{\prime}$ of $i_{1},\dots,i_{d}$ , for all $j_{1},\dots,j_{d}$ satisfying deg $(j_{o})=0$ , add neighbors of $j_{o}$ as in step 1.

3.

Check whether there exists an edge $(i_{n},i_{m})$ or a pair of edges $(i_{n},j_{m})$ , $(i_{n^{\prime}},j_{m^{\prime}})$ . If such edge or a pair of edges exists, then the procedure restarts from the beginning.

4.

Let $G$ be a graph such that contracting $\ell_{1},\dots,\ell_{d}$ into $\ell$ for all $\ell\in V$ . Check whether $i$ satisfies Condition 1 with $j,k$ and $i$ is a bottleneck by conditioning neighbors of $j,k$ .

5.

If $G$ satisfies the condition in step 3, then the procedure succeeds. If not, repeat the procedure for the next latent node until every latent node decides its neighbor.

6.

If every latent nodes decided its neighbor, the procedure fails.

The above procedure is constructing the fractional edges of random $d$ -regular graph by contracting $\ell_{1},\dots,\ell_{d}$ into $\ell$ . step 3 checks whether the procedure creates a loop or multiple edges. One can notice that if any node satisfies Condition 1 in step 3, then there exists a recoverable latent node. Our primary goal is to bound the probability that the procedure fails, i.e., no latent node satisfies Condition 1 under the fractional graph.

One can observe that if some visible node is chosen to be a neighbor of a latent node in the procedure but it is already a neighbor of other latent node, then it cannot help to satisfy Condition 1. Also, at each iteration, choosing neighbor has an effect that reducing at most $2d$ nodes from whole nodes as at most $d^{2}$ edges are created. Now, suppose there exist $\alpha n$ latent nodes where $\alpha<\frac{1}{2d(d-1)}$ . Using this fact, one can observe that the probability that a visible node connected to a latent node has $d-1$ visible neighbors is at least $p=(1-2d\alpha)^{d-1}$ . We also note that the probability that the procedure start over in step 3 is $O(1/N)$ at each iteration. Therefore, one can conclude that

[TABLE]

for sufficiently small $\alpha$ (up to constant) where $O(1/N)$ in the bracelet represents the probability that non-existence of $\ell$ in Condition 1 and the degree varies as the procedure iterates. Also, $\mathbf{1}_{S}$ is an indicator function having a value $1$ if an event $S$ occurs, [math] if not. The second last inequality follows from the fact that we can choose at least $\alpha n/(d+1)$ latent nodes of degree [math] at first, and then, we can choose at least $d\alpha n/(d+1)^{2}$ latent nodes of degree less than or equal to $1$ . $k$ in the last inequality is

[TABLE]

for all $d\geq 5$ . One might concern that after the procedure succeeds, the extension of the procedure to the all vertices may start over with high probability so that the probability $\mathbb{P}(\text{no latent node satisfies Condition \ref{cond:regular}})$ becomes significantly larger than (9). However, we note that the restarting probability that extending the procedure to all vertices is $1-\exp\left(\frac{1-d^{2}}{4}\right)$ a.a.s., i.e., constant, (see [33]) and therefore

[TABLE]

for $O(1)\alpha<1$ in the above equation. Now, we consider all $1\leq\alpha N\leq cN$ and all choices of sets of latent node to apply the union bound as below. The explicit choice of $c$ will be presented later.

[TABLE]

where the first inequality is from Stirling’s formula and we choose $c$ to satisfy that $(k-1)c\log c+c\log O(1)-(1-c)\log(1-c)<0$ to obtain the last equality. Such $c$ always exists as

[TABLE]

for a sufficiently small $c$ .

Now, we know that at each iteration of the sequential learning framework, there exists at least one bottleneck latent node which can be recovered without labeling issue (forcing labels). Furthermore, using $\mathtt{LinearView}$ and conditioning, one can also treat recovered latent nodes as visible nodes while the marginals including latent nodes always containing the conditioned variables, i.e., the order of marginals reduces in some sense as recovered marginals has fixed order while a part of order is the constant number (at most $d-1$ ) of conditioned variables. Using this fact, one can conclude that the sequential learning framework recovers every pairwise marginals in $2d|H|$ iterations. where $2d$ follows from that the upperbound of calls of $\mathtt{LinearView}$ for recovering a single latent node is $2d-2$ and at most two bottleneck calls are required. This completes the proof of Theorem 9.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Animashree Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems , pages 917–925, 2012.
2[2] Animashree Anandkumar, Rong Ge, Daniel J Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research , 15(1):2773–2832, 2014.
3[3] Animashree Anandkumar, Daniel J Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory , 2012.
4[4] Francis R Bach and Michael I Jordan. Kernel independent component analysis. Journal of machine learning research , 3(Jul):1–48, 2002.
5[5] Borja Balle and Mehryar Mohri. Spectral learning of general weighted automata via constrained matrix completion. In Advances in Neural Information Processing Systems , pages 2159–2167, 2012.
6[6] Arun T. Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning , pages 1040–1048, 2013.
7[7] Arun T. Chaganty and Percy Liang. Estimating latent-variable graphical models using moments and likelihoods. In International Conference on Machine Learning , pages 1872–1880, 2014.
8[8] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications . Academic press, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Sequential Local Learning for Latent Graphical Models

Abstract

1 Introduction

2 Preliminaries

2.1 Graphical Model and Parameter Learning

2.2 Tensor Decomposition

Assumption 1** (Faithful).**

Definition 1** (Bottleneck).**

Theorem 1**.**

Definition 2** (Exclusive View).**

Theorem 2**.**

3 Marginalizing and Conditioning

3.1 Key Ideas

Definition 3** (Marginalization).**

Proposition 3**.**

Proposition 4**.**

3.2 Labeling Issues

Definition 4** (Label-Consistency).**

Assumption 2** (Degeneracy).**

4 Sequential Marginalizing and Conditioning

4.1 Example

4.2 Algorithm Design

Theorem 5**.**

4.3 Recoverable Local Structures

5 Examples

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

Random regular graph.

Lemma 9**.**

6 Conclusion

Appendix A Proof of Proposition 3

Appendix B Proof of Theorem 5

Appendix C Proof of Lemma 6

Appendix D Proof of Lemma 7

Appendix E Proof of Lemma 9

Condition 1**.**

Assumption 1 (Faithful).

Definition 1 (Bottleneck).

Theorem 1.

Definition 2 (Exclusive View).

Theorem 2.

Definition 3 (Marginalization).

Proposition 3.

Proposition 4.

Definition 4 (Label-Consistency).

Assumption 2 (Degeneracy).

Theorem 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Condition 1.