Learning Restricted Boltzmann Machines with Arbitrary External Fields

Surbhi Goel

arXiv:1906.06595·cs.LG·June 18, 2019

Learning Restricted Boltzmann Machines with Arbitrary External Fields

Surbhi Goel

PDF

Open Access

TL;DR

This paper introduces an optimal-sample-complexity algorithm for learning ferromagnetic and antiferromagnetic Restricted Boltzmann Machines with arbitrary external fields, overcoming previous limitations to consistent external fields.

Contribution

It presents the first algorithm capable of learning RBMs with arbitrary external fields, utilizing a new structural property and covariance-based neighborhood construction.

Findings

01

Algorithm has optimal dependence on dimension for sample complexity and runtime.

02

Successfully learns RBMs with arbitrary external fields, extending prior work.

03

Relies on covariance properties even with arbitrary external fields.

Abstract

We study the problem of learning graphical models with latent variables. We give the first algorithm for learning locally consistent (ferromagnetic or antiferromagnetic) Restricted Boltzmann Machines (or RBMs) with {\em arbitrary} external fields. Our algorithm has optimal dependence on dimension in the sample complexity and run time however it suffers from a sub-optimal dependency on the underlying parameters of the RBM. Prior results have been established only for {\em ferromagnetic} RBMs with {\em consistent} external fields (signs must be same)\cite{bresler2018learning}. The proposed algorithm strongly relies on the concavity of magnetization which does not hold in our setting. We show the following key structural property: even in the presence of arbitrary external field, for any two observed nodes that share a common latent neighbor, the covariance is high. This enables us to…

Equations108

Pr [X = x, Y = y] = \frac{1}{Z} exp (x^{T} J y + h^{T} x + g^{T} y)

Pr [X = x, Y = y] = \frac{1}{Z} exp (x^{T} J y + h^{T} x + g^{T} y)

Pr [X = x, Y = y] = \frac{1}{Z} exp (x^{T} J y + h^{T} x + g^{T} y)

Pr [X = x, Y = y] = \frac{1}{Z} exp (x^{T} J y + h^{T} x + g^{T} y)

Cov (u, v ∣ X_{S} = x_{S}) := E [X_{u} X_{v} ∣ X_{S} = x_{S}] - E [X_{u} ∣ X_{S} = x_{S}] \leavevmode E [X_{v} ∣ X_{S} = x_{S}] .

Cov (u, v ∣ X_{S} = x_{S}) := E [X_{u} X_{v} ∣ X_{S} = x_{S}] - E [X_{u} ∣ X_{S} = x_{S}] \leavevmode E [X_{v} ∣ X_{S} = x_{S}] .

Cov (u, v ∣ X_{S} = x_{S}) \geq α^{2} exp (- 12 λ) .

Cov (u, v ∣ X_{S} = x_{S}) \geq α^{2} exp (- 12 λ) .

Pr [X = x, Y = y, X^{'} = x^{'}, Y^{'} = y^{'}] \propto exp (x^{T} J y + h^{T} x + g^{T} y + x^{' T} J y^{'} + h^{T} x^{'} + g^{T} y^{'})

Pr [X = x, Y = y, X^{'} = x^{'}, Y^{'} = y^{'}] \propto exp (x^{T} J y + h^{T} x + g^{T} y + x^{' T} J y^{'} + h^{T} x^{'} + g^{T} y^{'})

Pr [X = x, Y = y, X^{'} = x^{'}, Y^{'} = y^{'}]

Pr [X = x, Y = y, X^{'} = x^{'}, Y^{'} = y^{'}]

\propto exp (x^{T} J y + h^{T} x + g^{T} y + x^{' T} J y^{'} + h^{T} x^{'} + g^{T} y^{'})

= exp (\frac{1}{2} (x^{T} J y + x^{' T} J y + x^{T} J y^{'} + x^{' T} J y^{'}) + \frac{1}{2} (x^{T} J y - x^{' T} J y - x^{T} J y^{'} + x^{' T} J y^{'})

+ h^{T} (x + x^{'}) + g^{T} (y + y^{'}))

= exp ((x^{+})^{T} J y^{+} + (x^{-})^{T} J y^{-} + 2 h^{T} x^{+} + 2 g^{T} y^{+}) .

Cov (u, v)

Cov (u, v)

= E_{D} [X_{u} X_{v}] - E_{D} [X_{u} X_{v}^{'}] = E_{D} [X_{u}^{'} X_{v}^{'}] - E_{D} [X_{u}^{'} X_{v}]

= \frac{1}{2} (E_{D} [X_{u} X_{v}] + E_{D} [X_{u}^{'} X_{v}^{'}] - E_{D} [X_{u}^{'} X_{v}] - E_{D} [X_{u} X_{v}^{'}])

= \frac{1}{2} E_{D} [(X_{u} - X_{u}^{'}) (X_{v} - V_{v}^{'})]

= E_{D} [X_{u}^{-} X_{v}^{-}]

= \frac{x , x ^{'} \in { \pm 1 } ^{n} y , y ^{'} \in { \pm 1 } ^{m} \sum x _{u}^{-} x _{v}^{-} exp (( x ^{+} ) ^{T} J y ^{+} + ( x ^{-} ) ^{T} J y ^{-} + 2 h ^{T} x ^{+} + 2 g ^{T} y ^{+} )}{x , x ^{'} \in { \pm 1 } ^{n} y , y ^{'} \in { \pm 1 } ^{m} \sum exp (( x ^{+} ) ^{T} J y ^{+} + ( x ^{-} ) ^{T} J y ^{-} + 2 h ^{T} x ^{+} + 2 g ^{T} y ^{+} )} .

N

N

= x, x^{'} \in {\pm 1}^{n} y, y^{'} \in {\pm 1}^{m} \sum x_{u}^{-} x_{v}^{-} exp ((x_{u}^{-} J_{u k} + x_{v}^{-} J_{v k}) y_{k}^{-}) γ (x^{-}, y^{-}) Δ (x^{+}, y^{+})

= x, x^{'} \in {\pm 1}^{n} y, y^{'} \in {\pm 1}^{m} \sum i = 0 \sum \infty x_{u}^{-} x_{v}^{-} (\frac{( x _{u}^{-} J _{u k} + x _{v}^{-} J _{v k} ) ^{i} ( y _{k}^{-} ) ^{i}}{i !}) γ (x^{-}, y^{-}) Δ (x^{+}, y^{+})

= x, x^{'} \in {\pm 1}^{n} y, y^{'} \in {\pm 1}^{m} \sum i = 0 \sum \infty j = 0 \sum i \frac{1}{i !} (j i) J_{u k}^{j} J_{v k}^{i - j} (x_{u}^{-})^{j + 1} (x_{v}^{-})^{i + 1 - j} (y_{k}^{-})^{i} γ (x^{-}, y^{-}) Δ (x^{+}, y^{+})

x, x^{'} \in {\pm 1}^{n} y, y^{'} \in {\pm 1}^{m} \sum a \in [n] \prod (x_{a}^{-})^{A_{a}} b \in [m] \prod (y_{b}^{-})^{B_{b}} f (x^{+}, y^{+}) \geq 0.

x, x^{'} \in {\pm 1}^{n} y, y^{'} \in {\pm 1}^{m} \sum a \in [n] \prod (x_{a}^{-})^{A_{a}} b \in [m] \prod (y_{b}^{-})^{B_{b}} f (x^{+}, y^{+}) \geq 0.

x, x^{'} \in {\pm 1}^{n}; y, y^{'} \in {\pm 1}^{m} \sum (x_{u}^{-})^{j + 1} (x_{v}^{-})^{i + 1 - j} (y_{k}^{-})^{i} γ (x^{-}, y^{-}) Δ (x^{+}, y^{+}) \geq 0.

x, x^{'} \in {\pm 1}^{n}; y, y^{'} \in {\pm 1}^{m} \sum (x_{u}^{-})^{j + 1} (x_{v}^{-})^{i + 1 - j} (y_{k}^{-})^{i} γ (x^{-}, y^{-}) Δ (x^{+}, y^{+}) \geq 0.

N

N

\geq α^{2} x, x^{'} \in {\pm 1}^{n} y, y^{'} \in {\pm 1}^{m} \sum (x_{u}^{-})^{2} (x_{v}^{-})^{2} (y_{k}^{-})^{2} γ (x^{-}, y^{-}) Δ (x^{+}, y^{+}) .

γ (x^{-}, y^{-}) Δ (x^{+}, y^{+})

γ (x^{-}, y^{-}) Δ (x^{+}, y^{+})

= exp ((x^{-})^{T} J y^{-} - x_{u}^{-} J_{u k} y_{k}^{-} - x_{v}^{-} J_{v k} y_{k}^{-} + (x^{+})^{T} J y^{+} + 2 h^{T} x^{+} + 2 g^{T} y^{+})

= exp ((x_{L}^{-})^{T} J (L, R) y_{R}^{-} + (x_{L}^{+})^{T} J (L, R) y_{R}^{+} + 2 h_{L}^{T} x_{L}^{+} + 2 g_{R} y_{R}^{+})

\times exp (x_{u}^{-} J ({u}, R) y_{R}^{-} + x_{u}^{+} J ({u}, R) y_{R}^{+} + 2 h_{u} x_{u}^{+})

\times exp (x_{v}^{-} J ({v}, R) y_{R}^{-} + x_{v}^{+} J ({v}, R) y_{R}^{+} + 2 h_{v} x_{v}^{+})

\times exp (x_{L}^{-} J (L, {k}) y_{k}^{-} - x_{u}^{-} J_{u k} y_{k}^{-} - x_{v}^{-} J_{v k} y_{k}^{-} + x^{+} J (V_{o b s}, {k}) y_{k}^{+} + 2 g_{k} y_{k}^{+})

exp (x_{u}^{-} J ({u}, R) y_{R}^{-} + x_{u}^{+} J ({u}, R) y_{R}^{+} + 2 h_{u} x_{u}^{+})

exp (x_{u}^{-} J ({u}, R) y_{R}^{-} + x_{u}^{+} J ({u}, R) y_{R}^{+} + 2 h_{u} x_{u}^{+})

= exp (x_{u} J ({u}, R) y_{R} + x_{u}^{'} J ({u}, R) y_{R}^{'} + h_{u} (x_{u} + x_{u}^{'}))

\geq exp - 2 j \in R \sum ∣ J_{u j} ∣ + ∣ h_{u} ∣ \geq exp (- 2 λ)

exp (x_{L}^{-} J (L, {k}) y_{k}^{-} - x_{u}^{-} J_{u k} y_{k}^{-} - x_{v}^{-} J_{v k} y_{k}^{-} + x^{+} J_{V_{o b s}, {k}} y_{k}^{+} + 2 g_{k} y_{k}^{+})

exp (x_{L}^{-} J (L, {k}) y_{k}^{-} - x_{u}^{-} J_{u k} y_{k}^{-} - x_{v}^{-} J_{v k} y_{k}^{-} + x^{+} J_{V_{o b s}, {k}} y_{k}^{+} + 2 g_{k} y_{k}^{+})

= exp (x_{L} J (L, {k}) y_{k} + x_{L}^{'} J (L, {k}) y_{k}^{'} + x_{u} J_{u k} y_{k}^{'} + x_{u}^{'} J_{u k} y_{k} + x_{v} J_{v k} y_{k}^{'} + x_{v}^{'} J_{v k} y_{k} + g_{k} (y_{k} + y_{k}^{'}))

\geq exp (- 2 (i \in V_{o b s} \sum ∣ J_{ik} ∣ + ∣ g_{k} ∣)) \geq exp (- 2 λ)

ρ (L, R) := x_{L}, x_{L}^{'} \in {\pm 1}^{∣ L ∣} y_{R}, y_{R}^{'} \in {\pm 1}^{∣ R ∣} \sum exp ((x_{L}^{-})^{T} J (L, R) y_{R}^{-} + (x_{L}^{+})^{T} J (L, R) y_{R}^{+} + 2 h_{L}^{T} x_{L}^{+} + 2 g_{R} y_{R}^{+}),

ρ (L, R) := x_{L}, x_{L}^{'} \in {\pm 1}^{∣ L ∣} y_{R}, y_{R}^{'} \in {\pm 1}^{∣ R ∣} \sum exp ((x_{L}^{-})^{T} J (L, R) y_{R}^{-} + (x_{L}^{+})^{T} J (L, R) y_{R}^{+} + 2 h_{L}^{T} x_{L}^{+} + 2 g_{R} y_{R}^{+}),

N

N

= 2^{6} exp (- 6 λ) ρ (L, R) .

D

D

\times exp (x_{u}^{-} J ({u}, R) y_{R}^{-} + x_{u}^{+} J ({u}, R) y_{R}^{+} + 2 h_{u} x_{u}^{+})

\times exp (x_{v}^{-} J ({v}, R) y_{R}^{-} + x_{v}^{+} J ({v}, R) y_{R}^{+} + 2 h_{v} x_{v}^{+})

\times exp (x^{-} J_{V_{o b s}, {k}} y_{k}^{-} + x^{+} J_{V_{o b s}, {k}} y_{k}^{+} + 2 g_{k} y_{k}^{+})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning

Full text

\AtAppendix

Learning Restricted Boltzmann Machines with Arbitrary External Fields

Surbhi [email protected]

Abstract

We study the problem of learning graphical models with latent variables. We give the first algorithm for learning locally consistent (ferromagnetic or antiferromagnetic) Restricted Boltzmann Machines (or RBMs) with arbitrary external fields. Our algorithm has optimal dependence on dimension in the sample complexity and run time however it suffers from a sub-optimal dependency on the underlying parameters of the RBM.

Prior results have been established only for ferromagnetic RBMs with consistent external fields (signs must be same)[3]. The proposed algorithm strongly relies on the concavity of magnetization which does not hold in our setting. We show the following key structural property: even in the presence of arbitrary external field, for any two observed nodes that share a common latent neighbor, the covariance is high. This enables us to design a simple greedy algorithm that maximizes covariance to iteratively build the neighborhood of each vertex.

1 Introduction

Graphical models are a popular framework for expressing high dimensional distributions by using an underlying graph to represent conditional dependencies among the variables. Learning the underlying dependency structure of a graphical model using samples drawn from the distribution is a core problem in understanding graphical models. Much progress has been made in the recent years towards developing efficient algorithms for learning fundamental models such as Ising model and Markov random fields (MRFs) with near optimal sample and time complexity under the assumptions of sparsity and/or correlation decay.

The structure learning problem becomes even more challenging when the underlying model is allowed to have latent (or hidden) variables. Compared to fully observed models, latent variable models can induce more complex dependencies among the observed variables once the latent variables are marginalized. In this work we restrict ourselves to a special class of latent variable models known as Restricted Boltzmann machines (RBMs). RBMs have been used for various unsupervised learning tasks [7, 10, 15, 8] since their inception in the early 2000s by Geoffrey Hinton. In RBMs, the interactions are restricted to be pairwise between observed and latent variables. More formally, a RBM induces a probability distribution over $n$ observed variables $X\in\{\pm 1\}^{n}$ and $m$ latent variables $Y\in\{\pm 1\}^{m}$ as follows,

[TABLE]

Here $J\in\mathbb{R}^{n\times m}$ is the interaction matrix, $h\in\mathbb{R}^{n},g\in\mathbb{R}^{m}$ are the external fields and $Z$ is the partition function. Alternatively, a RBM can be viewed as a bipartite graph between the set of observed and latent variables with edge weights given by $J$ .

Recently Bresler et. al. [3] proposed an algorithm that learns ferromagnetic RBMs ( $J\geq 0$ ) with non-negative external fields ( $h,g\geq 0$ ). They apply the famous Griffiths-Hurst-Sherman correlation inequality to prove that a certain influence function is submodular and use a simple greedy algorithm to maximize the same. Their work relies heavily on the GHS inequality which requires the external fields to be consistent, that is, have the same sign.

In this paper we focus on learning locally consistent RBMs (outgoing edges of each latent variable have the same sign) with arbitrary external fields. The presence of inconsistent external fields allows for different biases on different hidden nodes potentially creating more conflicts between the observed nodes making the problem more challenging. It is well-known that the presence of arbitrary external fields can greatly change the complexity of closely related problems such as approximating the partition function [5].

Our Results.

The main contribution of our paper is the following key structural property of locally consistent RBMs with arbitrary external fields.

Lemma 1 (Informal version of Lemma 2).

For any observed node $u$ in a locally consistent RBM, for all observed nodes $v$ that share a common neighbor with $u$ in the underlying graph, the covariance between $u$ and $v$ is at least some positive constant independent of the dimension $n$ .

The above key property gives us the following structure learning result for locally consistent RBMs.

Theorem 1 (Informal version of Theorem 2).

Consider a locally consistent RBM with arbitrary external fields such that all non-zero interactions are bounded below by $\alpha$ and the sum of absolute weights of outgoing edges of every node (plus absolute value of external field) is bounded above by $\lambda$ then there is an algorithm that recovers the markov blanket of each observed variable in time $\widetilde{O}_{\alpha,\lambda}(n^{2})$ and sample complexity $O_{\alpha,\lambda}(\log n)$ 222The sub-script indicates that the dependency on $\alpha,\lambda$ is suppressed. Also $\widetilde{O}$ hides logarithmic dependencies..

Here the dependence on $\alpha$ is exponential and that on $\lambda$ is doubly exponential. Singly exponential dependence is necessary for learning. Note that our bounds are similar to those in [2]. However, we note that for ferromagnetic RBMs with consistent fields, [3] have a singly exponential dependence on $\lambda$ which is optimal. The question to improve the dependence on $\alpha,\lambda$ for locally consisitent RBMs with arbitrary external fields is an outstanding open question.

Our Techniques.

For our key structural result, we define a transformation on the variables that enables us to use symmetry arguments in order to prove the non-negativity of the covariance. A more involved analysis lets us go further and bound the covariance by a constant independent of the input dimension.

For learning RBMs, in the spirit of the influence maximization algorithm due to Bresler [2], we maximize covariance to iteratively build the neighborhood of each observed vertex. Using an entropy argument, we can show that our iterative algorithm returns us the exact neighborhood of each vertex.

Related Work.

Structure learning for graphical models is a well studied problem, with major focus on the fully-observed model. The first algorithms were proposed by Chow and Liu [4] for learning undirected graphical models on trees. Subsequently, various algorithms were proposed for structure learning under varying assumptions on the underlying model [11, 14, 19, 2, 17, 9, 6, 18]. Bresler [2] proposed a simple greedy algorithm based on influence maximization for assumption-free structure learning of Ising models. His algorithm achieved optimal sample/time complexity in terms of the dimension however depended doubly exponentially on the degree of the underlying graph. Subsequently Vuffray et. al. [17] and Klivans and Meka [9] proposed alternative techniques to remove the doubly exponential dependence.

The problem of structure recovery in the presence of latent variables is not as well understood as the fully-observed setting. For locally tree-like models, Anandkumar and Valluvan [1] gave efficient algorithms for recovery under correlation decay assumption. Assuming that the latent variables are distributed according to a Gaussian distribution, Nussbaum and Giesen [12] proposed a likelihood model for sparse + low rank model for stucture learning. The most relevant to our work is that of [3] which proposed the first algorithm to recover the structure of ferromagnetic RBMs with non-negative external fields using concavity of magnetization. Unlike their setup, we allow the external fields to be arbitrary and relax the ferromagnetic condition to a locally-consistent condition at the cost of a worse dependence on $\alpha,\lambda$ .

2 Preliminaries

We consider a RBM on underlying bipartite graph $G=(V_{obs},V_{lat},E)$ over observed variables $X$ and latent variables $Y$ with $|V_{obs}|=n$ and $|V_{lat}|=m$ . Each configuration of observed/latent variables $\in\pm 1$ is assigned probability

[TABLE]

where $J$ is the interaction matrix and $h,g$ are external fields. In this work, we consider the following class of locally consistent RBMs.

Definition 1.

A RBM is said to be $(\alpha,\lambda)$ -locally consistent if the following conditions are satisfied:

•

$J$ * is locally consistent, that is, for each $j\in[m]$ , $J_{ij}\geq 0$ for all $i$ (ferromagnetic) or $J_{ij}\leq 0$ for all $i$ (anti-ferromagnetic).*

•

For all $(i,j)\in E$ such that $|J_{ij}|\geq\alpha$ .

•

For all $i\in[n]$ , $\sum_{j}|J_{ij}|+|h_{i}|\leq\lambda$ .

•

For all $j\in[m]$ , $\sum_{i}|J_{ij}|+|g_{j}|\leq\lambda$ .

Define $N(u):=\{j:J_{uj}\neq 0\}$ to be the graph-theoretic neighborhood of observed node $u$ and define $N_{2}(u)=\{i:\exists\leavevmode\nobreak\ j,J_{ij},J_{uj}\neq 0\}$ to be the two-hop graph-theoretic neighborhood. We also define $N^{mkv}_{2}(u)$ to be the two-hop Markov neighborhood, that is, the smallest set $S\subseteq V_{obs}\backslash\{u\}$ such that conditioned on $X_{S}$ , $X_{u}$ is independent of $X_{v}$ for all $v\in V_{obs}\backslash(S\cup\{u\})$ .

Our objective is to recover the two-hop Markov neighborhood of each observed variable. In our setting, this will correspond to the two-hop graph-theoretic neighborhood of each observed variables.

Remark.

We can WLOG assume $J\geq 0$ since if there exists $j$ such that $J_{ij}\leq 0$ for all $i$ (locally consistent) then we can map $Y_{j}\rightarrow-Y_{j}$ without affecting the marginal on $X$ and the model is ferromagnetic at $j$ . The change of variable will reverse the external field at $j$ however since we do not make any assumption on the sign of the external field, our model assumptions still hold. We can repeat this for all such $j$ and the model can therefore be made globally ferromagnetic. We will subsequently assume that $J\geq 0$ .

3 Conditional Covariance

In this section we present our main structural result. We show that for two observed nodes sharing a common latent neighbor, the covariance is positive and bounded away from 0. The main motivation to believe that such a structural result holds is the famous FKG inequality [13, 16] which states that for ferromagnetic Ising models with arbitrary external field the covariance of any two nodes is non-negative.

Define the conditional covariance for observed nodes $u,v\in V_{obs}$ and a subset of observed nodes $S\subseteq V_{obs}\backslash\{u,v\}$ with configuration $x_{S}$ as follows,

[TABLE]

We also define the notion of average conditional covariance as follows, $\mathsf{Cov}^{\mathsf{avg}}(u,v|S)=\mathbb{E}_{x_{S}}[\mathsf{Cov}(u,v|X_{S}=x_{S})]$ . We will prove the following useful property of the conditional covariance:

Lemma 1.

For fixed node $u$ and any fixed subset of observed nodes $S\subseteq V_{obs}\backslash\{u\}$ with configuration $x_{S}$ , then for all $v\in N_{2}(u)\backslash S$ ,

[TABLE]

Proof.

It is easy to verify that on conditioning over a set of observed variables ( $X_{S}=x_{S}$ )), an $(\alpha,\lambda)$ -locally consistent RBM remains an $(\alpha,\lambda)$ -locally consistent RBM. Moreover, the edges between the the remaining nodes remain the same with the same edge weights. Thus, we can restrict to looking at $S=\emptyset$ . Also, we will WLOG assume $J\geq 0$ as discussed before.

Consider the direct sum of two RBM $G\oplus G$ with two copies of $G$ such that the probability of a configuration under this new distribution $\mathcal{D}$ is

[TABLE]

Define $X^{-}_{i}=\frac{X_{i}-X^{\prime}_{i}}{\sqrt{2}},Y^{-}_{i}=\frac{Y_{i}-Y^{\prime}_{i}}{\sqrt{2}}$ and $X^{+}_{i}=\frac{X_{i}+X^{\prime}_{i}}{\sqrt{2}},Y^{+}_{i}=\frac{Y_{i}+Y^{\prime}_{i}}{\sqrt{2}}$ . Then we have

[TABLE]

Observe that $\Pr[X=x,Y=y,X^{\prime}=x^{\prime},Y^{\prime}=y^{\prime}]=\Pr[X=x^{\prime},Y=y^{\prime},X^{\prime}=x,Y^{\prime}=y]=\Pr[X=x,Y=y]\Pr[X^{\prime}=x^{\prime},Y^{\prime}=y^{\prime}]$ . Thus under this transformation, we have

[TABLE]

Now we will bound the numerator (N) and denominator (D) separately. Since $v\in N_{2}(u)$ , there exists $k$ such that $J_{uk},J_{vk}\neq 0$ . Let $\gamma(x^{-},y^{-})=\exp((x^{-})^{T}Jy^{-}-x^{-}_{u}J_{uk}y^{-}_{k}-x^{-}_{v}J_{vk}y^{-}_{k})$ and $\Delta(x^{+},y^{+})=\exp((x^{+})^{T}Jy^{+}+\sqrt{2}h^{T}x^{+}+\sqrt{2}g^{T}y^{+})$ . We have,

[TABLE]

The following lemma is the main observation to bound the above term, it shows that each term in the summation is non-negative.

Lemma 2.

For all $A\in\mathbb{Z}_{+}^{n},B\in\mathbb{Z}_{+}^{n}$ and function $f$ over $x^{+},y^{+}$ such that $f\geq 0$ ,

[TABLE]

Proof.

Observe that for any $i\in[n]$ , exchanging $x_{i}\leftrightarrow x^{\prime}_{i}$ does not change the summation, however it changes $x^{-}_{i}\rightarrow-x^{-}_{i}$ while leaving $x^{+}_{i}\rightarrow x^{+}_{i}$ unchanged. Thus, if $A_{i}$ is odd, then the summation will be 0. Therefore, for the term to be non-zero, for all $i\in[n]$ , $A_{i}$ must be even. Similarly, for all $j\in[m]$ , $B_{j}$ must be even. Now since $f\geq 0$ and there are only even powers, the summation must be positive. ∎

It is easy to see that $\gamma(x^{-},y^{-})$ can be expanded as a multivariate polynomial over $x^{-},y^{-}$ with non-negative coefficients (since $J\geq 0$ )333Since $\gamma$ is an exponential function of a polynomial with non-negative coefficients, using taylor expansion of $e^{a}$ , we will overall get a polynomial with all non-negative coefficients.. Therefore, applying Lemma 2, we have for all $i\geq j$ ,

[TABLE]

This implies that the covariance is indeed non-negative.

Now we will show that in fact the covariance is at least a constant independent of $n$ . Since all terms are non-negative, we can lower bound the numerator by the term corresponding to $i=2$ and $j=1$ . This yields only squares of $x^{-}_{u},x^{-}_{v},y^{-}_{k}$ as follows,

[TABLE]

Here the second inequality follows from noting that by our assumption $J_{uk},J_{vk}\neq 0$ and hence must be at least $\alpha$ . Lastly we bound $\gamma(x^{-},y^{-})\Delta(x^{+},y^{+})$ . Define $L:=V_{obs}\backslash\{u,v\}$ and $R:=V_{lat}\backslash\{k\}$ . We have

[TABLE]

Here $x_{T}(y_{T})$ denote the restriction of $x(y)$ to all indices in $T$ and similarly $J(T_{1},T_{2})$ denote the sub-matrix obtained by restricting $J$ to the rows and columns indexed by $T_{1},T_{2}$ respectively. We can show that each of the last three terms in the product can be straightforwardly bounded in $[\exp(-2\lambda),\exp(2\lambda)]$ . Observe that

[TABLE]

Similarly we can bound $\exp\left(x^{-}_{v}J({\{v\},R})y^{-}_{R}+x^{+}_{v}J(\{v\},R)y^{+}_{R}+\sqrt{2}h_{v}x^{+}_{v}\right)\geq\exp(-2\lambda)$ . As for the last term, we have

[TABLE]

Now, setting

[TABLE]

we have

[TABLE]

Here the second equality follows from observing that $\sum\limits_{x_{u},x^{\prime}_{u}\in\{\pm 1\}}(x^{-}_{u})^{2}=4$ (similarly for $x^{-}_{v}$ and $y^{-}_{k}$ ). Similarly, the denominator can be bounded as follows,

[TABLE]

Combining, we have $\mathsf{Cov}(u,v)\geq\alpha^{2}\exp(-12\lambda)$ . ∎

Corollary 1.

For $u\neq v\in V_{obs}$ such that there exists $w\in V_{lat}$ with $(u,k),(v,k)\in E$ and a subset of observed nodes $S\subseteq V_{obs}\backslash\{u,v\}$ , $\mathsf{Cov}^{\mathsf{avg}}(u,v|X_{S})\geq{\alpha^{2}}\exp(-12\lambda)$ .

Proof.

Since for any $X_{S}=x_{S}$ , by Lemma 2, the covariance is bounded below by ${\alpha^{2}}\exp(-12\lambda)$ , hence the expectation is also bounded by the same quantity. ∎

Remark.

Observe that the above lemma also shows that $N_{2}(u)\subseteq N^{mkv}_{2}(u)$ . It is not hard to see that $N^{mkv}_{2}(u)\subseteq N_{2}(u)$ by the structure of the RBM therefore $N_{2}(u)=N^{mkv}_{2}(u)$ .

Remark.

The key structural result can be extended to the setting in which there are edges between hidden and observed variables using the same techniques, however now the bound will depend on the length of the shortest path connecting two observed nodes similar to [3].

4 Algorithm

In this section we present the main algorithm (Algorithm 1) and a proof of its correctness. Our algorithm and analysis is similar to the influence maximization algorithms for learning ising models as in [2]. However, instead of maximizing influence, our algorithm exploits the key property to maximize conditional covariance. For completeness, we give the full proof.

Theorem 2.

Consider $M$ samples drawn from an $(\alpha,\lambda)$ -locally consistent RBM, $X^{(1)},\ldots,X^{(M)}$ . For $\tau=\frac{\alpha^{2}}{2}\exp(-12\lambda)$ , with probability $1-\zeta$ , $\textsc{LearnRBMNbhd}(X^{(1)},\ldots,X^{(M)},\tau,u)$ outputs exactly the two-hop neighborhood of each observed variable $u$ as long as

[TABLE]

Moreover, the algorithm runs in time $O(T^{*}Mn)$ for each node $u$ .

Proof.

The proof follows along the same lines as [2]. We will first show that our estimates of conditional covariance are close to the true values with the given $M$ samples. We will then show that after $T$ iterations, set $S$ contains a superset of the two-hop neighbors. Lastly we will show that our refining step removes all nodes except the two-hop neighbors. This will complete our proof.

Closeness of Estimates.

Denote by $\mathcal{A}(l,\epsilon)$ the event such that for all $u,v$ and $S$ with $|S|\leq l$ , simultaneously, $\left|\widehat{\mathsf{Cov}}^{\mathsf{avg}}(u,v|S)-\mathsf{Cov}^{\mathsf{avg}}(u,v|S)\right|\leq\epsilon$ .

Lemma 1.

For fixed $l,\epsilon,\zeta\geq 0$ , if the number of samples is $\Omega\left(\left(\log(1/\zeta)+l\log(n)\right)\frac{2^{2l}}{\epsilon^{2}\delta^{2l}}\right)$ . then $\Pr[A(l,\epsilon)]\geq 1-\zeta$ .

We defer the proof of the above lemma to the appendix. Choosing $M=\Omega\left(\left(\log(1/\zeta)+T^{*}\log(n)\right)\frac{2^{2T}}{\tau^{2}\delta^{2l}}\right)$ , we have $A:=A(T^{*},\tau/2)$ holds for $T^{*}=8/\tau^{2}$ with probability $1-\zeta$ . From now om we assume $A$ holds.

Entropy Gain.

We will show that the conditional mutual information is bounded below by a function of the average conditional covariance thus at each iteration of the algorithm we are increasing the overall entropy of $X_{u}$ .

Lemma 2.

For $u\neq v\in V_{obs}$ and a subset of observed nodes $S\subseteq V_{obs}\backslash\{u,v\}$ with configuration $x_{S}$ ,

[TABLE]

Proof.

We have

[TABLE]

Here the first inequality follows using Jensen’s and the second inequality follows from the Pinsker’s inequality and the rest follow from simple algebraic manipulations. ∎

Upper Bound on Size of $S$ .

We will show that $|S|\leq T^{*}$ . Let the sequence of added nodes be $i_{1},\ldots,i_{T}$ for some $T$ and $S_{l}=\{i_{1},\ldots,i_{l}\}$ for $1\leq l\leq T$ . For each $j\in T$ , we have $\widehat{\mathsf{Cov}}^{\mathsf{avg}}(u;i_{j}|X_{S_{j}})\geq\tau$ (by Step 3). If $T\geq T^{*}$ , then we have $\mathsf{Cov}^{\mathsf{avg}}(u;i_{j}|X_{S_{j}})\geq\tau/2$ for all $j\leq T^{*}+1$ (since $A$ holds). Thus we have,

[TABLE]

Here the inequalities follow from standard properties of entropy and mutual information. This leads to a contradiction since $T^{*}=\frac{8}{\tau^{2}}$ . Thus, we have $T\leq T^{*}$ . Observe that each iteration requires $O(Mn)$ time and at most $T^{*}$ iterations take place prior to pruning. Also pruning takes $O(Mn)$ time, giving us a total runtime of $O(T^{*}Mn)$ .

Recovery of Two-hop Neighborhood.

We will show that $N_{2}(u)\subseteq S$ . Suppose $N_{2}(u)\not\subseteq S$ , then there exists $v\in N_{2}(u)$ . By Lemma 1, we know that $\mathsf{Cov}^{\mathsf{avg}}(u,v|X_{S})\geq{\alpha^{2}}\exp(-12\lambda)=2\tau$ . Since $A$ holds and $|S|\leq 8/\tau^{2}$ , we have $\widehat{\mathsf{Cov}}^{\mathsf{avg}}(u,v|X_{S})\geq 3\tau/2$ , thus the algorithm would not have terminated. This is a contradiction, thus $N_{2}(u)\subseteq S$ before pruning.

Now if $v\not\in N_{u}(S)$ then $\mathsf{Cov}(u,v|X_{S\backslash\{v\}})=0$ since conditional on the 2-hop neighborhood, $X_{u}$ and $X_{v}$ are independent, therefore they will be removed. Whereas, by Lemma 1, if $v\in N_{u}(S)$ then $\mathsf{Cov}(u,v|X_{S\backslash\{v\}})\geq 2\tau$ and our test will not remove it (estimates of covariance are correct withing $\alpha/2$ ). Thus we will exactly obtain the neighborhood at the end of the algorithm. ∎

5 Hardness of Learning General RBMs

In this section we will discuss why our model does not violate the hardness result stated in [3]. The hardness result in the paper reduces the problem of learning sparse parities with noise over the uniform distribution to the problem of structure recovery of a RBM.

Suppose $S\subseteq[n]$ is the subset on which the parity problem is defined. The main technique used for the reduction is the observation from [9] that the joint distribution on the input and noisy parity $(x,y)$ can be represented as a single term MRF (term $y\prod_{i\in S}x_{i}$ ). Further [3] showed that every MRF can be represented as a RBM with sufficiently many hidden units. Here we show that even if the external fields are arbitrary, any ferromagnetic RBM when expressed as an MRF has pairwise potentials for every two-hop neighbor pair. This implies that it cannot represent the MRF corresponding to the noisy parity.

Lemma 1 ([3]).

Given a RBM, with $\rho(a)=\log(\exp(a)+\exp(-a))$ , we have,

[TABLE]

Let us look at the potential corresponding to $k\in V_{lat}$ , $\rho(x^{T}J_{V_{obs},\{k\}}+g_{k})$ . We will show that when you expand the term over the monomial basis, the coefficient corresponding to $x_{i}x_{j}$ for any $i,j\in V_{obs}$ is non-negative and the coefficient corresponding to $x_{u}x_{v}$ for $u,v\in V_{obs}$ such that $k\in N(u)\cap N(v)$ is strictly positive. More formally,

Lemma 2.

$f(x)=\sum_{j=1}^{m}\rho(x^{T}J({V_{obs},\{j\}})+g_{j})+h^{T}x$ * when expressed in the monomial basis with coefficients $\widehat{f}_{S}$ for every monomial $S$ satisfies: $\widehat{f}_{\{i,j\}}\geq 0$ for all $i,j\in V_{obs}$ , moreover, $\widehat{f}_{\{i,j\}}>0$ for $i,j$ such that $i\in N_{2}(j)$ .*

We defer the proof of the above lemma to the appendix. Since we sum such potentials, this positive coefficient cannot be canceled and $f$ cannot represent the parity MRF as in the reduction. This raises the question of understanding the exact class of RBMs for which the hardness results truly holds.

6 Conclusions and Open Problems

In this work we presented a key structural property of locally consistent RBMs with arbitrary external fields and subsequently showed how to use this property to iteratively build the two-hop neighborhood of each node. Our algorithm runs in optimal time and sample complexity in terms of the dimension however pays doubly exponentially in the upper bound on the weights. This seems to be an artifact of the approach of maximizing influence in general whereas algorithms using convex optimization are able to avoid this dependence for fully-observed graphical models. A natural open question is to improve this dependency potentially using tools from convex optimization. Alternatively, proving a stronger structural result such as weak-submodularity could lead to the currect dependency. More broadly, understanding the most expressive class of RBMs that allow efficient structure learning while not violating the hardness result is a worthwhile future direction to pursue.

Acknowledgements.

The author would like to thank Sumegha Garg and Jessica Hoffmann for comments on the initial draft, and Adam Klivans, Frederic Koehler and Josh Vekhter for useful discussions.

Appendix A Omitted Proofs

Proof of Lemma 1.

The proof follows essentially from [2]. Let $m$ denote the number of samples. Using standard concentration inequalities, we know that for any subset $W\subseteq V_{obs}$ and configuration $x_{W}\in\{\pm\}^{|W|}$ , we have

[TABLE]

We need the above to hold over all possible choices of $W$ and $x_{W}$ with $|W|\leq l+2$ . There are at most $\sum_{k=1}^{l+2}2^{k}{n\choose k}\leq(l+2)(2n)^{l+2}$ many choices. Thus for $m\geq\frac{\log(2(l+2))+\log(1/\zeta)+(l+2)\log(2n)}{2\gamma^{2}}$ , with probability, $1-\zeta$ , for all $W$ and $x_{W}$ with $|W|\leq l+2$ , we have $|\widehat{\Pr}(X_{W}=x_{W})-\Pr(X_{W}-x_{W})|\leq\gamma$ .

Now assuming that the above is true, we will show that $\left|\widehat{\mathsf{Cov}}^{\mathsf{avg}}(u,v|S)-\mathsf{Cov}^{\mathsf{avg}}(u,v|S)\right|$ is bounded for all $|S|\leq l$ . We have

[TABLE]

The second term can be bounded as follows,

[TABLE]

Choosing $\gamma\leq\epsilon 2^{-l}\frac{\delta^{l}}{20}$ , we get,

[TABLE]

Thus, we have $\Pr[A(l,\epsilon)]\geq 1-\zeta$ for $m=\Omega\left(\left(\log(1/\zeta)+l\log(n)\right)\frac{2^{2l}}{\epsilon^{2}\delta^{2l}}\right)$ ∎

Proof of Lemma 2.

Let $c_{i}j$ be the coefficient corresponding to $x_{i}x_{j}$ and $L=V_{obs}\backslash\{i,j\}$ , then using standard fourier expansion, we have

[TABLE]

Observe that $\exp(a)+\exp(-a)$ is an increasing function of $|a|$ , since $|J_{ik}+J_{jk}|\geq|J_{ik}-J_{jk}|$ ( $J\geq 0$ ), therefore $\exp(2J_{ik}+2J_{jk})+\exp(-2J_{ik}-2J_{jk})\geq\exp(2J_{ik}-2J_{jk})+\exp(-2J_{ik}+2J_{jk})$ . Thus the above term in non-negative. Also notice if $J_{ik},J_{jk}>0$ then the sum is strictly greater than 0. Thus we have the desired property for $c_{ij}$ . ∎

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Animashree Anandkumar, Ragupathyraj Valluvan, et al. Learning loopy graphical models with latent variables: Efficient methods and guarantees. The Annals of Statistics , 41(2):401–435, 2013.
2[2] Guy Bresler. Efficiently learning ising models on arbitrary graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing , pages 771–782. ACM, 2015.
3[3] Guy Bresler, Frederic Koehler, Ankur Moitra, and Elchanan Mossel. Learning restricted boltzmann machines via influence maximization. ar Xiv preprint ar Xiv:1805.10262 , 2018.
4[4] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory , 14(3):462–467, 1968.
5[5] Leslie Ann Goldberg and Mark Jerrum. The complexity of ferromagnetic ising with local fields. Combinatorics, Probability and Computing , 16(1):43–61, 2007.
6[6] Linus Hamilton, Frederic Koehler, and Ankur Moitra. Information theoretic properties of markov random fields, and their algorithmic applications. In Advances in Neural Information Processing Systems , pages 2463–2472, 2017.
7[7] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science , 313(5786):504–507, 2006.
8[8] Geoffrey E Hinton and Ruslan R Salakhutdinov. Replicated softmax: an undirected topic model. In Advances in neural information processing systems , pages 1607–1614, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning Restricted Boltzmann Machines with Arbitrary External Fields

Abstract

1 Introduction

Our Results.

Lemma 1** (Informal version of Lemma 2).**

Theorem 1** (Informal version of Theorem 2).**

Our Techniques.

Related Work.

2 Preliminaries

Definition 1**.**

Remark.

3 Conditional Covariance

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Corollary 1**.**

Proof.

Remark.

Remark.

4 Algorithm

Theorem 2**.**

Proof.

Closeness of Estimates.

Lemma 1**.**

Entropy Gain.

Lemma 2**.**

Proof.

Upper Bound on Size of SSS.

Recovery of Two-hop Neighborhood.

5 Hardness of Learning General RBMs

Lemma 1** ([3]).**

Lemma 2**.**

6 Conclusions and Open Problems

Acknowledgements.

Appendix A Omitted Proofs

Proof of Lemma 1.

Proof of Lemma 2.

Lemma 1 (Informal version of Lemma 2).

Theorem 1 (Informal version of Theorem 2).

Definition 1.

Lemma 1.

Lemma 2.

Corollary 1.

Theorem 2.

Lemma 1.

Lemma 2.

Upper Bound on Size of $S$ .

Lemma 1 ([3]).

Lemma 2.