Empirical Bayes Method for Boltzmann Machines

Muneki Yasuda; Tomoyuki Obuchi

arXiv:1906.06002·stat.ML·January 7, 2020

Empirical Bayes Method for Boltzmann Machines

Muneki Yasuda, Tomoyuki Obuchi

PDF

TL;DR

This paper introduces a fast, non-iterative empirical Bayes algorithm for Boltzmann machines that uses the replica method and Plefka expansion to estimate hyperparameters efficiently, despite some bias issues.

Contribution

It proposes a novel, simple, and fast empirical Bayes method for Boltzmann machines that bypasses computational intractability using advanced approximation techniques.

Findings

01

The method is computationally efficient and does not require iterative procedures.

02

It introduces a bias in estimates due to the Plefka expansion.

03

The peculiar bias behavior is linked to the approximation method used.

Abstract

In this study, we consider an empirical Bayes method for Boltzmann machines and propose an algorithm for it. The empirical Bayes method allows estimation of the values of the hyperparameters of the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function in this study. However, the maximization is computationally hard because the empirical Bayes likelihood function involves intractable integrations of the partition function. The proposed algorithm avoids this computational problem by using the replica method and the Plefka expansion. Our method does not require any iterative procedures and is quite simple and fast, though it introduces a bias to the estimate, which exhibits an unnatural behavior with respect to the size of the dataset. This peculiar behavior is supposed to be due to the approximate treatment by the Plefka…

Tables3

Table 1. Table 1: Detailed values (the averages and standard deviations) of some plots in Fig. 2 (when H true = 0 subscript 𝐻 true 0 H_{\mathrm{true}}=0 and α = 0.4 𝛼 0.4 \alpha=0.4 ).

		$J_{true}$
		0	0.2	0.4	0.6	0.8	1	1.2
$\hat{J}$	$n = 300$	$0.048 \pm 0.06$	$0.20 \pm 0.04$	$0.41 \pm 0.02$	$0.62 \pm 0.02$	$0.82 \pm 0.02$	$0.96 \pm 0.02$	$1.03 \pm 0.02$
$\hat{J}$	$n = 500$	$0.038 \pm 0.05$	$0.20 \pm 0.03$	$0.40 \pm 0.01$	$0.62 \pm 0.01$	$0.82 \pm 0.01$	$0.96 \pm 0.01$	$1.03 \pm 0.01$

Table 2. Table 2: Detailed values (the averages and standard deviations) of some plots in Fig. 5 (a) (when H true = 0.2 subscript 𝐻 true 0.2 H_{\mathrm{true}}=0.2 and α = 30 / n 𝛼 30 𝑛 \alpha=30/n ).

		$J_{true}$
		0	0.2	0.4	0.6	0.8	1	1.2
$\hat{J}$	$n = 300$	$0.083 \pm 0.10$	$0.17 \pm 0.12$	$0.38 \pm 0.07$	$0.58 \pm 0.05$	$0.79 \pm 0.06$	$1.05 \pm 0.12$	$1.35 \pm 0.16$
$\hat{J}$	$n = 500$	$0.075 \pm 0.09$	$0.16 \pm 0.11$	$0.38 \pm 0.06$	$0.57 \pm 0.04$	$0.78 \pm 0.06$	$1.05 \pm 0.10$	$1.39 \pm 0.16$

Table 3. Table 3: Detailed values (the averages and standard deviations) of some plots in Fig. 5 (b) (when H true = 0.4 subscript 𝐻 true 0.4 H_{\mathrm{true}}=0.4 and α = 5 / n 𝛼 5 𝑛 \alpha=5/n ).

		$J_{true}$
		0	0.2	0.4	0.6	0.8	1	1.2
$\hat{J}$	$n = 300$	$0.15 \pm 0.17$	$0.17 \pm 0.17$	$0.33 \pm 0.19$	$0.53 \pm 0.14$	$0.75 \pm 0.12$	$0.95 \pm 0.14$	$1.22 \pm 0.20$
$\hat{J}$	$n = 500$	$0.12 \pm 0.15$	$0.17 \pm 0.17$	$0.33 \pm 0.17$	$0.55 \pm 0.12$	$0.76 \pm 0.10$	$0.98 \pm 0.11$	$1.20 \pm 0.16$

Equations163

\displaystyle P(\bm{S}\mid h,\bm{J}):=\frac{1}{Z(h,\bm{J})}\exp\Big{(}h\sum_{i=1}^{n}S_{i}+\sum_{i<j}J_{ij}S_{i}S_{j}\Big{)},

\displaystyle P(\bm{S}\mid h,\bm{J}):=\frac{1}{Z(h,\bm{J})}\exp\Big{(}h\sum_{i=1}^{n}S_{i}+\sum_{i<j}J_{ij}S_{i}S_{j}\Big{)},

\displaystyle Z(h,\bm{J}):=\sum_{\bm{S}}\exp\Big{(}h\sum_{i=1}^{n}S_{i}+\sum_{i<j}J_{ij}S_{i}S_{j}\Big{)},

\displaystyle Z(h,\bm{J}):=\sum_{\bm{S}}\exp\Big{(}h\sum_{i=1}^{n}S_{i}+\sum_{i<j}J_{ij}S_{i}S_{j}\Big{)},

L_{ML} (h, J) := \frac{1}{n N} μ = 1 \sum N ln P (S^{(μ)} ∣ h, J) .

L_{ML} (h, J) := \frac{1}{n N} μ = 1 \sum N ln P (S^{(μ)} ∣ h, J) .

{\hat{h}_{ML}, \hat{J}_{ML}} = h, J arg max L_{ML} (h, J) .

{\hat{h}_{ML}, \hat{J}_{ML}} = h, J arg max L_{ML} (h, J) .

P_{prior} (J ∣ γ)

P_{prior} (J ∣ γ)

P_{post} (h, J ∣ D, H, γ)

P_{post} (h, J ∣ D, H, γ)

= \frac{P ( D ∣ h , J ) P _{prior} ( h ∣ H ) P _{prior} ( J ∣ γ )}{P ( D ∣ H , γ )},

P (D ∣ h, J) := μ = 1 \prod N P (S^{(μ)} ∣ h, J) .

P (D ∣ h, J) := μ = 1 \prod N P (S^{(μ)} ∣ h, J) .

{\hat{h}_{MAP}, \hat{J}_{MAP}} = h, J arg max L_{MAP} (h, J),

{\hat{h}_{MAP}, \hat{J}_{MAP}} = h, J arg max L_{MAP} (h, J),

L_{MAP} (h, J) := \frac{1}{n N} ln P_{post} (h, J ∣ D, H, γ)

L_{MAP} (h, J) := \frac{1}{n N} ln P_{post} (h, J ∣ D, H, γ)

= L_{ML} (h, J) + \frac{1}{n N} R_{0} (h) + \frac{1}{n N} R_{1} (J) + constant .

\displaystyle P_{\mathrm{prior}}(J_{ij}\mid\gamma)=\sqrt{\frac{n}{2\pi\gamma}}\exp\Big{(}-\frac{nJ_{ij}^{2}}{2\gamma}\Big{)},\quad\gamma>0,

\displaystyle P_{\mathrm{prior}}(J_{ij}\mid\gamma)=\sqrt{\frac{n}{2\pi\gamma}}\exp\Big{(}-\frac{nJ_{ij}^{2}}{2\gamma}\Big{)},\quad\gamma>0,

\displaystyle P_{\mathrm{prior}}(J_{ij}\mid\gamma)=\sqrt{\frac{n}{2\gamma}}\exp\Big{(}-\sqrt{\frac{2n}{\gamma}}|J_{ij}|\Big{)},\quad\gamma>0

\displaystyle P_{\mathrm{prior}}(J_{ij}\mid\gamma)=\sqrt{\frac{n}{2\gamma}}\exp\Big{(}-\sqrt{\frac{2n}{\gamma}}|J_{ij}|\Big{)},\quad\gamma>0

P_{prior} (h ∣ H) = δ (h - H),

P_{prior} (h ∣ H) = δ (h - H),

L_{EB} (H, γ)

L_{EB} (H, γ)

[\dots]_{h, J} := \int d J \int d h (\dots) P_{prior} (h ∣ H) P_{prior} (J ∣ γ) .

[\dots]_{h, J} := \int d J \int d h (\dots) P_{prior} (h ∣ H) P_{prior} (J ∣ γ) .

{\hat{H}, \overset{γ}{^}} = H, γ arg max L_{EB} (H, γ) .

{\hat{H}, \overset{γ}{^}} = H, γ arg max L_{EB} (H, γ) .

\displaystyle L_{\mathrm{EB}}(H,\gamma)=\frac{1}{nN}\ln\Big{[}\exp\big{(}nNL_{\mathrm{ML}}(h,\bm{J})\big{)}\Big{]}_{h,\bm{J}}

\displaystyle L_{\mathrm{EB}}(H,\gamma)=\frac{1}{nN}\ln\Big{[}\exp\big{(}nNL_{\mathrm{ML}}(h,\bm{J})\big{)}\Big{]}_{h,\bm{J}}

L_{EB} (H, γ)

L_{EB} (H, γ)

+ \frac{1}{n N} ln P_{prior} (\hat{J}_{ML} ∣ γ) + constant .

L_{EB} (H, γ) = \frac{1}{n N} ln x \to - 1 lim Ψ_{x} (H, γ),

L_{EB} (H, γ) = \frac{1}{n N} ln x \to - 1 lim Ψ_{x} (H, γ),

Ψ_{x} (H, γ)

Ψ_{x} (H, γ)

\displaystyle:=\Big{[}Z(h,\bm{J})^{xN}\exp N\Big{(}h\sum_{i=1}^{n}d_{i}+\sum_{i<j}J_{ij}d_{ij}\Big{)}\Big{]}_{h,\bm{J}},

d_{i} := \frac{1}{N} μ = 1 \sum N S_{i}^{(μ)}, d_{ij} := \frac{1}{N} μ = 1 \sum N S_{i}^{(μ)} S_{j}^{(μ)}

d_{i} := \frac{1}{N} μ = 1 \sum N S_{i}^{(μ)}, d_{ij} := \frac{1}{N} μ = 1 \sum N S_{i}^{(μ)} S_{j}^{(μ)}

Ψ_{x} (H, γ)

Ψ_{x} (H, γ)

\displaystyle\quad\>+\sum_{i<j}J_{ij}\Big{(}\sum_{a=1}^{\tau_{x}}S_{i}^{\{a\}}S_{j}^{\{a\}}+Nd_{ij}\Big{)}\Big{\}}\Big{]}_{h,\bm{J}},

Ψ_{x}^{Gauss} (H, γ)

Ψ_{x}^{Gauss} (H, γ)

\displaystyle\quad\>-F_{x}(H,\gamma)\Big{\}},

M := \frac{1}{n} i = 1 \sum n d_{i}, C_{k} := \frac{2}{n ( n - 1 )} i < j \sum d_{ij}^{k},

M := \frac{1}{n} i = 1 \sum n d_{i}, C_{k} := \frac{2}{n ( n - 1 )} i < j \sum d_{ij}^{k},

\displaystyle F_{x}(H,\gamma):=-\ln\sum_{\mathcal{S}_{x}}\exp\big{(}-E_{x}(\mathcal{S}_{x};H,\gamma)\big{)}

\displaystyle F_{x}(H,\gamma):=-\ln\sum_{\mathcal{S}_{x}}\exp\big{(}-E_{x}(\mathcal{S}_{x};H,\gamma)\big{)}

E_{x} (S_{x}; H, γ)

E_{x} (S_{x}; H, γ)

:= - H i = 1 \sum n a = 1 \sum τ_{x} S_{i}^{{a}} - \frac{γ N}{n} i < j \sum d_{ij} a = 1 \sum τ_{x} S_{i}^{{a}} S_{j}^{{a}}

- \frac{γ}{n} i < j \sum a < b \sum S_{i}^{{a}} S_{j}^{{a}} S_{i}^{{b}} S_{j}^{{b}}

G_{x} (m, H, γ)

G_{x} (m, H, γ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Empirical Bayes Method for Boltzmann Machines

Muneki Yasuda

[email protected]

Graduate School of Science and Engineering, Yamagata University, Japan.

Tomoyuki Obuchi

Department of Mathematical and Computing Science, Tokyo Institute of Technology, Japan.

Abstract

In this study, we consider an empirical Bayes method for Boltzmann machines and propose an algorithm for it. The empirical Bayes method allows estimation of the values of the hyperparameters of the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function in this study. However, the maximization is computationally hard because the empirical Bayes likelihood function involves intractable integrations of the partition function. The proposed algorithm avoids this computational problem by using the replica method and the Plefka expansion. Our method does not require any iterative procedures and is quite simple and fast, though it introduces a bias to the estimate, which exhibits an unnatural behavior with respect to the size of the dataset. This peculiar behavior is supposed to be due to the approximate treatment by the Plefka expansion. A possible extension to overcome this behavior is also discussed.

Boltzmann machine, inverse Ising problem, empirical Bayes method, replica method, Plefka expansion

pacs:

Valid PACS appear here

††preprint: APS/123-QED

I Introduction

Boltzmann machine learning (BML) Ackley et al. (1985) has been actively studied in the field of machine learning and also in statistical mechanics. In statistical mechanics, the problem of BML is sometimes referred to as the inverse Ising problem, because a Boltzmann machine is the same as an Ising model, and BML can be regarded as an inverse problem for the Ising model. The framework of the usual BML is as follows. Given a set of observed data points (e.g., spin snapshots), we estimate appropriate values of the parameters, the external field and couplings, of our Boltzmann machine through maximum likelihood (ML) estimation (cf. Sec. II.1). Because BML involves intractable multiple summations (i.e., evaluation of the partition function), many approximations for it were proposed from the viewpoint of statistical mechanics Roudi et al. (2009): for example, methods based on mean-field approximations (such as the Plefka expansion Plefka (1982) and the cluster variation method Pelizzola (2005)) Kappen and Rodríguez (1998); Tanaka (1998); Yasuda and Horiguchi (2006); Sessak and Monasson (2009); Yasuda and Tanaka (2009); Ricci-Tersenghi (2012); Furtlehner (2013) and methods based on other approximations Sohl-Dickstein et al. (2011); Yasuda (2015).

In this study, we focus on another type of learning problem. We consider prior distributions of parameters of the Boltzmann machine and assume that the prior distributions are governed by some hyperparameters. The introduction of the prior distributions is strongly connected with the regularized ML estimation (cf. Sec. II.1). As mentioned above, the aim of the usual BML is to optimize the values of the parameters of the Boltzmann machine by using a set of observed data points. Meanwhile, the aim of the problem investigated in this study is the estimation of appropriate values of the hyperparameters from the dataset without estimating specific values of the parameters. One way to allow us to accomplish this from the Bayesian point of view is the empirical Bayes method (or also called type-II ML estimation or evidence approximation) MacKay (1992); Bishop (2006) (cf. Sec. II.2). The schemes of the usual BML and of our problem are illustrated in Fig. 1.

However, the evaluation of the likelihood function in the empirical Bayes method is again intractable, because it involves intractable multiple integrations of the partition function. In this study, we analyze the empirical Bayes method for fully-connected Boltzmann machines, using statistical mechanical techniques based on the replica method Mezard et al. (1987); Nishimori (2001) and the Plefka expansion to derive an algorithm for it. We consider two types of cases of the prior distribution of $\bm{J}$ : the cases of Gaussian and Laplace priors.

The rest of this paper is organized as follows. The formulations of the usual BML and the empirical Bayes method are presented in Sec. II. In Sec. III, we describe our statistical mechanical analysis for the empirical Bayes method. The proposed inference algorithm obtained from our analysis is shown in Sec. III.3 with its pseudocode. In Sec. IV, we examine our proposed method through numerical experiments. Finally, the summary and some discussions are presented in Sec. V.

II Boltzmann Machine and Empirical Bayes Method

II.1 Boltzmann machine and prior distributions

Consider a fully-connected Boltzmann machine with $n$ Ising variables $\bm{S}:=\{S_{i}\in\{-1,+1\}\mid i=1,2,\ldots,n\}$ Ackley et al. (1985):

[TABLE]

where $\sum_{i<j}$ is the sum over all the distinct pairs of variables; i.e., $\sum_{i<j}=\sum_{i=1}^{n}\sum_{j=i+1}^{n}$ . $Z(h,\bm{J})$ is the partition function defined by

[TABLE]

where $\sum_{\bm{S}}$ is the sum over all the possible configurations of $\bm{S}$ ; i.e., $\sum_{\bm{S}}:=\prod_{i=1}^{n}\sum_{S_{i}=\pm 1}$ . The parameters, $h\in(-\infty,+\infty)$ and $\bm{J}:=\{J_{ij}\in(-\infty,+\infty)\mid i<j\}$ , denote the external field and couplings, respectively.

Given $N$ observed data points, $\mathcal{D}:=\{\mathbf{S}^{(\mu)}\in\{-1,+1\}^{n}\mid\mu=1,2,\ldots,N\}$ , we define the log-likelihood function:

[TABLE]

Maximizing the log-likelihood function with respect to $h$ and $\bm{J}$ (i.e., the ML estimation) just corresponds to the BML (or the inverse Ising problem), i.e.,

[TABLE]

Now, we introduce prior distributions for the parameters $h$ and $\bm{J}$ as $P_{\mathrm{prior}}(h\mid H)$ and

[TABLE]

respectively. $H$ and $\gamma$ are the hyperparameters of these prior distributions. One of the most important motivations for introducing the prior distributions is for a Bayesian interpretation of the regularized ML estimation Bishop (2006). Given the observed dataset $\mathcal{D}$ , by using the prior distributions, the posterior distribution of $h$ and $\bm{J}$ is expressed as

[TABLE]

where

[TABLE]

The distribution in the denominator in Eq. (5), $P(\mathcal{D}\mid H,\gamma)$ , is sometimes referred to as the evidence. By using the posterior distribution, the maximum a posteriori (MAP) estimation of the parameters is obtained as

[TABLE]

where

[TABLE]

The MAP estimation in Eq. (6) corresponds to the regularized ML estimation, in which $R_{0}(h):=\ln P_{\mathrm{prior}}(h\mid H)$ and $R_{1}(\bm{J}):=\ln P_{\mathrm{prior}}(\bm{J}\mid\gamma)$ work as a penalty. For example, (i) when the prior distribution of $\bm{J}$ is the Gaussian prior,

[TABLE]

$R_{1}(\bm{J})$ corresponds to the $L_{2}$ regularization term, and $\gamma$ corresponds to its coefficient; (ii) when the prior distribution of $\bm{J}$ is the Laplace prior,

[TABLE]

$R_{1}(\bm{J})$ corresponds to the $L_{1}$ regularization term, and $\gamma$ again corresponds to its coefficient. The variances of these prior distributions are identical, $\mathrm{Var}[J_{ij}]=\gamma/n$ . In this study, as a simple test case, we use these two prior distributions for $\bm{J}$ and

[TABLE]

where $\delta(x)$ is the Dirac delta function, for $h$ .

II.2 Framework of the empirical Bayes method

Using the empirical Bayes method, we can infer the values of the hyperparameters, $H$ and $\gamma$ , from the observed dataset $\mathcal{D}$ . We define a marginal log-likelihood function as

[TABLE]

where $[\cdots]_{h,\bm{J}}$ is the average over the prior distributions; i.e.,

[TABLE]

We refer to the marginal log-likelihood function as the empirical Bayes likelihood function in this study. From the perspective of the empirical Bayes method, the optimal values of the hyperparameters, $\hat{H}$ and $\hat{\gamma}$ , are obtained by maximizing of the empirical Bayes likelihood function; i.e.,

[TABLE]

It is noteworthy that $[P(\mathcal{D}\mid h,\bm{J})]_{h,\bm{J}}$ in Eq. (11) is identified as the evidence appearing in Eq. (5).

The marginal log-likelihood function can be rewritten as

[TABLE]

Consider the case $N\gg n$ . In this case, by using the saddle point evaluation, Eq. (13) is reduced to

[TABLE]

In this case, the empirical Bayes’ estimates $\{\hat{H},\hat{\gamma}\}$ thus converge to the maximum likelihood estimates of the hyperparameters in the prior distributions in which the maximum likelihood estimates of the parameters $\{\hat{h}_{\mathrm{ML}},\hat{\bm{J}}_{\mathrm{ML}}\}$ (i.e., the solution to the BML) are inserted. This indicates that the parameter estimations can be conducted independently of the hyperparameter estimation. In this study, we do not concern ourselves with this trivial case.

III Statistical Mechanical Analysis

The empirical Bayes likelihood function in Eq. (11) involves intractable multiple integrations. In this section, we evaluate the empirical Bayes likelihood function using a statistical mechanical analysis. We consider the two types of the prior distribution of $\bm{J}$ : one is the Gaussian prior in Eq. (8), and the other is the Laplace prior in Eq. (9).

First, we evaluate the empirical Bayes likelihood function on the basis of the Gaussian prior in Secs. III.1–III.3, after which we describe the evaluation based on the Laplace prior in Sec. III.4.

III.1 Replica method

The empirical Bayes likelihood function in Eq. (11) can be represented as

[TABLE]

where

[TABLE]

and

[TABLE]

are the sample averages of the observed data points. We assume that $\tau_{x}:=xN$ is a natural number, and therefore Eq. (15) can be expressed as

[TABLE]

where $a,b\in\{1,2,\ldots,\tau_{x}\}$ are replica indices, and $S_{i}^{\{a\}}$ is the Ising variable on site $i$ in the $a$ th replica. $\mathcal{S}_{x}:=\{S_{i}^{\{a\}}\mid i=1,2,\ldots,n;\,a=1,2,\ldots,\tau_{x}\}$ is the set of all the Ising variables in the replicated system, and $\sum_{\mathcal{S}_{x}}$ is the sum over all the possible configurations of $\mathcal{S}_{x}$ ; i.e., $\sum_{\mathcal{S}_{x}}:=\prod_{i=1}^{n}\prod_{a=1}^{\tau_{x}}\sum_{S_{i}^{\{a\}}=\pm 1}$ . We evaluate $\Psi_{x}(H,\gamma)$ under the assumption that $\tau_{x}$ us a natural number, after which we take the limit of $x\to-1$ of the evaluation result to obtain the empirical Bayes likelihood function (this is the so-called replica trick).

By employing the Gaussian prior in Eq. (8), Eq. (16) becomes

[TABLE]

where

[TABLE]

and

[TABLE]

is the replicated (Helmholtz) free energy Rizzo et al. (2010); Yasuda et al. (2012); Lage-Castellanos et al. (2013); Yasuda et al. (2015); here,

[TABLE]

is the Hamiltonian of the replicated system, where $\sum_{a<b}$ is the sum over all the distinct pairs of replicas; i.e., $\sum_{a<b}=\sum_{a=1}^{\tau_{x}}\sum_{b=a+1}^{\tau_{x}}$ .

III.2 Plefka expansion

Because the replicated free energy in Eq. (19) includes intractable multiple summations, an approximation is needed to proceed with our evaluation. In this section, we approximate the replicated free energy using the Plefka expansion Plefka (1982). In brief, the Plefka expansion is the perturbative expansion in a Gibbs free energy that is a dual form of a corresponding Helmholtz free energy.

The Gibbs free energy is obtained as

[TABLE]

The derivation of this Gibbs free energy is described in Appendix A. It is noteworthy that this type of expression of the Gibbs free energy implies the replica-symmetric (RS) assumption. To take the replica-symmetry breaking (RSB) into account, explicit treatments of overlaps between different replicas are needed Yasuda et al. (2012). By expanding $G_{x}(m,H,\gamma)$ around $\gamma=0$ , we obtain

[TABLE]

where $e(m)$ is the negative mean-field entropy defined by

[TABLE]

and the coefficients, $\phi_{x}^{(1)}(m)$ and $\phi_{x}^{(2)}(m)$ , are expressed as Eqs. (41) and (46), respectively. The detailed derivation of these coefficients is presented in Appendix B.

From Eqs. (14), (17), (22), and (37), we obtain the empirical Bayes likelihood function as

[TABLE]

where

[TABLE]

From Eqs. (41) and (46), $\Phi(m)$ and $\phi_{-1}^{(2)}(m)$ are

[TABLE]

and

[TABLE]

respectively. The coefficient $\Omega$ appearing in the above equation is defined by

[TABLE]

where

[TABLE]

here, $\partial(i):=\{1,2,\ldots,n\}\setminus\{i\}$ .

III.3 Inference algorithm

As mentioned in Sec. II.2, the empirical Bayes inference is achieved by maximizing $L_{\mathrm{EB}}(H,\gamma)$ with respect to $H$ and $\gamma$ (cf. Eq. (12)). From the extremum condition of Eq. (24) with respect to $H$ , we obtain

[TABLE]

where $\hat{m}$ is the value of $m$ that satisfies the extremum condition in Eq. (24). From the extremum condition of Eq. (24) with respect to $m$ and Eq. (29), we obtain

[TABLE]

From Eqs. (24) and (29), the optimal value of $\gamma$ is obtained by

[TABLE]

From Eq. (31), $\hat{\gamma}$ is immediately obtained as follows: (i) when $\phi_{-1}^{(2)}(M)>0$ and $\Phi(M)\geq 0$ or when $\phi_{-1}^{(2)}(M)=0$ and $\Phi(M)>0$ , $\hat{\gamma}=0$ , (ii) when $\phi_{-1}^{(2)}(M)>0$ and $\Phi(M)<0$ , $\hat{\gamma}=-\Phi(M)/(2\phi_{-1}^{(2)}(M))$ , and (iii) $\hat{\gamma}\to\infty$ elsewhere. Here, we ignore the case $\phi_{-1}^{(2)}(M)=\Phi(M)=0$ , because it hardly occurs in realistic settings. By using Eqs. (30) and (31), we can obtain the solution to the empirical Bayes inference without any iterative processes. The pseudocode of the proposed procedure is shown in Algorithm 1.

In the proposed method, the value of $\hat{H}$ does not affect the determination of $\hat{\gamma}$ . Many mean-field-based methods for BML (e.g., listed in Sec. I) have similar procedures, in which $\hat{\bm{J}}_{\mathrm{ML}}$ are determined separately from $\hat{h}_{\mathrm{ML}}$ . This is seen as one of the common properties of the mean-field-based methods for BML including the current empirical Bayes problem.

III.4 Evaluation based on Laplace prior

The above evaluation was for the Gaussian prior in Eq. (8). Here, we explain the evaluation for the Laplace prior in Eq. (9). By employing the Laplace prior in Eq. (9), Eq. (16) becomes

[TABLE]

where $\xi:=\sqrt{2n/\gamma}$ . Here, we assume

[TABLE]

By using the perturbative approximation,

[TABLE]

we obtain the approximation of Eq. (32) as

[TABLE]

The right-hand side of this equation coincides with $\Psi_{x}^{\mathrm{Gauss}}(H,\gamma)$ in Eq. (17). This means that the empirical Bayes inference based on the Laplace prior in Eq. (9) is (approximately) equivalent to that based on the Gaussian prior in Eq. (8) (i.e., $\Psi_{x}^{\mathrm{Laplace}}(H,\gamma)\approx\Psi_{x}^{\mathrm{Gauss}}(H,\gamma)$ ) when the assumption of Eq. (33) is justified. Thus, we can also use the algorithm presented in Sec. III.3 for the case of the Laplace prior.

IV Numerical Experiments

In this section, we describe the results of our numerical experiments. In these experiments, the observed dataset $\mathcal{D}$ are generated from the generative Boltzmann machine, which has the same form as Eq. (1), by using annealed importance sampling (AIS) Neal (2001). In AIS, we controlled the annealing schedule using a series of inverse temperature $0=\beta_{0}<\beta_{1}<\cdots<\beta_{T}=1$ , where $\beta_{t+1}=\beta_{t}+0.03$ . The parameters of the generative Boltzmann machine are drawn from the prior distributions in Eqs. (4) and (10). That is, we consider the model-matched case (i.e., the generative and learning models are identical).

In the following, we use the notations $\alpha:=N/n$ and $J:=\sqrt{\gamma}$ . The standard deviations of the Gaussian prior in Eq. (8) and of the Laplace prior in Eq. (9) are then $J/\sqrt{n}$ . We express the hyperparameters for the generative Boltzmann machine by $H_{\mathrm{true}}$ and $J_{\mathrm{true}}$ .

IV.1 Gaussian prior case

Here, we consider the case in which the prior distribution of $\bm{J}$ is the Gaussian prior in Eq. (8). In this case, the Boltzmann machine corresponds to the Sherrington-Kirkpatrick (SK) model Sherrington and Kirkpatrick (1975), and therefore it shows the spin-glass transition at $J=1$ when $h=0$ (i.e., when $H=0$ ).

First, we consider the case $H_{\mathrm{true}}=0$ . We show the scatter plots for the estimation of $\hat{J}$ for various $J_{\mathrm{true}}$ when $H_{\mathrm{true}}=0$ and $\alpha=0.4$ in Fig. 2.

The detailed values of the plots for some $J_{\mathrm{true}}$ values are shown in Tab. 1.

When $J_{\mathrm{true}}<1$ , our estimates of $\hat{J}$ are in good agreement with $J_{\mathrm{true}}$ . This implies that the validity of our perturbative approximation is lost in the spin-glass phase, as is often the case with many mean-field approximations. Fig. 3 shows the scatter plots for various $\alpha$ .

A smaller $\alpha$ causes $\hat{J}$ to be overestimated and a larger $\alpha$ causes it to be underestimated. At least in our experiments, the optimal value of $\alpha$ seems to be $\alpha_{\mathrm{opt}}\approx 0.4$ when $H_{\mathrm{true}}=0$ . Our method can estimate $\hat{H}$ together with $\hat{J}$ . The results for the estimation of $\hat{H}$ when $H_{\mathrm{true}}=0$ and $\alpha=0.4$ are shown in Fig. 4.

Figs. 4(a) and (b) show the average of $|H_{\mathrm{true}}-\hat{H}|$ (i.e., the mean absolute error (MAE)) and the standard deviation of $\hat{H}$ over 300 experiments, respectively. The MAE and standard deviation increase in the region $J_{\mathrm{true}}>1$ .

Next, we consider the cases $H_{\mathrm{true}}>0$ . The scatter plots for the estimation of $\hat{J}$ for various $J_{\mathrm{true}}$ values when $H_{\mathrm{true}}=0.2$ and $H_{\mathrm{true}}=0.4$ are shown in Fig. 5.

The appropriate values of $\alpha$ when $H_{\mathrm{true}}=0.2$ and $H_{\mathrm{true}}=0.4$ “approximately” seem to be $\alpha_{\mathrm{opt}}\approx 30/n$ and $\alpha_{\mathrm{opt}}\approx 5/n$ , respectively. The detailed values of these plots for some $J_{\mathrm{true}}$ values are shown in Tabs. 2 and 3. The results for the estimation of $\hat{H}$ when $H_{\mathrm{true}}=0.2$ and $\alpha=30/n$ and when $H_{\mathrm{true}}=0.4$ and $\alpha=5/n$ are shown in Figs. 6 and 7, respectively.

The increases in the MAE and standard deviations occur earlier than for the case in Fig. 4.

One of the largest qualitative differences between the cases $H_{\mathrm{true}}=0$ and $H_{\mathrm{true}}>0$ is the scale of $\alpha$ . In the case $H_{\mathrm{true}}=0$ , the optimal $\alpha$ was scaled by $O(1)$ with respect to $n$ (i.e., $N=O(n)$ ). Meanwhile, in the case $H_{\mathrm{true}}>0$ , the optimal $\alpha$ is scaled by $O(1/n)$ with respect to $n$ (i.e., $N=O(1)$ ). This change of scale can be understood from a scale evaluation for the terms in the empirical Bayes likelihood function in Eq. (24). The detailed reasoning is given in Appendix C.

IV.2 Laplace prior case

Here, we consider the case in which the prior distribution of $\bm{J}$ is the Laplace prior in Eq. (9). The scatter plots for the estimation of $\hat{J}$ for various $J_{\mathrm{true}}$ values when $H_{\mathrm{true}}=0$ are shown in Fig. 8.

The plots shown in Fig. 8 almost completely overlap with those in Fig. 3. Furthermore, all the numerical results in the case $H_{\mathrm{true}}>0$ also almost completely overlap with the corresponding results obtained in the above Gaussian prior case, and therefore we do not show those results.

V Summary and Discussions

In this study, we proposed a hyperparameters inference algorithm by analyzing the empirical Bayes likelihood function in Eq. (11) using the replica method and the Plefka expansion. The validity of our method was examined in numerical experiments for the Gaussian and Laplace priors, which demonstrated the existence of an appropriate scale in the size of the dataset that can accurately recover the values of the hyperparameters.

However, some problems remain. The first one is the scale of $N$ . In our experiments, we found that an appropriate $N$ is scaled by $O(n)$ when $H_{\mathrm{true}}=0$ or by $O(1)$ when $H_{\mathrm{true}}\neq 0$ . However, such scales seem to be unnatural, because they should not appear in the original framework of the empirical Bayes method. As discussed in Sec. II.2, when $N\gg n$ , maximizing the empirical Bayes likelihood function is reduced to the maximum likelihood estimation of the prior distributions for the solution to BML. This must lead to the correct $\hat{\gamma}$ and $\hat{H}$ , because the solution to BML is perfect when $N\to\infty$ . Therefore, such unnatural scales appear due to our approximation, which is also supported by a scale analysis given in Appendix C. An improvement of the approximation (e.g., by evaluating the leading terms in the Plefka expansion or using some other approximations) might reduce these unnatural behaviors.

The second problem is the optimal setting $\alpha=N/n$ . Empirically, we found that $\alpha_{\mathrm{opt}}\approx 0.4$ when $H_{\mathrm{true}}=0$ and that it decreases as $H_{\mathrm{true}}$ increases (e.g., $\alpha_{\mathrm{opt}}\approx 30/n$ when $H_{\mathrm{true}}=0.2$ and $\alpha_{\mathrm{opt}}\approx 5/n$ when $H_{\mathrm{true}}=0.4$ ). As can be seen in the results of our experiments, the solution to our method is robust for the choice of $\alpha$ when $J_{\mathrm{true}}$ is small ( $J_{\mathrm{true}}<J_{c}$ ) and is sensitive to it when $J_{\mathrm{true}}$ is large ( $J_{\mathrm{true}}>J_{c}$ ), where $J_{c}\approx 0.4$ . The estimation of $\alpha_{\mathrm{opt}}$ is very important for our method, and it will make our method more practical. This problem would be strongly related to the first problem.

The third problem is the degradation of the estimation accuracy in the spin-glass phase. In our experiments, the estimation accuracies of $\hat{\gamma}$ and $\hat{H}$ were obviously degraded in the spin-glass phase. This means that our Plefka expansion based on the RS assumption loses its validity in the spin-glass phase. In Ref. Yasuda et al. (2012), a Plefka expansion for the one-step RSB was proposed. Employing this expansion instead of the current expansion could reduce the degradation in the spin-glass phase. These three problems should be addressed in our future studies.

In this study, we used fully-connected Boltzmann machines whose variables are all visible. We are also interested in an extension of our method to other types of Boltzmann machines such as Boltzmann machines having specific structures or hidden variables. Furthermore, we considered the model-matched case (i.e., the case in which the generative mode and learning model are the same model) in the current study, but model-mismatched cases are more practical and important.

Appendix A Gibbs Free Energy

In this appendix, we derive the Gibbs free energy for the replicated (Helmholtz) free energy in Eq. (19).

The replicated free energy is obtained by minimizing the variational free energy, defined by

[TABLE]

under the normalization constraint, i.e., $\sum_{\mathcal{S}_{x}}Q(\mathcal{S}_{x})=1$ , where $Q(\mathcal{S}_{x})$ is a test distribution over $\mathcal{S}_{x}$ , and $E_{x}(\mathcal{S}_{x};H,\gamma)$ is the Hamiltonian for the replicated system defined in Eq. (20).

The Gibbs free energy is obtained by adding new constraints to the minimization of $f[Q]$ . Here, we add the relation $(n\tau_{x})^{-1}\sum_{i=1}^{n}\sum_{a=1}^{\tau_{x}}\sum_{\mathcal{S}_{x}}S_{i}^{\{a\}}Q(\mathcal{S}_{x})=m$ as the constraint. By using Lagrange multipliers, the Gibbs free energy is obtained as

[TABLE]

where “ $\operatorname*{extr}$ ” denotes the extremum with respect to the assigned parameters. By performing the extremum operation with respect to $Q(\mathcal{S})$ and $r$ in Eq. (35), we obtain

[TABLE]

The replicated free energy in Eq. (19) coincides with the extremum of this Gibbs free energy with respect to $m$ ; i.e.,

[TABLE]

By performing the shift $H+\lambda\to\lambda$ in Eq. (36), we obtain Eq. (21).

Appendix B Derivation of Coefficients of Plefka Expansion

The Plefka expansion considered in this study can be obtained by expanding the Gibbs free energy in Eq. (21) around $\gamma=0$ .

When $\gamma=0$ , we have

[TABLE]

where $e(m)$ is defined in Eq. (23).

For the derivations of the coefficients $\phi_{x}^{(1)}(m)$ and $\phi_{x}^{(2)}(m)$ , we decompose $E_{x}(\mathcal{S}_{x};H,\lambda)$ in Eq. (21) into two parts:

[TABLE]

where

[TABLE]

Coefficient $\phi_{x}^{(1)}(m)$ is defined by

[TABLE]

The derivative leads to

[TABLE]

where $\langle\cdots\rangle_{\gamma}$ denotes the average for the distribution

[TABLE]

where $\lambda^{*}$ is the value of $\lambda$ that satisfies the extremum condition in Eq. (21) and which is the function relating $\gamma$ and $m$ ; i.e., $\lambda^{*}=\lambda^{*}(\gamma,m)$ . From the extremum condition for $\lambda$ in Eq. (21), we obtain the equation

[TABLE]

which holds for any $\gamma$ . In the derivation of Eq. (39), we used Eq. (40). When $\gamma=0$ , Eq. (40) reduces to $m=\tanh\lambda^{*}$ . This means that $\langle S_{i}^{\{a\}}\rangle_{0}=m$ for any $i$ and $a$ . Therefore, we obtain

[TABLE]

where $K_{x}:=\tau_{x}(\tau_{x}-1)/2$ . In the derivation of Eq. (41), we used the relation $\langle S_{i}^{\{a\}}S_{j}^{\{b\}}\rangle_{0}=\langle S_{i}^{\{a\}}\rangle_{0}\langle S_{j}^{\{b\}}\rangle_{0}$ if $i\not=j$ or $a\not=b$ .

The coefficient $\phi_{x}^{(2)}(m)$ is defined by

[TABLE]

From Eq. (39), the second derivative is

[TABLE]

where

[TABLE]

is Georges’s operator, proposed in Ref. Georges and Yedidia (1991). To simplify the notation, we omit the explicit description of the dependency of the operator on $\mathcal{S}_{x}$ and $m$ . By using this operator, the derivative of $\langle A\rangle_{\gamma}$ with respect to $\gamma$ is obtained as

[TABLE]

This immediately leads to $\langle S_{i}^{\{a\}}U_{x}(\gamma)\rangle_{\gamma}=0$ , because $\partial\langle S_{i}^{\{a\}}\rangle_{\gamma}/\partial\gamma=\partial m/\partial\gamma=0$ . Therefore,

[TABLE]

is obtained, where we have used $\langle U_{x}(\gamma)\rangle_{\gamma}=0$ . From Eqs. (42) and (43), we have

[TABLE]

Because

[TABLE]

when $\gamma=0$ , we obtain

[TABLE]

where $\omega_{i}$ is defined in Eq. (28).

By using Eqs. (44) and (45), we obtain

[TABLE]

where $\Omega$ is defined in Eq. (27).

Appendix C Evaluation of Orders of Each Term in the Empirical Bayes Likelihood

Here, we evaluate the orders of each term in Eq. (24) with $m=M$ , with respect to $n\gg 1$ , that is, the orders of each term in

[TABLE]

In the following, we assume that $N=O\big{(}n^{\rho}\big{)}$ ( $\rho\geq 0$ ) and that $\{\mathrm{S}_{i}^{(\mu)}\}$ are i.i.d. samples from a certain distribution.

First, we consider the case $H_{\mathrm{true}}=0$ in which the distribution of $\{\mathrm{S}_{i}^{(\mu)}\}$ is unbiased. In this case, we obtain $M=O\big{(}n^{-(1+\rho)/2}\big{)}$ , $C_{1}=O\big{(}n^{-1-\rho/2}\big{)}$ , and

[TABLE]

Similarly, we obtain

[TABLE]

because $C_{1}^{2}=O\big{(}n^{-2-\rho}\big{)}$ . Using the above results and Eqs. (23), (25), and (26), we obtain $e(M)=O(1)$ , $\Phi(M)=O(1)$ , and $\phi_{-1}^{(2)}(M)=O\big{(}n^{\rho-1}\big{)}$ , respectively. Therefore, when $\rho=1$ , the orders of all the terms in Eq. (47) are just $O(1)$ with respect to $n$ .

Next, we consider the case $H_{\mathrm{true}}\neq 0$ in which the distribution of $\{\mathrm{S}_{i}^{(\mu)}\}$ is biased. In this case, $M$ , $C_{1}$ , and $C_{2}$ are $O(1)$ , and furthermore, $\Omega$ is $O(1)$ because $\omega_{i}=O(1)$ . This leads to $e(M)=O(1)$ , $\Phi(M)=O\big{(}n^{\rho}\big{)}$ , and $\phi_{-1}^{(2)}(M)=O\big{(}n^{2\rho}\big{)}$ . Therefore, when $\rho=0$ , the orders of all the terms in Eq. (47) are just $O(1)$ with respect to $n$ .

This consideration and the experiments in Sec. IV imply that our method based on the Plefka expansion can be validated when all the terms in the empirical Bayes likelihood are $O(1)$ . The introduction of the external field changes the condition to satisfy this criterion, leading to the appropriate scaling of $\alpha$ . This statement is consistent with the numerical observation that a stable result is obtained even for different $n$ ’s as long as the appropriate scale in $\alpha$ is maintained, as shown in Sec. IV.

Acknowledgment

This work was partially supported by JSPS KAKENHI (Grant Numbers: 15H03699, 18K11459, 18H03303, 25120013, and 17H00764), JST CREST (Grant Number: JPMJCR1402), and the COI Program from the JST (Grant Number JPMJCE1312). TO is also supported by a Grant for Basic Science Research Projects from the Sumitomo Foundation.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ackley et al. (1985) D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cognitive Science 9 , 147 (1985).
2Roudi et al. (2009) Y. Roudi, E. Aurell, and J. Hertz, Frontiers in Computational Neuroscience 3 , 1 (2009).
3Plefka (1982) T. Plefka, J. Phys. A: Math. and Gen. 15 , 1971 (1982).
4Pelizzola (2005) A. Pelizzola, J. Phys. A: Math. and Gen. 38 , R 309 (2005).
5Kappen and Rodríguez (1998) H. J. Kappen and F. B. Rodríguez, Neural Computation 10 , 1137 (1998).
6Tanaka (1998) T. Tanaka, Phys. Rev. E 58 , 2302 (1998).
7Yasuda and Horiguchi (2006) M. Yasuda and T. Horiguchi, Physica A 368 , 83 (2006).
8Sessak and Monasson (2009) V. Sessak and R. Monasson, Journal of Physics A: Mathematical and Theoretical 42 , 055001 (2009).