Empirical Bayes Method for Boltzmann Machines
Muneki Yasuda, Tomoyuki Obuchi

TL;DR
This paper introduces a fast, non-iterative empirical Bayes algorithm for Boltzmann machines that uses the replica method and Plefka expansion to estimate hyperparameters efficiently, despite some bias issues.
Contribution
It proposes a novel, simple, and fast empirical Bayes method for Boltzmann machines that bypasses computational intractability using advanced approximation techniques.
Findings
The method is computationally efficient and does not require iterative procedures.
It introduces a bias in estimates due to the Plefka expansion.
The peculiar bias behavior is linked to the approximation method used.
Abstract
In this study, we consider an empirical Bayes method for Boltzmann machines and propose an algorithm for it. The empirical Bayes method allows estimation of the values of the hyperparameters of the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function in this study. However, the maximization is computationally hard because the empirical Bayes likelihood function involves intractable integrations of the partition function. The proposed algorithm avoids this computational problem by using the replica method and the Plefka expansion. Our method does not require any iterative procedures and is quite simple and fast, though it introduces a bias to the estimate, which exhibits an unnatural behavior with respect to the size of the dataset. This peculiar behavior is supposed to be due to the approximate treatment by the Plefka…
| 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 | 1.2 | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 | 1.2 | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1 | 1.2 | ||
|---|---|---|---|---|---|---|---|---|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Empirical Bayes Method for Boltzmann Machines
Muneki Yasuda
Graduate School of Science and Engineering, Yamagata University, Japan.
Tomoyuki Obuchi
Department of Mathematical and Computing Science, Tokyo Institute of Technology, Japan.
Abstract
In this study, we consider an empirical Bayes method for Boltzmann machines and propose an algorithm for it. The empirical Bayes method allows estimation of the values of the hyperparameters of the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function in this study. However, the maximization is computationally hard because the empirical Bayes likelihood function involves intractable integrations of the partition function. The proposed algorithm avoids this computational problem by using the replica method and the Plefka expansion. Our method does not require any iterative procedures and is quite simple and fast, though it introduces a bias to the estimate, which exhibits an unnatural behavior with respect to the size of the dataset. This peculiar behavior is supposed to be due to the approximate treatment by the Plefka expansion. A possible extension to overcome this behavior is also discussed.
Boltzmann machine, inverse Ising problem, empirical Bayes method, replica method, Plefka expansion
pacs:
Valid PACS appear here
††preprint: APS/123-QED
I Introduction
Boltzmann machine learning (BML) Ackley et al. (1985) has been actively studied in the field of machine learning and also in statistical mechanics. In statistical mechanics, the problem of BML is sometimes referred to as the inverse Ising problem, because a Boltzmann machine is the same as an Ising model, and BML can be regarded as an inverse problem for the Ising model. The framework of the usual BML is as follows. Given a set of observed data points (e.g., spin snapshots), we estimate appropriate values of the parameters, the external field and couplings, of our Boltzmann machine through maximum likelihood (ML) estimation (cf. Sec. II.1). Because BML involves intractable multiple summations (i.e., evaluation of the partition function), many approximations for it were proposed from the viewpoint of statistical mechanics Roudi et al. (2009): for example, methods based on mean-field approximations (such as the Plefka expansion Plefka (1982) and the cluster variation method Pelizzola (2005)) Kappen and Rodríguez (1998); Tanaka (1998); Yasuda and Horiguchi (2006); Sessak and Monasson (2009); Yasuda and Tanaka (2009); Ricci-Tersenghi (2012); Furtlehner (2013) and methods based on other approximations Sohl-Dickstein et al. (2011); Yasuda (2015).
In this study, we focus on another type of learning problem. We consider prior distributions of parameters of the Boltzmann machine and assume that the prior distributions are governed by some hyperparameters. The introduction of the prior distributions is strongly connected with the regularized ML estimation (cf. Sec. II.1). As mentioned above, the aim of the usual BML is to optimize the values of the parameters of the Boltzmann machine by using a set of observed data points. Meanwhile, the aim of the problem investigated in this study is the estimation of appropriate values of the hyperparameters from the dataset without estimating specific values of the parameters. One way to allow us to accomplish this from the Bayesian point of view is the empirical Bayes method (or also called type-II ML estimation or evidence approximation) MacKay (1992); Bishop (2006) (cf. Sec. II.2). The schemes of the usual BML and of our problem are illustrated in Fig. 1.
However, the evaluation of the likelihood function in the empirical Bayes method is again intractable, because it involves intractable multiple integrations of the partition function. In this study, we analyze the empirical Bayes method for fully-connected Boltzmann machines, using statistical mechanical techniques based on the replica method Mezard et al. (1987); Nishimori (2001) and the Plefka expansion to derive an algorithm for it. We consider two types of cases of the prior distribution of : the cases of Gaussian and Laplace priors.
The rest of this paper is organized as follows. The formulations of the usual BML and the empirical Bayes method are presented in Sec. II. In Sec. III, we describe our statistical mechanical analysis for the empirical Bayes method. The proposed inference algorithm obtained from our analysis is shown in Sec. III.3 with its pseudocode. In Sec. IV, we examine our proposed method through numerical experiments. Finally, the summary and some discussions are presented in Sec. V.
II Boltzmann Machine and Empirical Bayes Method
II.1 Boltzmann machine and prior distributions
Consider a fully-connected Boltzmann machine with Ising variables Ackley et al. (1985):
[TABLE]
where is the sum over all the distinct pairs of variables; i.e., . is the partition function defined by
[TABLE]
where is the sum over all the possible configurations of ; i.e., . The parameters, and , denote the external field and couplings, respectively.
Given observed data points, , we define the log-likelihood function:
[TABLE]
Maximizing the log-likelihood function with respect to and (i.e., the ML estimation) just corresponds to the BML (or the inverse Ising problem), i.e.,
[TABLE]
Now, we introduce prior distributions for the parameters and as and
[TABLE]
respectively. and are the hyperparameters of these prior distributions. One of the most important motivations for introducing the prior distributions is for a Bayesian interpretation of the regularized ML estimation Bishop (2006). Given the observed dataset , by using the prior distributions, the posterior distribution of and is expressed as
[TABLE]
where
[TABLE]
The distribution in the denominator in Eq. (5), , is sometimes referred to as the evidence. By using the posterior distribution, the maximum a posteriori (MAP) estimation of the parameters is obtained as
[TABLE]
where
[TABLE]
The MAP estimation in Eq. (6) corresponds to the regularized ML estimation, in which and work as a penalty. For example, (i) when the prior distribution of is the Gaussian prior,
[TABLE]
corresponds to the regularization term, and corresponds to its coefficient; (ii) when the prior distribution of is the Laplace prior,
[TABLE]
corresponds to the regularization term, and again corresponds to its coefficient. The variances of these prior distributions are identical, . In this study, as a simple test case, we use these two prior distributions for and
[TABLE]
where is the Dirac delta function, for .
II.2 Framework of the empirical Bayes method
Using the empirical Bayes method, we can infer the values of the hyperparameters, and , from the observed dataset . We define a marginal log-likelihood function as
[TABLE]
where is the average over the prior distributions; i.e.,
[TABLE]
We refer to the marginal log-likelihood function as the empirical Bayes likelihood function in this study. From the perspective of the empirical Bayes method, the optimal values of the hyperparameters, and , are obtained by maximizing of the empirical Bayes likelihood function; i.e.,
[TABLE]
It is noteworthy that in Eq. (11) is identified as the evidence appearing in Eq. (5).
The marginal log-likelihood function can be rewritten as
[TABLE]
Consider the case . In this case, by using the saddle point evaluation, Eq. (13) is reduced to
[TABLE]
In this case, the empirical Bayes’ estimates thus converge to the maximum likelihood estimates of the hyperparameters in the prior distributions in which the maximum likelihood estimates of the parameters (i.e., the solution to the BML) are inserted. This indicates that the parameter estimations can be conducted independently of the hyperparameter estimation. In this study, we do not concern ourselves with this trivial case.
III Statistical Mechanical Analysis
The empirical Bayes likelihood function in Eq. (11) involves intractable multiple integrations. In this section, we evaluate the empirical Bayes likelihood function using a statistical mechanical analysis. We consider the two types of the prior distribution of : one is the Gaussian prior in Eq. (8), and the other is the Laplace prior in Eq. (9).
First, we evaluate the empirical Bayes likelihood function on the basis of the Gaussian prior in Secs. III.1–III.3, after which we describe the evaluation based on the Laplace prior in Sec. III.4.
III.1 Replica method
The empirical Bayes likelihood function in Eq. (11) can be represented as
[TABLE]
where
[TABLE]
and
[TABLE]
are the sample averages of the observed data points. We assume that is a natural number, and therefore Eq. (15) can be expressed as
[TABLE]
where are replica indices, and is the Ising variable on site in the th replica. is the set of all the Ising variables in the replicated system, and is the sum over all the possible configurations of ; i.e., . We evaluate under the assumption that us a natural number, after which we take the limit of of the evaluation result to obtain the empirical Bayes likelihood function (this is the so-called replica trick).
By employing the Gaussian prior in Eq. (8), Eq. (16) becomes
[TABLE]
where
[TABLE]
and
[TABLE]
is the replicated (Helmholtz) free energy Rizzo et al. (2010); Yasuda et al. (2012); Lage-Castellanos et al. (2013); Yasuda et al. (2015); here,
[TABLE]
is the Hamiltonian of the replicated system, where is the sum over all the distinct pairs of replicas; i.e., .
III.2 Plefka expansion
Because the replicated free energy in Eq. (19) includes intractable multiple summations, an approximation is needed to proceed with our evaluation. In this section, we approximate the replicated free energy using the Plefka expansion Plefka (1982). In brief, the Plefka expansion is the perturbative expansion in a Gibbs free energy that is a dual form of a corresponding Helmholtz free energy.
The Gibbs free energy is obtained as
[TABLE]
The derivation of this Gibbs free energy is described in Appendix A. It is noteworthy that this type of expression of the Gibbs free energy implies the replica-symmetric (RS) assumption. To take the replica-symmetry breaking (RSB) into account, explicit treatments of overlaps between different replicas are needed Yasuda et al. (2012). By expanding around , we obtain
[TABLE]
where is the negative mean-field entropy defined by
[TABLE]
and the coefficients, and , are expressed as Eqs. (41) and (46), respectively. The detailed derivation of these coefficients is presented in Appendix B.
From Eqs. (14), (17), (22), and (37), we obtain the empirical Bayes likelihood function as
[TABLE]
where
[TABLE]
From Eqs. (41) and (46), and are
[TABLE]
and
[TABLE]
respectively. The coefficient appearing in the above equation is defined by
[TABLE]
where
[TABLE]
here, .
III.3 Inference algorithm
As mentioned in Sec. II.2, the empirical Bayes inference is achieved by maximizing with respect to and (cf. Eq. (12)). From the extremum condition of Eq. (24) with respect to , we obtain
[TABLE]
where is the value of that satisfies the extremum condition in Eq. (24). From the extremum condition of Eq. (24) with respect to and Eq. (29), we obtain
[TABLE]
From Eqs. (24) and (29), the optimal value of is obtained by
[TABLE]
From Eq. (31), is immediately obtained as follows: (i) when and or when and , , (ii) when and , , and (iii) elsewhere. Here, we ignore the case , because it hardly occurs in realistic settings. By using Eqs. (30) and (31), we can obtain the solution to the empirical Bayes inference without any iterative processes. The pseudocode of the proposed procedure is shown in Algorithm 1.
In the proposed method, the value of does not affect the determination of . Many mean-field-based methods for BML (e.g., listed in Sec. I) have similar procedures, in which are determined separately from . This is seen as one of the common properties of the mean-field-based methods for BML including the current empirical Bayes problem.
III.4 Evaluation based on Laplace prior
The above evaluation was for the Gaussian prior in Eq. (8). Here, we explain the evaluation for the Laplace prior in Eq. (9). By employing the Laplace prior in Eq. (9), Eq. (16) becomes
[TABLE]
where . Here, we assume
[TABLE]
By using the perturbative approximation,
[TABLE]
we obtain the approximation of Eq. (32) as
[TABLE]
The right-hand side of this equation coincides with in Eq. (17). This means that the empirical Bayes inference based on the Laplace prior in Eq. (9) is (approximately) equivalent to that based on the Gaussian prior in Eq. (8) (i.e., ) when the assumption of Eq. (33) is justified. Thus, we can also use the algorithm presented in Sec. III.3 for the case of the Laplace prior.
IV Numerical Experiments
In this section, we describe the results of our numerical experiments. In these experiments, the observed dataset are generated from the generative Boltzmann machine, which has the same form as Eq. (1), by using annealed importance sampling (AIS) Neal (2001). In AIS, we controlled the annealing schedule using a series of inverse temperature , where . The parameters of the generative Boltzmann machine are drawn from the prior distributions in Eqs. (4) and (10). That is, we consider the model-matched case (i.e., the generative and learning models are identical).
In the following, we use the notations and . The standard deviations of the Gaussian prior in Eq. (8) and of the Laplace prior in Eq. (9) are then . We express the hyperparameters for the generative Boltzmann machine by and .
IV.1 Gaussian prior case
Here, we consider the case in which the prior distribution of is the Gaussian prior in Eq. (8). In this case, the Boltzmann machine corresponds to the Sherrington-Kirkpatrick (SK) model Sherrington and Kirkpatrick (1975), and therefore it shows the spin-glass transition at when (i.e., when ).
First, we consider the case . We show the scatter plots for the estimation of for various when and in Fig. 2.
The detailed values of the plots for some values are shown in Tab. 1.
When , our estimates of are in good agreement with . This implies that the validity of our perturbative approximation is lost in the spin-glass phase, as is often the case with many mean-field approximations. Fig. 3 shows the scatter plots for various .
A smaller causes to be overestimated and a larger causes it to be underestimated. At least in our experiments, the optimal value of seems to be when . Our method can estimate together with . The results for the estimation of when and are shown in Fig. 4.
Figs. 4(a) and (b) show the average of (i.e., the mean absolute error (MAE)) and the standard deviation of over 300 experiments, respectively. The MAE and standard deviation increase in the region .
Next, we consider the cases . The scatter plots for the estimation of for various values when and are shown in Fig. 5.
The appropriate values of when and “approximately” seem to be and , respectively. The detailed values of these plots for some values are shown in Tabs. 2 and 3. The results for the estimation of when and and when and are shown in Figs. 6 and 7, respectively.
The increases in the MAE and standard deviations occur earlier than for the case in Fig. 4.
One of the largest qualitative differences between the cases and is the scale of . In the case , the optimal was scaled by with respect to (i.e., ). Meanwhile, in the case , the optimal is scaled by with respect to (i.e., ). This change of scale can be understood from a scale evaluation for the terms in the empirical Bayes likelihood function in Eq. (24). The detailed reasoning is given in Appendix C.
IV.2 Laplace prior case
Here, we consider the case in which the prior distribution of is the Laplace prior in Eq. (9). The scatter plots for the estimation of for various values when are shown in Fig. 8.
The plots shown in Fig. 8 almost completely overlap with those in Fig. 3. Furthermore, all the numerical results in the case also almost completely overlap with the corresponding results obtained in the above Gaussian prior case, and therefore we do not show those results.
V Summary and Discussions
In this study, we proposed a hyperparameters inference algorithm by analyzing the empirical Bayes likelihood function in Eq. (11) using the replica method and the Plefka expansion. The validity of our method was examined in numerical experiments for the Gaussian and Laplace priors, which demonstrated the existence of an appropriate scale in the size of the dataset that can accurately recover the values of the hyperparameters.
However, some problems remain. The first one is the scale of . In our experiments, we found that an appropriate is scaled by when or by when . However, such scales seem to be unnatural, because they should not appear in the original framework of the empirical Bayes method. As discussed in Sec. II.2, when , maximizing the empirical Bayes likelihood function is reduced to the maximum likelihood estimation of the prior distributions for the solution to BML. This must lead to the correct and , because the solution to BML is perfect when . Therefore, such unnatural scales appear due to our approximation, which is also supported by a scale analysis given in Appendix C. An improvement of the approximation (e.g., by evaluating the leading terms in the Plefka expansion or using some other approximations) might reduce these unnatural behaviors.
The second problem is the optimal setting . Empirically, we found that when and that it decreases as increases (e.g., when and when ). As can be seen in the results of our experiments, the solution to our method is robust for the choice of when is small () and is sensitive to it when is large (), where . The estimation of is very important for our method, and it will make our method more practical. This problem would be strongly related to the first problem.
The third problem is the degradation of the estimation accuracy in the spin-glass phase. In our experiments, the estimation accuracies of and were obviously degraded in the spin-glass phase. This means that our Plefka expansion based on the RS assumption loses its validity in the spin-glass phase. In Ref. Yasuda et al. (2012), a Plefka expansion for the one-step RSB was proposed. Employing this expansion instead of the current expansion could reduce the degradation in the spin-glass phase. These three problems should be addressed in our future studies.
In this study, we used fully-connected Boltzmann machines whose variables are all visible. We are also interested in an extension of our method to other types of Boltzmann machines such as Boltzmann machines having specific structures or hidden variables. Furthermore, we considered the model-matched case (i.e., the case in which the generative mode and learning model are the same model) in the current study, but model-mismatched cases are more practical and important.
Appendix A Gibbs Free Energy
In this appendix, we derive the Gibbs free energy for the replicated (Helmholtz) free energy in Eq. (19).
The replicated free energy is obtained by minimizing the variational free energy, defined by
[TABLE]
under the normalization constraint, i.e., , where is a test distribution over , and is the Hamiltonian for the replicated system defined in Eq. (20).
The Gibbs free energy is obtained by adding new constraints to the minimization of . Here, we add the relation as the constraint. By using Lagrange multipliers, the Gibbs free energy is obtained as
[TABLE]
where “” denotes the extremum with respect to the assigned parameters. By performing the extremum operation with respect to and in Eq. (35), we obtain
[TABLE]
The replicated free energy in Eq. (19) coincides with the extremum of this Gibbs free energy with respect to ; i.e.,
[TABLE]
By performing the shift in Eq. (36), we obtain Eq. (21).
Appendix B Derivation of Coefficients of Plefka Expansion
The Plefka expansion considered in this study can be obtained by expanding the Gibbs free energy in Eq. (21) around .
When , we have
[TABLE]
where is defined in Eq. (23).
For the derivations of the coefficients and , we decompose in Eq. (21) into two parts:
[TABLE]
where
[TABLE]
Coefficient is defined by
[TABLE]
The derivative leads to
[TABLE]
where denotes the average for the distribution
[TABLE]
where is the value of that satisfies the extremum condition in Eq. (21) and which is the function relating and ; i.e., . From the extremum condition for in Eq. (21), we obtain the equation
[TABLE]
which holds for any . In the derivation of Eq. (39), we used Eq. (40). When , Eq. (40) reduces to . This means that for any and . Therefore, we obtain
[TABLE]
where . In the derivation of Eq. (41), we used the relation if or .
The coefficient is defined by
[TABLE]
From Eq. (39), the second derivative is
[TABLE]
where
[TABLE]
is Georges’s operator, proposed in Ref. Georges and Yedidia (1991). To simplify the notation, we omit the explicit description of the dependency of the operator on and . By using this operator, the derivative of with respect to is obtained as
[TABLE]
This immediately leads to , because . Therefore,
[TABLE]
is obtained, where we have used . From Eqs. (42) and (43), we have
[TABLE]
Because
[TABLE]
when , we obtain
[TABLE]
where is defined in Eq. (28).
By using Eqs. (44) and (45), we obtain
[TABLE]
where is defined in Eq. (27).
Appendix C Evaluation of Orders of Each Term in the Empirical Bayes Likelihood
Here, we evaluate the orders of each term in Eq. (24) with , with respect to , that is, the orders of each term in
[TABLE]
In the following, we assume that N=O\big{(}n^{\rho}\big{)} () and that are i.i.d. samples from a certain distribution.
First, we consider the case in which the distribution of is unbiased. In this case, we obtain M=O\big{(}n^{-(1+\rho)/2}\big{)}, C_{1}=O\big{(}n^{-1-\rho/2}\big{)}, and
[TABLE]
Similarly, we obtain
[TABLE]
because C_{1}^{2}=O\big{(}n^{-2-\rho}\big{)}. Using the above results and Eqs. (23), (25), and (26), we obtain , , and \phi_{-1}^{(2)}(M)=O\big{(}n^{\rho-1}\big{)}, respectively. Therefore, when , the orders of all the terms in Eq. (47) are just with respect to .
Next, we consider the case in which the distribution of is biased. In this case, , , and are , and furthermore, is because . This leads to , \Phi(M)=O\big{(}n^{\rho}\big{)}, and \phi_{-1}^{(2)}(M)=O\big{(}n^{2\rho}\big{)}. Therefore, when , the orders of all the terms in Eq. (47) are just with respect to .
This consideration and the experiments in Sec. IV imply that our method based on the Plefka expansion can be validated when all the terms in the empirical Bayes likelihood are . The introduction of the external field changes the condition to satisfy this criterion, leading to the appropriate scaling of . This statement is consistent with the numerical observation that a stable result is obtained even for different ’s as long as the appropriate scale in is maintained, as shown in Sec. IV.
Acknowledgment
This work was partially supported by JSPS KAKENHI (Grant Numbers: 15H03699, 18K11459, 18H03303, 25120013, and 17H00764), JST CREST (Grant Number: JPMJCR1402), and the COI Program from the JST (Grant Number JPMJCE1312). TO is also supported by a Grant for Basic Science Research Projects from the Sumitomo Foundation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ackley et al. (1985) D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cognitive Science 9 , 147 (1985).
- 2Roudi et al. (2009) Y. Roudi, E. Aurell, and J. Hertz, Frontiers in Computational Neuroscience 3 , 1 (2009).
- 3Plefka (1982) T. Plefka, J. Phys. A: Math. and Gen. 15 , 1971 (1982).
- 4Pelizzola (2005) A. Pelizzola, J. Phys. A: Math. and Gen. 38 , R 309 (2005).
- 5Kappen and Rodríguez (1998) H. J. Kappen and F. B. Rodríguez, Neural Computation 10 , 1137 (1998).
- 6Tanaka (1998) T. Tanaka, Phys. Rev. E 58 , 2302 (1998).
- 7Yasuda and Horiguchi (2006) M. Yasuda and T. Horiguchi, Physica A 368 , 83 (2006).
- 8Sessak and Monasson (2009) V. Sessak and R. Monasson, Journal of Physics A: Mathematical and Theoretical 42 , 055001 (2009).
