Active learning for enumerating local minima based on Gaussian process derivatives
Yu Inatsu, Daisuke Sugita, Kazuaki Toyoura, Ichiro Takeuchi

TL;DR
This paper introduces an active learning approach using Gaussian Processes to efficiently identify all local minima of a black-box function by sequentially selecting points based on derivative confidence intervals.
Contribution
The paper proposes a novel active learning method that leverages GP derivatives to enumerate local minima, with theoretical analysis and numerical validation.
Findings
Effective enumeration of local minima achieved
The method outperforms baseline approaches in experiments
Theoretical guarantees support the approach's validity
Abstract
We study active learning (AL) based on Gaussian Processes (GPs) for efficiently enumerating all of the local minimum solutions of a black-box function. This problem is challenging due to the fact that local solutions are characterized by their zero gradient and positive-definite Hessian properties, but those derivatives cannot be directly observed. We propose a new AL method in which the input points are sequentially selected such that the confidence intervals of the GP derivatives are effectively updated for enumerating local minimum solutions. We theoretically analyze the proposed method and demonstrate its usefulness through numerical experiments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
Active learning for enumerating local minima based on Gaussian process derivatives
Yu Inatsu RIKEN Center for Advanced Intelligence Project
Daisuke sugita Nagoya Institute of Technology
Kazuaki Toyoura Kyoto University
Ichiro Takeuchi 11footnotemark: 1 22footnotemark: 2 Center for Materials Research by Information Integration, National Institute for Materials ScienceE-mail:[email protected]
ABSTRACT
We study active learning (AL) based on Gaussian Processes (GPs) for efficiently enumerating all of the local minimum solutions of a black-box function. This problem is challenging due to the fact that local solutions are characterized by their zero gradient and positive-definite Hessian properties, but those derivatives cannot be directly observed. We propose a new AL method in which the input points are sequentially selected such that the confidence intervals of the GP derivatives are effectively updated for enumerating local minimum solutions. We theoretically analyze the proposed method and demonstrate its usefulness through numerical experiments.
1 Introduction
In many areas of science and technology, machine learning has been successfully used for uncovering unknown complex systems which are formulated as black-box functions. When the evaluation of a black-box function is expensive, it is often difficult to exhaustively investigate the function in the entire input domain. Active learning (AL) has been developed as a method for effectively selecting the input points at which the function evaluations are helpful for the target task. For example, if the target task is to find the global minimum, it is reasonable to evaluate the function at the input points which are likely to be global minima (this AL problem has been intensively studied in the context of Bayesian Optimization (BO) [9, 2, 1, 6, 14, 5]).
In this paper, we study the problem of enumerating local minima (or maxima) of a black-box function. In many applications, it is beneficial to identify the positions of local minima and/or maxima because it helps to roughly grasp the “shape” of the black-box function. Furthermore, it is often the case that each local minimum point has its own special meaning. For example, when modeling the energy space of a physical system, each local minimum point corresponds to a stable energy point of the system, which is crucially important for revealing various physical properties of the system (see §5 for an application of the proposed method to a physical problem).
A local minimum point is characterized by the first and the second derivatives of the function, i.e., an input point is a local minimum if the gradient vector is zero and the Hessian matrix is positive-definite (PD). The difficulty of this problem is due to the fact that we need to select the input points which are likely to be local minima under a situation that those derivatives cannot be directly observed. In other words, we need to select a set of input points at which the function evaluations are helpful for getting information on the zero gradient and the PD Hessian properties.
We employ Gaussian Processes (GPs) for modeling a black-box function. GPs are useful in many AL problems since they enable one to predict not only the average but also the uncertainty of the black-box function. Our basic idea is to exploit the property that the derivative of a GP is also a GP. Based on this property, we develop a method for computing the confidence intervals (CIs) of each element of the gradient vector and the minimum eigenvalue of the Hessian matrix111A Hessian matrix is PD if and only if the minimum eigenvalue is positive.. Then, these CIs are used for designing an acquisition function (AF) for efficiently enumerating all of the local minima. We call the proposed method Active learning for Local Optima Enumeration (ALOE).
Related works
BO has been intensively studied (see [10, 9] for comprehensive survey of BO). In a few existing BO studies, the gradient of a GP is used for accelerating the BO task. For example, [15] discussed the advantage of using the gradient in a framework called Knowledge-Gradient. Furthermore, [11] demonstrated that the gradient of a GP is helpful for modeling dynamical systems. In these works, it is assumed that not only the function values but also the gradient vectors are directly observed. On the other hand, we consider a setup where neither the gradient nor Hessian are directly observed. The CI-based approach in ALOE is motivated by [4], in which the CIs of function values are used for estimating a level set of the function. Similarly, the CIs of the function values were also used for safe BO in [13]. We employ some of the theoretical techniques developed in [12, 4] for analyzing the various theoretical properties of ALOE. In contrast to these existing studies, we use the CIs of the gradient and the Hessian, which are not be easily available since they cannot be directly observed.
Our contribution
To the best of our knowledge, there is no existing AL method for enumerating local minima. We propose a new AL method called ALOE, in which we develop a method to compute the CIs of the gradients and the minimum eigenvalue of the Hessian, without observing these derivatives directly. Furthermore, based on these CIs, we propose a novel AF for efficiently enumerating local minima. We theoretically analyze the accuracy and the convergence of ALOE, and evaluate its empirical performance by numerical experiments with synthetic data and real application to a physical problem.
2 Preliminaries
Problem setting
Suppose that an unknown function is defined on a set . For simplicity, we consider a finite set of input points , and consider an AL method to classify if each point is local minimum point222All the methods and theories in this paper can be extended to the case where is continuous with reasonable assumptions.. Let us define the following subset of points in .
Definition 2.1** (The set of local minima).**
[TABLE]
where indicates that the matrix is PD.
Note that does not contain “pathological” local minimum points at which all the eigenvalues of the Hessian are zero, e.g., for . Hereafter, with a slight abuse of terminology, we call as the set of local minima. The goal of ALOE is to efficiently classify all the points in into either of or with as small number of function evaluations as possible.
We employ GP for modeling the unknown function . Specifically, we assume that the prior distribution of is , where is a PD kernel. Consider the step where a sequence of the input points on are selected by an AL method. Then, the joint distribution follows the -dimensional normal distribution with the mean vector and the covariance matrix whose element is . The output is assumed to be obtained as , where are independent random variables from . Furthermore, the posterior distribution of is also represented as a GP whose mean , variance and covariance are given by
[TABLE]
where , , and is a -dimensional identity matrix.
GP derivatives
We assume that the kernel function is differentiable up to order four. Many commonly used kernels including Gaussian and Linear kernels satisfy this assumption. Under this assumption, it is known that the first and second derivatives of is also GPs (e.g., [8], [7]). Here, let and be the first and the second derivatives of in the and elements, respectively. Then, given the observations , the posterior distribution of is also GP, and its mean, variance and covariance are respectively given by
[TABLE]
where the element of is and Similarly, the posterior distribution of is also GP, and its mean, variance, and covariance are respectively given by
[TABLE]
where the th element of the second derivative is given by , and
[TABLE]
3 Proposed method
In this section, we describe the proposed ALOE method for efficiently identifying the set of local minima in (2.1). At the step, ALOE estimates whether each is included in using the CIs of the gradients and the Hessian minimum eigenvalue. Figure 1 illustrates the behavior of ALOE.
3.1 Local minimum estimation based on the CIs of GP derivatives
For each , we define the CIs of at the iteration as , where , and . Then, by using an accuracy parameter we define and as
[TABLE]
Here, is the set of points at which the CIs of the gradients fall within , i.e., the set of points expected to have zero gradient with the accuracy . Similarly, is the set of points at which the CIs of the gradients are sufficiently away from zero.
Next, we consider the identification of the points at which the Hessian of is PD. Note that the minimum eigenvalue of a matrix is positive if and only if the Hessian is PD. Using this equivalence, we perform identification of PD Hessian using the CI of the minimum eigenvalue. For each point , we define the CI of the minimum eigenvalue as , where , , and is the minimum eigenvalue of the matrix whose element is . On the other hand, since the variance of is not readily available, we use defined as , where . As shown in section 4, by appropriately adjusting , is shown to be included in with high probability. Then, using an accuracy parameter we define and as
[TABLE]
where (resp. ) is the set of points where the minimum eigenvalue is expected to be positive (resp. negative) with an accuracy . Then, from (3.1)–(3.2) and (3.3)–(3.4), we estimate as follows:
Definition 3.1** ( estimation).**
The estimates of and are respectively defined as
[TABLE]
The set of remaining points at step is defined as . Figure 2 shows an example of CIs.
3.2 Acquisition function by predicted violations
Based on and , we propose an AF for efficiently enumerating local minima. The proposed AF consists of components as
[TABLE]
where adjusts the trade-off between two components. The first component is merely the posterior variance of . Thus, when , the AF is reduced to the AF of Uncertainty sampling [9]. The second component is a specific function designed for reducing the uncertainties in the gradients and Hessian minimum eigenvalue for the task of enumerating local minima. In the remainder of this section, we describe the detail of .
First, we define violations (see, Figure 2). Remembering that the goal is to classify each point into either of or , the violation of the CI of at is defined as
[TABLE]
where and . Here, if and otherwise [math]. The in the second component of the AF is designed to be able to select the next point such that the largest violation is maximally reduced.
Let be the input at which the sum of the violation is largest, i.e.,
[TABLE]
Unfortunately, since the gradient cannot be directly observed, it is not sufficient to simply select the input as the next input point. Indeed, we need to select the input point such that the predicted violation of can be maximally reduced by evaluating . Let be the posterior variance of when the function is newly evaluated. By replacing in to , we obtain the the predicted CIs 333 Here, the mean is not replaced since its update is unknown before we actually evaluate . . Then, by replacing and to similarly defined and , respectively, we similarly define the predicted violation , which represents the predicted violation when is newly evaluated. In summary, the second component of the AF is defined as
[TABLE]
Algorithm 1. shows the flow of ALOE.
4 Theoretical results
We provide theorems on the performance and convergence of Algorithm 1. First, the following theorem holds:
Theorem 4.1**.**
Let be positive numbers, and let . For any , define , and
[TABLE]
Then, Algorithm 1 completes classification after at least the minimum positive integer trials that satisfy the following inequality . Moreover, with probability at least , for any , and it holds that
[TABLE]
and
[TABLE]
Proof.
First, when the inequality on holds, the lengths of and are less than and , respectively. Hence, from classification rules, all the points are classified. Next, noting that the posterior of is also GP, from Lemma 5.1 in [12], with probability at least (w.p.a.l.) it holds that for any , and . This implies that w.p.a.l. it holds that for all . Similarly, by using the same argument for , w.p.a.l. the inequality holds for any , and , where . Thus, w.p.a.l. above inequalities are simultaneously satisfied. Here, denote the Hessian matrix of by , where the element of is and that of is normal distribution with mean [math] and variance . Therefore, w.p.a.l. the absolute value of each element in is less than . Then, for any satisfying , it holds that
[TABLE]
Furthermore, noting that and
[TABLE]
the inequality holds. Hence, we have . Similarly, we also have . This implies that w.p.a.l. it holds that . Therefore, w.p.a.l. it holds that , and . ∎
Next, we provide an upper bound of . Let
[TABLE]
where and is a -dimensional vector whose th element is one and remainders are zeros. Assume the following conditions:
- (A1).
There exists a positive constant such that for any satisfying , and for any and . 2. (A2).
There exists a positive constant such that
[TABLE]
for any , and satisfying , where is given in the assumption (A1).
Finally, let be a mutual information between and . Also let be a maximum information gain after rounds on , defined by Then, the following theorem holds:
Theorem 4.2**.**
Let , , , and
[TABLE]
Then, it holds that
The proof is given in Appendix. In addition, Srinivas t al. [12] provided the order of for certain kernels under mild conditions. For example, in Gaussian kernel, its order is . Hence, if we set , then converges to 0, i.e., satisfies for some .
5 Numerical experiments
We confirm the performance of ALOE by numerical experiments. We used F-score defined as , where precision and recall are given by
[TABLE]
Here, we compared the following seven AFs:
(Random):
Random sampling.
(US):
Uncertainty sampling, i.e., we set for any in (3.5).
(LCB):
Lower confidence bound (LCB) , where .
(No_):
AF for ALOE with for any , and use .
(ALOE1):
AF for ALOE with for any .
(ALOE2):
AF for ALOE with if is a multiple of 5 and otherwise 0.
(ALOE3):
AF for ALOE with if is a multiple of 10 and otherwise 0.
Furthermore, we consider the following as a competitor. Let
[TABLE]
where satisfies and satisfies . Then, if satisfies , is classified as . When using this neighborhood based classification rule (Neighbor), we use as the acquisition function, and next input is selected by . In all the experiments , we use Gaussian kernel . Moreover, we assume that error variance is known.
5.1 Synthetic function experiments
We considered 2-dimensional synthetic functions. Let be a grid point set obtained by dividing the interval into 40 equal parts, and let . Define
[TABLE]
The following three cases were considered:
(Case 1):
, , , , , , .
(Case 2):
, , , , , , .
(Case 3):
, , , , , , .
Furthermore, we set , , and . Here, At this time, one initial point was randomly determined, and based on each AF, function evaluations were sequentially done up to step 200. This was repeated 50 times, and the average of the F-score was calculated (Fig.3). The results indicate that ALOE has better performance than other methods.
5.2 Real data experiments
We analyzed the potential energy (PE) data in inorganic crystal AlLaO3. The data includes 3-dimensional inputs corresponding to 3-dimensional coordinates and PE , for . Here, is given by and is a grid point set obtained by dividing the interval into 17 equal parts, where . In this experiment, GP was first fitted using the whole data excluding outliers, and the posterior mean at each point is defined as the true function . We used this to calculate the energy at each point as . Moreover, we defined . Furthermore, we set , , , , and . In addition, it is known that there are 6 local minimum points in from the domain knowledge in material science. Therefore, these six points are defined as the members of . Here, one initial point was randomly selected, and based on each AF, function evaluations was iterated up to step 300. This was repeated 50 times, and the average of F-score was calculated (see the bottom right plot in Fig.3). The results indicate that the performance of ALOE is better than other competitors as in the previous synthetic experiments.
6 Conclusion
In this paper, we proposed an AL method called ALOE for enumerating local minima using GP derivatives. From the theoretical results and numerical experiments, the usefulness of ALOE was confirmed.
Acknowledgments
This work was partially supported by MEXT KAKENHI (17H00758, 16H06538), JST CREST (JPMJCR1302, JPMJCR1502), RIKEN Center for Advanced Intelligence Project, and JST support program for starting up innovation-hub on materials research by information integration initiative.
Appendix
A Proof of Theorem 4.2
From the definition of , can be written by
[TABLE]
Here, for any , and , let
[TABLE]
Then, it holds that
[TABLE]
where last inequality is derived by . Note that is distributed as a multivariate normal distribution. Thus, from the definition of the conditional variance in the multivariate normal distribution, from the assumption (A2) we get . Hence, by substituting this inequality into (A.2), we obtain
[TABLE]
Furthermore, the variance satisfies the following inequality:
[TABLE]
where last inequality is derived by . Therefore, by using (A.3) and (A.4), we have
[TABLE]
Here, let be a positive number satisfying , and . Then, we set
[TABLE]
Thus, noting that , we obtain and
[TABLE]
Hence, by substituting (A.6) and (A.7) into (A.5), we get
[TABLE]
Here, define
[TABLE]
Moreover, from the definition of and the assumption (A1), it holds that , . Therefore, it holds that
[TABLE]
Thus, we have
[TABLE]
Next, suppose that are positive integers satisfying and . Then, noting that the monotonicity of posterior variances and the definition of (3.5), we obtain
[TABLE]
and
[TABLE]
Therefore, from Lemma 5.3 in Srinivas et al. [12], we have
[TABLE]
Thus, substituting (A.9) into (A.8) we obtain
[TABLE]
This implies that
[TABLE]
Similarly, for any , and , let
[TABLE]
Then, using same arguments we get
[TABLE]
Furthermore, we set . Thus, noting that , we have
[TABLE]
and
[TABLE]
Therefore, from (A.9) it holds that
[TABLE]
Hence, noting that , we obtain
[TABLE]
Finally, by substituting (A.10) and (A.11) into (A.1), we get Theorem 4.2. ∎
B Local minima identification for infinite set
In this section, we consider the case that is infinite. Let be an infinite set, and let be a finite subset of . In addition, we assume that is a compact and convex set. Moreover, we may assume , without loss of generality. Here, for each point , we define and . Similarly, using accuracy parameters and , we define
[TABLE]
and
[TABLE]
Moreover, for each , let be a point in closest to . Then, from (B.1) we define the estimated sets and as follows:
Definition B.1** (Estimated sets and for infinite ).**
Estimated sets and of and are respectively defined as
[TABLE]
Furthermore, we define the acquisition function as follows:
Definition B.2** (Function based on predicted violatios for infinite ).**
Define
[TABLE]
Then, the function is defined by
[TABLE]
Finally, the flow of the proposed method when is infinite is shown in Algorithm 2.
C Theoretical results for infinite
In this section, we provide the theorem on the performance and convergence of Algorithm 2. Hereafter, instead of the assumptions (A1) and (A2), we assume the following conditions:
- (C1).
There exists a positive constant such that for any satisfying , and for any and . 2. (C2).
There exists a positive constant such that
[TABLE]
for any , and satisfying , where is given in the assumption (C1).
Furthermore, we also assume the following condition.
- (C3).
There exists positive constants and such that
[TABLE]
and
[TABLE]
where is given by .
Furthermore, let be a set which has elements, and let
[TABLE]
Then, the following theorem holds:
Theorem C.1**.**
Let be positive numbers, and let . For any , define , ,
[TABLE]
Then, Algorithm 2 completes classification after at least the minimum positive integer trials that satisfy the following inequality . Moreover, with probability at least , for any , and it holds that
[TABLE]
and
[TABLE]
In order to prove Theorem C.1, we consider following lemmas and corollaries.
Lemma C.1**.**
From the assumption (C3), it holds that
[TABLE]
Proof.
The proof is given by using the same arguments as in the Appendix A.2. of Srinivas et al. [12]. ∎
Furthermore, from Appendix A.2. of Srinivas et al. [12] and Lemma C.1, we obtain the following corollary.
Corollary C.1**.**
With probability at least , it holds that
[TABLE]
and
[TABLE]
Then, the following lemma holds:
Lemma C.2**.**
For any , let and . Then, with probability at least ,
[TABLE]
and
[TABLE]
Proof.
From Corollary C.1 and (C.1), w.p.a.l. we have
[TABLE]
and
[TABLE]
Hence, noting that , we obtain Lemma C.2. ∎
Using these we provide a proof of Theorem C.1.
Proof.
First, for each , when the inequality on holds, the lengths of and are less than and , respectively. Hence, from classification rules, all points in are classified. Moreover, by replacing of and in Theorem 4.1 with , we get
[TABLE]
Therefore, from Theorem 4.1, w.p.a.l. , for any , and it holds that
[TABLE]
Here, by combining Lemma C.2, w.p.a.l. , the equations (C.2), (C.3), (C.4) and (C.5) hold. In addition, letting be a matrix the element of which is given by , for any satisfying we obtain
[TABLE]
when (C.3) holds. Thus, noting that , we get
[TABLE]
Hence, noting that and , from the definition of , we have
[TABLE]
Similarly, by using the argument we obtain
[TABLE]
∎
In addition, the following theorem holds:
Theorem C.2**.**
Let , , , and
[TABLE]
Then, it holds that
Proof.
The proof is given by using the same argument as the proof of Theorem 4.2. ∎
D Sufficient conditions for assumptions
In this section, we provide sufficient conditions for the assumptions. First, the following Lemma holds:
Lemma D.1**.**
Let be a compact set, and let be an interior set of . Suppose that is a finite subset of . Assume that the kernel function is a five times continuously differentiable function. Then, the assumptions (A1) and (A2) hold.
Proof.
For any , noting that is the open set, there exists a positive number such that , where is the -neighborhood of . Here, since is the finite set, we can define . Hence, for any , it holds that . Thus, this implies that the assumption (A1) holds.
Next, for any , and satisfying , the variance of is given by
[TABLE]
Here, we put . Then, can be written by , and using Taylor’s expansion we have
[TABLE]
where is a point on a line segment connecting and . Similarly, we obtain
[TABLE]
where is also a point on a line segment connecting and . Moreover, we get
[TABLE]
and
[TABLE]
Here, noting that
[TABLE]
the covariance can be expressed as
[TABLE]
By using the same argument, we also have
[TABLE]
Therefore, using (D.3), (D.4) and (D.5), we obtain
[TABLE]
Thus, by combining (D.2) and (D.8), we have
[TABLE]
Finally, by substituting (D.6), (D.7) and (D.9) into (D.1), we get
[TABLE]
Note that is a five times continuously differentiable function and is a compact set. This implies that there exists a positive constant such that Similarly, using same argument the inequality also holds. ∎
By replacing with , we obtain the following corollary.
Corollary D.1**.**
Let be a compact set, and let be an interior set of . Suppose that is a finite subset of . Assume that the kernel function is a five times continuously differentiable function. Then, the assumptions (C1) and (C2) hold.
Finally, the following lemma holds:
Lemma D.2**.**
Let be a compact set. Assume that the kernel function is an eight times differentiable function. Then, the assumption (C3) holds.
Proof.
For GP samples , from Theorem 5 of Ghosal and Roy [3], if a kernel function of has a fourth derivative, there exists positive constants and such that
[TABLE]
Here, for the GP sample , its kernel function is the second derivative of the kernel function . Therefore, if has a sixth derivative, then the kernel function of has a fourth derivative. Similarly, if has an eighth derivative, then the kernel function of also has a fourth derivative. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Edwin V Bonilla, Kian M. Chai, and Christopher Williams. Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems 20 , pages 153–160. Curran Associates, Inc., 2008.
- 2[2] David K Duvenaud, Hannes Nickisch, and Carl E. Rasmussen. Additive gaussian processes. In Advances in Neural Information Processing Systems 24 , pages 226–234. Curran Associates, Inc., 2011.
- 3[3] Subhashis Ghosal and Anindya Roy. Posterior consistency of gaussian process prior for nonparametric binary regression. The Annals of Statistics , pages 2413–2429, 2006.
- 4[4] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause. Active learning for level set estimation. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence , pages 1344–1350, 2013.
- 5[5] Philipp Hennig and Christian J. Schuler. Entropy search for information-efficient global optimization. J. Mach. Learn. Res. , 13(1):1809–1837, June 2012.
- 6[6] Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, and Svetha Venkatesh. Regret for expected improvement over the best-observed value and stopping condition. In Proceedings of the Ninth Asian Conference on Machine Learning , volume 77, pages 279–294, 2017.
- 7[7] Athanasios Papoulis and S Unnikrishna Pillai. Probability, random variables, and stochastic processes . Tata Mc Graw-Hill Education, 2002.
- 8[8] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) . The MIT Press, 2005.
