Active learning for enumerating local minima based on Gaussian process   derivatives

Yu Inatsu; Daisuke Sugita; Kazuaki Toyoura; Ichiro Takeuchi

arXiv:1903.03279·stat.ML·March 11, 2019

Active learning for enumerating local minima based on Gaussian process derivatives

Yu Inatsu, Daisuke Sugita, Kazuaki Toyoura, Ichiro Takeuchi

PDF

Open Access

TL;DR

This paper introduces an active learning approach using Gaussian Processes to efficiently identify all local minima of a black-box function by sequentially selecting points based on derivative confidence intervals.

Contribution

The paper proposes a novel active learning method that leverages GP derivatives to enumerate local minima, with theoretical analysis and numerical validation.

Findings

01

Effective enumeration of local minima achieved

02

The method outperforms baseline approaches in experiments

03

Theoretical guarantees support the approach's validity

Abstract

We study active learning (AL) based on Gaussian Processes (GPs) for efficiently enumerating all of the local minimum solutions of a black-box function. This problem is challenging due to the fact that local solutions are characterized by their zero gradient and positive-definite Hessian properties, but those derivatives cannot be directly observed. We propose a new AL method in which the input points are sequentially selected such that the confidence intervals of the GP derivatives are effectively updated for enumerating local minimum solutions. We theoretically analyze the proposed method and demonstrate its usefulness through numerical experiments.

Equations207

\displaystyle S:=\left\{\bm{x}\in\mathcal{X}~{}\Big{|}~{}\frac{\partial f}{\partial\bm{x}}=\bm{0}\text{ and }\frac{\partial^{2}f}{\partial\bm{x}\bm{x}^{\top}}\succ 0\right\},

\displaystyle S:=\left\{\bm{x}\in\mathcal{X}~{}\Big{|}~{}\frac{\partial f}{\partial\bm{x}}=\bm{0}\text{ and }\frac{\partial^{2}f}{\partial\bm{x}\bm{x}^{\top}}\succ 0\right\},

μ_{t} (x)

μ_{t} (x)

k_{t} (x, x^{'})

μ_{t, i}^{(1)} (x) = k_{t, i}^{(1)} (x)^{⊤} C_{t}^{- 1} y_{t}, {σ_{t, i}^{(1)} (x)}^{2} = v_{t, i}^{(1)} (x, x),

μ_{t, i}^{(1)} (x) = k_{t, i}^{(1)} (x)^{⊤} C_{t}^{- 1} y_{t}, {σ_{t, i}^{(1)} (x)}^{2} = v_{t, i}^{(1)} (x, x),

v_{t, i}^{(1)} (x, x^{'}) = v_{i}^{(1)} (x, x^{'}) - k_{t, i}^{(1)} (x)^{⊤} C_{t}^{- 1} k_{t, i}^{(1)} (x^{'}),

μ_{t, j k}^{(2)} (x) = k_{t, j k}^{(2)} (x)^{⊤} C_{t}^{- 1} y_{t}, {σ_{t, j k}^{(2)} (x)}^{2} = v_{t, j k}^{(2)} (x, x),

μ_{t, j k}^{(2)} (x) = k_{t, j k}^{(2)} (x)^{⊤} C_{t}^{- 1} y_{t}, {σ_{t, j k}^{(2)} (x)}^{2} = v_{t, j k}^{(2)} (x, x),

v_{t, j k}^{(2)} (x, x^{'}) = v_{j k}^{(2)} (x, x^{'}) - k_{t, j k}^{(2)} (x)^{⊤} C_{t}^{- 1} k_{t, j k}^{(2)} (x^{'}),

v_{j k}^{(2)} (x, x^{'}) = \partial^{4} k (x, x^{'}) / \partial x_{j} \partial x_{k} \partial x_{j}^{'} \partial x_{k}^{'} .

v_{j k}^{(2)} (x, x^{'}) = \partial^{4} k (x, x^{'}) / \partial x_{j} \partial x_{k} \partial x_{j}^{'} \partial x_{k}^{'} .

G_{t, i}^{(1)}

G_{t, i}^{(1)}

\overset{ˉ}{G}_{t, i}^{(1)}

H_{t}^{(2)}

H_{t}^{(2)}

\overset{ˉ}{H}_{t}^{(2)}

S_{t} = H_{t}^{(2)} \cap i = 1 ⋂ d G_{t, i}^{(1)}, \overset{ˉ}{S}_{t} = \overset{ˉ}{H}_{t}^{(2)} \cup i = 1 ⋃ d \overset{ˉ}{G}_{t, i}^{(1)} .

S_{t} = H_{t}^{(2)} \cap i = 1 ⋂ d G_{t, i}^{(1)}, \overset{ˉ}{S}_{t} = \overset{ˉ}{H}_{t}^{(2)} \cup i = 1 ⋃ d \overset{ˉ}{G}_{t, i}^{(1)} .

a_{t} (x) = r_{t} σ_{t}^{2} (x) + (1 - r_{t}) b_{t} (x),

a_{t} (x) = r_{t} σ_{t}^{2} (x) + (1 - r_{t}) b_{t} (x),

V_{t, i}^{(1)} (x) = min {ξ (u_{t, i}^{(1)} (x)), ξ (- l_{t, i}^{(1)} (x)), ξ (u_{t, i, ϵ_{i}^{(1)}}^{(1)} (x)) + ξ (l_{t, i, ϵ_{i}^{(1)}}^{(1)} (x))},

V_{t, i}^{(1)} (x) = min {ξ (u_{t, i}^{(1)} (x)), ξ (- l_{t, i}^{(1)} (x)), ξ (u_{t, i, ϵ_{i}^{(1)}}^{(1)} (x)) + ξ (l_{t, i, ϵ_{i}^{(1)}}^{(1)} (x))},

x_{t}^{+} := x \in U_{t} argmax i = 1 \sum d V_{t, i}^{(1)} (x) .

x_{t}^{+} := x \in U_{t} argmax i = 1 \sum d V_{t, i}^{(1)} (x) .

b_{t} (x) = i = 1 \sum d (V_{t, i}^{(1)} (x_{t}^{+}) - V_{t, i}^{(1)} (x_{t}^{+}; x)) .

b_{t} (x) = i = 1 \sum d (V_{t, i}^{(1)} (x_{t}^{+}) - V_{t, i}^{(1)} (x_{t}^{+}; x)) .

η_{t} = max {x \in X max 2 β_{t}^{1/2} σ_{t, 1}^{(1)} (x), \dots, x \in X max 2 β_{t}^{1/2} σ_{t, d}^{(1)} (x), x \in X max γ_{t}^{1/2} ς_{t}^{(2)} (x)} .

η_{t} = max {x \in X max 2 β_{t}^{1/2} σ_{t, 1}^{(1)} (x), \dots, x \in X max 2 β_{t}^{1/2} σ_{t, d}^{(1)} (x), x \in X max γ_{t}^{1/2} ς_{t}^{(2)} (x)} .

x \in S_{t} \Rightarrow - ϵ_{i}^{(1)} < f_{i}^{(1)} (x) < ϵ_{i}^{(1)} \land λ (x) > - ϵ^{(2)}

x \in S_{t} \Rightarrow - ϵ_{i}^{(1)} < f_{i}^{(1)} (x) < ϵ_{i}^{(1)} \land λ (x) > - ϵ^{(2)}

x \in \overset{ˉ}{S}_{t} \Rightarrow f_{i}^{(1)} (x) \neq = 0 \lor λ (x) < ϵ^{(2)} .

x \in \overset{ˉ}{S}_{t} \Rightarrow f_{i}^{(1)} (x) \neq = 0 \lor λ (x) < ϵ^{(2)} .

λ (x) = a in f a^{⊤} H (x) a

λ (x) = a in f a^{⊤} H (x) a

= λ_{t} (x) + a in f a^{⊤} Z_{t} (x) a .

∣ a^{⊤} Z_{t} (x) a ∣ \leq \tilde{γ}_{t}^{1/2} ς_{t}^{(2)} (x) (∣ a_{1} ∣ + \dots + ∣ a_{d} ∣)^{2},

∣ a^{⊤} Z_{t} (x) a ∣ \leq \tilde{γ}_{t}^{1/2} ς_{t}^{(2)} (x) (∣ a_{1} ∣ + \dots + ∣ a_{d} ∣)^{2},

\tilde{f}_{t, i}^{(1)} (x; ζ)

\tilde{f}_{t, i}^{(1)} (x; ζ)

\tilde{f}_{t, j k}^{(2)} (x; ζ)

Var [\tilde{f}_{0, i}^{(1)} (x; ζ)] \leq ∣ ζ ∣ C_{0}, Var [\tilde{f}_{0, j k}^{(2)} (x; ζ)] \leq ∣ ζ ∣ C_{0},

Var [\tilde{f}_{0, i}^{(1)} (x; ζ)] \leq ∣ ζ ∣ C_{0}, Var [\tilde{f}_{0, j k}^{(2)} (x; ζ)] \leq ∣ ζ ∣ C_{0},

\tilde{η}_{t}^{2} = max {\frac{3200 C ~ _{0}^{2} C _{1} β _{t}^{3} κ _{t}}{R _{t} ϵ ^{4}}, \frac{1250 C ~ _{0}^{4} C _{1} γ _{t}^{5} κ _{t}}{R _{t} ϵ ^{8}}} .

\tilde{η}_{t}^{2} = max {\frac{3200 C ~ _{0}^{2} C _{1} β _{t}^{3} κ _{t}}{R _{t} ϵ ^{4}}, \frac{1250 C ~ _{0}^{4} C _{1} γ _{t}^{5} κ _{t}}{R _{t} ϵ ^{8}}} .

P r e = ∣ S \cap S_{t} ∣/∣ S_{t} ∣, R ec = ∣ S \cap S_{t} ∣/∣ S ∣.

P r e = ∣ S \cap S_{t} ∣/∣ S_{t} ∣, R ec = ∣ S \cap S_{t} ∣/∣ S ∣.

P_{t} (x; α) = P (f_{t} (x) \leq f_{t} (x_{l, s}^{(α)}), l \in [d], s \in {1, - 1})

P_{t} (x; α) = P (f_{t} (x) \leq f_{t} (x_{l, s}^{(α)}), l \in [d], s \in {1, - 1})

X = {x = (x_{1}, x_{2}) \in D ∣ x_{1} \in [a, b], x_{2} \in [a, b]} .

X = {x = (x_{1}, x_{2}) \in D ∣ x_{1} \in [a, b], x_{2} \in [a, b]} .

η_{t}^{2} = max {x \in X max 4 β_{t} {σ_{t, 1}^{(1)} (x)}^{2}, \dots, x \in X max 4 β_{t} {σ_{t, d}^{(1)} (x)}^{2}, x \in X max γ_{t} {ς_{t}^{(2)} (x)}^{2}} .

η_{t}^{2} = max {x \in X max 4 β_{t} {σ_{t, 1}^{(1)} (x)}^{2}, \dots, x \in X max 4 β_{t} {σ_{t, d}^{(1)} (x)}^{2}, x \in X max γ_{t} {ς_{t}^{(2)} (x)}^{2}} .

Δ f_{t, i}^{(1)} (x; ζ) = \frac{f _{t} ( x + ζ e _{i} ) - f _{t} ( x )}{ζ} .

Δ f_{t, i}^{(1)} (x; ζ) = \frac{f _{t} ( x + ζ e _{i} ) - f _{t} ( x )}{ζ} .

{σ_{t, i}^{(1)} (x)}^{2} = Var [f_{t, i}^{(1)} (x)]

{σ_{t, i}^{(1)} (x)}^{2} = Var [f_{t, i}^{(1)} (x)]

= Var [Δ f_{t, i}^{(1)} (x; ζ)] + 2 Cov [Δ f_{t, i}^{(1)} (x; ζ), \tilde{f}_{t, i}^{(1)} (x; ζ)] + Var [\tilde{f}_{t, i}^{(1)} (x; ζ)]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference

Full text

Active learning for enumerating local minima based on Gaussian process derivatives

Yu Inatsu RIKEN Center for Advanced Intelligence Project

Daisuke sugita Nagoya Institute of Technology

Kazuaki Toyoura Kyoto University

Ichiro Takeuchi 11footnotemark: 1 22footnotemark: 2 Center for Materials Research by Information Integration, National Institute for Materials ScienceE-mail:[email protected]

ABSTRACT

We study active learning (AL) based on Gaussian Processes (GPs) for efficiently enumerating all of the local minimum solutions of a black-box function. This problem is challenging due to the fact that local solutions are characterized by their zero gradient and positive-definite Hessian properties, but those derivatives cannot be directly observed. We propose a new AL method in which the input points are sequentially selected such that the confidence intervals of the GP derivatives are effectively updated for enumerating local minimum solutions. We theoretically analyze the proposed method and demonstrate its usefulness through numerical experiments.

1 Introduction

In many areas of science and technology, machine learning has been successfully used for uncovering unknown complex systems which are formulated as black-box functions. When the evaluation of a black-box function is expensive, it is often difficult to exhaustively investigate the function in the entire input domain. Active learning (AL) has been developed as a method for effectively selecting the input points at which the function evaluations are helpful for the target task. For example, if the target task is to find the global minimum, it is reasonable to evaluate the function at the input points which are likely to be global minima (this AL problem has been intensively studied in the context of Bayesian Optimization (BO) [9, 2, 1, 6, 14, 5]).

In this paper, we study the problem of enumerating local minima (or maxima) of a black-box function. In many applications, it is beneficial to identify the positions of local minima and/or maxima because it helps to roughly grasp the “shape” of the black-box function. Furthermore, it is often the case that each local minimum point has its own special meaning. For example, when modeling the energy space of a physical system, each local minimum point corresponds to a stable energy point of the system, which is crucially important for revealing various physical properties of the system (see §5 for an application of the proposed method to a physical problem).

A local minimum point is characterized by the first and the second derivatives of the function, i.e., an input point is a local minimum if the gradient vector is zero and the Hessian matrix is positive-definite (PD). The difficulty of this problem is due to the fact that we need to select the input points which are likely to be local minima under a situation that those derivatives cannot be directly observed. In other words, we need to select a set of input points at which the function evaluations are helpful for getting information on the zero gradient and the PD Hessian properties.

We employ Gaussian Processes (GPs) for modeling a black-box function. GPs are useful in many AL problems since they enable one to predict not only the average but also the uncertainty of the black-box function. Our basic idea is to exploit the property that the derivative of a GP is also a GP. Based on this property, we develop a method for computing the confidence intervals (CIs) of each element of the gradient vector and the minimum eigenvalue of the Hessian matrix111A Hessian matrix is PD if and only if the minimum eigenvalue is positive.. Then, these CIs are used for designing an acquisition function (AF) for efficiently enumerating all of the local minima. We call the proposed method Active learning for Local Optima Enumeration (ALOE).

Related works

BO has been intensively studied (see [10, 9] for comprehensive survey of BO). In a few existing BO studies, the gradient of a GP is used for accelerating the BO task. For example, [15] discussed the advantage of using the gradient in a framework called Knowledge-Gradient. Furthermore, [11] demonstrated that the gradient of a GP is helpful for modeling dynamical systems. In these works, it is assumed that not only the function values but also the gradient vectors are directly observed. On the other hand, we consider a setup where neither the gradient nor Hessian are directly observed. The CI-based approach in ALOE is motivated by [4], in which the CIs of function values are used for estimating a level set of the function. Similarly, the CIs of the function values were also used for safe BO in [13]. We employ some of the theoretical techniques developed in [12, 4] for analyzing the various theoretical properties of ALOE. In contrast to these existing studies, we use the CIs of the gradient and the Hessian, which are not be easily available since they cannot be directly observed.

Our contribution

To the best of our knowledge, there is no existing AL method for enumerating local minima. We propose a new AL method called ALOE, in which we develop a method to compute the CIs of the gradients and the minimum eigenvalue of the Hessian, without observing these derivatives directly. Furthermore, based on these CIs, we propose a novel AF for efficiently enumerating local minima. We theoretically analyze the accuracy and the convergence of ALOE, and evaluate its empirical performance by numerical experiments with synthetic data and real application to a physical problem.

2 Preliminaries

Problem setting

Suppose that an unknown function $f:D\to\mathbb{R}$ is defined on a set $D\subseteq\mathbb{R}^{d}$ . For simplicity, we consider a finite set of input points $\mathcal{X}\subset D$ , and consider an AL method to classify if each point $\bm{x}\in\mathcal{X}$ is local minimum point222All the methods and theories in this paper can be extended to the case where $\mathcal{X}$ is continuous with reasonable assumptions.. Let us define the following subset of points in $\mathcal{X}$ .

Definition 2.1 (The set of local minima).

[TABLE]

where $M\succ 0$ indicates that the matrix $M$ is PD.

Note that $S$ does not contain “pathological” local minimum points at which all the eigenvalues of the Hessian are zero, e.g., $x=0$ for $f(x)=x^{4}$ . Hereafter, with a slight abuse of terminology, we call $S$ as the set of local minima. The goal of ALOE is to efficiently classify all the points in $\mathcal{X}$ into either of $S$ or $\bar{S}:=\mathcal{X}\setminus S$ with as small number of function evaluations as possible.

We employ GP for modeling the unknown function $f$ . Specifically, we assume that the prior distribution of $f$ is $\mathcal{G}\mathcal{P}(0,k(\bm{x},\bm{x}^{\prime}))$ , where $k(\bm{x},\bm{x}^{\prime}):D\times D\to\mathbb{R}$ is a PD kernel. Consider the $t^{\rm th}$ step where a sequence of the input points ${\bm{x}}_{1},\ldots,{\bm{x}}_{t}$ on $D$ are selected by an AL method. Then, the joint distribution $(f(\bm{x}_{1}),\ldots,f(\bm{x}_{t}))^{\top}$ follows the $t$ -dimensional normal distribution $\mathcal{N}_{t}({\bm{\mu}}_{t},{\bm{K}}_{t})$ with the mean vector ${\bm{\mu}}_{t}=(0,\ldots,0)^{\top}\equiv{\bm{0}}_{t}$ and the covariance matrix $\bm{K}_{t}$ whose $(i,j)^{\rm th}$ element is $k(\bm{x}_{i},\bm{x}_{j})$ . The output $y_{i}$ is assumed to be obtained as $y_{i}=f(\bm{x}_{i})+\varepsilon_{i}$ , where $\varepsilon_{1},\ldots,\varepsilon_{t}$ are independent random variables from $\mathcal{N}(0,\sigma^{2})$ . Furthermore, the posterior distribution of $f$ is also represented as a GP whose mean $\mu_{t}(\bm{x})$ , variance $\sigma^{2}_{t}(\bm{x})$ and covariance $k_{t}(\bm{x},\bm{x}^{\prime})$ are given by

[TABLE]

where ${\bm{k}}_{t}(\bm{x})=(k(\bm{x}_{1},\bm{x}),\ldots,k(\bm{x}_{t},\bm{x}))^{\top}$ , ${\bm{C}}_{t}=({\bm{K}}_{t}+\sigma^{2}{\bm{I}}_{t})$ , $\bm{y}_{t}=(y_{1},\ldots,y_{t})^{\top}$ and ${\bm{I}}_{t}$ is a $t$ -dimensional identity matrix.

GP derivatives

We assume that the kernel function $k(\bm{x},\bm{x}^{\prime})$ is differentiable up to order four. Many commonly used kernels including Gaussian and Linear kernels satisfy this assumption. Under this assumption, it is known that the first and second derivatives of $\mathcal{G}\mathcal{P}(0,k(\bm{x},\bm{x}^{\prime}))$ is also GPs (e.g., [8], [7]). Here, let $f^{(1)}_{i}$ and $f^{(2)}_{jk}$ be the first and the second derivatives of $f$ in the $i^{\rm th}$ and $(j,k)^{\rm th}$ elements, respectively. Then, given the observations $(\bm{x}_{1},y_{1}),\ldots,(\bm{x}_{t},y_{t})$ , the posterior distribution of $f^{(1)}_{i}$ is also GP, and its mean, variance and covariance are respectively given by

[TABLE]

where the $l^{\rm th}$ element of ${\bm{k}}^{(1)}_{t,i}(\bm{x})$ is $\partial k(\bm{x}_{l},\bm{x})/\partial x_{i}$ and $v^{(1)}_{i}(\bm{x},\bm{x}^{\prime})=\partial^{2}k(\bm{x},\bm{x}^{\prime})/\partial x_{i}\partial x^{\prime}_{i}.$ Similarly, the posterior distribution of $f^{(2)}_{jk}$ is also GP, and its mean, variance, and covariance are respectively given by

[TABLE]

where the $l$ th element of the second derivative ${\bm{k}}^{(2)}_{t,jk}(\bm{x})$ is given by $\partial^{2}k(\bm{x}_{l},\bm{x})/\partial x_{j}\partial x_{k}$ , and

[TABLE]

3 Proposed method

In this section, we describe the proposed ALOE method for efficiently identifying the set of local minima $S$ in (2.1). At the step, ALOE estimates whether each ${\bm{x}}\in\mathcal{X}$ is included in $S$ using the CIs of the gradients and the Hessian minimum eigenvalue. Figure 1 illustrates the behavior of ALOE.

3.1 Local minimum estimation based on the CIs of GP derivatives

For each ${\bm{x}}\in\mathcal{X}$ , we define the CIs of $f^{(1)}_{i}({\bm{x}})$ at the $t^{\rm th}$ iteration as $Q^{(1)}_{t,i}({\bm{x}})=[l^{(1)}_{t,i}({\bm{x}}),u^{(1)}_{t,i}({\bm{x}})]$ , where $l^{(1)}_{t,i}({\bm{x}})=\mu^{(1)}_{t,i}({\bm{x}})-\beta^{1/2}_{t}\sigma^{(1)}_{t,i}({\bm{x}})$ , $u^{(1)}_{t,i}({\bm{x}})=\mu^{(1)}_{t,i}({\bm{x}})+\beta^{1/2}_{t}\sigma^{(1)}_{t,i}({\bm{x}})$ and $\beta^{1/2}_{t}\geq 0$ . Then, by using an accuracy parameter $\epsilon^{(1)}_{i}>0$ we define $G^{(1)}_{t,i}$ and $\bar{G}^{(1)}_{t,i}$ as

[TABLE]

Here, $G^{(1)}_{t,i}$ is the set of points at which the CIs of the gradients fall within $[-\epsilon^{(1)}_{i},\epsilon^{(1)}_{i}]$ , i.e., the set of points expected to have zero gradient with the accuracy $\epsilon^{(1)}_{i}$ . Similarly, $\bar{G}^{(1)}_{t,i}$ is the set of points at which the CIs of the gradients are sufficiently away from zero.

Next, we consider the identification of the points at which the Hessian of $f$ is PD. Note that the minimum eigenvalue of a matrix is positive if and only if the Hessian is PD. Using this equivalence, we perform identification of PD Hessian using the CI of the minimum eigenvalue. For each point ${\bm{x}}\in\mathcal{X}$ , we define the CI of the minimum eigenvalue $\lambda({\bm{x}})$ as $Q^{(2)}_{t}({\bm{x}})=[l^{(2)}_{t},u^{(2)}_{t}]$ , where $l^{(2)}_{t}=\lambda_{t}({\bm{x}})-\gamma^{1/2}_{t}\varsigma^{(2)}_{t}({\bm{x}})$ , $u^{(2)}_{t}=\lambda_{t}({\bm{x}})+\gamma^{1/2}_{t}\varsigma^{(2)}_{t}({\bm{x}})$ , $\gamma^{1/2}_{t}\geq 0$ and $\lambda_{t}({\bm{x}})$ is the minimum eigenvalue of the $d\times d$ matrix whose $(j,k)^{\rm th}$ element is $\mu^{(2)}_{t,jk}({\bm{x}})$ . On the other hand, since the variance of $\lambda({\bm{x}})$ is not readily available, we use $\varsigma^{(2)}_{t}({\bm{x}})$ defined as $\varsigma^{(2)}_{t}({\bm{x}})=\max_{j,k\in[d]}\sigma^{(2)}_{t,jk}({\bm{x}})$ , where $[d]:=\{1,\ldots,d\}$ . As shown in section 4, by appropriately adjusting $\gamma_{t}$ , $\lambda({\bm{x}})$ is shown to be included in $Q^{(2)}_{t}({\bm{x}})$ with high probability. Then, using an accuracy parameter $\epsilon^{(2)}>0$ we define $H^{(2)}_{t}$ and $\bar{H}^{(2)}_{t}$ as

[TABLE]

where $H^{(2)}_{t}$ (resp. $\bar{H}^{(2)}_{t}$ ) is the set of points where the minimum eigenvalue is expected to be positive (resp. negative) with an accuracy $\epsilon^{(2)}$ . Then, from (3.1)–(3.2) and (3.3)–(3.4), we estimate $S$ as follows:

Definition 3.1 ( $S$ estimation).

The estimates of $S$ and $\bar{S}:=\mathcal{X}\setminus S$ are respectively defined as

[TABLE]

The set of remaining points at step $t$ is defined as $U_{t}=\mathcal{X}\setminus(\widehat{S}_{t}\cup\widehat{\bar{S}}_{t})$ . Figure 2 shows an example of CIs.

3.2 Acquisition function by predicted violations

Based on $\widehat{S}_{t}$ and $\widehat{\bar{S}}_{t}$ , we propose an AF for efficiently enumerating local minima. The proposed AF $a_{t}({\bm{x}})$ consists of components as

[TABLE]

where $r_{t}=\{0,1\}$ adjusts the trade-off between two components. The first component $\sigma^{2}_{t}({\bm{x}})$ is merely the posterior variance of $f({\bm{x}})$ . Thus, when $r_{t}=1$ , the AF is reduced to the AF of Uncertainty sampling [9]. The second component $b_{t}({\bm{x}})$ is a specific function designed for reducing the uncertainties in the gradients and Hessian minimum eigenvalue for the task of enumerating local minima. In the remainder of this section, we describe the detail of $b_{t}({\bm{x}})$ .

First, we define violations (see, Figure 2). Remembering that the goal is to classify each point ${\bm{x}}\in\mathcal{X}$ into either of $S$ or $\bar{S}$ , the violation of the CI of $f^{(1)}_{i}$ at ${\bm{x}}$ is defined as

[TABLE]

where $u^{(1)}_{t,i,\epsilon^{(1)}_{i}}({\bm{x}})=u^{(1)}_{t,i}({\bm{x}})-\epsilon^{(1)}_{i}$ and $l^{(1)}_{t,i,\epsilon^{(1)}_{i}}({\bm{x}})=-\epsilon^{(1)}_{i}-l^{(1)}_{t,i}({\bm{x}})$ . Here, $\xi(a)=a$ if $a>0$ and otherwise [math]. The $b_{t}({\bm{x}})$ in the second component of the AF is designed to be able to select the next point such that the largest violation is maximally reduced.

Let ${\bm{x}}^{+}\in U_{t}$ be the input at which the sum of the violation $\sum^{d}_{i=1}V^{(1)}_{t,i}({\bm{x}}^{+})$ is largest, i.e.,

[TABLE]

Unfortunately, since the gradient $f^{(1)}_{i}({\bm{x}}^{+})$ cannot be directly observed, it is not sufficient to simply select the input ${\bm{x}}^{+}$ as the next input point. Indeed, we need to select the input point ${\bm{x}}^{*}\in D$ such that the predicted violation of $f_{i}^{(1)}({\bm{x}}^{+})$ can be maximally reduced by evaluating $f({\bm{x}}^{*})$ . Let $\sigma_{t,i}^{(1)}({\bm{x}};{\bm{x}}^{*})$ be the posterior variance of $f_{i}^{(1)}({\bm{x}})$ when the function $f({\bm{x}}^{*})$ is newly evaluated. By replacing $\sigma_{t,i}^{(1)}({\bm{x}})$ in $Q^{(1)}_{t,i}({\bm{x}})$ to $\sigma_{t,i}^{(1)}({\bm{x}};{\bm{x}}^{*})$ , we obtain the the predicted CIs $[l_{t,i}^{(1)}({\bm{x}}^{+};{\bm{x}}^{*}),u_{t,i}^{(1)}({\bm{x}}^{+};{\bm{x}}^{*})]$ 333 Here, the mean $\mu_{t,i}^{(1)}({\bm{x}})$ is not replaced since its update is unknown before we actually evaluate $f({\bm{x}}^{*})$ . . Then, by replacing $l_{t,i,\epsilon_{i}}^{(1)}({\bm{x}})$ and $u_{t,i,\epsilon_{i}}^{(1)}({\bm{x}})$ to similarly defined $l_{t,i,\epsilon_{i}}^{(1)}({\bm{x}};{\bm{x}}^{*})$ and $u_{t,i,\epsilon_{i}}^{(1)}({\bm{x}};{\bm{x}}^{*})$ , respectively, we similarly define the predicted violation $V_{t,i}^{(1)}({\bm{x}}^{+};{\bm{x}}^{*})$ , which represents the predicted violation when $f({\bm{x}}^{*})$ is newly evaluated. In summary, the second component of the AF is defined as

[TABLE]

Algorithm 1. shows the flow of ALOE.

4 Theoretical results

We provide theorems on the performance and convergence of Algorithm 1. First, the following theorem holds:

Theorem 4.1.

Let $\epsilon^{(1)}_{1},\ldots,\epsilon^{(1)}_{d},\epsilon^{(2)}$ be positive numbers, and let $\epsilon=\min\{\epsilon^{(1)}_{1},\ldots,\epsilon^{(1)}_{d},\epsilon^{(2)}\}$ . For any $\delta\in(0,1)$ , define $\beta_{t}=2\log((d+1)|\mathcal{X}|\pi^{2}t^{2}/(6\delta))$ , $\gamma_{t}=2d^{2}\log(d^{2}(d+1)|\mathcal{X}|\pi^{2}t^{2}/(6\delta))$ and

[TABLE]

Then, Algorithm 1 completes classification after at least the minimum positive integer $T$ trials that satisfy the following inequality $\eta^{2}_{T}\leq\epsilon^{2}$ . Moreover, with probability at least $1-\delta$ , for any $t\geq 1$ , ${\bm{x}}\in\mathcal{X}$ and $i\in[d]$ it holds that

[TABLE]

and

[TABLE]

Proof.

First, when the inequality on $\eta_{T}$ holds, the lengths of $Q^{(1)}_{T,i}({\bm{x}})$ and $Q^{(2)}_{T}({\bm{x}})$ are less than $\epsilon^{(1)}_{i}$ and $2\epsilon^{(2)}$ , respectively. Hence, from classification rules, all the points are classified. Next, noting that the posterior of $f^{(1)}_{i}$ is also GP, from Lemma 5.1 in [12], with probability at least (w.p.a.l.) $1-(d+1)^{-1}\delta$ it holds that $f^{(1)}_{i}({\bm{x}})\in Q^{(1)}_{t,i}({\bm{x}})$ for any $i\in[d]$ , $t\geq 1$ and ${\bm{x}}\in\mathcal{X}$ . This implies that w.p.a.l. $1-d(d+1)^{-1}\delta$ it holds that $f^{(1)}_{i}({\bm{x}})\in Q^{(1)}_{t,i}({\bm{x}})$ for all $i\in[d]$ . Similarly, by using the same argument for $f^{(2)}_{jk}$ , w.p.a.l. $1-d^{-2}(d+1)^{-1}\delta$ the inequality $|f^{(2)}_{jk}({\bm{x}})-\mu^{(2)}_{t,jk}({\bm{x}})|\leq\tilde{\gamma}^{1/2}_{t}\sigma^{(2)}_{t,jk}({\bm{x}})$ holds for any $j,k\in[d]$ , $t\geq 1$ and ${\bm{x}}\in\mathcal{X}$ , where $\tilde{\gamma}^{1/2}_{t}=d^{-1}\gamma^{1/2}_{t}$ . Thus, w.p.a.l. $1-(d+1)^{-1}\delta$ above inequalities are simultaneously satisfied. Here, denote the Hessian matrix of ${\bm{x}}$ by ${\bm{H}}({\bm{x}})={\bm{M}}_{t}({\bm{x}})+{\bm{Z}}_{t}({\bm{x}})$ , where the $(j,k)^{\rm th}$ element of ${\bm{M}}_{t}({\bm{x}})$ is $\mu^{(2)}_{t,jk}({\bm{x}})$ and that of ${\bm{Z}}_{t}({\bm{x}})$ is normal distribution with mean [math] and variance $\{\sigma^{(2)}_{t,jk}({\bm{x}})\}^{2}$ . Therefore, w.p.a.l. $1-(d+1)^{-1}\delta$ the absolute value of each element in ${\bm{Z}}_{t}({\bm{x}})$ is less than $\tilde{\gamma}^{1/2}_{t}\varsigma^{(2)}_{t}({\bm{x}})$ . Then, for any ${\bm{a}}$ satisfying $\|{\bm{a}}\|=1$ , it holds that

[TABLE]

Furthermore, noting that $(|a_{1}|+\cdots+|a_{d}|)^{2}\leq d$ and

[TABLE]

the inequality $|{\bm{a}}^{\top}{\bm{Z}}_{t}({\bm{x}}){\bm{a}}|\leq\tilde{\gamma}^{1/2}_{t}\varsigma^{(2)}_{t}({\bm{x}})d=\gamma^{1/2}_{t}\varsigma^{(2)}_{t}({\bm{x}})$ holds. Hence, we have $\lambda({\bm{x}})\geq\lambda_{t}({\bm{x}})-\gamma^{1/2}_{t}\varsigma^{(2)}_{t}({\bm{x}})$ . Similarly, we also have $\lambda({\bm{x}})\leq\lambda_{t}({\bm{x}})+\gamma^{1/2}_{t}\varsigma^{(2)}_{t}({\bm{x}})$ . This implies that w.p.a.l. $1-(d+1)^{-1}\delta$ it holds that $\lambda({\bm{x}})\in Q^{(2)}_{t}({\bm{x}})$ . Therefore, w.p.a.l. $1-\delta$ it holds that $f^{(1)}_{i}({\bm{x}})\in Q^{(1)}_{t,i}({\bm{x}})$ , $i\in[d]$ and $\lambda({\bm{x}})\in Q^{(2)}_{t}({\bm{x}})$ . ∎

Next, we provide an upper bound of $\eta_{t}$ . Let

[TABLE]

where ${\bm{e}}_{jk}={\bm{e}}_{j}+{\bm{e}}_{k}$ and ${\bm{e}}_{i}$ is a $d$ -dimensional vector whose $i$ th element is one and remainders are zeros. Assume the following conditions:

(A1).

There exists a positive constant $A_{0}$ such that for any $\zeta$ satisfying $|\zeta|<A_{0}$ , $({\bm{x}}+\zeta{\bm{e}}_{i})\in D$ and $({\bm{x}}+\zeta{\bm{e}}_{jk})\in D$ for any ${\bm{x}}\in\mathcal{X}$ and $i,j,k\in[d]$ . 2. (A2).

There exists a positive constant $C_{0}$ such that

[TABLE]

for any ${\bm{x}}\in\mathcal{X}$ , $i,j,k\in[d]$ and $\zeta$ satisfying $|\zeta|<A_{0}$ , where $A_{0}$ is given in the assumption (A1).

Finally, let ${\rm{I}}({\bm{y}}_{t};f)$ be a mutual information between ${\bm{y}}_{t}$ and $f$ . Also let $\kappa_{t}$ be a maximum information gain after $t$ rounds on $D$ , defined by $\kappa_{t}=\max_{A\subset D;|A|=t}{\rm{I}}({\bm{y}}_{A};f_{A}).$ Then, the following theorem holds:

Theorem 4.2.

Let $C_{1}=2\sigma^{2}C_{2}$ , $C_{2}=\sigma^{-2}/\log(1+\sigma^{-2})$ , $R_{t}=r_{1}+\cdots+r_{t}$ , $\tilde{C}_{0}>C_{0}$ and

[TABLE]

Then, it holds that $\eta^{2}_{t}\leq\tilde{\eta}^{2}_{t}+\frac{4}{5}\epsilon^{2}.$

The proof is given in Appendix. In addition, Srinivas t al. [12] provided the order of $\kappa_{t}$ for certain kernels under mild conditions. For example, in Gaussian kernel, its order is $\mathcal{O}((\log t)^{d+1})$ . Hence, if we set $R_{t}=\mathcal{O}(t)$ , then $\tilde{\eta}^{2}_{t}$ converges to 0, i.e., $\eta^{2}_{T}$ satisfies $\eta^{2}_{T}<\epsilon^{2}$ for some $T$ .

5 Numerical experiments

We confirm the performance of ALOE by numerical experiments. We used F-score defined as $F=2\cdot Pre\cdot Rec/(Pre+Rec)$ , where precision $Pre$ and recall $Rec$ are given by

[TABLE]

Here, we compared the following seven AFs:

(Random):

Random sampling.

(US):

Uncertainty sampling, i.e., we set $r_{t}=1$ for any $t\geq 1$ in (3.5).

(LCB):

Lower confidence bound (LCB) $lcb({\bm{x}})$ , where $lcb({\bm{x}})=\mu_{t}({\bm{x}})-3\sigma_{t}({\bm{x}})$ .

(No_ $\lambda$ ):

AF for ALOE with $r_{t}=0$ for any $t\geq 1$ , and use $U_{t}=\mathcal{X}\setminus(\bigcap_{i=1}^{d}G^{(1)}_{t,i}\cup\bigcup_{i=1}^{d}\bar{G}^{(1)}_{t,i})$ .

(ALOE1):

AF for ALOE with $r_{t}=0$ for any $t\geq 1$ .

(ALOE2):

AF for ALOE with $r_{t}=1$ if $t$ is a multiple of 5 and otherwise 0.

(ALOE3):

AF for ALOE with $r_{t}=1$ if $t$ is a multiple of 10 and otherwise 0.

Furthermore, we consider the following as a competitor. Let

[TABLE]

where ${\bm{x}}^{(\alpha)}_{l,1}$ satisfies ${\bm{x}}-{\bm{x}}^{(\alpha)}_{l,1}=\alpha{\bm{e}}_{l}$ and ${\bm{x}}^{(\alpha)}_{l,-1}$ satisfies ${\bm{x}}-{\bm{x}}^{(\alpha)}_{l,-1}=-\alpha{\bm{e}}_{l}$ . Then, if ${\bm{x}}$ satisfies $P_{t}({\bm{x}};0.3)>0.6$ , ${\bm{x}}$ is classified as $\widehat{S}_{t}$ . When using this neighborhood based classification rule (Neighbor), we use $Nei_{t}({\bm{x}})=\sigma^{2}_{t}({\bm{x}})P_{t}({\bm{x}};0.3)$ as the acquisition function, and next input is selected by ${\bm{x}}_{t+1}=\operatornamewithlimits{argmax}Nei_{t}({\bm{x}})$ . In all the experiments , we use Gaussian kernel $k({\bm{x}},{\bm{x}}^{\prime})=\sigma^{2}_{f}\exp(-\|{\bm{x}}-{\bm{x}}^{\prime}\|^{2}/L)$ . Moreover, we assume that error variance $\sigma^{2}$ is known.

5.1 Synthetic function experiments

We considered 2-dimensional synthetic functions. Let $\mathcal{A}$ be a grid point set obtained by dividing the interval $[A,B]$ into 40 equal parts, and let $D=\mathcal{A}\times\mathcal{A}$ . Define

[TABLE]

The following three cases were considered:

(Case 1):

$f(x_{1},x_{2})=\sin(x_{1})\cos(x_{2})$ , $A=-\pi/2$ , $B=7\pi/2$ , $a=0$ , $b=3\pi$ , $\sigma^{2}_{f}=1$ , $L=4.5$ .

(Case 2):

$f(x_{1},x_{2})=18+\sum_{s=1}^{2}\{(1/4)x^{4}_{s}-(13/3)x^{3}_{s}+25x^{2}_{s}-56x_{s}\}/3$ , $A=-1$ , $B=9$ , $a=0$ , $b=8$ , $\sigma^{2}_{f}=2$ , $L=3$ .

(Case 3):

$f(x_{1},x_{2})=\sum_{s=1}^{2}(x_{s}-4)^{2}$ , $A=-1$ , $B=9$ , $a=0$ , $b=8$ , $\sigma^{2}_{f}=2$ , $L=3$ .

Furthermore, we set $\sigma^{2}=0.005$ , $\beta^{1/2}_{t}=\gamma^{1/2}_{t}=3$ , $\epsilon^{(1)}_{1}=\epsilon^{(1)}_{2}\equiv h\in\{0.35,0.45\}$ and $\epsilon^{(2)}=0.1$ . Here, At this time, one initial point was randomly determined, and based on each AF, function evaluations were sequentially done up to step 200. This was repeated 50 times, and the average of the F-score was calculated (Fig.3). The results indicate that ALOE has better performance than other methods.

5.2 Real data experiments

We analyzed the potential energy (PE) data in inorganic crystal AlLaO3. The data includes 3-dimensional inputs ${\bm{x}}_{i}\in D$ corresponding to 3-dimensional coordinates and PE $y^{\ast}_{i}$ , for $i=1,\ldots,5832$ . Here, $D$ is given by $D=\mathcal{A}^{3}$ and $\mathcal{A}$ is a grid point set obtained by dividing the interval $[0,r]$ into 17 equal parts, where $r\approx 3.6$ . In this experiment, GP was first fitted using the whole data excluding outliers, and the posterior mean at each point is defined as the true function $f({\bm{x}})$ . We used this to calculate the energy at each point as $y_{i}=f({\bm{x}}_{i})+\varepsilon_{i}$ . Moreover, we defined $\mathcal{X}=\{{\bm{x}}=(x_{1},x_{2},x_{3})\in D\ |\ x_{s}\in[0.4,2],\ s=1,2,3\}$ . Furthermore, we set $\sigma^{2}=0.01$ , $L=2.5$ , $\beta^{1/2}_{t}=4$ , $\gamma^{1/2}_{t}=1$ , $\epsilon^{(1)}_{1}=\epsilon^{(1)}_{2}=\epsilon^{(1)}_{3}\equiv h\in\{0.7,0.8\}$ and $\epsilon^{(2)}=1.2$ . In addition, it is known that there are 6 local minimum points in $\mathcal{X}$ from the domain knowledge in material science. Therefore, these six points are defined as the members of $S$ . Here, one initial point was randomly selected, and based on each AF, function evaluations was iterated up to step 300. This was repeated 50 times, and the average of F-score was calculated (see the bottom right plot in Fig.3). The results indicate that the performance of ALOE is better than other competitors as in the previous synthetic experiments.

6 Conclusion

In this paper, we proposed an AL method called ALOE for enumerating local minima using GP derivatives. From the theoretical results and numerical experiments, the usefulness of ALOE was confirmed.

Acknowledgments

This work was partially supported by MEXT KAKENHI (17H00758, 16H06538), JST CREST (JPMJCR1302, JPMJCR1502), RIKEN Center for Advanced Intelligence Project, and JST support program for starting up innovation-hub on materials research by information integration initiative.

Appendix

A Proof of Theorem 4.2

From the definition of $\eta_{t}$ , $\eta^{2}_{t}$ can be written by

[TABLE]

Here, for any $i\in[d]$ , $t\geq 1$ and ${\bm{x}}\in\mathcal{X}$ , let

[TABLE]

Then, it holds that

[TABLE]

where last inequality is derived by $2{\rm{Cov}}[X,Y]\leq{\rm{Var}}[X]+{\rm{Var}}[Y]$ . Note that $(f({\bm{x}}_{1}),\ldots,f({\bm{x}}_{t}),\tilde{f}^{(1)}_{i}({\bm{x}};\zeta))^{\top}$ is distributed as a multivariate normal distribution. Thus, from the definition of the conditional variance in the multivariate normal distribution, from the assumption (A2) we get ${\rm{Var}}[\tilde{f}^{(1)}_{t,i}({\bm{x}};\zeta)]\leq{\rm{Var}}[\tilde{f}^{(1)}_{0,i}({\bm{x}};\zeta)]$ . Hence, by substituting this inequality into (A.2), we obtain

[TABLE]

Furthermore, the variance ${\rm{Var}}[\Delta f^{(1)}_{t,i}({\bm{x}};\zeta)]$ satisfies the following inequality:

[TABLE]

where last inequality is derived by $-2{\rm{Cov}}[X,Y]\leq{\rm{Var}}[X]+{\rm{Var}}[Y]$ . Therefore, by using (A.3) and (A.4), we have

[TABLE]

Here, let $\tilde{C}_{0}$ be a positive number satisfying $\tilde{C}_{0}>C_{0}$ , $\epsilon^{2}(10\beta_{1}\tilde{C}_{0})^{-1}<A_{0}$ and $(2/5)\epsilon^{2}(\gamma_{1}\tilde{C}_{0})^{-1}<A_{0}$ . Then, we set

[TABLE]

Thus, noting that $\beta_{1}\leq\beta_{t}$ , we obtain $|\zeta|<A_{0}$ and

[TABLE]

Hence, by substituting (A.6) and (A.7) into (A.5), we get

[TABLE]

Here, define

[TABLE]

Moreover, from the definition of $\zeta$ and the assumption (A1), it holds that ${\bm{x}}\in D$ , $({\bm{x}}+\zeta{\bm{e}}_{i})\in D$ . Therefore, it holds that

[TABLE]

Thus, we have

[TABLE]

Next, suppose that $t_{1},\ldots,t_{l}$ are positive integers satisfying $1\leq t_{1}\leq\cdots\leq t_{l}\leq t$ and $r_{t_{s}}=1$ . Then, noting that the monotonicity of posterior variances and the definition of (3.5), we obtain

[TABLE]

and

[TABLE]

Therefore, from Lemma 5.3 in Srinivas et al. [12], we have

[TABLE]

Thus, substituting (A.9) into (A.8) we obtain

[TABLE]

This implies that

[TABLE]

Similarly, for any $j,k\in[d]$ , $t\geq 1$ and ${\bm{x}}\in\mathcal{X}$ , let

[TABLE]

Then, using same arguments we get

[TABLE]

Furthermore, we set $\tilde{\zeta}=(2/5)\epsilon^{2}(\gamma_{t}\tilde{C}_{0})^{-1}$ . Thus, noting that $|\tilde{\zeta}|<A_{0}$ , we have

[TABLE]

and

[TABLE]

Therefore, from (A.9) it holds that

[TABLE]

Hence, noting that $\{\varsigma^{(2)}_{t}({\bm{x}})\}^{2}=\max_{j,k\in[d]}\{\sigma^{(2)}_{t,jk}({\bm{x}})\}^{2}$ , we obtain

[TABLE]

Finally, by substituting (A.10) and (A.11) into (A.1), we get Theorem 4.2. ∎

B Local minima identification for infinite set $\mathcal{X}$

In this section, we consider the case that $\mathcal{X}$ is infinite. Let $\mathcal{X}$ be an infinite set, and let $\mathcal{X}^{\star}$ be a finite subset of $\mathcal{X}$ . In addition, we assume that $D$ is a compact and convex set. Moreover, we may assume $D\subset[0,r]^{d}$ , without loss of generality. Here, for each point ${\bm{x}}\in\mathcal{X}^{\star}$ , we define $Q^{(1)}_{t,i}({\bm{x}})$ and $Q^{(2)}_{t}({\bm{x}})$ . Similarly, using accuracy parameters $\epsilon^{(1)}_{i}>0$ and $\epsilon^{(2)}>0$ , we define

[TABLE]

and

[TABLE]

Moreover, for each ${\bm{a}}\in\mathcal{X}$ , let $[{\bm{a}}]$ be a point in $\mathcal{X}^{\star}$ closest to ${\bm{a}}$ . Then, from (B.1) we define the estimated sets $\widehat{S}_{t}$ and $\widehat{\bar{S}}_{t}$ as follows:

Definition B.1 (Estimated sets $\widehat{S}_{t}$ and $\widehat{\bar{S}}_{t}$ for infinite $\mathcal{X}$ ).

Estimated sets $\widehat{S}_{t}$ and $\widehat{\bar{S}}_{t}$ of $S$ and $\mathcal{X}\setminus S$ are respectively defined as

[TABLE]

Furthermore, we define the acquisition function $b_{t}({\bm{x}})$ as follows:

Definition B.2 (Function $b_{t}({\bm{x}})$ based on predicted violatios for infinite $\mathcal{X}$ ).

Define

[TABLE]

Then, the function $b_{t}({\bm{x}})$ is defined by

[TABLE]

Finally, the flow of the proposed method when $\mathcal{X}$ is infinite is shown in Algorithm 2.

C Theoretical results for infinite $\mathcal{X}$

In this section, we provide the theorem on the performance and convergence of Algorithm 2. Hereafter, instead of the assumptions (A1) and (A2), we assume the following conditions:

(C1).

There exists a positive constant $A^{\star}_{0}$ such that for any $\zeta$ satisfying $|\zeta|<A^{\star}_{0}$ , $({\bm{x}}+\zeta{\bm{e}}_{i})\in D$ and $({\bm{x}}+\zeta{\bm{e}}_{jk})\in D$ for any ${\bm{x}}\in\mathcal{X}^{\star}$ and $i,j,k\in[d]$ . 2. (C2).

There exists a positive constant $C^{\star}_{0}$ such that

[TABLE]

for any ${\bm{x}}\in\mathcal{X}^{\star}$ , $i,j,k\in[d]$ and $\zeta$ satisfying $|\zeta|<A^{\star}_{0}$ , where $A^{\star}_{0}$ is given in the assumption (C1).

Furthermore, we also assume the following condition.

(C3).

There exists positive constants $a$ and $b$ such that

[TABLE]

and

[TABLE]

where $f^{(3)}_{ijk}({\bm{x}})$ is given by $\partial f^{(2)}_{ij}({\bm{x}})/\partial x_{k}$ .

Furthermore, let $\mathcal{X}^{\star}$ be a set which has $(\tau_{\epsilon})^{d}$ elements, and let

[TABLE]

Then, the following theorem holds:

Theorem C.1.

Let $\epsilon^{(1)}_{1},\ldots,\epsilon^{(1)}_{d},\epsilon^{(2)}$ be positive numbers, and let $\epsilon=\min\{\epsilon^{(1)}_{1},\ldots,\epsilon^{(1)}_{d},\epsilon^{(2)}\}$ . For any $\delta\in(0,1)$ , define $L=b(\log\{a(d^{2}+d^{3})/\delta\})^{1/2}$ , $\tau_{\epsilon}=\lceil d^{2}r\epsilon^{-1}L\rceil$ ,

[TABLE]

Then, Algorithm 2 completes classification after at least the minimum positive integer $T$ trials that satisfy the following inequality $\eta^{2}_{T}\leq\epsilon^{2}$ . Moreover, with probability at least $1-2\delta$ , for any $t\geq 1$ , ${\bm{x}}\in\mathcal{X}$ and $i\in[d]$ it holds that

[TABLE]

and

[TABLE]

In order to prove Theorem C.1, we consider following lemmas and corollaries.

Lemma C.1.

From the assumption (C3), it holds that

[TABLE]

Proof.

The proof is given by using the same arguments as in the Appendix A.2. of Srinivas et al. [12]. ∎

Furthermore, from Appendix A.2. of Srinivas et al. [12] and Lemma C.1, we obtain the following corollary.

Corollary C.1.

With probability at least $1-(d^{2}+d^{3})ae^{-(L/b)^{2}}$ , it holds that

[TABLE]

and

[TABLE]

Then, the following lemma holds:

Lemma C.2.

For any $\delta\in(0,1)$ , let $L=b(\log\{a(d^{2}+d^{3})/\delta\})^{1/2}$ and $\tau_{\epsilon}=\lceil d^{2}r\epsilon^{-1}L\rceil$ . Then, with probability at least $1-\delta$ ,

[TABLE]

and

[TABLE]

Proof.

From Corollary C.1 and (C.1), w.p.a.l. $1-\delta$ we have

[TABLE]

and

[TABLE]

Hence, noting that $\tau^{-1}_{\epsilon}\leq\epsilon(Lrd^{2})^{-1}$ , we obtain Lemma C.2. ∎

Using these we provide a proof of Theorem C.1.

Proof.

First, for each ${\bm{x}}\in\mathcal{X}^{\star}$ , when the inequality on $\eta_{T}$ holds, the lengths of $Q^{(1)}_{T,i}({\bm{x}})$ and $Q^{(2)}_{t}({\bm{x}})$ are less than $\epsilon^{(1)}_{i}$ and $2\epsilon^{(2)}$ , respectively. Hence, from classification rules, all points in $\mathcal{X}^{\star}$ are classified. Moreover, by replacing $|\mathcal{X}|$ of $\beta_{t}$ and $\gamma_{t}$ in Theorem 4.1 with $|\mathcal{X}^{\star}|$ , we get

[TABLE]

Therefore, from Theorem 4.1, w.p.a.l. $1-\delta$ , for any ${\bm{x}}\in\mathcal{X}^{\star}$ , $t\geq 1$ and $i\in[d]$ it holds that

[TABLE]

Here, by combining Lemma C.2, w.p.a.l. $1-2\delta$ , the equations (C.2), (C.3), (C.4) and (C.5) hold. In addition, letting $\tilde{{\bm{H}}}({\bm{x}})$ be a matrix the $(j,k)^{\rm th}$ element of which is given by $f^{(2)}_{jk}({\bm{x}})-f^{(2)}_{jk}([{\bm{x}}])$ , for any ${\bm{a}}$ satisfying $\|a\|=1$ we obtain

[TABLE]

when (C.3) holds. Thus, noting that ${\bm{H}}({\bm{x}})={\bm{H}}([{\bm{x}}])+\tilde{{\bm{H}}}({\bm{x}})$ , we get

[TABLE]

Hence, noting that $\epsilon\leq\epsilon^{(1)}_{i}$ and $\epsilon\leq\epsilon^{(2)}$ , from the definition of $\widehat{S}_{t}$ , we have

[TABLE]

Similarly, by using the argument we obtain

[TABLE]

∎

In addition, the following theorem holds:

Theorem C.2.

Let $C_{1}=2\sigma^{2}C_{2}$ , $C_{2}=\sigma^{-2}/\log(1+\sigma^{-2})$ , $R_{t}=r_{1}+\cdots+r_{t}$ , $\tilde{C}^{\star}_{0}>C^{\star}_{0}$ and

[TABLE]

Then, it holds that $\eta^{2}_{t}\leq\tilde{\eta}^{2}_{t}+\frac{4}{5}\epsilon^{2}.$

Proof.

The proof is given by using the same argument as the proof of Theorem 4.2. ∎

D Sufficient conditions for assumptions

In this section, we provide sufficient conditions for the assumptions. First, the following Lemma holds:

Lemma D.1.

Let $D$ be a compact set, and let ${\rm int}(D)$ be an interior set of $D$ . Suppose that $\mathcal{X}$ is a finite subset of ${\rm int}(D)$ . Assume that the kernel function $k({\bm{x}},{\bm{x}}^{\prime})$ is a five times continuously differentiable function. Then, the assumptions (A1) and (A2) hold.

Proof.

For any ${\bm{x}}\in\mathcal{X}\subset{\rm int}(D)$ , noting that ${\rm int}(D)$ is the open set, there exists a positive number $\delta_{\bm{x}}$ such that $\mathscr{N}({\bm{x}};\delta_{\bm{x}})\subset{\rm int}(D)\subset D$ , where $\mathscr{N}({\bm{x}};\delta_{\bm{x}})$ is the $\delta_{\bm{x}}$ -neighborhood of ${\bm{x}}$ . Here, since $\mathcal{X}$ is the finite set, we can define $A_{0}:=\min_{{\bm{x}}\in\mathcal{X}}\{\delta_{\bm{x}}\}$ . Hence, for any ${\bm{x}}\in\mathcal{X}$ , it holds that $\mathscr{N}({\bm{x}};A_{0})\subset D$ . Thus, this implies that the assumption (A1) holds.

Next, for any ${\bm{x}}\in\mathcal{X}$ , $i\in[d]$ and $\zeta$ satisfying $|\zeta|<A_{0}$ , the variance of $\tilde{f}^{(1)}_{0,i}({\bm{x}};\zeta)$ is given by

[TABLE]

Here, we put ${\bm{x}}+\zeta{\bm{e}}_{i}={{\bm{x}}}^{\ast}$ . Then, $k({\bm{x}},{\bm{x}}+\zeta{\bm{e}}_{i})$ can be written by $k({\bm{x}},{\bm{x}}+\zeta{\bm{e}}_{i})=k({\bm{x}},{\bm{x}}^{\ast})$ , and using Taylor’s expansion we have

[TABLE]

where ${\bm{x}}^{\star}_{1}$ is a point on a line segment connecting ${\bm{x}}^{\ast}$ and ${\bm{x}}$ . Similarly, we obtain

[TABLE]

where ${\bm{x}}^{\star}_{2}$ is also a point on a line segment connecting ${\bm{x}}^{\ast}$ and ${\bm{x}}$ . Moreover, we get

[TABLE]

and

[TABLE]

Here, noting that

[TABLE]

the covariance ${\rm{Cov}}[f^{(1)}_{0,i}({\bm{x}}),f_{0}({\bm{x}}+\zeta{\bm{e}}_{i})]$ can be expressed as

[TABLE]

By using the same argument, we also have

[TABLE]

Therefore, using (D.3), (D.4) and (D.5), we obtain

[TABLE]

Thus, by combining (D.2) and (D.8), we have

[TABLE]

Finally, by substituting (D.6), (D.7) and (D.9) into (D.1), we get

[TABLE]

Note that $k({\bm{x}},{\bm{x}}^{\prime})$ is a five times continuously differentiable function and $D$ is a compact set. This implies that there exists a positive constant $C_{0}$ such that $|{\rm{Var}}[\tilde{f}^{(1)}_{0,i}({\bm{x}};\zeta)]|\leq|\zeta|C_{0}.$ Similarly, using same argument the inequality $|{\rm{Var}}[\tilde{f}^{(2)}_{0,jk}({\bm{x}};\zeta)]|\leq|\zeta|C_{0}$ also holds. ∎

By replacing $\mathcal{X}$ with $\mathcal{X}^{\star}$ , we obtain the following corollary.

Corollary D.1.

Let $D$ be a compact set, and let ${\rm int}(D)$ be an interior set of $D$ . Suppose that $\mathcal{X}^{\star}$ is a finite subset of ${\rm int}(D)$ . Assume that the kernel function $k({\bm{x}},{\bm{x}}^{\prime})$ is a five times continuously differentiable function. Then, the assumptions (C1) and (C2) hold.

Finally, the following lemma holds:

Lemma D.2.

Let $D$ be a compact set. Assume that the kernel function $k({\bm{x}},{\bm{x}}^{\prime})$ is an eight times differentiable function. Then, the assumption (C3) holds.

Proof.

For GP samples $g$ , from Theorem 5 of Ghosal and Roy [3], if a kernel function $\tilde{k}({\bm{x}},{\bm{x}}^{\prime})$ of $g$ has a fourth derivative, there exists positive constants $a$ and $b$ such that

[TABLE]

Here, for the GP sample $f^{(1)}_{i}$ , its kernel function is the second derivative of the kernel function $k({\bm{x}},{\bm{x}}^{\prime})$ . Therefore, if $k({\bm{x}},{\bm{x}}^{\prime})$ has a sixth derivative, then the kernel function of $f^{(1)}_{i}$ has a fourth derivative. Similarly, if $k({\bm{x}},{\bm{x}}^{\prime})$ has an eighth derivative, then the kernel function of $f^{(2)}_{jk}$ also has a fourth derivative. ∎

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Edwin V Bonilla, Kian M. Chai, and Christopher Williams. Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems 20 , pages 153–160. Curran Associates, Inc., 2008.
2[2] David K Duvenaud, Hannes Nickisch, and Carl E. Rasmussen. Additive gaussian processes. In Advances in Neural Information Processing Systems 24 , pages 226–234. Curran Associates, Inc., 2011.
3[3] Subhashis Ghosal and Anindya Roy. Posterior consistency of gaussian process prior for nonparametric binary regression. The Annals of Statistics , pages 2413–2429, 2006.
4[4] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause. Active learning for level set estimation. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence , pages 1344–1350, 2013.
5[5] Philipp Hennig and Christian J. Schuler. Entropy search for information-efficient global optimization. J. Mach. Learn. Res. , 13(1):1809–1837, June 2012.
6[6] Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, and Svetha Venkatesh. Regret for expected improvement over the best-observed value and stopping condition. In Proceedings of the Ninth Asian Conference on Machine Learning , volume 77, pages 279–294, 2017.
7[7] Athanasios Papoulis and S Unnikrishna Pillai. Probability, random variables, and stochastic processes . Tata Mc Graw-Hill Education, 2002.
8[8] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) . The MIT Press, 2005.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Active learning for enumerating local minima based on Gaussian process derivatives

ABSTRACT

1 Introduction

Related works

Our contribution

2 Preliminaries

Problem setting

Definition 2.1** (The set of local minima).**

GP derivatives

3 Proposed method

3.1 Local minimum estimation based on the CIs of GP derivatives

Definition 3.1** (SSS estimation).**

3.2 Acquisition function by predicted violations

4 Theoretical results

Theorem 4.1**.**

Proof.

Theorem 4.2**.**

5 Numerical experiments

5.1 Synthetic function experiments

5.2 Real data experiments

6 Conclusion

Acknowledgments

Appendix

A Proof of Theorem 4.2

B Local minima identification for infinite set X\mathcal{X}X

Definition B.1** (Estimated sets S^t\widehat{S}_{t}St​ and Sˉ^t\widehat{\bar{S}}_{t}Sˉt​ for infinite X\mathcal{X}X).**

Definition B.2** (Function bt(x)b_{t}({\bm{x}})bt​(x) based on predicted violatios for infinite X\mathcal{X}X).**

C Theoretical results for infinite X\mathcal{X}X

Theorem C.1**.**

Lemma C.1**.**

Proof.

Corollary C.1**.**

Lemma C.2**.**

Proof.

Proof.

Theorem C.2**.**

Proof.

D Sufficient conditions for assumptions

Lemma D.1**.**

Proof.

Corollary D.1**.**

Lemma D.2**.**

Proof.

Definition 2.1 (The set of local minima).

Definition 3.1 ( $S$ estimation).

Theorem 4.1.

Theorem 4.2.

B Local minima identification for infinite set $\mathcal{X}$

Definition B.1 (Estimated sets $\widehat{S}_{t}$ and $\widehat{\bar{S}}_{t}$ for infinite $\mathcal{X}$ ).

Definition B.2 (Function $b_{t}({\bm{x}})$ based on predicted violatios for infinite $\mathcal{X}$ ).

C Theoretical results for infinite $\mathcal{X}$

Theorem C.1.

Lemma C.1.

Corollary C.1.

Lemma C.2.

Theorem C.2.

Lemma D.1.

Corollary D.1.

Lemma D.2.