Harnessing Low-Fidelity Data to Accelerate Bayesian Optimization via   Posterior Regularization

Bin Liu

arXiv:1902.03740·cs.LG·December 18, 2019

Harnessing Low-Fidelity Data to Accelerate Bayesian Optimization via Posterior Regularization

Bin Liu

PDF

Open Access

TL;DR

This paper introduces a novel Bayesian optimization method that leverages low-fidelity data through posterior regularization, significantly reducing the number of expensive function evaluations needed for global optimization.

Contribution

It proposes a new framework combining low-fidelity data with posterior regularization using a dynamic weighted product of experts, enhancing BO efficiency.

Findings

01

Outperforms state-of-the-art methods on benchmark tasks

02

Reduces number of function evaluations needed

03

Maintains high solution quality

Abstract

Bayesian optimization (BO) is a powerful paradigm for derivative-free global optimization of a black-box objective function (BOF) that is expensive to evaluate. However, the overhead of BO can still be prohibitive for problems with highly expensive function evaluations. In this paper, we investigate how to reduce the required number of function evaluations for BO without compromise in solution quality. We explore the idea of posterior regularization to harness low fidelity (LF) data within the Gaussian process upper confidence bound (GP-UCB) framework. The LF data can arise from previous evaluations of an LF approximation of the BOF or of a related optimization task. An extra GP model called LF-GP is trained to fit the LF data. We develop an operator termed dynamic weighted product of experts (DW-POE) fusion. The regularization is induced by this operator on the posterior of the BOF.…

Equations41

x \in χ max f (x),

x \in χ max f (x),

S_{t} = i = 1, \dots, t min (f^{*} - f (x_{i})) .

S_{t} = i = 1, \dots, t min (f^{*} - f (x_{i})) .

μ_{n} (x)

μ_{n} (x)

κ_{n} (x, x^{'})

ϕ_{t} (x) = μ_{t} (x) + β_{t}^{1/2} σ_{t} (x),

ϕ_{t} (x) = μ_{t} (x) + β_{t}^{1/2} σ_{t} (x),

p (x) = \frac{1}{Z} Π_{i} p_{i} (x),

p (x) = \frac{1}{Z} Π_{i} p_{i} (x),

μ (x)

μ (x)

σ^{2} (x)

p_{r e g, t} (x) \propto p_{1, t} (x)^{1 - w_{l f, t}} p_{2} (x)^{w_{l f, t}},

p_{r e g, t} (x) \propto p_{1, t} (x)^{1 - w_{l f, t}} p_{2} (x)^{w_{l f, t}},

μ_{r e g, t} (x)

μ_{r e g, t} (x)

σ_{r e g, t}^{2} (x)

ϕ_{r e g, t} (x) = μ_{r e g, t} (x) + β_{t}^{1/2} σ_{r e g, t} (x) .

ϕ_{r e g, t} (x) = μ_{r e g, t} (x) + β_{t}^{1/2} σ_{r e g, t} (x) .

\overset{w}{^}_{l f, t + 1} = \frac{w _{l f, t}^{α}}{w _{l f, t}^{α} + ( 1 - w _{l f, t} ) ^{α}},

\overset{w}{^}_{l f, t + 1} = \frac{w _{l f, t}^{α}}{w _{l f, t}^{α} + ( 1 - w _{l f, t} ) ^{α}},

w_{lf,t+1}=\left\{\begin{array}[]{ll}\frac{\hat{w}_{lf,t+1}\cdot l_{lf}}{\hat{w}_{lf,t+1}\cdot l_{lf}+(1-\hat{w}_{lf,t+1})\cdot l_{hf}},\;\mbox{if}\;y_{t+1}>\max(y_{1:t})\\ \hat{w}_{lf,t+1},\quad\quad\quad\quad\quad\quad\quad\;\mbox{otherwise}\end{array}\right.

w_{lf,t+1}=\left\{\begin{array}[]{ll}\frac{\hat{w}_{lf,t+1}\cdot l_{lf}}{\hat{w}_{lf,t+1}\cdot l_{lf}+(1-\hat{w}_{lf,t+1})\cdot l_{hf}},\;\mbox{if}\;y_{t+1}>\max(y_{1:t})\\ \hat{w}_{lf,t+1},\quad\quad\quad\quad\quad\quad\quad\;\mbox{otherwise}\end{array}\right.

f (x) = 2 x^{1.2} sin (2 x) + 2,

f (x) = 2 x^{1.2} sin (2 x) + 2,

f_{l} (x) = 0.7 f (x) + (x^{1.3} - 0.3) \cdot sin (3 x - 0.5) + 4 cos (2 x) - 5,

f_{l} (x) = 0.7 f (x) + (x^{1.3} - 0.3) \cdot sin (3 x - 0.5) + 4 cos (2 x) - 5,

f (x)

f (x)

f_{l} (x) = \frac{A + B + C + D}{4},

f_{l} (x) = \frac{A + B + C + D}{4},

f (x)

f (x)

f_{l} (x) = [1 + \frac{sin ( x _{1} )}{10}] f (x) - 2 x_{1} + x_{2}^{2} + x_{3}^{2} + 0.5.

f_{l} (x) = [1 + \frac{sin ( x _{1} )}{10}] f (x) - 2 x_{1} + x_{2}^{2} + x_{3}^{2} + 0.5.

f (x) = \frac{2}{3} exp (x_{1} + x_{2}) - x_{4} sin (x_{3}) + x_{3} .

f (x) = \frac{2}{3} exp (x_{1} + x_{2}) - x_{4} sin (x_{3}) + x_{3} .

f_{l} (x) = 1.2 f (x) - 1.

f_{l} (x) = 1.2 f (x) - 1.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Advanced Multi-Objective Optimization Algorithms · Advanced Bandit Algorithms Research

MethodsGaussian Process

Full text

Harnessing Low-Fidelity Data to Accelerate Bayesian Optimization via Posterior Regularization

Bin Liu

School of Computer Science

Jiangsu Key Lab of Big Data Security $\&$ Intelligent Processing

*Nanjing University of Posts and Telecommunications

*Nanjing, China

[email protected]

Abstract

Bayesian optimization (BO) is a powerful paradigm for derivative-free global optimization of a black-box objective function (BOF) that is expensive to evaluate. However, the overhead of BO can still be prohibitive for problems with highly expensive function evaluations. In this paper, we investigate how to reduce the required number of function evaluations for BO without compromise in solution quality. We explore the idea of posterior regularization to harness low fidelity (LF) data within the Gaussian process upper confidence bound (GP-UCB) framework. The LF data can arise from previous evaluations of an LF approximation of the BOF or a related optimization task. An extra GP model called LF-GP is trained to fit the LF data. We develop an operator termed dynamic weighted product of experts (DW-POE) fusion. The regularization is induced by this operator on the posterior of the BOF. The impact of the LF GP model on the resulting regularized posterior is adaptively adjusted via Bayesian formalism. Extensive experimental results on benchmark BOF optimization tasks demonstrate the superior performance of the proposed algorithm over state-of-the-art.

Index Terms:

Bayesian optimization, Gaussian process, Upper confidence bound, multi-fidelity modeling

I Introduction

In this paper, we consider a maximization problem

[TABLE]

where $f$ : $\chi\rightarrow\mathbb{R}$ is a continuous real-valued function, $\chi$ a Euclidean solution domain defined in $\mathbb{R}^{d}$ , $d$ the dimension of $x$ . Suppose that there exists an $x^{\ast}\in\chi$ such that $f(x)\leq f(x^{\ast})$ , $\forall x\in\chi$ . The task is to find $x^{\ast}$ based on a limited number of evaluations of $f$ . An evaluation consists of sampling an $x$ in $\chi$ , inputting it to $f$ , and then obtaining the corresponding output $y=f(x)+\epsilon$ , where $\epsilon\sim\mathcal{N}(0,\sigma^{2})$ , at the expense of a certain amount of computational resources. We focus on cases wherein $f$ is an expensive-to-evaluate black-box function with no access to its gradient. We also assume that $f$ is smooth and can be modeled by a Gaussian process (GP). Such derivative-free expensive function optimization problems arise in many fields such as the industrial design in complex engineered systems, model selection in statistics, the hyper-parameter configuration for complex machine learning systems. BO is well recognized as a powerful framework for addressing such type of problems.

Of particular interest here is how to find or obtain a satisfactory estimate of $x^{\ast}$ with BO using as less as possible evaluations of $f$ . In particular, we explore the idea of posterior regularization to accelerate the GP-UCB method of [1] by harnessing LF data. The accelerated BO algorithm (ABO) can be used for cases wherein the objective function is extremely expensive to evaluate and there is a fixed related LF data set available for exploitation. The regularization is induced by an expert fusion operator on the posterior of the BOF at each iteration of the BO procedure. An extra GP model, termed LF-GP, is trained to fit the LF data and then gets involved in the fusion operation. The impact of LF-GP on the resulting regularized posterior is dynamically adapted via Bayesian formalism.

The basic idea underlying the proposed ABO algorithm is illustrated in Fig.2. It depicts the result obtained at an iteration of ABO when applied for a 1D pedagogical case presented in subsection IV-A. We see that ABO suggests a better next point to query than the baseline GP-UCB method whose result is plotted in Fig.1. This is due to the posterior regularization operation embedded into the ABO algorithm that helps to reveal more structural information of the BOF $f$ through exploiting LF data points. In Fig.2, we see that the presence of the LF point at $x=3$ makes the uncertainty band of the posterior significantly shrank at the local area of $x=3$ . The UCB of the predicted $f$ therein is reduced accordingly. In contrast, the baseline GP-UCB, which is trained with only three high fidelity (HF) points, suggests evaluating $f$ at one query point near $x=3$ . The resulting UCB curve of the baseline method is somewhat misleading because of the high uncertainty of the posterior estimate around $x=3$ and the structural information missing near $x=4$ .

I-1 Related work

Multi-fidelity optimization has recently attracted considerable research interests. Techniques such as hierarchical partitioning [2], hierarchical modeling [3] and ensemble methods [4], are used to incorporate multiple fidelities/cheap approximations of the BOF. Most relevant to this paper is the line of work on Bayesian optimization with multi-fidelity data such as the MF-GP-UCB method in [5] and the multi-fidelity BO (MFBO) algorithm in [6]. Research topics that are close to MFBO in concept include multi-information source optimization [7, 8], multi-task BO [9], multi-output GP [10, 11], meta-learning based BO [12, 13].

The success of the aforementioned methods requires specific assumptions to be satisfied. For instance, MFBO methods in [6, 14, 15] work under a basic assumption that the relationship between $f(x)$ and $f_{l}(x)$ satisfies $f(x)=\rho f_{l}(x)+n$ , where $f_{l}(x)$ denotes an LF approximation of $f(x)$ and $n$ a noise item. Extra operations or assumptions are usually needed to determine the value of the correlation parameter $\rho$ . The hierarchical modeling approach of [3] requires that the data points selected for HF evaluations come from a subset of those used for LF evaluations. The MF-GP-UCB algorithm in [5] assumes that $\|f(x)-f_{l}(x)\|_{\infty}$ is bounded and a priori known.

In contrast, the application of the proposed ABO algorithm does not require any of the aforementioned assumptions to be satisfied. In ABO, a flexible expert fusion operator is embedded that automatically grasps and exploits the intrinsic correlation between the BOF and its LF counterpart. Besides, we consider a fixed LF data set $\mathcal{D}_{lf}$ for use. That said, only the HF BOF is allowed to be evaluated after starting the BO process. In contrast, in settings of most existent MFBO methods, e.g. in [5], new LF evaluations are allowed to perform and thus the set $\mathcal{D}_{lf}$ will be expanded accordingly. See details on the problem setup in subsection II-A.

II Preliminary

II-A Problem setup

The task is to maximize the BOF $f$ over the domain $\chi$ , as formulated in Eqn.(1). We search the maximizer $x^{\ast}$ or the maximum value $f^{\ast}=f(x^{\ast})$ using an algorithm that evaluates a sequence of points $x_{1:t}\triangleq\{x_{1},\ldots,x_{t}\},t>0$ . An evaluation of $f$ at $x\in\chi$ yields an observation $y=f(x)+\epsilon$ , where $\epsilon\sim\mathcal{N}(0,\sigma^{2})$ . There are $J$ LF query points $\mathcal{D}_{l}=\{x_{l,j},y_{l,j}\}_{j=1}^{J}$ that can be exploited, where $x_{l}\in\chi$ , $y_{l}=f_{l}(x_{l})+\varepsilon$ , $\varepsilon$ denotes a zero-mean noise item. At time $t$ , the algorithm chooses to query at $x_{t+1}$ based on $\{x_{i},y_{i}\}_{i=1}^{t}$ and $\{x_{l,j},y_{l,j}\}_{j=1}^{J}$ . The goal of the algorithm is to achieve as small as possible simple regret, which is defined as below

[TABLE]

Note that we do not put any constraint on the relationship between $f$ and $f_{l}$ here, while our algorithm will discover and then make use of their relationship automatically and implicitly in a data-driven manner.

II-B Gaussian process (GP)

A GP is a stochastic process. It is often used as a Bayesian nonparametric prior for a function $f$ defined over a space $\chi$ . A GP is determined by its mean function $\mu:\chi\rightarrow\mathbb{R}$ and covariance function $\kappa:\chi^{2}\rightarrow\mathbb{R}$ . Suppose that our prior belief on $f$ is modeled by a GP, denoted by $f\sim\mathcal{GP}(\mu,\kappa)$ . This is equivalent to say that in our prior knowledge, $f(x)$ is distributed normally $\mathcal{N}(\mu(x),\kappa(x,x)),\forall x\in\chi$ . Given $n$ observations $\mathcal{D}_{n}=\{(x_{i},y_{i})\}_{i=1}^{n}$ drawn from this GP, the posterior belief on $f$ is also a GP with an updated mean and covariance as follows

[TABLE]

where $Y=y_{1:n}$ , $k,k^{\prime}\in\mathbb{R}^{n}$ with $k_{i}=\kappa(x,x_{i})$ , $k^{\prime}_{i}=\kappa(x’,x_{i})$ . A common choice of the covariance function $\kappa$ is the squared exponential (SE) kernel, written as $\kappa(x,x^{\prime})=\kappa_{0}\exp(-(\|x-x^{\prime}\|)^{2}/(2h^{2}))$ . Here $\kappa_{0}$ is the scale parameter that determines the extent to which $f$ could deviate from $\mu$ . The bandwidth parameter $h\in\mathbb{R}_{+}$ determines the smoothness of the GP. The larger $h$ is, the smoother the samples drawn from the GP tend to be. See [16] for more information on the GP.

II-C The baseline GP-UCB method

The GP-UCB algorithm of [1] is a typical BO method, which assigns a GP prior to $f$ and uses a UCB acquisition function to recommend new query points for evaluating $f$ . At time $t$ , the next point to query, $x_{t+1}$ , is chosen via two steps. First, calculate a UCB of the GP as follows

[TABLE]

where $\mu_{t}$ and $\sigma_{t}$ are respectively the posterior mean and standard deviation of the GP conditional on $\mathcal{D}_{t}=\{(x_{i},y_{i})\}_{i=1}^{t}$ . Next, choose the next query point by maximizing $\phi_{t}$ , i.e., $x_{t+1}=\underset{x\in\chi}{\max}\phi_{t}(x)$ . This optimization can be dealt with by off-the-shelf optimization techniques, e.g., the CMA-ES method [17]. The composites of the acquisition function, namely $\mu_{t}$ and $\sigma_{t}$ in $\phi_{t}$ , promote exploitation and exploration, respectively. The baseline GP-UCB method is summarized in Algorithm 1. For more details on GP-UCB and other alternatives of BO methods, see [18].

III The proposed ABO algorithm

The ABO algorithm is built on the basis of GP-UCB [1]. Compared with GP-UCB, ABO is expected to be capable of employing less expensive BOF evaluations to find a satisfactory solution. The basic strategy to achieve search acceleration is to exploit an LF dataset $\mathcal{D}_{lf}=\{(x_{lf,j},y_{lf,j})\}_{j=1}^{J}$ that is assumed to be pre-available. To implement the above strategy, the key idea we adopt here is to adjust the posterior of $f$ by letting it respect predictions made by running another GP regression that uses $\mathcal{D}_{lf}$ as the training data, as shown in Fig.2. That says we construct two GP models in total. One is embedded in the traditional GP-UCB framework, and the other, which we term LF-GP, is trained to fit the LF data and then used for making predictions of $f$ based on the LF data. In spirit, the ABO algorithm can be regarded as an application of the posterior regularization strategy [19, 20] to the GP-UCB method. We develop a dynamic weighted product of experts (DW-POE) fusion operator, which generalizes the POE model of [21] by using a technique termed dynamic model averaging [22, 23, 24]. The regularization is induced by the DW-POE fusion operator on the posterior. The impact of the LF-GP model on the resulting regularized posterior is adaptively adjusted via Bayesian formalism.

III-A The implementation of the ABO algorithm

An implementation of the ABO is shown in Algorithm 2. First, we train an LF GP model to fit the LF data $\mathcal{D}_{lf}$ . This operation is carried out off-the-shelf. Then, given any query $x\in\chi$ , we invoke the LF GP model to get an LF posterior mean and standard derivation of $f(x)$ , denoted by $\mu_{lf}(x)$ and $\sigma_{lf}(x)$ , respectively. In the main loop of ABO, we first train an HF GP model to fit $\mathcal{D}_{t}$ at time $t$ . We call it HF GP to discriminate it from the LF GP model. This HF GP model gives a posterior estimate of $f(x)$ , with mean $\mu_{t}(x)$ and variance $\sigma_{t}(x)$ . We adjust this posterior via the DW-POE operator, which will be described in detail in subsection III-B. Then, based on the adjusted posterior, we construct a UCB acquisition function as shown in Eqn.(12) and find $x_{t+1}$ by optimizing the acquisition function using the CMA-ES algorithm. As shown in Eqn.(9), a time-evolving weight $0\leq w_{lf,t}<1$ is assigned to the LF GP model when carrying out the DW-POE operator. The weight $w_{lf}$ will be adjusted along time by Eqns.(13)-(14). We give an analysis of the above algorithm design in subsection III-C.

III-B Dynamically weighted POE (DW-POE)

We start by briefly describing the POE model of [21], which is the basis of the DW-POE operator proposed for GP posterior regularization.

POE

Given multiple probability densities, $p_{i}(x)$ , $i=1,\ldots,I$ , a POE models a target probability distribution $p(x)$ as the product of $p_{i}(x)$ ’s as follows,

[TABLE]

where $Z$ is a normalizing constant that makes $p(x)$ a probability distribution that integrates to 1. When $p_{i}(x)\sim\mathcal{N}(\mu_{i}(x),\sigma_{i}^{2}(x)),i=1,\ldots,I$ , $p(x)$ is still Gaussian, with mean and variance:

[TABLE]

DW-POE for GP posterior regularization

We generalize the POE model for GP posterior regularization. This generalized POE is termed DW-POE. The regularization is induced by the DW-POE on the posterior of $f$ given by the HF GP model. Define $p_{1,t}(x)\sim\mathcal{N}(\mu_{t}(x),\sigma_{t}^{2}(x))$ and $p_{2}(x)\sim\mathcal{N}(\mu_{lf}(x),\sigma_{lf}^{2}(x))$ . That says we use $p_{1,t}(x)$ and $p_{2}(x)$ here to denote the posterior of $f$ given by the HF GP model and the LF GP model at time $t$ , respectively. The regularized posterior is specified to be

[TABLE]

where $0\leq w_{lf,t}<1$ denotes a time-evolving weight assigned to the LF GP model. The time-evolving rule is specified by Eqns.(13)-(14), which will be introduced later. Since $p_{1,t}(x)$ and $p_{2}(x)$ are both Gaussian, the mean and the variance of the regularized posterior can be calculated as below [25]

[TABLE]

where $w_{1}=1-w_{lf,t}$ , $w_{2}=w_{lf,t}$ , $P_{1}=(\sigma_{t}^{2}(x))^{-1}$ , $P_{2}=(\sigma_{lf}^{2}(x))^{-1}$ . The UCB of the regularized GP is

[TABLE]

The weight $w_{lf}$ is used to control the influence of the LF GP model on the regularized posterior. The dynamic feature of $w_{lf}$ makes the DW-POE adaptable for use for different cases. Suppose a case in which the LF GP model produces a biased mean prediction with an erroneously low predicted variance. If the combination rule specified by the original POE is under use, then it can lead to a detrimental prediction of $f$ , while a down-weighting of the LF GP model is beneficial for avoiding that detrimental prediction. On the other hand, when the HF GP more is more unreliable due to lack of enough training data in $\mathcal{D}_{t}$ , an up-weighting of the LF GP model can be beneficial for providing a better prediction. The key is how to adapt the value of $w_{lf}$ smartly. We propose a data-driven approach to adapt it based on Bayesian formalism. The adaptation procedure consists of two steps. Given $w_{lf,t}$ , the first step gives a prior prediction of $w_{lf,t+1}$ as follows

[TABLE]

where $\alpha$ is called the forgetting factor. If we let $\alpha=1$ , then Eqn.(13) reduces to $\hat{w}_{lf,t+1}=w_{lf,t}$ , corresponding to case in which the posterior probability of the LF GP model at iteration $t$ is adopted as the predictive prior probability at iteration $t+1$ . We set $\alpha=0.9$ , instead of 1, in our experiments, to increase the impact of the new HF evaluation observation in generating the posterior at iteration $t+1$ . Upon the arrival of the new observation $y_{t+1}$ , we update $w_{lf,t+1}$ as below

[TABLE]

where $l_{lf}=\mathcal{N}(y_{t+1}|\mu_{lf}(x_{t+1}),\sigma_{lf}^{2}(x_{t+1}))$ and $l_{hf}=\mathcal{N}(y_{t+1}|\mu_{t}(x_{t+1}),\sigma_{t}^{2}(x_{t+1}))$ are likelihoods of the GP models conditional on the observation $y_{t+1}$ . Note that the first line in Eqn.(14) is just the Bayes equation. We hope to only take advantage of high quality queries for updating weights of the GP models to avoid misleading given by low-quality queries, so we assign a prerequisite, $y_{t+1}>\max(y_{1:t})$ , for updating $w_{lf,t+1}$ . As we known, a BO algorithm repeatedly executes two alternated sub-tasks: (a) approximate the objective function by a GP (exploration); (b) Search the optimum based on the learnt GP (exploitation). Eqns.(13)-(14) bias the computation to the latter sub-task. This is in spirit like the annealing mechanism adopted in simulated annealing methods.

III-C Computational complexity analysis

We analyze the computational complexity of ABO from a completely algorithmic perspective. We do not consider the computational complexity of the BOF evaluation in this analysis. Two GP models get involved in ABO, while, one of them, the LF GP model, can be trained off-the-shelf. All required predictions given by the LF GP model can also be obtained off-the-shelf before running the main loop of the algorithm. Within the main loop, the ABO algorithm has two additional operations compared with the baseline GP-UCB method, namely the posterior regularization operation (Eqns.(10)-(11)) and the weights updating operation (Eqns.(13)-(14)). They contribute a tiny amount of computation complexity. Through the above analysis, we see that ABO has the same level of computational complexity as the GP-UCB method per iteration. As will be shown in Section IV, ABO requires less iterations than the GP-UCB method to find a good enough solution, which means that, in real applications, ABO will have smaller computational complexity in total than the baseline GP-UCB method.

IV Experiments

We compare ABO with the GP-UCB algorithm and two MFBO methods, termed MFBO-I and MFBO-II here. We considered four function optimization cases. Among the objective functions under consideration, three of them are benchmark functions used for multi-fidelity simulation in the literature. For each case, the objective function is treated as a BOF, and the maximum number of allowed evaluations of the BOF is restricted at 20. Several LF data are generated via evaluating an LF version of the BOF at query points randomly chosen from $\chi$ . The performance metric adopted here is the simple regret, as defined in Eqn.(2).

The GP-UCB method is included here as a baseline for algorithm performance comparison. MFBO-I is adapted from [26], in which the HF GP model is initialized with the best query point suggested by the LF GP model that is trained to fit the LF data. MFBO-II is obtained by slightly adjusting the MF-GP-UCB method of [5]. The only difference between MF-GP-UCB and MFBO-II lies in that the former needs to select a fidelity level for next query at each iteration, while the latter restricts the fidelity level of next query to be the highest one to fit the settings considered here. We treat MFBO-II as a competitive posterior regularization based method, which uses a different way to regularize the posterior given by the HF GP model. Through empirical tests, we show that our proposed posterior regularization operator outperforms that used in MFBO-II.

We start by introducing the objective functions under use. We consider four functions in total. In the design of these functions, many practical issues, e.g., different relationships between the LF and the HF functions, have been considered. Therefore, we expect that experimental results revealed here can be generalized to real-life cases.

IV-A Function optimization cases under consideration

Here, with a slight abuse of notation, we use $x_{i}$ to denote the $i$ th element of the vector $x$ .

Case I

First, we considered a 1D pedagogical case in which the BOF $f$ and its LF counterpart $f_{l}$ are specified as below

[TABLE]

where $x\in[0,6]$ .

Case II

We then considered a 2D benchmark function used in [27]. It is defined as

[TABLE]

where $x_{i}\in[0,1]$ , for all $i=1,2$ . Following [28], we considered an LF approximation of $f$ as below

[TABLE]

where $A=f(x_{1}+0.05,x_{2}+0.05)$ , $B=f(x_{1}+0.05,\max(0,x_{2}+0.05))$ , $C=f(x_{1}-0.05,x_{2}+0.05)$ , $D=f(x_{1}-0.05,\max(0,x_{2}-0.05))$ , and $x_{i}\in[0,1]$ , for all $i=1,2$ .

Case III

Next we considered a 4D benchmark function, termed Park (1991) Function 1 [28]:

[TABLE]

where $x_{i}\in[0,1)$ , for all $i=1,2,3,4.$ Following [28], we set its LF approximation to be:

[TABLE]

where $x_{i}\in[0,1)$ , for all $i=1,2,3,4.$

Case IV

The final function considered here is Park (1991) Function 2 [28]:

[TABLE]

where $x_{i}\in[0,1]$ , for all $i=1,2,3,4.$ Its LF approximation is [28]:

[TABLE]

Note that, in Cases I and III in Sec.4.1, the LF data is actually far from the HF data. We can see from Fig. 3 that the advantage of our method is more obvious for Cases I and III.

IV-B Experimental results

In the experiments, we adopted the SE kernel function and the constant mean function for GP regression, and the “minimize.m” function in the GPML toolbox [16] for hyper-parameter optimization of GP models. For all algorithms considered, we adopted CMA-ES of [17] for optimizing the acquisition function. Each algorithm is run 100 times independently to get a Monte Carlo estimate of the algorithm’s performance for each case considered. The weight of the LF GP model $w_{lf}$ is initialized at 0.5 for ABO. Using Case I, we validated the mechanism of ABO for harnessing LF data to accelerate searching by visualizing an intermediate result, as shown in Figs.1-2. Fig.3 plots simple regrets. It is shown that ABO outperforms the other methods significantly in terms of the searching speed in the first three cases. For the last case, ABO is much faster than the baseline GP-UCB method and MFBO-II, and it achieves a much smaller simple regret than MFBO-I. Fig.4 shows that the influence of the HF GP model increases along time as more HF evaluations of $f$ are performed, which conforms to our expectation.

V Conclusion

In this paper, we demonstrated that LF data, which may arise from previous evaluations of an LF approximation of the BOF or a related optimization task, can be a valuable resource for use in accelerating BO. In particular, we presented a novel algorithm design, namely ABO, for harnessing LF data to accelerate the GP-UCB algorithm of [1]. Experimental results demonstrate that our algorithm outperforms existent state-of-the-art methods consistently over all cases under consideration.

The basic idea underlying our method is to enable the LF data to influence the GP posterior in an automatic, data-driven and theoretically sound way. We implemented this idea by generalizing the POE model of [21] via Bayesian dynamic model averaging in the context of GP-UCB. In principle, our algorithm can be regarded as an efficient approach to warm start GP-UCB by making use of related LF data. Compared with related existent methods, the presented ABO algorithm has three major features. First, it requires no specific assumptions on the correlation structure between the BOF and its LF approximation. Second, the impact of the LF data is adaptively adjusted online. Specifically, the more informative are the LF data, compared with the HF data that have already been observed, the greater is the impact of the LF data for suggesting the next HF data point to evaluate. Lastly, the computation complexity per iteration of ABO is roughly the same as that of GP-UCB, provided that the LF GP model has been built up beforehand.

Throughout, we make use of GP-UCB as a running example of BO methods, while the ideas presented may not be restricted to GP-UCB. A possible future work following this line is to investigate the applicability of such ideas to accelerate other types of BO methods and develop corresponding algorithms.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” in International Conference on Machine Learning (ICML) , 2010, pp. 1015–1022.
2[2] R. Sen, K. Kandasamy, and S. Shakkottai, “Multi-fidelity black-box optimization with hierarchical partitions,” in International Conference on Machine Learning (ICML) , 2018, pp. 4545–4554.
3[3] P. Qian and J. Wu, “Bayesian hierarchical modeling for integrating low-accuracy and high-accuracy experiments,” Technometrics , vol. 50, no. 2, pp. 192–204, 2008.
4[4] B. Peherstorfer, K. Willcox, and M. Gunzburger, “Survey of multifidelity methods in uncertainty propagation, inference, and optimization,” SIAM Review , vol. 60, no. 3, pp. 550–591, 2018.
5[5] K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. Póczos, “Gaussian process bandit optimisation with multi-fidelity evaluations,” in Advances in Neural Information Processing Systems , 2016, pp. 992–1000.
6[6] P. Perdikaris and G. E. Karniadakis, “Model inversion via multi-fidelity bayesian optimization: a new paradigm for parameter estimation in haemodynamics, and beyond,” Journal of The Royal Society Interface , vol. 13, no. 118, pp. 20151107, 2016.
7[7] S. F. Ghoreishi and D. Allaire, “Multi-information source constrained bayesian optimization,” Structural and Multidisciplinary Optimization , pp. 1–15, 2018.
8[8] M. Poloczek, J. Wang, and P. Frazier, “Multi-information source optimization,” in Advances in Neural Information Processing Systems (NIPS) , 2017, pp. 4288–4298.