Multifidelity Bayesian Optimization for Binomial Output
Leonid Matyushin, Alexey Zaytsev, Oleg Alenkin, Andrey Ustuzhanin

TL;DR
This paper introduces a multifidelity Bayesian optimization approach tailored for binomial output functions, leveraging a specialized Gaussian process model and an adaptive sampling strategy to efficiently optimize expensive binomial-based targets.
Contribution
It develops a novel Gaussian process model for binomial outputs and proposes an adaptive sampling heuristic within a multifidelity Bayesian optimization framework.
Findings
Effective optimization of binomial functions demonstrated
Adaptive sampling improves efficiency and accuracy
Model outperforms traditional Gaussian process approaches
Abstract
The key idea of Bayesian optimization is replacing an expensive target function with a cheap surrogate model. By selection of an acquisition function for Bayesian optimization, we trade off between exploration and exploitation. The acquisition function typically depends on the mean and the variance of the surrogate model at a given point. The most common Gaussian process-based surrogate model assumes that the target with fixed parameters is a realization of a Gaussian process. However, often the target function doesn't satisfy this approximation. Here we consider target functions that come from the binomial distribution with the parameter that depends on inputs. Typically we can vary how many Bernoulli samples we obtain during each evaluation. We propose a general Gaussian process model that takes into account Bernoulli outputs. To make things work we consider a simple acquisition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Advanced Multi-Objective Optimization Algorithms · Control Systems and Identification
MethodsGaussian Process
Multifidelity Bayesian Optimization for Binomial Output
Leonid Matyushin
Alexey Zaytsev
Oleg Alenkin
Andrey Ustuzhanin
Skoltech, Moscow
IITP RAS, Moscow
National Research University Higher School of Economics, Moscow
Abstract
The key idea of Bayesian optimization is replacing an expensive target function with a cheap surrogate model. By selection of an acquisition function for Bayesian optimization we trade off between exploration and exploitation. The acquisition function typically depends on mean and variance of the surrogate model at a given point. The most common Gaussian process-based surrogate model assumes that the target with fixed parameters is a realization of a Gaussian process. However, often the target function doesn’t satisfy this approximation. Here we consider target functions that come from the binomial distribution with the parameter that depends on inputs. Typically we can vary how many Bernoulli samples we obtain during each evaluation. We propose a general Gaussian process model that takes into account Bernoulli outputs. To make things work we consider a simple acquisition function based on Expected Improvement and a heuristic strategy to choose the number of samples at each point thus taking into account precision of the obtained output.
keywords:
Bayesian optimization , Gaussian processes , Binomial distribution , Multifidelity
††journal: Neurocomputing
1 Introduction
Bayesian optimization (BO) is a powerful class of optimization methods that allows optimization of black-box non-deterministic functions. In vanilla approach we assume that this function is a deterministic function plus Gaussian noise and then obtain analytical treatment of the problem of evaluation of posterior mean and variance. Using mean and variance we can evaluate most of the acquisition functions used for selection of point for evaluation at the next step of optimization [1].
This assumption about the target function is sometimes inadequate. For example, for binomially distributed observations vanilla BO sometimes struggles to find a minimum, as the model is wrong. The binomially distributed observations often occur in high energy physics, e.g. the spectrometer tracker optimization [2] and the muon shield optimization [3]. In both examples, target functions are Monte-Carlo simulations of real experiments with main reasons of randomness are quantum effects.
These two examples share many properties. The target functions are expensive to evaluate. It took hours to get results using a modern cluster. The target functions have discrete distribution, so they are not Gaussian-noised deterministic functions. For example, the last one is naturally Binomial distributed. It is possible to choose the complexity of simulation determined by the number of simulated particles. High-fidelity simulations are accurate but expensive. Low-fidelity simulations are cheaper but less accurate.
Thus we need approach that able to deal with this kind of problems. To create such approach we need to propose a correct model based on Generalized Gaussian Process regression, then construct an acquisition function. Also we need to clarify if we can improve our models using availability of multifidelity evaluations.
The paper is organized as follows. In Section 3 we give a preliminary information about Bayesian Optimization (BO) and Gaussian processes (GP). Section 4 is devoted to our modification of GP model construction and BO for binomial output. In section 5 we investigate usefulness of the proposed approach and examine peculiarities and possible applications. We use artificial functions in our numerical experiments.
2 Related Works
Area of application of Bayesian optimization known in different areas under different names are quite wide. A recent overview of Bayesian optimization is provided by authors in [1], see this article and references in it. Below we cover some issues related to our specific applications.
We start of range of applications where the output is binomial. See e.g. problems of hyperparameter tuning or AutoML: In work [4] authors propose an early stopping criterion combined with modification of EI acquisition function in which evaluation of a configuration is stopped if predicted performance is worse than the current best configuration. Bayesian optimization was used for tuning of hyperparameters for Alpha Go [5] as well as for other deep learning based systems [6]. Also see [2] and [3] for high energy physics.
As mentioned in chapter 1 values of a black box could have Binomial distribution. It means that the exact Bayesian inference fails, since the likelihood is not Gaussian. The same problem arises when you try to adapt the Gaussian processes for the task of classification [7] or robust regression with Laplace or Cauchy likelihood [8].
To use these models one can approximate non-Gaussian posterior by Gaussian distribution. Many approaches are used in this area, to name a few [9]: Markov-chain Monte Carlo [7], Laplace approximation [10], mean field variational inference [11], and expectation propagation [12]. GP models like GP classifier, GP counter or GP regression use different observations likelihoods: Bernoulli, Poisson, Gaussian, Binomial and etc. All these distributions are samples of exponential family. Aim of the work [13] is to show how to create a framework unifying all existing GP models and making easier creating of new ones using distribution from exponential family.
Common Bayesian optimization approaches assume single-fidelity simulations. But in some cases it is possible to use cheaper calculation of the same objective with lower fidelity. For example, in such fields as aerodynamics, hyperparameter tuning and industrial design there is an opportunity to use simulations with different fidelities [14, 15, 16, 17, 18]. Methods considered above consider specific problems. More general approach MF-GP-UCB [19] considers a finite number of approximations and assumes that there exist an upper bound for the difference between high and low-fidelity. This model doesn’t allow sharing information between fidelities, and each model is treated independently. There exist a generalization of this approach to the continuous fidelity case [20].
3 Overview
3.1 Bayesian optimization
We want to minimize function. Suppose that it is impossible to evaluate it directly. In classic case for Gaussian process regression we observe . In this paper we consider observations from the , where is a number of evaluations at each point. We can represent a vanilla BO as the following iterative scheme:
Train a regression model that approximates target function via GP. Now we can evaluate an acquisition function using the regression model. 2. 2.
Obtain the point that maximizes the acquisition function
[TABLE] 3. 3.
Evaluate
[TABLE] 4. 4.
Update the available sample
[TABLE]
Now let us consider each step in more details.
3.2 Regression Model
Gaussian process regression is a popular approach for the construction of nonlinear regression models [10] with uncertainty estimates required to perform Bayesian optimization.
Gaussian process on the is fully specified by mean and covariance functions . Following Bayesian ideology on k-th step we put the following prior distribution over ’s, where :
[TABLE]
where and . The typical way is to set , .
To perform a Bayesian inference we need to choose likelihood. Classical approach suggest us to work with a Gaussian likelihood, since in this case prior and likelihood are conjugate and it is possible to find an exact expression for a posterior [21].
[TABLE]
This is a distribution over already visited points. To make a prediction we need to compute the following integral. In case of a Gaussian likelihood we have two conjugate distribution inside of this integral, so result could be found analytically.
[TABLE]
Where is a marginal Gaussian, since all ’s are Gaussian distributed. And the distribution over noised variable is nothing but
[TABLE]
Where is a Gaussian distribution in case of a Gaussian likelihood, so the classical final answer is a Gaussian distribution. It follows that the posterior mean and the posterior variance have the following form:
[TABLE]
3.3 Acquisition function
At each iteration of Bayesian optimization we select the next point to evaluate the target function. There are several approaches. One of the most popular choice we adopt here is “Expected Improvement” (EI):
[TABLE]
where is the minimal value obtained to the current iteration. In case of a Gaussian likelihood it was shown that it converges under mild assumption [citation TODO] and has a closed-form expression:
[TABLE]
Where is the standard normal CDF.
Next point is of . Expected improvement has many local maxima and regions with almost constant function value. As we need to evaluate only the posterior mean and variance at each point, it is cheap to have as many evaluation as required, so most of the global optimization methods can solve this task.
4 Proposed approach
In this paper we consider Binomial Bayesian optimization, so observable variable have Binomial distribution:
[TABLE]
In this case we use a GGPM setting to perform a Bayesian Inference [13]. In particular, (3.2), (3.2) (3.2) could not be expressed in a closed form, since in this case there is no conjugate distributions in the corresponding formulas, so one should use approximate methods, such as Laplace approximation [13]. We approximate Expected Improvement (3.3) in Binomial setting via Monte-Carlo.
In practice we specify parameter for Binomial distribution before evaluation of the black-box. Low ’s allow us to spend less computational resources, while large ’s allows to make a more exact evaluation.
The proposed method considers two different fidelities for simulation and . First thing should be done is determining of these fidelities. To distinguish promising point one can run low fidelity simulation at this point that does not cost a lot. Then using obtained information one can make a decision to continue simulation at the same point or to move to the next one proposed by acquisition function.
Decision function takes as input result of low fidelity simulation at current point , actual surrogate model and some external parameters. After running low fidelity simulation one can calculate posterior distribution of objective function at this point. Observation is a realization of Binomial random value with parameters and , one can estimate posterior distribution of :
[TABLE]
is binomial distributed, so we assume that is Beta distribution with parameters , (uniform distribution on ) since in this case we can perform Bayesian inference in a closed form:
[TABLE]
Let us now compute the probability of improvement of at the current point:
[TABLE]
This probability is the value of Beta CDF at . Now one compare this probability with the threshold and make a decision to continue calculation at current point or to spend available budget for evaluations at other points. In section 5 we explore performance of proposed approach for different choices of .
5 Experiments
5.1 Methodology of comparison
We assume that workflow of Bayesian optimization and the only computational expenses are related to evaluations of a black box . For each function we consider how the minimal seen value (i.e. ) depends on computational resources. We suppose that high fidelity is times more expensive than low fidelity.
To compare the approach proposed in Section 4 with other methods we perform massive testing of considered algorithms on different optimization problems. Each problem is characterized by a target function and the parameter . Optimization method starts from random initial design and performs several steps of optimization of given function. For the sake of comparison we rescale optimization results with respect to spent computational resources. As a result, for each problem we got a minimal seen values for a unit of computational resources. We take an average of this values over multiple runs to achieve stability.
5.2 Metrics
We use Dolan-More curves as a method to compare results of optimization [22, 23]. To define Dolan-More curves for our problem we need to specify a set of problems , a set of solvers , and — the measure of success of an approach on a problem . In our case was defined in previous paragraph - it is a set of all problems of minimization of described functions for a fixed budget.
is a set that consists of vanilla Gaussian Bayesian optimization, vanilla Binomial Bayesian optimization, and proposed modification of Binomial Bayesian optimization with and . For all methods we used SLSQP method to maximize Expected Improvement. is a true value of objective function at the point with minimal observed value at the current step (averaged over multiple runs).
Finally the Dolan-More curve is the following:
[TABLE]
The higher is curve the better is corresponding solver.
5.3 Generation of samples for evaluation
Artificial functions of different dimensions are a common choice for benchmarking of test optimization algorithms. For our case the objective function should lie in the interval . So, we rescale the artificial function using minimum and maximum values of them:
[TABLE]
The values and are not always known, so we obtained them using numerical optimization. In the end we have that lies in the desired interval and can be interpreted as the probability of a success at a particular point.
We used several popular artificial functions : Michalewicz, Rastrigin, Zakharov and Styblinski-Tang functions on hypercubes in , and . Observations of these functions are sampled from Binomial distribution with parameters and with specified manually.
5.4 Results
Note, that two of considered functions Zakharov and Styblinski-Tang have only one global optimum, while the other two Michalewicz and Rastrigin have multiple extremes.
Figures 2 depict dynamic of regret w.r.t number of devoted computational resources. We see that the behaviour is different for different models used. At the figure 3 we see can see Dolan-More curves for each of these two groups. The difference is even more evident.
We observe that in the single optimum case vanilla Binomial Bayesian optimization outperform vanilla Gaussian Bayesian optimization. Also for Binomial Bayesian outperform all other algorithms. In multiple extremes case situation is different: vanilla Gaussian Bayesian optimization outperforms Binomial algorithms. We conclude, that Binomial optimization approach and, especially, the multifidelity one is a good choice for single optimum functions, while for multiple extreme function vanilla Gaussian Bayesian optimization is better.
6 Discussion
In this paper we consider the problem of optimization in non-Gaussian likelihood setting. According to our findings usage of proper surrogate model during Bayesian optimization significantly improve performance. It is important for the case of complex multi-optimum functions. But at the same time this approach has a drawback: using of GGPM instead of usual GP regression for approximation fitting requires more times, approximated inference instead of closed form calculation should be used. As a consequence calculation of acquisition function also becomes intractable problem, in this work we applied Monte-Carlo sampling technique to this problem. But this drawback is not crucial, because BO is usually applied to heavy to evaluate black-box function, so time spent on searching of next point is much lower than time of objective function evaluation.
We also proposed modifications of the algorithm to work with variable fideltiy evaluations: high cost of simulation and low level of noise, low cost of simulation and high level of noise. The proposed heuristics showed weak improvement in comparison with single-fidelity optimization. For further improvement one can try to select hyperparameters and fidelity in a smarter way.
In this work the ratios for conducted experiments were chosen close to 2. But we assume that using of reinforcement learning technique for selection these parameters depending on particular function might led to better performance of optimization.
7 Conclusions
Bayesian optimization is actively used today in different fields. There are several main directions for the development of this approach. Firstly, computation not one but a batch of next points to parallelize evaluation of costly function. Secondly, working with different fidelities. In this work we showed that the usage of suitable surrogate models gives significant improvement of optimization. Moreover we have proposed algorithm suitable for all distributions from the exponential family.
8 Acknowledgments
The research, presented in Section 5 of this paper, was partially supported by the Russian Foundation for Basic Research grants 16-01-00576 A and 16-29-09649 ofi m.
References
- [1]
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Freitas, Taking the human out of the loop: A review of bayesian optimization, Proceedings of the IEEE 104 (1) (2016) 148–175.
- [2]
E. Van Herwijnen, T. Ruf, M. Ferro-Luzzi, H. Dijkstra, Simulation and pattern recognition for the ship spectrometer tracker, Tech. rep. (2015).
- [3]
A. Baranov, E. Burnaev, D. Derkach, A. Filatov, N. Klyuchnikov, O. Lantwin, F. Ratnikov, A. Ustyuzhanin, A. Zaitsev, Optimising the active muon shield for the ship experiment at cern, in: Journal of Physics: Conference Series, Vol. 934, IOP Publishing, 2017, p. 012050.
- [4]
T. Domhan, J. T. Springenberg, F. Hutter, Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- [5]
Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, N. de Freitas, Bayesian optimization in alphago, arXiv preprint arXiv:1812.06855.
- [6]
A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, Fast bayesian optimization of machine learning hyperparameters on large datasets, arXiv preprint arXiv:1605.07079.
- [7]
H. Nickisch, C. E. Rasmussen, Approximations for binary gaussian process classification, Journal of Machine Learning Research 9 (Oct) (2008) 2035–2078.
- [8]
M. Opper, C. Archambeau, The variational gaussian approximation revisited, Neural computation 21 (3) (2009) 786–792.
- [9]
C. M. Bishop, Pattern recognition and machine learning, springer, 2006.
- [10]
C. K. Williams, C. E. Rasmussen, Gaussian processes for machine learning, Vol. 2, MIT Press Cambridge, MA, 2006.
- [11]
D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (518) (2017) 859–877.
- [12]
T. P. Minka, Expectation propagation for approximate bayesian inference, in: Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., 2001, pp. 362–369.
- [13]
L. Shang, A. B. Chan, On approximate inference for generalized gaussian process models, arXiv preprint arXiv:1311.6371.
- [14]
A. I. Forrester, A. Sóbester, A. J. Keane, Multi-fidelity optimization via surrogate modelling, Proceedings of the royal society a: mathematical, physical and engineering sciences 463 (2088) (2007) 3251–3269.
- [15]
D. Huang, T. T. Allen, W. I. Notz, R. A. Miller, Sequential kriging optimization using multiple-fidelity evaluations, Structural and Multidisciplinary Optimization 32 (5) (2006) 369–382.
- [16]
A. Klein, S. Bartels, S. Falkner, P. Hennig, F. Hutter, Towards efficient bayesian optimization for big data, in: NIPS 2015 Bayesian Optimization Workshop, 2015.
- [17]
A. Zaytsev, Variable fidelity regression using low fidelity function blackbox and sparsification, in: Symposium on Conformal and Probabilistic Prediction with Applications, Springer, 2016, pp. 147–164.
- [18]
A. Zaytsev, Reliable surrogate modeling of engineering data with more than two levels of fidelity, in: 2016 7th International Conference on Mechanical and Aerospace Engineering (ICMAE), IEEE, 2016, pp. 341–345.
- [19]
K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, B. Póczos, Gaussian process bandit optimisation with multi-fidelity evaluations, in: Advances in Neural Information Processing Systems, 2016, pp. 992–1000.
- [20]
K. Kandasamy, G. Dasarathy, J. Schneider, B. Póczos, Multi-fidelity bayesian optimisation with continuous approximations, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1799–1808.
- [21]
E. Burnaev, A. Zaytsev, M. Panov, P. Prihodko, Y. Yanovich, Modeling of nonstationary covariance function of gaussian process using decomposition in dictionary of nonlinear functions, Information Technologies and Systems–2011 (2011) 2–7.
- [22]
E. D. Dolan, J. J. Moré, Benchmarking optimization software with performance profiles, Mathematical programming 91 (2) (2002) 201–213.
- [23]
M. Belyaev, E. Burnaev, E. Kapushev, M. Panov, P. Prikhodko, D. Vetrov, D. Yarotsky, Gtapprox: Surrogate modeling for industrial design, Advances in Engineering Software 102 (2016) 29–39.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Freitas, Taking the human out of the loop: A review of bayesian optimization, Proceedings of the IEEE 104 (1) (2016) 148–175.
- 2[2] E. Van Herwijnen, T. Ruf, M. Ferro-Luzzi, H. Dijkstra, Simulation and pattern recognition for the ship spectrometer tracker, Tech. rep. (2015).
- 3[3] A. Baranov, E. Burnaev, D. Derkach, A. Filatov, N. Klyuchnikov, O. Lantwin, F. Ratnikov, A. Ustyuzhanin, A. Zaitsev, Optimising the active muon shield for the ship experiment at cern, in: Journal of Physics: Conference Series, Vol. 934, IOP Publishing, 2017, p. 012050.
- 4[4] T. Domhan, J. T. Springenberg, F. Hutter, Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves, in: Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- 5[5] Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, N. de Freitas, Bayesian optimization in alphago, ar Xiv preprint ar Xiv:1812.06855.
- 6[6] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, Fast bayesian optimization of machine learning hyperparameters on large datasets, ar Xiv preprint ar Xiv:1605.07079.
- 7[7] H. Nickisch, C. E. Rasmussen, Approximations for binary gaussian process classification, Journal of Machine Learning Research 9 (Oct) (2008) 2035–2078.
- 8[8] M. Opper, C. Archambeau, The variational gaussian approximation revisited, Neural computation 21 (3) (2009) 786–792.
