BCMA-ES II: revisiting Bayesian CMA-ES
Eric Benhamou, David Saltiel, Beatrice Guez, Nicolas Paris

TL;DR
This paper revisits Bayesian CMA-ES, clarifies the differences between normal and inverse Wishart priors, and introduces a mixture model to unify both approaches, supported by numerical experiments.
Contribution
It provides theoretical insights into the covariance expectations of normal and inverse Wishart priors and proposes a generalized mixture model for Bayesian CMA-ES.
Findings
Expected covariance is lower with normal Wishart prior due to convexity.
The mixture model unifies normal and inverse Wishart priors.
Numerical experiments compare the performance of both models and the generalized approach.
Abstract
This paper revisits the Bayesian CMA-ES and provides updates for normal Wishart. It emphasizes the difference between a normal and normal inverse Wishart prior. After some computation, we prove that the only difference relies surprisingly in the expected covariance. We prove that the expected covariance should be lower in the normal Wishart prior model because of the convexity of the inverse. We present a mixture model that generalizes both normal Wishart and normal inverse Wishart model. We finally present various numerical experiments to compare both methods as well as the generalized method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
BCMA-ES II: revisiting Bayesian CMA-ES
Eric Benhamou
A.I Square Connect and Lamsade, France
,
David Saltiel
A.I Square Connect and LISIC, France
,
Beatrice Guez
A.I Square Connect, France
and
Nicolas Paris
A.I Square Connect, France
(2019)
Abstract.
This paper revisits the Bayesian CMA-ES and provides updates for normal Wishart. It emphasizes the difference between a normal and normal inverse Wishart prior. After some computation, we prove that the only difference relies surprisingly in the expected covariance. We prove that the expected covariance should be lower in the normal Wishart prior model because of the convexity of the inverse. We present a mixture model that generalizes both normal Wishart and normal inverse Wishart model. We finally present various numerical experiments to compare both methods as well as the generalized method.
CMA ES, Bayesian, conjugate prior, normal Wishart, normal inverse Wishart, mixture models
††copyright: rightsretained††conference: A.I Square Working Paper; March 2019; France††journalyear: 2019††ccs: Mathematics of computing Probability and statistics
1. Introduction
Bayesian statistics have revolutionized statistics like quantum mechanics have done for Newtonian mechanism. Like the latter, the usual frequentist statistics can be seen as a particular asymptotic case of the former. Indeed, the Cox Jaynes theorem ((Cox,, 1946)) proves that under the four axiomatic assumptions given by:
- •
plausibility degrees are represented by real numbers (continuity of method),
- •
none of the possible data should be ignored (no retention)
- •
these values follow usual common sense rule as stated by the well known Laplace formula: the probability theory is truly the common sense represented in calculus (common sense),
- •
and states of equivalent knowledge should have equivalent degree of plausibility (consistency),
then, there exists a probability measure defined up to a monotonous function such that it follows the usual probability calculus and the fundamental rule of Bayes, that is:
[TABLE]
where and are two members of the implied algebra. The letters are not by chance. stands for the hypothesis, which can be interpreted as an hypothesis on the parameters, while stands for data.
The usual frequentist probabilities states that the probability of an observation is given certain hypothesis on the state of the world. However, as the equation (1) is completely symmetric, nothing hinders us to change our point of view and state the inverse question. Given an observation of a data , what is the plausibility of the hypothesis . The Bayes rules trivially answers this question:
[TABLE]
or equivalently,
[TABLE]
In the above equation, is called the prior probability or simply the prior while the conditional probability is called the posterior probability or simply the posterior. There are a few remarks to be made. First of all, the prior is not necessarily independent of the knowledge of the experience, on the contrary, a prior is often determined with some knowledge of previous experience in order to make a meaningful choice. Second, prior and posterior are not necessarily related to a chronological order but rather to a logical order.
After observing some data , we revise the plausibility of . it is interesting to see that the conditional probability considered as a function of is indeed a likelihood for . The Cox Jaynes theorem as presented in (Jaynes,, 2003) gives the foundation for Bayesian calculus. Another important result is the De Finetti’s theorem. Let us recall the definition of Infinite exchangeability.
Definition 1.0.
(Infinite exchangeability). We say that is an infinitely exchangeable sequence of random variables if, for any n, the joint probability is invariant to permutation of the indices. That is, for any permutation ,
[TABLE]
Equipped with this definition, the De Finetti’s theorem as provided below states that exchangeable observations are conditionally independent relative to some latent variable.
Theorem 1.2.
(De Finetti, 1930s). A sequence of random variables is infinitely exchangeable iff, for all n,
[TABLE]
for some measure P on .
This representation theorem 1.2 justifies the use of priors on parameters since for exchangeable data, there must exist a parameter , a likelihood and a distribution on . A proof of De Finetti theorem is for instance given in (Schervish,, 1996) (section 1.5). We will see that this Bayesian setting gives a powerful framework for revisiting black box optimization that is introduced below.
2. Black box optimization
We assume that we have a real value -dimensional function . We examine the following optimization program:
[TABLE]
In contrast to traditional convex optimization theory, we do not assume that is convex, neither continuous nor admits a global minimum. We are interested in the so called Black box optimization (BBO) settings where we only have access to the function and nothing else. By nothing else, we mean we can not for instance compute gradient. A practical way to do optimization in this very general and minimal setting is to do evolutionary optimization and in particular use the covariance matrix adaptation evolution strategy (CMA-ES) methodology. The CMA-ES (Hansen and Ostermeier,, 2001) is arguably one of the most powerful real-valued derivative-free optimization algorithms, finding many applications in machine learning. It is a state-of-the-art optimizer for continuous black-box functions as shown by the various benchmarks of the COCO (COmparing Continuous Optimisers) INRIA platform for ill-posed functions. It has led to a large number of papers and articles and we refer the interested reader to (Hansen and Ostermeier,, 2001; Auger et al.,, 2004; Igel et al.,, 2007; Auger and Hansen,, 2009; Hansen and Auger,, 2011; Auger and Hansen,, 2012; Hansen and Auger,, 2014; Akimoto et al.,, 2015, 2016; Ollivier et al.,, 2017) and (Varelas et al.,, 2018) to cite a few.
It has has been successfully applied in many unbiased performance comparisons and numerous real-world applications. In particular, in machine learning, it has been used for direct policy search in reinforcement learning and hyper-parameter tuning in supervised learning ( (Gomez et al.,, 2008), (Igel et al.,, 2009; Heidrich-Meisner and Igel,, 2009; Igel,, 2010)), and references therein, as well as hyperparameter optimization of deep neural networks (Loshchilov and Hutter,, 2016).
In a nutshell, the ( / ) CMA-ES is an iterative black box optimization algorithm, that, in each of its iterations, samples candidate solutions from a multivariate normal distribution, evaluates these solutions (sequentially or in parallel) retains candidates and adjusts the sampling distribution used for the next iteration to give higher probability to good samples. Each iteration can be individually seen as taking an initial guess or prior for the multi variate parameters, namely the mean and the covariance, and after making an experiment by evaluating these sample points with the fit function updating the initial parameters accordingly. Although rethinking the CMA-ES in terms of a prior and posterior seems natural when coming over from Bayesian statistics, it is only recently that it has been explored (Benhamou et al.,, 2019).
Historically, the CMA-ES has been developed heuristically. It was done mainly by conducting experimental research and validating intuitions empirically.
Research was done without much focus on theoretical foundations because of the apparent complexity of this algorithm. It was only recently that (Akimoto et al.,, 2010), (Glasmachers et al.,, 2010) and (Ollivier et al.,, 2017) made a breakthrough and provided a theoretical justification of CMA-ES updates thanks to information geometry. They proved that CMA-ES was performing a natural gradient descent in the Fisher information metric. The Bayesian formulation of the CMA-ES came effectively much later and has only been done sofar with the normal inverse Wishart prior.
In this paper, we revisit the Bayesian CMA-ES formulation and show that there exists indeed an infinity of conjugate prior given by the convex combination of a normal Wishart and normal inverse Wishart Gaussian prior. We first prove that normal Wishart and normal inverse Wishart Gaussian priors have the same update equations except for the mean of the covariance matrix. We provide a theoretical argument to show that the inverse of a matrix should be lower than in the normal inverse Wishart Gaussian prior. We then introduce a new prior given by a mixture of normal Wishart and normal inverse Wishart Gaussian prior. Likewise, we derive the update equations. In section 5, we finally give numerical results to compare all these methods.
3. Conjugate priors
A key concept in Bayesian statistics is conjugate priors that makes the computation really easy and is described below.
Definition 3.0.
A prior distribution is said to be a conjugate prior if the posterior distribution
[TABLE]
remains in the same distribution family as the prior.
At this stage, it is relevant to introduce exponential family distributions as this higher level of abstraction that encompasses the multi variate normal trivially solves the issue of founding conjugate priors. This will be very helpful for inferring conjugate priors for the multi variate Gaussian used in CMA-ES.
Definition 3.0.
A distribution is said to belong to the exponential family if it can be written (in its canonical form) as:
[TABLE]
where is the natural parameter, is the sufficient statistic, is log-partition function and is the base measure. and may be vector-valued. Here denotes the inner product of and .
The log-partition function is defined by the integral:
[TABLE]
Also, where is the natural parameter space. Moreover, is a convex set and is a convex function on .
Remark 3.1.
Not surprisingly, the normal distribution with mean and covariance matrix belongs to the exponential family but with a different parametrisation. Its exponential family form is given by:
[TABLE]
where in equations (8a), the notation means we have vectorized the matrix, stacking each column on top of each other and hence can equivalently write for and , two matrices, the trace result as the scalar product of their vectorization (see 7.2). We can remark the canonical parameters are very different from traditional (also called moment) parameters. We can notice that changing slightly the sufficient statistic leads to change the corresponding canonical parameters . In equation (8b), the notation means the determinant of the matrix: .
For an exponential family distribution, it is particularly easy to form conjugate prior.
Proposition 3.3.
If the observations have a density of the exponential family form p(x|\theta,\kappa)=h(x)\exp\Big{(}\eta(\theta,\kappa)^{T}T(x)-nA(\eta(\theta,\kappa))\Big{)}, with a set of hyper-parameters, then the prior with likelihood defined by with is a conjugate prior.
The proof is given in appendix subsection 7.1. As we can vary the parameterisation of the likelihood, we can obtain multiple conjugate priors. Because of the conjugacy, if the initial parameters of the multi variate Gaussian follows the prior, the posterior is the true distribution given the information and stay in the same family making the update of the parameters really easy. Said differently, with conjugate prior, we make the optimal update.
A consequence of proposition 3.3 is that the various conjugate priors of the multi variate normal that belong to the exponential family can be determined. This is the subject of the corollary below.
Corollary 3.4.
The conjugate priors of the multi variate normal that belong to the exponential family are necessarily of the form :
- •
normal inverse Wishart distribution if the multivariate normal is described in terms of its mean vector and covariance matrix .
- •
normal Wishart distribution if the multivariate normal is described in terms of its mean vector and precision matrix .
The proof is given in appendix subsection 7.3. As conjugate priors, the posterior of the two identified distributions of the corollary 3.4 are easy to derive and are given by the following proposition.
Proposition 3.5.
For a likelihood of points distributed according to a multi variate normal distribution whose parameters are given by the priors below:
- (1)
the normal inverse Wishart distribution:
** 2. (2)
the normal Wishart distribution: 3. (3)
the mixture of a normal inverse and normal Wishart with same parameters: with
The posterior is given by:
- (1)
the normal inverse Wishart distribution
[TABLE] 2. (2)
the normal Wishart distribution
[TABLE] 3. (3)
the mixture of a normal inverse and normal Wishart with same parameters:
where is the sample mean, the sample covariance and .
The proof is given in appendix subsection 7.4.
4. Algorithm
The idea behind the algorithm is at each step to make use the previous iteration posterior as a prior, draw the likelihood and then update according to proposition (3.5) the posterior. In full generality, the prior is a distribution, so we would need to do a Monte Carlo of Monte Carlo. But in order to reduce the variance by this Monte Carlo of Monte Carlo, we make the simplification to use the mean value of the prior distribution. These values are given as follows:
- (1)
for the normal inverse Wishart distribution, and 2. (2)
for the normal Wishart distribution, and for . 3. (3)
for the mixture of the normal inverse and normal Wishart with same parameters, and
It is obvious that the expected value of the covariance matrix of the normal inverse Wishart should be above the one of the normal Wishart distribution as the inverse of a matrix is a convex function in the domain of symmetric definite positive matrices. A proof is given in 7.5. To recover the true minimum, we design two strategies.
- •
we design a strategy where we rebuild our normal distribution but using sorted information of our ’s weighted by their normal density to ensure this is a true normal corrected from the Monte Carlo bias. We need to explicitly compute the weights. For each simulated point , we compute it assumed density denoted by where denotes the p.d.f. of the multi-variate Gaussian. We divide these density by their sum to get weights that are positive and sum to one as follows. . Hence for simulated points, we get . We reorder jointly the uplets (points and density) in terms of their weights in decreasing order. To insist we take sorted value in decreasing order with respect to the weights , we denote the order statistics . This first sorting leads to k new uplets . Using a stable sort (that keeps the order of the density), we sort jointly the uplets (points and weights) according to their objective function value (in increasing order this time) and get a k new uplets . We can now compute a new mean as follows:
[TABLE]
The intuition of equation (11) is to compute in the left term the Monte Carlo mean using reordered points according to their objective value and correct our initial computation by the Monte Carlo bias computed as the right term, equal to the initial Monte Carlo mean minus the real mean. We call this strategy one.
- •
If we think for a minute about the strategy one, we get the intuition that when starting the minimization, it may not be optimal. This is because weights are proportional to . When we start the algorithm, we use a large search space, hence a large covariance matrix which leads to have weights which are quite similar. Hence even if we sort candidates by their fit, ranking them according to the value of in increasing order, we will move our theoretical multi variate Gaussian little by little. A better solution is more to brutally move the center of our multi variate Gaussian to the best candidate seen so far, as follows:
[TABLE]
We call this strategy two. Intuitively, strategy two should be best when starting the algorithm while strategy one would be better once we are close to the solution.
To recover the true variance, we can adapt what we did in strategy one as follows:
- •
[TABLE]
where and are respectively the mean of the sorted and non sorted points.
5. Numerical results
5.1. Functions examined
We have examined five functions to stress test our algorithm. They are listed in increasing order of complexity for our algorithm and correspond to different type of functions. They are all generalized function that can defined for any dimension . For all, we present the corresponding equation for a variable of dimension. Code is provided in supplementary materials. We have frozen seeds to have reproducible of results.
5.1.1. Cone
The most simple function to optimize is the quadratic cone whose equation is given by (14) and represented in figure 1. It is also the standard Euclidean norm. It is obviously convex and is a good test of the performance of an optimization method.
[TABLE]
5.1.2. Schwefel 2 function
A slightly more complicated function is the Schwefel 2 function whose equation is given by (15) and represented in figure 2. It is a piece wise linear function and validates the algorithm can cope with non convex function.
[TABLE]
5.1.3. Rastrigin
The Rastrigin function, first proposed by (Rastrigin,, 1974) and generalized by (Mühlenbein et al.,, 1991), is more difficult compared to the Cone and the Schwefel 2 function. Its equation is given by (16) and represented in figure 3. It is a non-convex function often used as a performance test problem for optimization algorithms. It is a typical example of non-linear multi modal function. Finding its minimum is considered a good stress test for an optimization algorithm, due to its large search space and its large number of local minima.
[TABLE]
5.1.4. Schwefel 1 function
The Schwefel 1 function whose equation is given by (17) is a tricky function to optimize. It is represented in figure 4. It is sometimes only defined on . The Schwefel 1 function shares similarities with the Rastrigin function. It is continuous, not convex, multi-modal and with a large number of local minima. The extra difficulty compared to the Rastrigin function, the local minima are more pronounced local bowl making the optimization even harder.
[TABLE]
5.1.5. Eggholder function
The Eggholder function whose equation is given by (18) is a difficult function to optimize, because of the large number of local minima. It is sometimes only defined on . It shares similarities with the Schwefel1 function. It is continuous, not convex, multi-modal and with a large number of local minima.
[TABLE]
5.2. Convergence
For each of the functions, we compared our method using strategy one entitled B-CMA-ES S1: update and using (11) and (13) in orange with strategy two B-CMA-ES S2: same update but using (12) and (13), in blue and standard CMA-ES as provided by the opensource python package pycma in green. We clearly see that strategy two outperforms standard CMA-ES and Bayesian CMA-ES S1. The convergence graphics that shows the error compared to the minimum are represented
- •
for the cone function by figure 6 (case of a convex function), with initial point
- •
for the Schwefel 2 function in figure 7 (case of piecewise linear function), with initial point
- •
for the Rastrigin function in figure 8 (case of a non convex function with multiple local minima), with initial point
- •
and for the Schwefel 1 function in figure 9 (case of a non convex function with multiple large bowl local minima), with initial point
For functions that are convex, our method performs similarly as standard CMA-ES. For function with harder local minima, the Bayesian CMA-ES is able to perform better. We conjecture that this is due to contraction dilatation mechanism that enables to avoid being trapped in a local minimum.
6. Conclusion
In this paper, we have revisited the CMA-ES algorithm and provided a Bayesian version of it. Taking conjugate priors, we can find optimal update for the mean and covariance of the multi variate Normal. We have provided the corresponding algorithm that is a new version of CMA-ES. First numerical experiments show this new version is comparable to standard CMA-ES on traditional functions such as cone, Schwefel 1, Rastrigin and Schwefel 2. The similar convergence can be explained on a theoretical side from the optimal update of the prior (thanks to Bayesian update) and the use of the best candidate seen at each simulation to shift the mean of the multi-variate Gaussian likelihood. We envisage further works to benchmark our algorithm to traditional CMA-ES and other evolutionary algorithms, in particular to use the COCO platform to provide more meaningful tests and confirm the theoretical intuition of good performance of this new version of CMA-ES, and to test the importance of the prior choice.
7. Appendix
7.1. Conjugate priors
Proof.
Consider independent and identically distributed (IID) measurements and assume that these variables have an exponential family density. The likelihood , writes simply as the product of each individual likelihood:
[TABLE]
If we start with a prior of the form for some function , its posterior writes:
[TABLE]
It is easy to check that the posterior (20) is in the same exponential family as the prior iff is in the form
[TABLE]
for some , such that
[TABLE]
Hence, the conjugate prior for the likelihood (19) is parametrized by and given by
[TABLE]
where . ∎
7.2. Multivariate Canonical form
In the case of the multi variate normal, the canonical form for this distribution writes as
[TABLE]
which gives the following moment and canonical parameters:
[TABLE]
7.3. Conjugate priors determination
Using proposition 3.3 and the exponential family formulation of the multi variate normal (equations (25)), we have that any conjugate prior for the multi variate normal that belongs to the exponential family is given by
[TABLE]
If we write and , we get
[TABLE]
The first term is a normal multi variate distribution. Its parameters are and .
In the second term, we can recognize the proportional term of an inverse Wishart , with parameters .
This shows the conjugate prior of the multi variate normal given by its mean vector and covariance matrix is a normal inverse Wishart. Its parameters are ∎
If the multi variate normal is parametrized by its mean vector and its precision matrix , the same reasoning gives
[TABLE]
The second term is a multi variate normal distribution given by while the first one is the term of a Wishart distribution that is proportional to whose parameters are . This shows that the conjugate prior of the multi variate normal described by its mean vector and precision matrix is a normal Wishart distribution ∎
7.4. Posterior update
The posterior update is quite straightforward and very similar for the two cases: NIW and NW. We will detail only the calculation for the NIW case as it is very similar for the NW. Recall that the probability density function of a Normal inverse Wishart random variable is expressed as the product of a Normal and an Inverse Wishart probability density functions. Denoting by the dimension of the covariance matrix and using the Bayes rules, the posterior is proportional to the product of the prior and likelihood:
[TABLE]
First of all, we can regroup all terms in as follows
[TABLE]
and use the following remarkable identity:
[TABLE]
where we have used the commutativity property of the trace operator and that for a real number, the number is equal to its trace and written the sample covariance. Going further, we have
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Hence, we can compute explicitly the posterior as follows:
[TABLE]
[TABLE]
which are exactly the equations provided in (9) ∎
7.5. Convexity of the inverse of a matrix
We give here six different proofs of the convexity of the inverse of a matrix in the domain of symmetric definite positive matrices . The first and second proofs relies on the fact that the result is a consequence of proving that the matrix fractional function is convex on the domain . The implication comes from the fact that
[TABLE]
Since is arbitrary, this implies the matrix within the square bracket in equation (7.5) is positive semi-definite. It is interesting to notice that matrix fractional function is in a sense an extension of the fact that the quadratic over linear function defined as is convex on .
Proof.
The first proof uses the property that the minimum of a convex function over a convex set is convex. For , and for we can consider the quadratic function defined by
[TABLE]
As , this function is a obviously convex (quadratic function with its quadratic coefficient given by a definite positive matrix). Hence its minimum over a convex set is convex. Its easy to minimize a quadratic function and find its minimum given by the stationary point of its gradient , which concludes the proof. ∎
Proof.
A second proof is to show that the epigraph of , denoted by is convex thanks to the link between positive semi definite cones and Schur complements. We have that
[TABLE]
This concludes the proof as the epigraph of is convex as the inverse image of the positive semi definite cone by the Schur complement that is an affine mapping. ∎
Proof.
A third proof relies on the fundamental identity of the inverse of a matrix : , where is the identity matrix with rows (or columns). Take two positive definite symmetric matrices and . Take . and are obviously symmetric positive definite. Denote by the derivative with respect to . We have:
[TABLE]
Notice that , since is linear in . Differentiate one more time to get:
[TABLE]
For any non-zero random vector , define and . Equations (42) says that
[TABLE]
since is positive definite. As the second order derivative is positive, we conclude that is a convex function for over . As a result, for any , we have:
[TABLE]
Since is arbitrary, this implies the matrix within the square bracket in (7.5) is positive semi-definite and hence:
[TABLE]
Please note that when is invertible, is non-zero for non-zero . The inequalities in (43) and (7.5) become strict and the matrix within the square bracket in (7.5) is positive definite instead of positive semi-definite. ∎
Proof.
A fourth proof is to derive the convexity of the inverse of a matrix from the convexity of the function for . Let . We want to prove that
[TABLE]
where in inequality (45), we have left- and right- multiplied both sides by . As is positive definite, it can be unitary diagonalised and hence without loss of generality, we can assume that it is a diagonal matrix. So, the inequality reduces down to the scalar case , which is true using the fact that the function is convex for ∎
The last two proofs relies on the fact that the result is also implied by the fact that the function is convex for for any . This comes from the nice property that the Trace operator can commute and that the trace of a real number is itself.
Proof.
The fifth proof uses the fact that a positive second order derivative along any line is enough to prove convexity. Consider where and are symmetric positive definite. It is enough to show that We have
[TABLE]
So
[TABLE]
But where and is positive definite, so is positive semi definite, which implies , which concludes the proof ∎
Proof.
A final sixth proof is to relate this to eigen values. We can notive that the function is indeed the sum of the inverse of eigen values denoted by .
[TABLE]
We know that the function that associates to a diagonal matrix with strictly positive terms its kth element (which turns out to be one of its eigen values but not necessarily its kth one) is linear, hence convex and concave. By the composition rules for convex function, with , we can conclude that the inverse of the kth elements is convex for diagonal matrices with strictly positive term. Thus, the sum of the inverse of eigen values (defined as a sum of convex functions) is convex on the set of diagonal matrix with strictly positive term. We can conclude using the diagonalisation result of definite positive matrix (with , an orthonormal matrix, a diagonal matrix with strictly positive term and ) to extend the convexity property to the set of and use also that ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Akimoto et al., (2015) Akimoto, Y., Auger, A., and Hansen, N. (2015). Continuous optimization and CMA-ES. GECCO 2015, Madrid, Spain , 1:313–344.
- 2Akimoto et al., (2016) Akimoto, Y., Auger, A., and Hansen, N. (2016). CMA-ES and advanced adaptation mechanisms. GECCO, Denver , 2016:533–562.
- 3Akimoto et al., (2010) Akimoto, Y., Nagata, Y., Ono, I., and Kobayashi, S. (2010). Bidirectional relation between cma evolution strategies and natural evolution strategies. PPSN , XI(1):154–163.
- 4Auger and Hansen, (2009) Auger, A. and Hansen, N. (2009). Benchmarking the (1+1)-CMA-ES on the BBOB-2009 noisy testbed. Companion Material , GECCO 2009:2467–2472.
- 5Auger and Hansen, (2012) Auger, A. and Hansen, N. (2012). Tutorial CMA-ES: evolution strategies and covariance matrix adaptation. Companion Material Proceedings , 2012(12):827–848.
- 6Auger et al., (2004) Auger, A., Schoenauer, M., and Vanhaecke, N. (2004). LS-CMA-ES: A second-order algorithm for covariance matrix adaptation. PPSN VIII, 8th International Conference, Birmingham, UK, September 18-22, 2004, Proceedings , 2004(2004):182–191.
- 7Benhamou et al., (2019) Benhamou, E., Saltiel, D., Verel, S., and Teytaud, F. (2019). BCMA-ES: A Bayesian approach to CMA-ES. ar Xiv e-prints , page ar Xiv:1904.01401.
- 8Cox, (1946) Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics , 14(2):1–13.
