BCMA-ES II: revisiting Bayesian CMA-ES

Eric Benhamou; David Saltiel; Beatrice Guez; Nicolas Paris

arXiv:1904.01466·cs.LG·April 10, 2019

BCMA-ES II: revisiting Bayesian CMA-ES

Eric Benhamou, David Saltiel, Beatrice Guez, Nicolas Paris

PDF

TL;DR

This paper revisits Bayesian CMA-ES, clarifies the differences between normal and inverse Wishart priors, and introduces a mixture model to unify both approaches, supported by numerical experiments.

Contribution

It provides theoretical insights into the covariance expectations of normal and inverse Wishart priors and proposes a generalized mixture model for Bayesian CMA-ES.

Findings

01

Expected covariance is lower with normal Wishart prior due to convexity.

02

The mixture model unifies normal and inverse Wishart priors.

03

Numerical experiments compare the performance of both models and the generalized approach.

Abstract

This paper revisits the Bayesian CMA-ES and provides updates for normal Wishart. It emphasizes the difference between a normal and normal inverse Wishart prior. After some computation, we prove that the only difference relies surprisingly in the expected covariance. We prove that the expected covariance should be lower in the normal Wishart prior model because of the convexity of the inverse. We present a mixture model that generalizes both normal Wishart and normal inverse Wishart model. We finally present various numerical experiments to compare both methods as well as the generalized method.

Equations118

P (H, D) = P (H ∣ D) P (D) = P (D ∣ H) P (H)

P (H, D) = P (H ∣ D) P (D) = P (D ∣ H) P (H)

P (H ∣ D) = P (D ∣ H) \frac{P ( H )}{P ( D )} = P (D ∣ H) P (H)

P (H ∣ D) = P (D ∣ H) \frac{P ( H )}{P ( D )} = P (D ∣ H) P (H)

P (H ∣ D) \propto P (D ∣ H) P (H)

P (H ∣ D) \propto P (D ∣ H) P (H)

p (x_{1}, x_{2}, ..., x_{n}) = p (x_{π 1}, x_{π 2}, ..., x_{π n})

p (x_{1}, x_{2}, ..., x_{n}) = p (x_{π 1}, x_{π 2}, ..., x_{π n})

p (x_{1}, x_{2}, ..., x_{n}) = \int i = 1 \prod n p (x_{i} ∣ θ) P (d θ),

p (x_{1}, x_{2}, ..., x_{n}) = \int i = 1 \prod n p (x_{i} ∣ θ) P (d θ),

x \in R^{p} min f (x)

x \in R^{p} min f (x)

π (θ ∣ x) \propto p (x ∣ θ) π (θ)

π (θ ∣ x) \propto p (x ∣ θ) π (θ)

p (x ∣ η) = h (x) exp (η \cdot T (x) - A (η)),

p (x ∣ η) = h (x) exp (η \cdot T (x) - A (η)),

A (η) ≜ lo g \int_{X} h (x) exp (η \cdot T (x)) d x .

A (η) ≜ lo g \int_{X} h (x) exp (η \cdot T (x)) d x .

η (μ, Σ)

η (μ, Σ)

h (x)

N I W_{1} = N I W (\frac{λ _{0} μ _{0} + n x}{λ _{0} + n}, λ_{0} + n, ν_{0} + n, Ψ_{0} + n C + n D)

N I W_{1} = N I W (\frac{λ _{0} μ _{0} + n x}{λ _{0} + n}, λ_{0} + n, ν_{0} + n, Ψ_{0} + n C + n D)

N W_{1} = N W (\frac{λ _{0} μ _{0} + n x}{λ _{0} + n}, λ_{0} + n, ν_{0} + n, (W_{0} + n C + n D)^{- 1})

N W_{1} = N W (\frac{λ _{0} μ _{0} + n x}{λ _{0} + n}, λ_{0} + n, ν_{0} + n, (W_{0} + n C + n D)^{- 1})

\overset{μ}{^} = MC mean for X_{f ↑} i = 1 \sum k w_{(i), w ↓} \cdot X_{(i), f ↑} - MC bias for X (i = 1 \sum k w_{i} X_{i} - \overset{μ}{^})

\overset{μ}{^} = MC mean for X_{f ↑} i = 1 \sum k w_{(i), w ↓} \cdot X_{(i), f ↑} - MC bias for X (i = 1 \sum k w_{i} X_{i} - \overset{μ}{^})

\overset{μ}{^} = X \in X arg min f (X)

\overset{μ}{^} = X \in X arg min f (X)

\hat{Σ} =

\hat{Σ} =

f (x) = (i = 1 \sum n x_{i}^{2})^{1/2} = ∥ x ∥_{2}

f (x) = (i = 1 \sum n x_{i}^{2})^{1/2} = ∥ x ∥_{2}

f (x) = i = 1 \sum n ∣ x_{i} ∣ + P o d_{i = 1}^{n} ∣ x_{i} ∣

f (x) = i = 1 \sum n ∣ x_{i} ∣ + P o d_{i = 1}^{n} ∣ x_{i} ∣

f (x) = 10 \times n + i = 1 \sum n [x_{i}^{2} - 10 cos (2 π x_{i})]

f (x) = 10 \times n + i = 1 \sum n [x_{i}^{2} - 10 cos (2 π x_{i})]

f (x) = 418.9829 \times n

f (x) = 418.9829 \times n

- i = 1 \sum n [x_{i} sin (∣ x_{i} ∣) \mathbbm 1_{∣ x_{i} ∣ < 500} + 500 sin (500) \mathbbm 1_{∣ x_{i} ∣ \geq 500}]

f (x, y) = - (y + 47) sin \frac{x}{2} + (y + 47) - x sin ∣ x - (y + 47) ∣

f (x, y) = - (y + 47) sin \frac{x}{2} + (y + 47) - x sin ∣ x - (y + 47) ∣

\displaystyle p(\mathcal{X}|\theta,\kappa)\!=\!\Big{(}\prod_{j=1}^{n}h({\mathbf{x}}^{j})\Big{)}\exp\Big{(}\eta(\theta,\kappa)^{T}\sum_{j=1}^{n}T(x^{j})-nA(\eta(\theta,\kappa))\Big{)}.

\displaystyle p(\mathcal{X}|\theta,\kappa)\!=\!\Big{(}\prod_{j=1}^{n}h({\mathbf{x}}^{j})\Big{)}\exp\Big{(}\eta(\theta,\kappa)^{T}\sum_{j=1}^{n}T(x^{j})-nA(\eta(\theta,\kappa))\Big{)}.

π (θ ∣ X)

π (θ ∣ X)

\propto exp (η (θ, κ) \cdot j = 1 \sum n T (x^{j}) - n A (η (θ, κ)) + F (θ)) .

F (θ) = λ_{1} \cdot η (θ, κ) - λ_{0} A (η (θ, κ))

F (θ) = λ_{1} \cdot η (θ, κ) - λ_{0} A (η (θ, κ))

\displaystyle\!\!\!\!p(\mathcal{X}|\theta,\kappa)\!\propto\!\exp\Big{(}\Big{(}\lambda_{1}+\sum_{j=1}^{n}T(x^{j})\Big{)}^{T}\!\!\eta(\theta,\kappa)-(n+\lambda_{0})A(\eta(\theta,\kappa))\Big{)}\!.

\displaystyle\!\!\!\!p(\mathcal{X}|\theta,\kappa)\!\propto\!\exp\Big{(}\Big{(}\lambda_{1}+\sum_{j=1}^{n}T(x^{j})\Big{)}^{T}\!\!\eta(\theta,\kappa)-(n+\lambda_{0})A(\eta(\theta,\kappa))\Big{)}\!.

p (X ∣ θ, κ) = \frac{1}{Z} exp (λ_{1} \cdot η (θ, κ) - λ_{0} A (η (θ, κ))),

p (X ∣ θ, κ) = \frac{1}{Z} exp (λ_{1} \cdot η (θ, κ) - λ_{0} A (η (θ, κ))),

\frac{1}{( 2 π ) ^{d} det ( Σ )} exp (- \frac{( X - μ ) ^{T} Σ ^{- 1} ( X - μ )}{2})

\frac{1}{( 2 π ) ^{d} det ( Σ )} exp (- \frac{( X - μ ) ^{T} Σ ^{- 1} ( X - μ )}{2})

θ

θ

T (X)

η (θ)

A (η (θ))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

BCMA-ES II: revisiting Bayesian CMA-ES

Eric Benhamou

A.I Square Connect and Lamsade, France

[email protected]

,

David Saltiel

A.I Square Connect and LISIC, France

[email protected]

,

Beatrice Guez

A.I Square Connect, France

[email protected]

and

Nicolas Paris

A.I Square Connect, France

[email protected]

(2019)

Abstract.

This paper revisits the Bayesian CMA-ES and provides updates for normal Wishart. It emphasizes the difference between a normal and normal inverse Wishart prior. After some computation, we prove that the only difference relies surprisingly in the expected covariance. We prove that the expected covariance should be lower in the normal Wishart prior model because of the convexity of the inverse. We present a mixture model that generalizes both normal Wishart and normal inverse Wishart model. We finally present various numerical experiments to compare both methods as well as the generalized method.

CMA ES, Bayesian, conjugate prior, normal Wishart, normal inverse Wishart, mixture models

††copyright: rightsretained††conference: A.I Square Working Paper; March 2019; France††journalyear: 2019††ccs: Mathematics of computing Probability and statistics

1. Introduction

Bayesian statistics have revolutionized statistics like quantum mechanics have done for Newtonian mechanism. Like the latter, the usual frequentist statistics can be seen as a particular asymptotic case of the former. Indeed, the Cox Jaynes theorem ((Cox,, 1946)) proves that under the four axiomatic assumptions given by:

•

plausibility degrees are represented by real numbers (continuity of method),

•

none of the possible data should be ignored (no retention)

•

these values follow usual common sense rule as stated by the well known Laplace formula: the probability theory is truly the common sense represented in calculus (common sense),

•

and states of equivalent knowledge should have equivalent degree of plausibility (consistency),

then, there exists a probability measure defined up to a monotonous function such that it follows the usual probability calculus and the fundamental rule of Bayes, that is:

[TABLE]

where $H$ and $D$ are two members of the implied $\sigma-$ algebra. The letters are not by chance. $H$ stands for the hypothesis, which can be interpreted as an hypothesis on the parameters, while $D$ stands for data.

The usual frequentist probabilities states that the probability of an observation $\mathbb{P}(D)$ is given certain hypothesis $H$ on the state of the world. However, as the equation (1) is completely symmetric, nothing hinders us to change our point of view and state the inverse question. Given an observation of a data $D$ , what is the plausibility of the hypothesis $H$ . The Bayes rules trivially answers this question:

[TABLE]

or equivalently,

[TABLE]

In the above equation, $\mathbb{P}(H)$ is called the prior probability or simply the prior while the conditional probability $\mathbb{P}(H|D)$ is called the posterior probability or simply the posterior. There are a few remarks to be made. First of all, the prior is not necessarily independent of the knowledge of the experience, on the contrary, a prior is often determined with some knowledge of previous experience in order to make a meaningful choice. Second, prior and posterior are not necessarily related to a chronological order but rather to a logical order.

After observing some data $D$ , we revise the plausibility of $H$ . it is interesting to see that the conditional probability $\mathbb{P}(D|H)$ considered as a function of $H$ is indeed a likelihood for $H$ . The Cox Jaynes theorem as presented in (Jaynes,, 2003) gives the foundation for Bayesian calculus. Another important result is the De Finetti’s theorem. Let us recall the definition of Infinite exchangeability.

Definition 1.0.

(Infinite exchangeability). We say that $(x_{1},x_{2},...)$ is an infinitely exchangeable sequence of random variables if, for any n, the joint probability $p(x_{1},x_{2},...,x_{n})$ is invariant to permutation of the indices. That is, for any permutation $\pi$ ,

[TABLE]

Equipped with this definition, the De Finetti’s theorem as provided below states that exchangeable observations are conditionally independent relative to some latent variable.

Theorem 1.2.

(De Finetti, 1930s). A sequence of random variables $(x_{1},x_{2},...)$ is infinitely exchangeable iff, for all n,

[TABLE]

for some measure P on $\theta$ .

This representation theorem 1.2 justifies the use of priors on parameters since for exchangeable data, there must exist a parameter $\theta$ , a likelihood $p(x|\theta)$ and a distribution $\pi$ on $\theta$ . A proof of De Finetti theorem is for instance given in (Schervish,, 1996) (section 1.5). We will see that this Bayesian setting gives a powerful framework for revisiting black box optimization that is introduced below.

2. Black box optimization

We assume that we have a real value $p$ -dimensional function $f:\mathbb{R}^{p}\rightarrow\mathbb{R}$ . We examine the following optimization program:

[TABLE]

In contrast to traditional convex optimization theory, we do not assume that $f$ is convex, neither continuous nor admits a global minimum. We are interested in the so called Black box optimization (BBO) settings where we only have access to the function $f$ and nothing else. By nothing else, we mean we can not for instance compute gradient. A practical way to do optimization in this very general and minimal setting is to do evolutionary optimization and in particular use the covariance matrix adaptation evolution strategy (CMA-ES) methodology. The CMA-ES (Hansen and Ostermeier,, 2001) is arguably one of the most powerful real-valued derivative-free optimization algorithms, finding many applications in machine learning. It is a state-of-the-art optimizer for continuous black-box functions as shown by the various benchmarks of the COCO (COmparing Continuous Optimisers) INRIA platform for ill-posed functions. It has led to a large number of papers and articles and we refer the interested reader to (Hansen and Ostermeier,, 2001; Auger et al.,, 2004; Igel et al.,, 2007; Auger and Hansen,, 2009; Hansen and Auger,, 2011; Auger and Hansen,, 2012; Hansen and Auger,, 2014; Akimoto et al.,, 2015, 2016; Ollivier et al.,, 2017) and (Varelas et al.,, 2018) to cite a few.

It has has been successfully applied in many unbiased performance comparisons and numerous real-world applications. In particular, in machine learning, it has been used for direct policy search in reinforcement learning and hyper-parameter tuning in supervised learning ( (Gomez et al.,, 2008), (Igel et al.,, 2009; Heidrich-Meisner and Igel,, 2009; Igel,, 2010)), and references therein, as well as hyperparameter optimization of deep neural networks (Loshchilov and Hutter,, 2016).

In a nutshell, the ( $\mu$ / $\lambda$ ) CMA-ES is an iterative black box optimization algorithm, that, in each of its iterations, samples $\lambda$ candidate solutions from a multivariate normal distribution, evaluates these solutions (sequentially or in parallel) retains $\mu$ candidates and adjusts the sampling distribution used for the next iteration to give higher probability to good samples. Each iteration can be individually seen as taking an initial guess or prior for the multi variate parameters, namely the mean and the covariance, and after making an experiment by evaluating these sample points with the fit function updating the initial parameters accordingly. Although rethinking the CMA-ES in terms of a prior and posterior seems natural when coming over from Bayesian statistics, it is only recently that it has been explored (Benhamou et al.,, 2019).

Historically, the CMA-ES has been developed heuristically. It was done mainly by conducting experimental research and validating intuitions empirically.

Research was done without much focus on theoretical foundations because of the apparent complexity of this algorithm. It was only recently that (Akimoto et al.,, 2010), (Glasmachers et al.,, 2010) and (Ollivier et al.,, 2017) made a breakthrough and provided a theoretical justification of CMA-ES updates thanks to information geometry. They proved that CMA-ES was performing a natural gradient descent in the Fisher information metric. The Bayesian formulation of the CMA-ES came effectively much later and has only been done sofar with the normal inverse Wishart prior.

In this paper, we revisit the Bayesian CMA-ES formulation and show that there exists indeed an infinity of conjugate prior given by the convex combination of a normal Wishart and normal inverse Wishart Gaussian prior. We first prove that normal Wishart and normal inverse Wishart Gaussian priors have the same update equations except for the mean of the covariance matrix. We provide a theoretical argument to show that the inverse of a matrix should be lower than in the normal inverse Wishart Gaussian prior. We then introduce a new prior given by a mixture of normal Wishart and normal inverse Wishart Gaussian prior. Likewise, we derive the update equations. In section 5, we finally give numerical results to compare all these methods.

3. Conjugate priors

A key concept in Bayesian statistics is conjugate priors that makes the computation really easy and is described below.

Definition 3.0.

A prior distribution $\pi(\theta)$ is said to be a conjugate prior if the posterior distribution

[TABLE]

remains in the same distribution family as the prior.

At this stage, it is relevant to introduce exponential family distributions as this higher level of abstraction that encompasses the multi variate normal trivially solves the issue of founding conjugate priors. This will be very helpful for inferring conjugate priors for the multi variate Gaussian used in CMA-ES.

Definition 3.0.

A distribution is said to belong to the exponential family if it can be written (in its canonical form) as:

[TABLE]

where $\eta$ is the natural parameter, $T({\mathbf{x}})$ is the sufficient statistic, $A(\eta)$ is log-partition function and $h({\mathbf{x}})$ is the base measure. $\eta$ and $T({\mathbf{x}})$ may be vector-valued. Here $a\cdot b$ denotes the inner product of $a$ and $b$ .

The log-partition function is defined by the integral:

[TABLE]

Also, $\eta\in\Omega=\{\eta\in\mathbb{R}^{m}|A(\theta)<+\infty\}$ where $\Omega$ is the natural parameter space. Moreover, $\Omega$ is a convex set and $A(\cdot)$ is a convex function on $\Omega$ .

Remark 3.1.

Not surprisingly, the normal distribution $\mathcal{N}({\mathbf{x}};\mu,\Sigma)$ with mean $\mu\in\mathbb{R}^{d}$ and covariance matrix $\Sigma$ belongs to the exponential family but with a different parametrisation. Its exponential family form is given by:

[TABLE]

where in equations (8a), the notation $\mathrm{vec}(\cdot)$ means we have vectorized the matrix, stacking each column on top of each other and hence can equivalently write for $a$ and $b$ , two matrices, the trace result $\operatorname{Tr}(a^{\mathrm{T}}b)$ as the scalar product of their vectorization $\mathrm{vec}(a)\cdot\mathrm{vec}(b)$ (see 7.2). We can remark the canonical parameters are very different from traditional (also called moment) parameters. We can notice that changing slightly the sufficient statistic $T(x)$ leads to change the corresponding canonical parameters $\eta$ . In equation (8b), the notation $|\Sigma|$ means the determinant of the matrix: $\det(\Sigma)$ .

For an exponential family distribution, it is particularly easy to form conjugate prior.

Proposition 3.3.

If the observations have a density of the exponential family form $p(x|\theta,\kappa)=h(x)\exp\Big{(}\eta(\theta,\kappa)^{T}T(x)-nA(\eta(\theta,\kappa))\Big{)}$ , with $\kappa$ a set of hyper-parameters, then the prior with likelihood defined by $\pi(\theta)\propto\exp\left(\lambda_{1}\cdot\eta(\theta,\kappa)-\lambda_{0}A(\eta(\theta,\kappa))\right)$ with $\lambda\triangleq(\lambda_{0},\lambda_{1})$ is a conjugate prior.

The proof is given in appendix subsection 7.1. As we can vary the parameterisation of the likelihood, we can obtain multiple conjugate priors. Because of the conjugacy, if the initial parameters of the multi variate Gaussian follows the prior, the posterior is the true distribution given the information $\mathcal{X}$ and stay in the same family making the update of the parameters really easy. Said differently, with conjugate prior, we make the optimal update.

A consequence of proposition 3.3 is that the various conjugate priors of the multi variate normal that belong to the exponential family can be determined. This is the subject of the corollary below.

Corollary 3.4.

The conjugate priors of the multi variate normal that belong to the exponential family are necessarily of the form :

•

normal inverse Wishart distribution $NIW(\mu_{0},\lambda_{0},\nu_{0},\Psi_{0})$ if the multivariate normal is described in terms of its mean vector $\mu$ and covariance matrix $\Sigma$ .

•

normal Wishart distribution $NW(\mu_{0},\lambda_{0},\nu_{0},W_{0})$ if the multivariate normal is described in terms of its mean vector $\mu$ and precision matrix $\Lambda$ .

The proof is given in appendix subsection 7.3. As conjugate priors, the posterior of the two identified distributions of the corollary 3.4 are easy to derive and are given by the following proposition.

Proposition 3.5.

For a likelihood of $n$ points $(x)_{i=1..n}$ distributed according to a multi variate normal distribution whose parameters are given by the priors below:

(1)

the normal inverse Wishart distribution:

$NIW_{0}=NIW(\mu_{0},\lambda_{0},\nu_{0},\Psi_{0})$ ** 2. (2)

the normal Wishart distribution: $NW_{0}=NW(\mu_{0},\lambda_{0},\nu_{0},W_{0})$ 3. (3)

the mixture of a normal inverse and normal Wishart with same parameters: $wNIW_{0}+(1-w)NW_{0}$ with $0\leq w\leq 1$

The posterior is given by:

(1)

the normal inverse Wishart distribution

[TABLE] 2. (2)

the normal Wishart distribution

[TABLE] 3. (3)

the mixture of a normal inverse and normal Wishart with same parameters: $wNIW_{1}+(1-w)NW_{1}$

where $\overline{x}=1/n\sum_{i=1}^{n}x_{i}$ is the sample mean, $C=1/n\sum_{i=1}^{n}(x_{i}-\overline{x})(x_{i}-\overline{x})^{T}$ the sample covariance and $D=\frac{\lambda_{0}\,n}{n(\lambda_{0}+n)}(\overline{x}-\mu_{0})(\overline{x}-\mu_{0})^{T}$ .

The proof is given in appendix subsection 7.4.

4. Algorithm

The idea behind the algorithm is at each step to make use the previous iteration posterior as a prior, draw the likelihood and then update according to proposition (3.5) the posterior. In full generality, the prior is a distribution, so we would need to do a Monte Carlo of Monte Carlo. But in order to reduce the variance by this Monte Carlo of Monte Carlo, we make the simplification to use the mean value of the prior distribution. These values are given as follows:

(1)

for the normal inverse Wishart distribution, $\hat{\mu}=\mathop{\mathbb{E}}[\mu]=\mu_{n}$ and $\hat{\Sigma}=\mathop{\mathbb{E}}[\Sigma]=\Psi_{n}/(v_{n}-p-1)$ 2. (2)

for the normal Wishart distribution, $\hat{\mu}=\mathop{\mathbb{E}}[\mu]=\mu_{n}$ and $\hat{\Sigma}=\mathop{\mathbb{E}}[\Lambda^{-1}]=\Psi_{n}/v_{n}$ for $\Psi_{n}=W_{n}^{-1}$ . 3. (3)

for the $w$ mixture of the normal inverse and normal Wishart with same parameters, $\hat{\mu}=\mathop{\mathbb{E}}[\mu]=\mu_{n}$ and $\hat{\Sigma}=\mathop{\mathbb{E}}[\Sigma]=\frac{v_{n}-p-1+wp+w}{v_{n}(v_{n}-p-1)}\Psi_{n}$

It is obvious that the expected value of the covariance matrix of the normal inverse Wishart $\hat{\Sigma}=\mathop{\mathbb{E}}[\Sigma]$ should be above the one of the normal Wishart distribution as the inverse of a matrix $Inv:S\rightarrow S^{-1}$ is a convex function in the domain $\mathcal{S}^{p}_{++}$ of symmetric definite positive matrices. A proof is given in 7.5. To recover the true minimum, we design two strategies.

•

we design a strategy where we rebuild our normal distribution but using sorted information of our $X$ ’s weighted by their normal density to ensure this is a true normal corrected from the Monte Carlo bias. We need to explicitly compute the weights. For each simulated point $X_{i}$ , we compute it assumed density denoted by $d_{i}=\mathcal{N}(\hat{\mu},\hat{\Sigma})(X_{i})$ where $\mathcal{N}(\hat{\mu},\hat{\Sigma})(.)$ denotes the p.d.f. of the multi-variate Gaussian. We divide these density by their sum to get weights $(w_{i})_{i=1..k}$ that are positive and sum to one as follows. $w_{j}=d_{j}/\sum_{i=1}^{k}d_{i}$ . Hence for $k$ simulated points, we get $\{X_{i},w_{i}\}_{i=1..k}$ . We reorder jointly the uplets (points and density) in terms of their weights in decreasing order. To insist we take sorted value in decreasing order with respect to the weights $(w_{i})_{i=1..k}$ , we denote the order statistics $(i),w\downarrow$ . This first sorting leads to k new uplets $\{X_{(i),w\downarrow},w_{(i),w\downarrow}\}_{i=1..k}$ . Using a stable sort (that keeps the order of the density), we sort jointly the uplets (points and weights) according to their objective function value (in increasing order this time) and get a k new uplets $\{X_{(i),f\uparrow},w_{(i),w\downarrow}\}_{i=1..k}$ . We can now compute a new mean as follows:

[TABLE]

The intuition of equation (11) is to compute in the left term the Monte Carlo mean using reordered points according to their objective value and correct our initial computation by the Monte Carlo bias computed as the right term, equal to the initial Monte Carlo mean minus the real mean. We call this strategy one.

•

If we think for a minute about the strategy one, we get the intuition that when starting the minimization, it may not be optimal. This is because weights are proportional to $\exp(\frac{1}{2}(X-\hat{\mu})^{T}\hat{\Sigma}^{-1}(X-\hat{\mu}))$ . When we start the algorithm, we use a large search space, hence a large covariance matrix $\hat{\Sigma}$ which leads to have weights which are quite similar. Hence even if we sort candidates by their fit, ranking them according to the value of $f$ in increasing order, we will move our theoretical multi variate Gaussian little by little. A better solution is more to brutally move the center of our multi variate Gaussian to the best candidate seen so far, as follows:

[TABLE]

We call this strategy two. Intuitively, strategy two should be best when starting the algorithm while strategy one would be better once we are close to the solution.

To recover the true variance, we can adapt what we did in strategy one as follows:

•

[TABLE]

where $\overline{X}_{(.),f\uparrow}=\sum_{i=1}^{k}w_{(i),w\downarrow}X_{(i),f\uparrow}$ and $\overline{X}=\sum_{i=1}^{k}w_{i}X_{i}$ are respectively the mean of the sorted and non sorted points.

5. Numerical results

5.1. Functions examined

We have examined five functions to stress test our algorithm. They are listed in increasing order of complexity for our algorithm and correspond to different type of functions. They are all generalized function that can defined for any dimension $n$ . For all, we present the corresponding equation for a variable $x=(x_{1},x_{2},..,x_{n})$ of $n$ dimension. Code is provided in supplementary materials. We have frozen seeds to have reproducible of results.

5.1.1. Cone

The most simple function to optimize is the quadratic cone whose equation is given by (14) and represented in figure 1. It is also the standard Euclidean norm. It is obviously convex and is a good test of the performance of an optimization method.

[TABLE]

5.1.2. Schwefel 2 function

A slightly more complicated function is the Schwefel 2 function whose equation is given by (15) and represented in figure 2. It is a piece wise linear function and validates the algorithm can cope with non convex function.

[TABLE]

5.1.3. Rastrigin

The Rastrigin function, first proposed by (Rastrigin,, 1974) and generalized by (Mühlenbein et al.,, 1991), is more difficult compared to the Cone and the Schwefel 2 function. Its equation is given by (16) and represented in figure 3. It is a non-convex function often used as a performance test problem for optimization algorithms. It is a typical example of non-linear multi modal function. Finding its minimum is considered a good stress test for an optimization algorithm, due to its large search space and its large number of local minima.

[TABLE]

5.1.4. Schwefel 1 function

The Schwefel 1 function whose equation is given by (17) is a tricky function to optimize. It is represented in figure 4. It is sometimes only defined on $\left[-500,500\right]^{n}$ . The Schwefel 1 function shares similarities with the Rastrigin function. It is continuous, not convex, multi-modal and with a large number of local minima. The extra difficulty compared to the Rastrigin function, the local minima are more pronounced local bowl making the optimization even harder.

[TABLE]

5.1.5. Eggholder function

The Eggholder function whose equation is given by (18) is a difficult function to optimize, because of the large number of local minima. It is sometimes only defined on $\left[-512,512\right]^{n}$ . It shares similarities with the Schwefel1 function. It is continuous, not convex, multi-modal and with a large number of local minima.

[TABLE]

5.2. Convergence

For each of the functions, we compared our method using strategy one entitled B-CMA-ES S1: update $\hat{\mu}$ and $\hat{\Sigma}$ using (11) and (13) in orange with strategy two B-CMA-ES S2: same update but using (12) and (13), in blue and standard CMA-ES as provided by the opensource python package pycma in green. We clearly see that strategy two outperforms standard CMA-ES and Bayesian CMA-ES S1. The convergence graphics that shows the error compared to the minimum are represented

•

for the cone function by figure 6 (case of a convex function), with initial point $(10,10)$

•

for the Schwefel 2 function in figure 7 (case of piecewise linear function), with initial point $(10,10)$

•

for the Rastrigin function in figure 8 (case of a non convex function with multiple local minima), with initial point $(10,10)$

•

and for the Schwefel 1 function in figure 9 (case of a non convex function with multiple large bowl local minima), with initial point $(10,10)$

For functions that are convex, our method performs similarly as standard CMA-ES. For function with harder local minima, the Bayesian CMA-ES is able to perform better. We conjecture that this is due to contraction dilatation mechanism that enables to avoid being trapped in a local minimum.

6. Conclusion

In this paper, we have revisited the CMA-ES algorithm and provided a Bayesian version of it. Taking conjugate priors, we can find optimal update for the mean and covariance of the multi variate Normal. We have provided the corresponding algorithm that is a new version of CMA-ES. First numerical experiments show this new version is comparable to standard CMA-ES on traditional functions such as cone, Schwefel 1, Rastrigin and Schwefel 2. The similar convergence can be explained on a theoretical side from the optimal update of the prior (thanks to Bayesian update) and the use of the best candidate seen at each simulation to shift the mean of the multi-variate Gaussian likelihood. We envisage further works to benchmark our algorithm to traditional CMA-ES and other evolutionary algorithms, in particular to use the COCO platform to provide more meaningful tests and confirm the theoretical intuition of good performance of this new version of CMA-ES, and to test the importance of the prior choice.

7. Appendix

7.1. Conjugate priors

Proof.

Consider $n$ independent and identically distributed (IID) measurements $\mathcal{X}\triangleq\{{\mathbf{x}}^{j}\in\mathbb{R}^{d}|1\leq j\leq n\}$ and assume that these variables have an exponential family density. The likelihood $p(\mathcal{X}|\theta,\kappa)$ , writes simply as the product of each individual likelihood:

[TABLE]

If we start with a prior $\pi(\theta)$ of the form $\pi(\theta)\propto\exp(\mathcal{F}(\theta))$ for some function $\mathcal{F}(\cdot)$ , its posterior writes:

[TABLE]

It is easy to check that the posterior (20) is in the same exponential family as the prior iff $\mathcal{F}(\cdot)$ is in the form

[TABLE]

for some $\lambda\triangleq(\lambda_{0},\lambda_{1})$ , such that

[TABLE]

Hence, the conjugate prior for the likelihood (19) is parametrized by $\lambda$ and given by

[TABLE]

where $Z={\int{\exp\left(\lambda_{1}\cdot\eta(\theta,\kappa)-\lambda_{0}A(\eta(\theta,\kappa))\right)\;\mathrm{d}x}}$ . ∎

7.2. Multivariate Canonical form

In the case of the multi variate normal, the canonical form for this distribution writes as

[TABLE]

which gives the following moment and canonical parameters:

[TABLE]

7.3. Conjugate priors determination

Using proposition 3.3 and the exponential family formulation of the multi variate normal (equations (25)), we have that any conjugate prior for the multi variate normal that belongs to the exponential family is given by

[TABLE]

If we write $\lambda_{1}=(\lambda_{0}\,\mu_{0},\lambda_{2})$ and $\Psi_{0}=-2(\lambda_{2}+\frac{\lambda_{0}}{2}\mu_{0}\mu_{0}^{T})$ , we get

[TABLE]

The first term is a normal multi variate distribution. Its parameters are $\mu_{0}$ and $\frac{\Sigma}{\lambda_{0}}$ .

In the second term, we can recognize the proportional term of an inverse Wishart $\exp\left(-\frac{1}{2}\operatorname{Tr}(\Psi_{0}\Sigma^{-1})\right)$ , with parameters $\nu_{0},\Psi$ .

This shows the conjugate prior of the multi variate normal given by its mean vector $\mu$ and covariance matrix $\Sigma$ is a normal inverse Wishart. Its parameters are $NIW(\mu_{0},\lambda_{0},\nu_{0},\Psi_{0})$ ∎

If the multi variate normal is parametrized by its mean vector $\mu$ and its precision matrix $\Lambda$ , the same reasoning gives

[TABLE]

The second term is a multi variate normal distribution given by $N(\mu_{0},(\lambda_{0}\Lambda)^{-1})$ while the first one is the term of a Wishart distribution that is proportional to $exp(\frac{1}{2}\operatorname{Tr}(W^{-1}\Lambda)$ whose parameters are $\mathcal{W}(W_{0},\nu_{0})$ . This shows that the conjugate prior of the multi variate normal described by its mean vector $\mu$ and precision matrix $\Lambda$ is a normal Wishart distribution $NW(\mu_{0},\lambda_{0},\nu_{0},W_{0})$ ∎

7.4. Posterior update

The posterior update is quite straightforward and very similar for the two cases: NIW and NW. We will detail only the calculation for the NIW case as it is very similar for the NW. Recall that the probability density function of a Normal inverse Wishart random variable is expressed as the product of a Normal and an Inverse Wishart probability density functions. Denoting by $p\times p$ the dimension of the covariance matrix $\Sigma$ and using the Bayes rules, the posterior is proportional to the product of the prior and likelihood:

[TABLE]

First of all, we can regroup all terms in $x_{i}$ as follows

[TABLE]

and use the following remarkable identity:

[TABLE]

where we have used the commutativity property of the trace operator $\operatorname{Tr}(AB)=\operatorname{Tr}(BA)$ and that for a real number, the number is equal to its trace and written $C=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{T}\Sigma^{-1}\left(x_{i}-\overline{x}\right)$ the sample covariance. Going further, we have

[TABLE]

Hence, we can compute explicitly the posterior as follows:

[TABLE]

which are exactly the equations provided in (9) ∎

7.5. Convexity of the inverse of a matrix

We give here six different proofs of the convexity of the inverse of a matrix in the domain of symmetric definite positive matrices $\mathcal{S}^{p}_{++}$ . The first and second proofs relies on the fact that the result is a consequence of proving that the matrix fractional function $f(X,y)=y^{T}X^{-1}y$ is convex on the domain $\mathrm{dom}f=\mathcal{S}^{p}_{++}\times\mathbb{R}^{p}$ . The implication comes from the fact that

[TABLE]

Since $y$ is arbitrary, this implies the matrix within the square bracket in equation (7.5) is positive semi-definite. It is interesting to notice that matrix fractional function is in a sense an extension of the fact that the quadratic over linear function defined as $f(x,y)=x^{2}/y$ is convex on $\mathbb{R}_{+}^{2}$ .

Proof.

The first proof uses the property that the minimum of a convex function over a convex set is convex. For $\Sigma\in S_{++}^{n}$ , and for $u,y\in\mathbb{R}^{n}$ we can consider the quadratic function $f(u)$ defined by

[TABLE]

As $\Sigma\in S_{++}^{n}$ , this function is a obviously convex (quadratic function with its quadratic coefficient given by a definite positive matrix). Hence its minimum $\inf_{u\in\mathbb{R}^{n}}f(u)$ over a convex set is convex. Its easy to minimize a quadratic function and find its minimum given by the stationary point of its gradient $\frac{1}{2}y^{T}\Sigma^{-1}y$ , which concludes the proof. ∎

Proof.

A second proof is to show that the epigraph of $f$ , denoted by $\text{epi}(f)$ is convex thanks to the link between positive semi definite cones and Schur complements. We have that

[TABLE]

This concludes the proof as the epigraph of $f$ is convex as the inverse image of the positive semi definite cone $S_{++}^{n+1}$ by the Schur complement that is an affine mapping. ∎

Proof.

A third proof relies on the fundamental identity of the inverse of a matrix $X$ : $XX^{-1}=I_{p}$ , where $I_{p}$ is the identity matrix with $p$ rows (or columns). Take $M,N$ two positive definite symmetric matrices and $\lambda\in[0,1]$ . Take $P_{\lambda}=(1-\lambda)M+\lambda N$ . $P$ and $P^{-1}$ are obviously symmetric positive definite. Denote by $(.)^{\prime}$ the derivative with respect to $\lambda$ . We have:

[TABLE]

Notice that $P^{\prime\prime}=0_{p}$ , since $P$ is linear in $\lambda$ . Differentiate one more time to get:

[TABLE]

For any non-zero random vector $y$ , define $v_{\lambda}=P^{\prime}_{\lambda}P^{-1}_{\lambda}y$ and $\varphi_{\lambda}=y^{T}P^{-1}_{\lambda}y$ . Equations (42) says that

[TABLE]

since $P^{-1}_{\lambda}$ is positive definite. As the second order derivative is positive, we conclude that $\varphi_{\lambda}$ is a convex function for $\lambda$ over $[0,1]$ . As a result, for any $\lambda\in(0,1)$ , we have:

[TABLE]

Since $y$ is arbitrary, this implies the matrix within the square bracket in (7.5) is positive semi-definite and hence:

[TABLE]

Please note that when $P^{\prime}=N-M$ is invertible, $v_{\lambda}$ is non-zero for non-zero $y$ . The inequalities in (43) and (7.5) become strict and the matrix within the square bracket in (7.5) is positive definite instead of positive semi-definite. ∎

Proof.

A fourth proof is to derive the convexity of the inverse of a matrix from the convexity of the function $f(t)=\frac{1}{t}$ for $t\geq 0$ . Let $P={X}^{-1/2}{Y}{X}^{-1/2}$ . We want to prove that

[TABLE]

where in inequality (45), we have left- and right- multiplied both sides by ${X}^{1/2}$ . As $P$ is positive definite, it can be unitary diagonalised and hence without loss of generality, we can assume that it is a diagonal matrix. So, the inequality reduces down to the scalar case $(1-\lambda)+\lambda p_{ii}^{-1}\geq((1-\lambda)+\lambda p_{ii})^{-1}$ , which is true using the fact that the function $f(t)=\frac{1}{t}$ is convex for $t\geq 0$ ∎

The last two proofs relies on the fact that the result is also implied by the fact that the function $f(X)=\operatorname{Tr}(X^{-1}yy^{t})=\operatorname{Tr}(y^{t}X^{-1}y)$ is convex for $X\in\mathcal{S}^{p}_{++}$ for any $y\in\mathbb{R}^{p}$ . This comes from the nice property that the Trace operator can commute and that the trace of a real number is itself.

Proof.

The fifth proof uses the fact that a positive second order derivative along any line is enough to prove convexity. Consider $S(t)=U+tV$ where $U$ and $V$ are symmetric positive definite. It is enough to show that $\left.\dfrac{d^{2}}{dt^{2}}\text{Tr}(y^{t}S(t)^{-1}y)\right|_{t=0}\geq 0$ We have

[TABLE]

So

[TABLE]

But $U^{-1}VU^{-1}VU^{-1}=WU^{-1}W^{T}$ where $W=U^{-1}V$ and $U^{-1}$ is positive definite, so $WU^{-1}W^{T}$ is positive semi definite, which implies $\text{Tr}(WU^{-1}W^{T})\geq 0$ , which concludes the proof ∎

Proof.

A final sixth proof is to relate this to eigen values. We can notive that the function $f(X)=\operatorname{Tr}(X^{-1}yy^{t})$ is indeed the sum of the inverse of eigen values denoted by $\lambda_{i}$ .

[TABLE]

We know that the function that associates to a diagonal matrix with strictly positive terms its kth element (which turns out to be one of its eigen values but not necessarily its kth one) is linear, hence convex and concave. By the composition rules for convex function, with $g(x)=1/x$ , we can conclude that the inverse of the kth elements is convex for diagonal matrices with strictly positive term. Thus, the sum of the inverse of eigen values (defined as a sum of convex functions) is convex on the set of diagonal matrix with strictly positive term. We can conclude using the diagonalisation result of definite positive matrix (with $S=UDU^{T}$ , $U$ an orthonormal matrix, $D$ a diagonal matrix with strictly positive term and $S\in S_{++}^{n}$ ) to extend the convexity property to the set of $S_{++}^{n}$ and use also that $Tr(AB)=Tra(BA)$ ∎

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Akimoto et al., (2015) Akimoto, Y., Auger, A., and Hansen, N. (2015). Continuous optimization and CMA-ES. GECCO 2015, Madrid, Spain , 1:313–344.
2Akimoto et al., (2016) Akimoto, Y., Auger, A., and Hansen, N. (2016). CMA-ES and advanced adaptation mechanisms. GECCO, Denver , 2016:533–562.
3Akimoto et al., (2010) Akimoto, Y., Nagata, Y., Ono, I., and Kobayashi, S. (2010). Bidirectional relation between cma evolution strategies and natural evolution strategies. PPSN , XI(1):154–163.
4Auger and Hansen, (2009) Auger, A. and Hansen, N. (2009). Benchmarking the (1+1)-CMA-ES on the BBOB-2009 noisy testbed. Companion Material , GECCO 2009:2467–2472.
5Auger and Hansen, (2012) Auger, A. and Hansen, N. (2012). Tutorial CMA-ES: evolution strategies and covariance matrix adaptation. Companion Material Proceedings , 2012(12):827–848.
6Auger et al., (2004) Auger, A., Schoenauer, M., and Vanhaecke, N. (2004). LS-CMA-ES: A second-order algorithm for covariance matrix adaptation. PPSN VIII, 8th International Conference, Birmingham, UK, September 18-22, 2004, Proceedings , 2004(2004):182–191.
7Benhamou et al., (2019) Benhamou, E., Saltiel, D., Verel, S., and Teytaud, F. (2019). BCMA-ES: A Bayesian approach to CMA-ES. ar Xiv e-prints , page ar Xiv:1904.01401.
8Cox, (1946) Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics , 14(2):1–13.