Diagonal Acceleration for Covariance Matrix Adaptation Evolution   Strategies

Youhei Akimoto; Nikolaus Hansen

arXiv:1905.05885·cs.NE·May 16, 2019

Diagonal Acceleration for Covariance Matrix Adaptation Evolution Strategies

Youhei Akimoto, Nikolaus Hansen

PDF

Open Access

TL;DR

This paper introduces dd-CMA, an acceleration technique for CMA-ES that uses adaptive diagonal decoding to improve performance on both separable and non-separable problems, especially in high dimensions.

Contribution

The paper presents a novel diagonal acceleration method for CMA-ES, enhancing its speed and scalability without sacrificing performance on complex, non-separable functions.

Findings

01

dd-CMA-ES outperforms original CMA-ES on ill-conditioned functions

02

Significant speedup observed in high-dimensional optimization

03

Effective even with large population sizes up to dimension squared

Abstract

We introduce an acceleration for covariance matrix adaptation evolution strategies (CMA-ES) by means of adaptive diagonal decoding (dd-CMA). This diagonal acceleration endows the default CMA-ES with the advantages of separable CMA-ES without inheriting its drawbacks. Technically, we introduce a diagonal matrix D that expresses coordinate-wise variances of the sampling distribution in DCD form. The diagonal matrix can learn a rescaling of the problem in the coordinates within linear number of function evaluations. Diagonal decoding can also exploit separability of the problem, but, crucially, does not compromise the performance on non-separable problems. The latter is accomplished by modulating the learning rate for the diagonal matrix based on the condition number of the underlying correlation matrix. dd-CMA-ES not only combines the advantages of default and separable CMA-ES, but may…

Tables1

Table 1. Table 1: Test function definitions and initial conditions. The unit vector 𝒖 𝒖 \bm{u} is either 𝒆 1 = ( 1 , 0 , … , 0 ) subscript 𝒆 1 1 0 … 0 \bm{e}_{1}=(1,0,\dots,0) for separable scenarios, or random vectors drawn uniformly on the unit sphere for non-separable scenarios or for Ell-Cig and Ell-Dis. A vector 𝒛 𝒛 \textstyle z represents an orthogonal transformation 𝐑 𝒙 𝐑 𝒙 \mathbf{R}\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}} of the input vector 𝒙 𝒙 \textstyle x , where the orthogonal matrix 𝐑 𝐑 \mathbf{R} is the identity matrix for separable scenarios, or is constructed by generating normal random elements and applying the Gram-Schmidt procedure. The diagonal matrix 𝐃 ell = diag ⁡ ( 1 , ⋯ , 10 i − 1 n − 1 , ⋯ , 10 ) subscript 𝐃 ell diag 1 ⋯ superscript 10 𝑖 1 𝑛 1 ⋯ 10 \mathbf{D}_{\mathrm{ell}}=\operatorname{diag}(1,\cdots,10^{\frac{i-1}{n-1}},\cdots,10) represents a coordinate-wise scaling and the vector 𝐲 𝐲 \mathbf{y} is a coordinate-wisely transformed input vector 𝒚 = 𝐃 ell 2 𝒙 𝒚 superscript subscript 𝐃 ell 2 𝒙 \mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}=\mathbf{D}_{\mathrm{ell}}^{2}\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}} .

	$f (𝒙)$	$𝒎^{(0)}$	$σ^{(0)}$
Sphere	${∥ 𝒙 ∥}^{2}$	$3 \cdot 𝟏$	$1$
Cigar	${⟨ 𝒖, 𝒙 ⟩}^{2} + 10^{6} ({∥ 𝒙 ∥}^{2} - {⟨ 𝒖, 𝒙 ⟩}^{2})$	$3 \cdot 𝟏$	$1$
Discus	$10^{6} {⟨ 𝒖, 𝒙 ⟩}^{2} + ({∥ 𝒙 ∥}^{2} - {⟨ 𝒖, 𝒙 ⟩}^{2})$	$3 \cdot 𝟏$	$1$
Ellipsoid	${∥ 𝐃_{ell}^{3} 𝒛 ∥}^{2}$	$3 \cdot 𝟏$	$1$
TwoAxes	$10^{6} \sum_{i = 1}^{n / 2} {[𝒛]}_{i}^{2} + \sum_{i = n / 2 + 1}^{n} {[𝒛]}_{i}^{2}$	$3 \cdot 𝟏$	$1$
Ell-Cig	$10^{- 4} {⟨ 𝒖, 𝒚 ⟩}^{2} + ({∥ 𝒚 ∥}^{2} - {⟨ 𝒖, 𝒚 ⟩}^{2})$	$3 \cdot 𝟏$	$1$
Ell-Dis	$10^{4} {⟨ 𝒖, 𝒚 ⟩}^{2} + ({∥ 𝒚 ∥}^{2} - {⟨ 𝒖, 𝒚 ⟩}^{2})$	$3 \cdot 𝟏$	$1$
Rosenbrock	$\sum_{i = 1}^{n - 1} 100 {({[𝒛]}_{i}^{2} - {[𝒛]}_{i + 1})}^{2} + {({[𝒛]}_{i} - 1)}^{2}$	$𝟎$	$0.1$
Bohachevsky	$\sum_{i = 1}^{n - 1} ({[𝒛]}_{i}^{2} + 2 {[𝒛]}_{i + 1}^{2} - 0.3 \cos (3 π {[𝒛]}_{i}) \dots$	$𝒩 (𝟎, 8^{2} 𝐈)$	$7$
	$- 0.4 \cos (4 π {[𝒛]}_{i + 1}) + 0.7)$
Rastrigin	$\sum_{i = 1}^{n} {[𝒛]}_{i}^{2} + 10 (1 - \cos (2 π {[𝒛]}_{i}))$	$𝒩 (𝟎, 3^{2} 𝐈)$	$2$

Equations83

\begin{split}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i}&\sim\mathcal{N}(\bm{0},\mathbf{I})\enspace,\\ \mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i}&=\sqrt{\mathbf{C}}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i}\sim\mathcal{N}(\bm{0},\mathbf{C})\enspace,\\ \mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{i}&=\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t)}+\sigma^{(t)}\mathbf{D}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i}\sim\mathcal{N}(\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t)},(\sigma^{(t)})^{2}\mathbf{D}\mathbf{C}\mathbf{D})\enspace.\end{split}

\begin{split}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i}&\sim\mathcal{N}(\bm{0},\mathbf{I})\enspace,\\ \mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i}&=\sqrt{\mathbf{C}}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i}\sim\mathcal{N}(\bm{0},\mathbf{C})\enspace,\\ \mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{i}&=\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t)}+\sigma^{(t)}\mathbf{D}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i}\sim\mathcal{N}(\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t)},(\sigma^{(t)})^{2}\mathbf{D}\mathbf{C}\mathbf{D})\enspace.\end{split}

\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t+1)}=\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t)}+c_{m}\sum_{i=1}^{\mu}w_{i}(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{i:\lambda}-\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}})\enspace,

\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t+1)}=\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}}^{(t)}+c_{m}\sum_{i=1}^{\mu}w_{i}(\mathchoice{\mbox{\boldmath$\displaystyle x$}}{\mbox{\boldmath$\textstyle x$}}{\mbox{\boldmath$\scriptstyle x$}}{\mbox{\boldmath$\scriptscriptstyle x$}}_{i:\lambda}-\mathchoice{\mbox{\boldmath$\displaystyle m$}}{\mbox{\boldmath$\textstyle m$}}{\mbox{\boldmath$\scriptstyle m$}}{\mbox{\boldmath$\scriptscriptstyle m$}})\enspace,

\displaystyle\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{\sigma}^{(t+1)}

\displaystyle\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{\sigma}^{(t+1)}

γ_{σ}^{(t + 1)}

\ln\sigma^{(t+1)}=\ln\sigma^{(t)}+\frac{c_{\sigma}}{d_{\sigma}}\left(\frac{\lVert\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{\sigma}^{(t+1)}\rVert}{\chi_{n}}-\sqrt{\gamma_{\sigma}^{(t+1)}}\right)\enspace,

\ln\sigma^{(t+1)}=\ln\sigma^{(t)}+\frac{c_{\sigma}}{d_{\sigma}}\left(\frac{\lVert\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{\sigma}^{(t+1)}\rVert}{\chi_{n}}-\sqrt{\gamma_{\sigma}^{(t+1)}}\right)\enspace,

c_{σ}

c_{σ}

d_{σ}

\mathbf{C}^{(t+1)}=\left(1-c_{1}\gamma_{c}^{(t+1)}-c_{\mu}\sum_{i=1}^{\mu}w_{i}\right)\mathbf{C}^{(t)}\\ +c_{1}\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)^{\mathsf{T}}+c_{\mu}\sum_{i=1}^{\mu}w_{i}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}^{\mathsf{T}}\enspace,

\mathbf{C}^{(t+1)}=\left(1-c_{1}\gamma_{c}^{(t+1)}-c_{\mu}\sum_{i=1}^{\mu}w_{i}\right)\mathbf{C}^{(t)}\\ +c_{1}\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)^{\mathsf{T}}+c_{\mu}\sum_{i=1}^{\mu}w_{i}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}^{\mathsf{T}}\enspace,

\displaystyle\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}

\displaystyle\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}

γ_{c}^{(t + 1)}

h_{\sigma}^{(t+1)}=\begin{cases}1&\text{if }\lVert\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{\sigma}^{(t+1)}\rVert^{2}/\gamma_{\sigma}^{(t+1)}<\big{(}2+\frac{4}{n+1}\big{)}n\\ 0&\text{otherwise}\end{cases}.

h_{\sigma}^{(t+1)}=\begin{cases}1&\text{if }\lVert\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{\sigma}^{(t+1)}\rVert^{2}/\gamma_{\sigma}^{(t+1)}<\big{(}2+\frac{4}{n+1}\big{)}n\\ 0&\text{otherwise}\end{cases}.

[\mathbf{D}^{(t+1)}]_{k,k}^{2}=[\mathbf{D}^{(t)}]_{k,k}^{2}\Bigg{(}1-c_{1}\gamma_{c}^{(t+1)}-c_{\mu}\sum_{i=1}^{\mu}w_{i}\\ +c_{1}[\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}]_{k}^{2}/[\mathbf{D}^{(t)}]_{k,k}^{2}+c_{\mu}\sum_{i=1}^{\mu}w_{i}\left[\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\right]_{k}^{2}\Bigg{)}\enspace.

[\mathbf{D}^{(t+1)}]_{k,k}^{2}=[\mathbf{D}^{(t)}]_{k,k}^{2}\Bigg{(}1-c_{1}\gamma_{c}^{(t+1)}-c_{\mu}\sum_{i=1}^{\mu}w_{i}\\ +c_{1}[\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}]_{k}^{2}/[\mathbf{D}^{(t)}]_{k,k}^{2}+c_{\mu}\sum_{i=1}^{\mu}w_{i}\left[\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\right]_{k}^{2}\Bigg{)}\enspace.

w_{i} = ⎩ ⎨ ⎧ 1/ μ 0 - 1/ μ (i ⩽ μ) (μ < i ⩽ λ - μ) (λ - μ < i)

w_{i} = ⎩ ⎨ ⎧ 1/ μ 0 - 1/ μ (i ⩽ μ) (μ < i ⩽ λ - μ) (λ - μ < i)

\mathbf{C}^{(t+1)}=\left(1-c_{1}\gamma_{c}^{(t+1)}-c_{\mu}\sum_{i=1}^{\,\text{\hbox{\pagecolor{red!15!white}$\displaystyle\lambda$}}\,}w_{i}\right)\mathbf{C}^{(t)}\\ +c_{1}\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)^{\mathsf{T}}+c_{\mu}\sum_{i=1}^{\,\text{\hbox{\pagecolor{red!15!white}$\displaystyle\lambda$}}\,}w_{i}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}^{\mathsf{T}}\enspace,

\mathbf{C}^{(t+1)}=\left(1-c_{1}\gamma_{c}^{(t+1)}-c_{\mu}\sum_{i=1}^{\,\text{\hbox{\pagecolor{red!15!white}$\displaystyle\lambda$}}\,}w_{i}\right)\mathbf{C}^{(t)}\\ +c_{1}\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)\left(\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)^{\mathsf{T}}+c_{\mu}\sum_{i=1}^{\,\text{\hbox{\pagecolor{red!15!white}$\displaystyle\lambda$}}\,}w_{i}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}^{\mathsf{T}}\enspace,

C^{(t + 1)} = C^{(t)} exp (Δ) C^{(t)} .

C^{(t + 1)} = C^{(t)} exp (Δ) C^{(t)} .

\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}=\begin{cases}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}&w_{i}\geqslant 0\\ \frac{\sqrt{n}}{\lVert\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\rVert}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}&w_{i}<0,\end{cases}\quad\text{and analogously,}\quad\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}=\begin{cases}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}&w_{i}\geqslant 0\\ \frac{\sqrt{n}}{\lVert\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\rVert}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}&w_{i}<0.\end{cases}

\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}=\begin{cases}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}&w_{i}\geqslant 0\\ \frac{\sqrt{n}}{\lVert\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\rVert}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}&w_{i}<0,\end{cases}\quad\text{and analogously,}\quad\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}=\begin{cases}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}&w_{i}\geqslant 0\\ \frac{\sqrt{n}}{\lVert\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\rVert}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}&w_{i}<0.\end{cases}

\displaystyle\mathbf{C}^{(t+1)}=\mathbf{C}^{(t)}+\,\text{\hbox{\pagecolor{red!15!white}$\displaystyle\alpha^{(t)}$}}\,\Bigg{(}

\displaystyle\mathbf{C}^{(t+1)}=\mathbf{C}^{(t)}+\,\text{\hbox{\pagecolor{red!15!white}$\displaystyle\alpha^{(t)}$}}\,\Bigg{(}

\displaystyle+c_{\mu}\sum_{i=1}^{\lambda}w_{i}\bigg{[}\,\text{\hbox{\pagecolor{red!15!white}$\displaystyle\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle y$}}{\mbox{\boldmath$\textstyle y$}}{\mbox{\boldmath$\scriptstyle y$}}{\mbox{\boldmath$\scriptscriptstyle y$}}_{i:\lambda}^{\mathsf{T}}$}}\,-\mathbf{C}^{(t)}\bigg{]}\Bigg{)}\enspace,

\mathbf{C}^{(t+1)}=\sqrt{\mathbf{C}}\Bigg{[}\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)}\Bigg{]}\sqrt{\mathbf{C}}

\mathbf{C}^{(t+1)}=\sqrt{\mathbf{C}}\Bigg{[}\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)}\Bigg{]}\sqrt{\mathbf{C}}

\mathbf{Z}^{(t)}=c_{1}\left[\left(\sqrt{\mathbf{C}}^{-1}\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)\left(\sqrt{\mathbf{C}}^{-1}\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)^{\mathsf{T}}-\gamma_{c}^{(t+1)}\mathbf{I}\right]\\ +c_{\mu}\sum_{i=1}^{\lambda}w_{i}\bigg{[}\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}^{\mathsf{T}}-\mathbf{I}\bigg{]}\enspace.

\mathbf{Z}^{(t)}=c_{1}\left[\left(\sqrt{\mathbf{C}}^{-1}\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)\left(\sqrt{\mathbf{C}}^{-1}\mathbf{D}^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c}^{(t+1)}\right)^{\mathsf{T}}-\gamma_{c}^{(t+1)}\mathbf{I}\right]\\ +c_{\mu}\sum_{i=1}^{\lambda}w_{i}\bigg{[}\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}^{\mathsf{T}}-\mathbf{I}\bigg{]}\enspace.

d_{1}(\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)})\mathbf{C}\succcurlyeq\sqrt{\mathbf{C}}\big{[}\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)}\big{]}\sqrt{\mathbf{C}}\succcurlyeq d_{n}(\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)})\mathbf{C}\enspace,

d_{1}(\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)})\mathbf{C}\succcurlyeq\sqrt{\mathbf{C}}\big{[}\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)}\big{]}\sqrt{\mathbf{C}}\succcurlyeq d_{n}(\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)})\mathbf{C}\enspace,

\mathbf{C}^{(t+t_{\mathrm{eig}})}=\sqrt{\mathbf{C}}\Bigg{[}\mathbf{I}+\alpha^{(t+t_{\mathrm{eig}})}\sum_{k=t}^{t+t_{\mathrm{eig}}-1}\mathbf{Z}^{(k)}\Bigg{]}\sqrt{\mathbf{C}}\enspace.

\mathbf{C}^{(t+t_{\mathrm{eig}})}=\sqrt{\mathbf{C}}\Bigg{[}\mathbf{I}+\alpha^{(t+t_{\mathrm{eig}})}\sum_{k=t}^{t+t_{\mathrm{eig}}-1}\mathbf{Z}^{(k)}\Bigg{]}\sqrt{\mathbf{C}}\enspace.

\alpha^{(t+t_{\mathrm{eig}})}=\min\left(\frac{0.75}{\left\lvert d_{n}\big{(}\sum_{k=t}^{t+t_{\mathrm{eig}}-1}\mathbf{Z}^{(k)}\big{)}\right\rvert},\ 1\right)\enspace.

\alpha^{(t+t_{\mathrm{eig}})}=\min\left(\frac{0.75}{\left\lvert d_{n}\big{(}\sum_{k=t}^{t+t_{\mathrm{eig}}-1}\mathbf{Z}^{(k)}\big{)}\right\rvert},\ 1\right)\enspace.

C^{(t + 1)}

C^{(t + 1)}

\displaystyle=\sqrt{\mathbf{C}}\left[\left(1-c_{1}-c_{\mu}\sum_{i=1}^{\lambda}w_{i}\right)\mathbf{I}-c_{\mu}\sum_{i:\ w_{i}<0}n\lvert w_{i}\rvert\frac{\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}^{\mathsf{T}}}{\lVert\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\rVert^{2}}\right]\sqrt{\mathbf{C}}

≽ C [(1 - c_{1} - c_{μ} i = 1 \sum λ w_{i} - n c_{μ} i : w_{i} < 0 \sum ∣ w_{i} ∣) I] C

= (1 - c_{1} - c_{μ} i = 1 \sum λ w_{i} - n c_{μ} i : w_{i} < 0 \sum ∣ w_{i} ∣) C^{(t)} .

α = \frac{1/ t _{eig} - ( c _{1} + c _{μ} )}{n c _{μ} \sum _{i : w_{i} < 0} ∣ w _{i}^{'} ∣} .

α = \frac{1/ t _{eig} - ( c _{1} + c _{μ} )}{n c _{μ} \sum _{i : w_{i} < 0} ∣ w _{i}^{'} ∣} .

w_{i} \propto - E [N_{i : λ}] for all i = 1, \dots, λ,

w_{i} \propto - E [N_{i : λ}] for all i = 1, \dots, λ,

w_{i}^{'} = ln \frac{λ + 1}{2} - ln i for i = 1, \dots, λ .

w_{i}^{'} = ln \frac{λ + 1}{2} - ln i for i = 1, \dots, λ .

μ_{w} = \frac{( \sum _{i : w_{i}^{'} > 0} ∣ w _{i}^{'} ∣ ) ^{2}}{\sum _{i : w_{i}^{'} > 0} ∣ w _{i}^{'} ∣ ^{2}} and μ_{w}^{-} = \frac{( \sum _{i : w_{i}^{'} < 0} ∣ w _{i}^{'} ∣ ) ^{2}}{\sum _{i : w_{i}^{'} < 0} ∣ w _{i}^{'} ∣ ^{2}} .

μ_{w} = \frac{( \sum _{i : w_{i}^{'} > 0} ∣ w _{i}^{'} ∣ ) ^{2}}{\sum _{i : w_{i}^{'} > 0} ∣ w _{i}^{'} ∣ ^{2}} and μ_{w}^{-} = \frac{( \sum _{i : w_{i}^{'} < 0} ∣ w _{i}^{'} ∣ ) ^{2}}{\sum _{i : w_{i}^{'} < 0} ∣ w _{i}^{'} ∣ ^{2}} .

w_{i} = ⎩ ⎨ ⎧ \frac{w _{i}^{'}}{\sum _{j : w_{j}^{'} > 0} ∣ w _{j}^{'} ∣} \frac{w _{i}^{'}}{\sum _{j : w_{j}^{'} < 0} ∣ w _{j}^{'} ∣} \times min (1 + \frac{c _{1}}{c _{μ}}, 1 + \frac{2 μ _{w}^{-}}{μ _{w} + 2}) for i with w_{i}^{'} ⩾ 0, for i with w_{i}^{'} < 0.

w_{i} = ⎩ ⎨ ⎧ \frac{w _{i}^{'}}{\sum _{j : w_{j}^{'} > 0} ∣ w _{j}^{'} ∣} \frac{w _{i}^{'}}{\sum _{j : w_{j}^{'} < 0} ∣ w _{j}^{'} ∣} \times min (1 + \frac{c _{1}}{c _{μ}}, 1 + \frac{2 μ _{w}^{-}}{μ _{w} + 2}) for i with w_{i}^{'} ⩾ 0, for i with w_{i}^{'} < 0.

\displaystyle[\Delta_{D}]_{k,k}=c_{1,D}\big{(}\big{[}\sqrt{\mathbf{C}}^{-1}(\mathbf{D}^{(t)})^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c,D}\big{]}_{k}^{2}-\gamma_{c,D}\big{)}+c_{\mu,D}\sum_{i=0}^{\lambda}w_{i,D}\big{(}\big{[}\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\big{]}_{k}^{2}-1\big{)}

\displaystyle[\Delta_{D}]_{k,k}=c_{1,D}\big{(}\big{[}\sqrt{\mathbf{C}}^{-1}(\mathbf{D}^{(t)})^{-1}\mathchoice{\mbox{\boldmath$\displaystyle p$}}{\mbox{\boldmath$\textstyle p$}}{\mbox{\boldmath$\scriptstyle p$}}{\mbox{\boldmath$\scriptscriptstyle p$}}_{c,D}\big{]}_{k}^{2}-\gamma_{c,D}\big{)}+c_{\mu,D}\sum_{i=0}^{\lambda}w_{i,D}\big{(}\big{[}\tilde{}\mathchoice{\mbox{\boldmath$\displaystyle z$}}{\mbox{\boldmath$\textstyle z$}}{\mbox{\boldmath$\scriptstyle z$}}{\mbox{\boldmath$\scriptscriptstyle z$}}_{i:\lambda}\big{]}_{k}^{2}-1\big{)}

D^{(t + 1)} \leftarrow D^{(t)} \cdot exp (\frac{Δ _{D}}{2 β ^{(t)}}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Neural Networks and Applications · Metaheuristic Optimization Algorithms Research

Full text

Diagonal Acceleration for Covariance Matrix Adaptation Evolution Strategies

\nameY. Akimoto \[email protected]

\addrUniversity of Tsukuba, 1-1-1 Tennodai, Tsukuba, JAPAN \AND\nameN. Hansen \[email protected]

\addrInria, RandOpt Team, CMAP, Ecole Polytechnique, Palaiseau, FRANCE

Abstract

We introduce an acceleration for covariance matrix adaptation evolution strategies (CMA-ES) by means of adaptive diagonal decoding (dd-CMA). This diagonal acceleration endows the default CMA-ES with the advantages of separable CMA-ES without inheriting its drawbacks. Technically, we introduce a diagonal matrix $D$ that expresses coordinate-wise variances of the sampling distribution in $DCD$ form. The diagonal matrix can learn a rescaling of the problem in the coordinates within linear number of function evaluations. Diagonal decoding can also exploit separability of the problem, but, crucially, does not compromise the performance on non-separable problems. The latter is accomplished by modulating the learning rate for the diagonal matrix based on the condition number of the underlying correlation matrix. dd-CMA-ES not only combines the advantages of default and separable CMA-ES, but may achieve overadditive speedup: it improves the performance, and even the scaling, of the better of default and separable CMA-ES on classes of non-separable test functions that reflect, arguably, a landscape feature commonly observed in practice.

The paper makes two further secondary contributions: we introduce two different approaches to guarantee positive definiteness of the covariance matrix with active CMA, which is valuable in particular with large population size; we revise the default parameter setting in CMA-ES, proposing accelerated settings in particular for large dimension.

All our contributions can be viewed as independent improvements of CMA-ES, yet they are also complementary and can be seamlessly combined. In numerical experiments with dd-CMA-ES up to dimension 5120, we observe remarkable improvements over the original covariance matrix adaptation on functions with coordinate-wise ill-conditioning. The improvement is observed also for large population sizes up to about dimension squared.

Keywords

Evolution strategies, covariance matrix adaptation, adaptive diagonal decoding, active covariance matrix update, default strategy parameters.

1 Introduction

In real world applications of continuous optimization involving simulations such as physics or chemical simulations, the input-output relation between a candidate solution and its objective function value is barely expressible in explicit mathematical formula. The objective function value is computed through a complex simulation with a candidate solution as an input. In such scenarios, we gain the information of the problem only through the evaluation of the objective function value of a given candidate solution. A continuous optimization of $f:\mathbb{R}^{n}\to\mathbb{R}$ is referred to as black-box continuous optimization if we gain the information of the problem only through the evaluation $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}\mapsto f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}})$ of a given candidate solution $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}\in\mathbb{R}^{n}$ .

Black-box continuous optimization arises widely in real world applications such as model parameter calibration and design of robot controller. It often involves computationally expensive simulation to evaluate the quality of candidate solutions. The search cost of black-box continuous optimization is therefore the number of simulations, i.e., the number of objective function calls, and a search algorithm is desired to locate good quality solutions with as few $f$ -calls as possible. Practitioners need to choose one or a few search algorithms to solve their problems and tune their hyper-parameters based on the prior knowledge into their problems. However, prior knowledge is often limited in the black-box situation due to the black-box relation between $\textstyle x$ and $f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}})$ . Hence, algorithm selection, as well as hyper-parameter tuning, is a tedious task for practitioners who are typically not experts in search algorithms.

Covariance Matrix Adaptation Evolution Strategy (CMA-ES), developed by Hansen and Ostermeier, (2001); Hansen et al., (2003); Hansen and Kern, (2004); Jastrebski and Arnold, (2006), is recognized as a state-of-the-art derivative-free search algorithm for difficult continuous optimization problems (Rios and Sahinidis,, 2013). CMA-ES is a stochastic and comparison-based search algorithm that maintains the multivariate normal distribution as a sampling distribution of candidate solutions. The distribution parameters such as the mean vector and the covariance matrix are updated at each iteration based on the candidate solutions and their objective value ranking, so that the sampling distribution will produce promising candidate solutions more likely in the next algorithmic iteration. The update of the distribution parameters is partially found as the natural gradient ascent on the manifold of the distribution parameter space equipped with the Fisher metric (Akimoto et al.,, 2010; Glasmachers et al.,, 2010; Ollivier et al.,, 2017), thereby revealing the connection to natural evolution strategies Wierstra et al., (2008); Sun et al., (2009); Glasmachers et al., (2010); Wierstra et al., (2014), whose parameter update is derived explicitly from the natural gradient principle.

Invariance (Hansen,, 2000; Hansen et al.,, 2011) is one of the governing principles of the design of CMA-ES and the essence of its success on difficult continuous optimization problems consisting of ruggedness, ill-conditioning, and non-separability. CMA-ES exhibits several invariance properties such as invariance to order preserving transformation of the objective function, invariance to translation, rotation and coordinate-wise scaling of the search space (Hansen and Auger,, 2014). Invariance guarantees the algorithm to work identically on an original problem and its transformed version. Thanks to its invariance properties, CMA-ES works, after an adaptation phase, equally well on separable and well-conditioned functions, which are easy for most search algorithms, and on non-separable and ill-conditioned functions produced by an affine coordinate transformation of the former, which are considered difficult for many other search algorithms. This also contributes to allow default hyper-parameter values to depend solely on the search space dimension and the population size, whereas many other search algorithms need tuning depending on problem difficulties to make the algorithms efficient (Karafotias et al.,, 2015).

On the other hand, exploiting problem structure, if possible, is beneficial for optimization speed. Different variants of CMA-ES have been proposed to exploit problem structure such as separability and limited variable dependency. They aim to achieve a better scaling with the dimension (Knight and Lunacek,, 2007; Ros and Hansen,, 2008; Akimoto et al.,, 2014; Akimoto and Hansen,, 2016; Loshchilov,, 2017). However, they lose some of the invariance properties that CMA-ES has and compromise the performance on problems where their specific, more or less restrictive assumptions on the problem structure do not hold. For instance, the separable CMA-ES (Ros and Hansen,, 2008) reduces the degrees of freedom of the covariance matrix from $(n^{2}+n)/2$ to $n$ by adapting only the diagonal elements of the covariance matrix. It scales better in terms of internal time and space complexity and in terms of number of function evaluations to adapt the coordinate-wise variance of the search distribution. Good results are hence observed on functions with weak variable dependencies. However, unsurprisingly, the convergence speed of the separable CMA is significantly deteriorated on non-separable and ill-conditioned functions, where the shape of the level sets of the objective function can not be reasonably well approximated by the equi-probability ellipsoid defined by the normal distribution with diagonal (co)variance matrix.

In this paper, we aim to improve the performance of CMA-ES on a relatively wide class of problems by exploiting problem structure, however crucially, without compromising the performance on more difficult problems without this structure.111Any covariance matrix, $\bm{\Sigma}$ , can be uniquely decomposed into $\bm{\Sigma}=\mathbf{D}\mathbf{C}\mathbf{D}$ , where $\mathbf{D}$ is a diagonal matrix and $\mathbf{C}$ is a correlation matrix. The addressed problem class can be characterized in that for the best problem approximation $\bm{\Sigma}=\mathbf{D}\mathbf{C}\mathbf{D}$ both matrices, $\mathbf{C}$ and $\mathbf{D}$ , have non-negligible condition number, say no less than $100$ .

The first mechanism we are concerned with is the so-called active covariance matrix update, which was originally proposed for the $(\mu,\lambda)$ -CMA-ES with intermediate recombination (Jastrebski and Arnold,, 2006), and later incorporated into the $(1+1)$ -CMA-ES (Arnold and Hansen,, 2010). It utilizes unpromising solutions to actively decrease the eigenvalues of the covariance matrix. The active update consistently improves the adaptation speed of the covariance matrix in particular on functions where a low dimensional subspace dominates the function value. The positive definiteness of the covariance matrix is however not guaranteed when the active update is utilized. Practically, a small enough learning rate of the covariance matrix is sufficient to keep the covariance matrix positive definite with overwhelming probability, however, we would like to increase the learning rate when the population size is large. We propose two novel schemes that guarantee the positive definiteness of the covariance matrix, so that we take advantage of the active update even when a large population size is desired, e.g., when the objective function evaluations are distributed on many CPUs or when the objective function is rugged.

The main contribution of this paper is the diagonal acceleration of CMA by means of adaptive diagonal decoding, referred to as dd-CMA. We introduce a coordinate-wise variance matrix, $\mathbf{D}^{2}$ , of the sampling distribution alongside the positive definite symmetric matrix $\mathbf{C}$ , such that the resulting covariance matrix of the sampling distribution is represented by $\mathbf{D}\mathbf{C}\mathbf{D}$ . We call $\mathbf{D}$ the diagonal decoding matrix. We update $\mathbf{C}$ with the original CMA, whereas $\mathbf{D}$ is updated similarly to separable CMA. An important point is that we want to update $\mathbf{D}$ faster than $\mathbf{C}$ , by setting higher learning rates for the $\mathbf{D}$ update. However, when $\mathbf{C}$ contains non-zero covariances, the update of $\mathbf{D}$ can result in a drastic change of the sampling distribution and disturb the adaptation of $\mathbf{C}$ . To resolve this issue, we introduce an adaptive damping mechanism for the $\mathbf{D}$ update, so that the difference (e.g., Kullback-Leibler divergence) between the distributions before and after the update remains sufficiently small. With this damping, $\mathbf{D}$ is updated as fast as by separable CMA on a separable function if the correlation matrix of the sampling distribution is close to the identity, and it suppresses the $\mathbf{D}$ update when the correlation matrix is away from the identity, i.e., variable dependencies have been learned.

The update of $\mathbf{D}$ breaks the rotation invariance of the original CMA, hence we lose a mathematical guarantee of the original CMA that it performs identical on functions in an equivalence group defined by this invariance property. The ultimate aim of this work is to gain significant speed-up in some situations and to preserve the performance of the original CMA in the worst case. Functions where we expect that the diagonally accelerated CMA outperforms the original one have variables with different sensitivities, i.e., coordinate-wise ill-conditioning. Such functions may often appear in practice, since variables in a black-box continuous optimization problem can have distinct meanings. Diagonal acceleration however can even be superior to the optimal additive portfolio of the original CMA and the separable CMA. We demonstrate in numerical experiments that dd-CMA outperforms the original CMA not only on separable functions but also on non-separable ill-conditioned functions with coordinate-wise ill-conditioning that separable CMA can not efficiently solve.

The last contribution is a set of improved and simplified default parameter settings for the covariance matrix adaptation and for the cumulation factor for the so-called evolution path. These learning rates, whose default values have been previously expressed with independent formulas, are reformulated so that their dependencies are clearer. The new default learning rates also improve the adaptation speed of the covariance matrix on high dimensional problems without compromising stability.

The rest of this paper is organized as follows. We introduce the original and separable CMA-ES in Section 2. The active update of the covariance matrix with positive definiteness guarantee is proposed in Section 3. The adaptive diagonal decoding mechanism is introduced in Section 4. Section 5 is devoted to explain the renewed default hyper-parameter values for the covariance matrix adaptation and provides an algorithm summary of dd-CMA-ES and a link to publicly available Python code. Numerical experiments are conducted in Section 6 to see how effective each component of CMA with diagonal acceleration works in different situations. We conclude the paper in Section 7.

2 Evolution Strategy with Covariance Matrix Adaptation

We summon up the ( $\mu_{w}$ , $\lambda$ )-CMA-ES consisting of weighted recombination, cumulative step-size adaptation, and rank-one and rank- $\mu$ covariance matrix adaptation.

The CMA-ES maintains the multivariate normal distribution, $\mathcal{N}(\mathchoice{\mbox{\boldmath$ \displaystyle m $}}{\mbox{\boldmath$ \textstyle m $}}{\mbox{\boldmath$ \scriptstyle m $}}{\mbox{\boldmath$ \scriptscriptstyle m $}},\sigma^{2}\mathbf{D}\mathbf{C}\mathbf{D})$ , where $\mathchoice{\mbox{\boldmath$ \displaystyle m $}}{\mbox{\boldmath$ \textstyle m $}}{\mbox{\boldmath$ \scriptstyle m $}}{\mbox{\boldmath$ \scriptscriptstyle m $}}\in\mathbb{R}^{n}$ is the mean vector that represents the center of the search distribution, $\sigma\in\mathbb{R}_{+}$ is the so-called step-size that represents the scaling factor of the distribution spread, and $\mathbf{C}\in\mathbb{R}^{n\times n}$ is a positive definite symmetric matrix that represents the shape of the distribution ellipsoid. Though the covariance matrix of the sampling distribution is $\sigma^{2}\mathbf{D}\mathbf{C}\mathbf{D}$ , we often call $\mathbf{C}$ the covariance matrix in the context of CMA-ES. In this paper, we apply this terminology, and we will call $\sigma^{2}\mathbf{D}\mathbf{C}\mathbf{D}$ the covariance matrix of the sampling distribution to distinguish them when necessary. The positive definite diagonal matrix $\mathbf{D}\in\mathbb{R}^{n\times n}$ is regarded as a diagonal decoding matrix, which represents the scaling factor of each design variable. It is fixed during the optimization, usually $\mathbf{D}=\mathbf{I}$ , and does not appear in the standard terminology. However, it plays an important role in this paper, and we define the CMA-ES with $\mathbf{D}$ for the later modification.

2.1 Original CMA-ES

The CMA-ES repeats the following steps until it meets one or more termination criteria:

Sample $\lambda$ candidate solutions, $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i}$ , independently from $\mathcal{N}(\mathchoice{\mbox{\boldmath$ \displaystyle m $}}{\mbox{\boldmath$ \textstyle m $}}{\mbox{\boldmath$ \scriptstyle m $}}{\mbox{\boldmath$ \scriptscriptstyle m $}},\sigma^{2}\mathbf{D}\mathbf{C}\mathbf{D})$ ; 2. 2.

Evaluate candidate solutions on the objective, $f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i})$ , and sort them in ascending order222Ties, $f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i:\lambda})=\cdots=f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i+k-1:\lambda})$ , are treated by redistributing the averaged recombination weights $\sum_{l=0}^{k-1}w_{i+l}/k$ to tied solutions $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i:\lambda},\dots,\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i+k-1:\lambda}$ ., $f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{1:\lambda})\leq\cdots\leq f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{\lambda:\lambda})$ , where $i:\lambda$ is the index of the $i$ -th best candidate solution among $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{1},\dots,\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{\lambda}$ ; 3. 3.

Update the distribution parameters, $\textstyle m$ , $\sigma$ , and $\mathbf{C}$ .

Sampling and Recombination

To generate candidate solutions, we compute the (unique, symmetric) square root of the covariance matrix $\sqrt{\mathbf{C}}$ obeying $\mathbf{C}^{(t)}=\big{(}\sqrt{\mathbf{C}}\big{)}^{2}$ . The candidate solutions are the affine transformation of independent and standard normally distributed random vectors,

[TABLE]

The default population size is $\lambda=4+\lfloor 3\ln n\rfloor$ . To reduce the time complexity per $f$ -call without compromising the performance, we compute the matrix decomposition $\mathbf{C}^{(t)}=\big{(}\sqrt{\mathbf{C}}\big{)}^{2}$ every $t_{\text{eig}}=\max\big{(}1,\big{\lfloor}(\beta_{\mathrm{eig}}(c_{1}+c_{\mu}))^{-1}\big{\rfloor}\big{)}$ iteration, where $c_{1}$ and $c_{\mu}$ are the learning rates for the covariance matrix adaptation that appear later, and $\beta_{\mathrm{eig}}=10n$ . If the learning rates are small enough ( $c_{1}+c_{\mu}\leqslant(2\beta_{\mathrm{eig}})^{-1}$ ), the covariance matrix is regarded as insignificantly changing in each iteration and we stall the decomposition.

The mean vector $\textstyle m$ is updated by taking the weighted average of the promising candidate directions,

[TABLE]

where $c_{m}$ is the learning rate for the mean vector update, usually $c_{m}=1$ . The number of promising candidate solutions are denoted by $\mu$ , and $\big{(}w_{i}\big{)}_{i=1}^{\mu}$ are recombination weights satisfying $w_{i}>0$ for $i\leqslant\mu$ . A standard choice is $w_{i}\propto\ln\frac{\lambda+1}{2}-\ln i$ for $i=1,\dots,\mu$ and $\mu=\lfloor\lambda/2\rfloor$ and $\sum_{i=1}^{\mu}w_{i}=1$ .

Step-Size Adaptation

The cumulative step-size adaptation (CSA) updates the step-size $\sigma$ based on the length of the evolution path that accumulates the mean shift normalized by the current distribution covariance, i.e.,333When $\mathbf{D}$ is not the identity, (3) is not exactly equivalent to the original CSA (Hansen and Ostermeier,, 2001): $\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}_{i:\lambda}=(\mathbf{D}\sqrt{\mathbf{C}})^{-1}(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i:\lambda}-\mathchoice{\mbox{\boldmath$ \displaystyle m $}}{\mbox{\boldmath$ \textstyle m $}}{\mbox{\boldmath$ \scriptstyle m $}}{\mbox{\boldmath$ \scriptscriptstyle m $}})/\sigma$ in this paper whereas originally $\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}_{i:\lambda}=\sqrt{\mathbf{D}\mathbf{C}\mathbf{D}}^{-1}(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i:\lambda}-\mathchoice{\mbox{\boldmath$ \displaystyle m $}}{\mbox{\boldmath$ \textstyle m $}}{\mbox{\boldmath$ \scriptstyle m $}}{\mbox{\boldmath$ \scriptscriptstyle m $}})/\sigma$ . This difference results in rotating the second term of the RHS of (3) at each iteration with a different orthogonal matrix, and ends up in a different $\lVert\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{\sigma}\rVert$ . Krause et al., (2016) have theoretically studied the effect of this difference and argued that this systematic difference becomes small if the parameterization of the covariance matrix of the sampling distribution is unique and it converges up to scale. If $\mathbf{D}$ is fixed, the parameterization is unique. Later in this paper, we update both $\mathbf{D}$ and $\mathbf{C}$ but we force the parameterization to be unique by (34) and (35). Hence the systematic difference is expected to be small. See Krause et al., (2016) for details.

[TABLE]

where $\mu_{w}=1/\sum_{i=1}^{\mu}w_{i}^{2}$ is the so-called variance effective selection mass, and $c_{\sigma}$ is the inverse time horizon for the $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{\sigma}$ update, aka cumulation factor. The scalar $\gamma_{\sigma}^{(t+1)}$ is a correction factor for small $t$ and converges to $1$ , where $\gamma_{\sigma}^{(0)}=\|\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{\sigma}^{(0)}\|^{2}/n=0$ .444An elegant alternative to introducing $\gamma_{\sigma}$ is to use $c_{\sigma}^{(t)}=\max(c_{\sigma},1/t)$ in place of $c_{\sigma}$ in (3), assuming the first $t=1$ . This resembles a simple average while $t\leq 1/c_{\sigma}$ and only afterwards discounts older information by the original decay of $1-c_{\sigma}$ .

The log step-size is updated as

[TABLE]

where $d_{\sigma}$ is the damping factor for the $\sigma$ update and $\chi_{n}=\sqrt{n}\big{(}1-\frac{1}{4n}+\frac{1}{21n^{2}}\big{)}$ is an approximation of the expected value of the norm of an $n$ -dimensional standard normally distributed random vector, $\sqrt{2}\Gamma\big{(}\frac{n+1}{2}\big{)}/\Gamma\big{(}\frac{n}{2}\big{)}$ . The default values for $c_{\sigma}$ and $d_{\sigma}$ are

[TABLE]

The damping parameter $d_{\sigma}$ is introduced to stabilize the step-size adaptation when the population size is large (Hansen and Kern,, 2004). When the step-size becomes too small by accident or is initialized so, the norm of the evolution path will become $O\big{(}\sqrt{\mu_{w}/n}\big{)}$ , which results in a quick increase of $\sigma$ if $d_{\sigma}=1$ (Akimoto et al.,, 2008). For large $\mu_{w}$ , the chosen damping factor $d_{\sigma}$ prevents an unreasonable increase of $\sigma$ at the price of a reduced convergence speed. In case of $\mu_{w}\gg n$ , the covariance matrix adaptation is the main component decreasing the overall variance of the sampling distribution, while the CSA is still effective to increase $\sigma$ when necessary.

Covariance Matrix Adaptation

The covariance matrix $\mathbf{C}$ is updated by the following formula that combines rank-one and rank- $\mu$ update

[TABLE]

where $c_{1}$ and $c_{\mu}$ are the learning rates for the rank-one update (2nd term) and the rank- $\mu$ update (3rd term), respectively, $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}$ is the evolution path that accumulates the successive mean movements and $\gamma_{c}$ is the correction factor for the rank-one update, which are updated as

[TABLE]

where $c_{c}$ is the inverse time horizon for the $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}$ update. The Heaviside function $h_{\sigma}^{(t+1)}$ is introduced to stall the update of $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}$ if $\lVert\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{\sigma}\rVert$ is large, i.e., when the step-size is rapidly increasing. It is defined as

[TABLE]

The default parameters for $c_{1}$ , $c_{\mu}$ , and $c_{c}$ are smaller than one and presented later.

2.2 Separable Covariance Matrix Adaptation

The separable covariance matrix adaptation (sep-CMA, Ros and Hansen, (2008)) adapts only the coordinate-wise variance of the sampling distribution, i.e., the diagonal elements of the covariance matrix in the same way as (8), but with larger learning rates. In our notation scheme, we keep $\mathbf{C}$ to be the identity and describe sep-CMA by updating $\mathbf{D}$ . The update of coordinate $k$ follows

[TABLE]

The learning rates $c_{1}$ and $c_{\mu}$ are set differently from those used for the $\mathbf{C}$ update.

One advantage of the separable CMA is that all operations can be done in linear time per $f$ -call. Therefore, it is promising if $f$ -calls are cheap. The other advantage is that one can set the learning rate greater than those used for the $\mathbf{C}$ update, since the degrees of freedom of the covariance matrix of the sampling distribution is $n$ , rather than $n(n+1)/2$ . The adaptation speed of the covariance matrix is faster than for the original CMA. However, if the problem is non-separable and has strong variable dependencies, adapting the coordinate-wise scaling is not enough to make the search efficient. More concisely, if the inverse Hessian of the objective function, $\text{Hess}(f)^{-1}$ , is not well approximated by a diagonal matrix, the convergence speed will be $O(1/\operatorname{Cond}(\mathbf{D}^{2}\text{Hess}(f)))$ , which is empirically observed on convex quadratic functions as well as theoretically deduced in Akimoto et al., (2018). In practice, it is rarely known in advance whether the separable CMA is appropriate or not. Ros and Hansen, (2008) propose to use the separable CMA for hundred times dimension function evaluations and then switch to the original CMA afterwards. Such an algorithm has been benchmarked in Ros, (2009), where the first $1+100n/\sqrt{\lambda}$ iterations are used for the separable CMA.

3 Active covariance matrix update with guarantee of positive definiteness

Active covariance matrix adaptation (referred to as Active-CMA, Jastrebski and Arnold,, 2006) utilizes the information of unpromising solutions, $\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}}_{i:\lambda}$ for $i>\mu$ , to update the covariance matrix. The modification is rather simple. We prepare $\lambda$ recombination weights $(w_{i})_{i=1}^{\lambda}$ for the active covariance matrix update, where $w_{i}$ are not anymore restricted to be positive. For example, the recombination weights used in Jastrebski and Arnold, (2006) are

[TABLE]

The update formula (8) is then replaced with

[TABLE]

where shaded areas depict the difference to (8). The only difference is that we use all $\lambda$ candidate solutions to update the covariance matrix with positive and negative recombination weights.555As of 2018, many implementations of CMA-ES feature the active update of $\mathbf{C}$ and it should be considered as default.

Each component of the covariance matrix adaptation, rank-one update, rank- $\mu$ update, and active update, produces complementary effects. The rank-one update of the covariance matrix (the second term on the RHS of (14), Hansen and Ostermeier,, 2001) accumulates the successive steps of the distribution mean and increases the eigenvalues of the covariance matrix in the direction of the successive movements. It excels at learning one long axis of the covariance matrix, and is effective on functions with a subspace of a relatively small dimension where the function value is less sensitive than in its orthogonal subspace. Figure 1(a) visualizes an example of such function. See also Figure 7 in Hansen and Auger, (2014) for numerical results. On the other hand, since the update is always of rank one, the learning rate $c_{1}$ needs to be sufficiently small to keep the covariance matrix regular and stable.

The rank- $\mu$ update (the third term on the RHS of (14) with positive $w_{i}$ , Hansen et al.,, 2003) utilizes the information of $\mu$ successful candidate solutions in a different way than the rank-one update. It computes the empirical covariance matrix of successful mutation steps $\mathchoice{\mbox{\boldmath$ \displaystyle y $}}{\mbox{\boldmath$ \textstyle y $}}{\mbox{\boldmath$ \scriptstyle y $}}{\mbox{\boldmath$ \scriptscriptstyle y $}}_{i}$ . The update matrix is with probability one of rank $\min(n,\mu)$ , allowing to set a relatively large learning rate $c_{\mu}$ . It reduces the number of iterations to adapt the covariance matrix when $\lambda\gg 1$ .

Both, rank-one and rank- $\mu$ update, try to increase the variances in the subspace of successful mutation steps. The eigenvalues corresponding to unsuccessful mutation steps only passively fade out. The active update actively decreases such eigenvalues. It consistently accelerates covariance matrix adaptation, and the improvement is particularly pronounced on functions with a small number of dominating eigenvalues of the Hessian of the objective function. Figure 1(c) is an example of such function.

A disadvantage of the active update with negative recombination weights such as (13) is to have no guarantee of the positive definiteness of the covariance matrix anymore. It is easy to see that the rank-one and rank- $\mu$ update of CMA in (8) guarantee that the minimal eigenvalue of $\mathbf{C}^{(t+1)}$ is no smaller than the minimal eigenvalue of $\mathbf{C}^{(t)}$ times $1-c_{1}-c_{\mu}\sum_{i=1}^{\mu}w_{i}$ , since the second and third terms only increase the eigenvalues. However, the introduction of negative recombination weights can violate the positive definiteness because the third term may decrease the minimum eigenvalue arbitrarily. Jastrebski and Arnold, (2006) set a sufficiently small learning rate for the active update, i.e., the absolute values of the negative recombination weights sum up to a smaller value than one. It will prevent the covariance matrix from being non-positive with high probability, but it does not guarantee positive definiteness. Moreover, it becomes ineffective when the population size is relatively large and a greater learning rate is desired.

Krause and Glasmachers, (2015) apply the active update with positive definiteness guarantee by introducing the exponential covariance matrix update, called xCMA. Instead of updating the covariance matrix in an additive way as in (14), the covariance matrix is updated as

[TABLE]

where $\Delta$ is a symmetric matrix. Since the eigenvalues of the matrix exponential are $e^{\delta_{i}}$ where $\delta_{i}$ are the eigenvalues of $\Delta$ , the positive definiteness is naturally guaranteed. Arnold and Hansen, (2010) achieve the positive definiteness guarantee in the $(1+1)$ -CMA-ES by rescaling the negative recombination weights depending on the norm of unsuccessful mutation steps $\lVert\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}\rVert$ . In this paper we introduce two strategies that are both considered as generalization of this idea to the $(\mu_{w},\lambda)$ -CMA-ES.666There are variants of CMA-ES that update a factored matrix $\mathbf{A}$ satisfying $\mathbf{C}=\mathbf{A}\mathbf{A}^{\mathsf{T}}$ (e.g., eNES by Sun et al.,, 2009). No matter how $\mathbf{A}$ is updated, the positive semi-definiteness of $\mathbf{C}$ is guaranteed since $\mathbf{A}\mathbf{A}^{\mathsf{T}}$ is always positive semi-definite. However this approach has a potential drawback that a negative update may end up increasing a variance. To see this, consider the case $\mathbf{A}\leftarrow\mathbf{A}(\mathbf{I}-\eta\bm{e}\bm{e}^{\mathsf{T}})$ , where $\bm{e}$ is some vector and $\eta>0$ represents the learning rate times the recombination weight. Then, the covariance matrix follows $\mathbf{C}\leftarrow\mathbf{A}(\mathbf{I}-\eta\bm{e}\bm{e}^{\mathsf{T}})^{2}\mathbf{A}^{\mathsf{T}}$ . This update shrinks a variance if $\eta$ is sufficiently small ( $\eta<1/\lVert\bm{e}\rVert^{2}$ ), however, it increases the variance if $\eta$ is large ( $\eta>1/\lVert\bm{e}\rVert^{2}$ ). Hence, a negative update with a long vector $\bm{e}$ and/or a large $\eta$ will cause an opposite effect. Therefore, the factored matrix update is not a conclusive solution to the positive definiteness issue.

3.1 Method 1: Scaling Down the Update Factor

To guarantee the positive definiteness of the covariance matrix, we rescale the update factor of the covariance matrix so that the changes of the minimum eigenvalue of the covariance matrix is bounded. We start by introducing the rescaling of unpromising solutions,

[TABLE]

This rescaling results in projecting unpromising solutions onto the surface of the hyper-ellipsoid defined by $\mathchoice{\mbox{\boldmath$ \displaystyle y $}}{\mbox{\boldmath$ \textstyle y $}}{\mbox{\boldmath$ \scriptstyle y $}}{\mbox{\boldmath$ \scriptscriptstyle y $}}^{\mathsf{T}}\mathbf{C}^{-1}\mathchoice{\mbox{\boldmath$ \displaystyle y $}}{\mbox{\boldmath$ \textstyle y $}}{\mbox{\boldmath$ \scriptstyle y $}}{\mbox{\boldmath$ \scriptscriptstyle y $}}=n$ . By this projection we achieve three desired effects. First, the update becomes bounded which makes it easier to control positive definiteness. Second, short steps are elongated, enhancing their effect in the update. Third, long steps are shortened, reducing their effect in the update. The two latter effects counter the correlation between rank and length of steps in unfavorable directions. For any given unfavorable direction, longer steps in this direction are most likely ranked worse than shorter steps.

With these scaled solutions, the covariance matrix is updated as

[TABLE]

where shaded areas highlight the difference to (14) and $\alpha^{(t)}\leqslant 1$ is the scaling factor to guarantee the positive definiteness of the covariance matrix. Note that if $\alpha^{(t)}=1$ , it is equivalent to (14) except for the rescaling of the unpromising samples. Importantly, the rescaling of unpromising solutions does not affect the stationarity of the parameter update, i.e., $\mathbb{E}[\mathbf{C}^{(t+1)}\mid\mathbf{C}^{(t)}]=\mathbf{C}^{(t)}$ under a random function such as $f(\mathchoice{\mbox{\boldmath$ \displaystyle x $}}{\mbox{\boldmath$ \textstyle x $}}{\mbox{\boldmath$ \scriptstyle x $}}{\mbox{\boldmath$ \scriptscriptstyle x $}})\sim\mathcal{U}(0,1)$ . This is shown as follows. First, given $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}^{(t)}\sim\mathcal{N}(\bm{0},\gamma_{c}^{(t)}\mathbf{D}\mathbf{C}^{(t)}\mathbf{D})$ , it is easy to see that $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}^{(t+1)}\sim\mathcal{N}(\bm{0},\gamma_{c}^{(t+1)}\mathbf{D}\mathbf{C}^{(t)}\mathbf{D})$ . Second, using $\mathbb{E}[\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}^{\mathsf{T}}/\lVert\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}\rVert^{2}]=\mathbf{I}/n$ (Lemma 1 of Heijmans,, 1999) we have $\mathbb{E}\big{[}\sum_{i=1}^{\lambda}w_{i}\tilde{}\mathchoice{\mbox{\boldmath$ \displaystyle y $}}{\mbox{\boldmath$ \textstyle y $}}{\mbox{\boldmath$ \scriptstyle y $}}{\mbox{\boldmath$ \scriptscriptstyle y $}}_{i:\lambda}\tilde{}\mathchoice{\mbox{\boldmath$ \displaystyle y $}}{\mbox{\boldmath$ \textstyle y $}}{\mbox{\boldmath$ \scriptstyle y $}}{\mbox{\boldmath$ \scriptscriptstyle y $}}_{i:\lambda}^{\mathsf{T}}\mid\mathbf{C}^{(t)}\big{]}=\big{(}\sum_{i=1}^{\lambda}w_{i}\big{)}\mathbf{C}^{(t)}$ . Finally combining them, we obtain $\mathbb{E}[\mathbf{C}^{(t+1)}\mid\mathbf{C}^{(t)}]=\mathbf{C}^{(t)}$ .

The main idea is to scale down, if necessary, the update of the covariance matrix by setting $\alpha^{(t)}<1$ in (17). To provide a better intuition, we start by considering the case $t_{\mathrm{eig}}=1$ , i.e., $\mathbf{C}^{(t)}=\sqrt{\mathbf{C}}^{2}$ for every iteration $t$ . Equation (17) can be written as

[TABLE]

with

[TABLE]

Then, the scaling down parameter $\alpha^{(t)}$ in (18) is taken so that $\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)}$ is positive definite. Then, with maximum and minimum eigenvalue of a matrix $\mathbf{A}$ denoted by $d_{1}(\mathbf{A})$ and $d_{n}(\mathbf{A})$ , respectively, we have that

[TABLE]

where $\mathbf{A}\succcurlyeq\mathbf{B}$ mean that $\mathbf{A}-\mathbf{B}$ is positive semi-definite (i.e., all eigenvalues are non-negative)777Some references (e.g., Harville, (2008)) use the term “positive semi-definite matrix” for a matrix with positive and zero eigenvalues and “non-negative definite matrix” is used for matrices with non-negative eigenvalues, whereas in other references these terms are used for matrices with non-negative eigenvalues. In this paper, we apply the latter terminology. for any two square matrices $\mathbf{A}$ and $\mathbf{B}$ . Moreover, $d_{n}(\mathbf{I}+\alpha^{(t)}\mathbf{Z}^{(t)})=1+\alpha^{(t)}d_{n}(\mathbf{Z}^{(t)})$ . Therefore, if $\alpha^{(t)}<1/\lvert d_{n}(\mathbf{Z}^{(t)})\rvert$ , the resulting covariance matrix is guaranteed to be positive definite.

For the general case ( $t_{\mathrm{eig}}\geqslant 1$ ), let $t$ be the iteration at which the last eigen decomposition was performed, i.e., $\mathbf{C}^{(t)}=\sqrt{\mathbf{C}}^{2}$ . For iterations $k\in[t,t+t_{\mathrm{eig}})$ , we store the intermediate update matrix $\mathbf{Z}^{(k)}$ . The covariance matrix is updated only when the eigen decomposition is required, i.e., at iteration $t+t_{\mathrm{eig}}$ , as

[TABLE]

Analogously to the above argument, the resulting covariance matrix is positive definite if the inside of the brackets is positive definite. We set the scaling down parameter as

[TABLE]

The first argument guarantees that the minimum eigenvalue of $\mathbf{C}^{(t+t_{\mathrm{eig}})}$ is greater than or equal to $1/4$ th of the minimum eigenvalue of $\mathbf{C}^{(t)}$ . More concisely, we have $\mathbf{C}^{(t+t_{\mathrm{eig}})}\succcurlyeq\frac{1}{4}\mathbf{C}^{(t)}$ . The last argument, which is the most frequent case in practice, implies that the covariance matrix update does not need to be scaled down.

3.2 Method 2: Scaling Down Negative Weights

An alternative way to guarantee the positive definiteness of the covariance matrix is to scale down the sum of the absolute values of the negative weights, combined with the projection of unpromising solutions introduced above. We use the same update formula (21), but $\alpha^{(t+t_{\mathrm{eig}})}=1$ .

The positive definiteness of the covariance matrix is guaranteed under the condition $1/t_{\mathrm{eig}}>c_{1}+c_{\mu}\sum_{i=1}^{\lambda}w_{i}+nc_{\mu}\sum_{i:\ w_{i}<0}\lvert w_{i}\rvert$ . More precisely, we have the following claim.

Theorem 3.1.

If $1/t_{\mathrm{eig}}>c_{1}+c_{\mu}\sum_{i=1}^{\lambda}w_{i}+nc_{\mu}\sum_{i:\ w_{i}<0}\lvert w_{i}\rvert$ , then $\mathbf{C}^{(t+t_{\mathrm{eig}})}\succcurlyeq\big{(}1-t_{\mathrm{eig}}\big{(}c_{1}+c_{\mu}\sum_{i=1}^{\lambda}w_{i}+nc_{\mu}\sum_{i:\ w_{i}<0}\lvert w_{i}\rvert\big{)}\big{)}\mathbf{C}^{(t)}$ .

Proof of Theorem 3.1.

First, we consider the case $t_{\mathrm{eig}}=1$ . Using the fact that the self outer product $\mathbf{v}\mathbf{v}^{\mathsf{T}}$ of a vector $\mathbf{v}$ is a matrix of rank $1$ with the only nonzero eigenvalue of $\lVert\mathbf{v}\rVert^{2}$ , we have

[TABLE]

The analogous argument holds for $t_{\mathrm{eig}}>1$ as well. This completes the proof. ∎

We show how to construct the recombination weights so that they satisfy the sufficient condition of Theorem 3.1 as follows. Let $(w_{i}^{\prime})_{i=1}^{\lambda}$ be the pre-defined weights that are non-increasing. Without loss of generality, we assume that the first $\mu$ weights are positive and sum to one. The recombination weights are $w_{i}=w_{i}^{\prime}$ for $w_{i}^{\prime}\geqslant 0$ and $w_{i}=\alpha w_{i}^{\prime}$ for $w_{i}^{\prime}<0$ , where $\alpha\in(0,1]$ . Then, the sufficient condition in Theorem 3.1 reads $1/t_{\mathrm{eig}}>c_{1}+c_{\mu}+(n-1)c_{\mu}\alpha\sum_{i:\ w_{i}<0}\lvert w_{i}^{\prime}\rvert$ . It holds if we set for example $t_{\mathrm{eig}}<(c_{1}+c_{\mu})^{-1}$ , as satisfied by the default choice $t_{\text{eig}}=\max\big{(}1,\big{\lfloor}(\beta_{\mathrm{eig}}(c_{1}+c_{\mu}))^{-1}\big{\rfloor}\big{)}$ , and

[TABLE]

Then, it is guaranteed that $\mathbf{C}^{(t+t_{\mathrm{eig}})}\succcurlyeq\frac{1}{n}\mathbf{C}^{(t)}$ .

This method is simpler than the method described in Section 3.1 since it does not require an additional eigenvalue computation. However, to guarantee the positive definiteness in this way, we bound the unrealistic worst case where all unpromising candidate solutions are sampled on a line. Therefore, the scaling down factor is set much smaller than the value that is actually needed in practice. Figure 2 visualizes the correction factor $\alpha$ under our choice of the default pre-weights and learning rates described in Section 5. The sum of the negative recombination weights needs to decrease as the population size increases. For $n\geqslant 320$ , the factor is less than $0.1$ for $\lambda\geqslant n^{1.5}$ . Unless the internal computational time becomes a critical bottleneck, we prefer the method described in Section 3.1 over the method in Section 3.2.

3.3 Choice of Recombination Weights

We first review the rationale for the choice of the positive recombination weights. The default positive recombination weights, $\ln\frac{\lambda+1}{2}-\ln i$ for $i\leqslant\lambda/2$ , are based on the quality gain analysis of the weighted recombination ES with isotropic Gaussian distribution (i.e., $\mathbf{C}\propto\mathbf{I}$ ). Arnold, (2006) studied the quality gain on a spherical function, and derived the optimal recombination weights for the infinite dimensional case. They are

[TABLE]

where $\mathcal{N}_{1:\lambda}\leqslant\cdots\leqslant\mathcal{N}_{\lambda:\lambda}$ is the ordered statistics from the independent and standard normally distributed random variables $\mathcal{N}_{1},\dots,\mathcal{N}_{\lambda}$ . Recently, it has been shown that the same recombination weights are optimal for a general convex quadratic function with Hessian matrix $\mathbf{A}$ satisfying $\operatorname{Tr}(\mathbf{A}^{2})/\operatorname{Tr}(\mathbf{A})^{2}\ll 1$ for $n$ large (Akimoto et al.,, 2018). The value $\ln\frac{\lambda+1}{2}-\ln i$ approximates the first half of the optimal recombination weights.888Interesting results are derived from the quality gain (and progress rate) analysis, see e.g. Section 4.2 of Hansen et al., (2015). Comparing the ( $1+1$ )-ES and ( $\mu$ , $\lambda$ )-ES with intermediate recombination, we realize that they reach the same quality gain in the limit for $n$ to infinity when $\mu$ is the optimal value, $\mu\approx 0.27\lambda$ (Beyer,, 1995). That is, a non-elitist strategy with optimal $\mu/\lambda$ for intermediate recombination can reach the same speed as the elitist strategy. With the optimal recombination weights above, the weighted recombination ES is $2.475$ times faster than those algorithms. If we employ only the positive half of the optimal recombination weights, the speed will be halved, yet it is faster by the factor of $1.237$ .

We propose to use the recombination weights constructed as follows. Let

[TABLE]

The variance effective selection mass for the positive and negative weights is computed as

[TABLE]

The learning rates $c_{1}$ and $c_{\mu}$ may be computed depending on the above quantities. The default values for the learning rates are discussed in Section 5. The recombination weights $w_{i}$ are set as follows

[TABLE]

The positive recombination weights are unchanged from the default settings without active CMA. They are approximately proportional to the positive half of the optimal recombination weights. However, this is not the case for the negative recombination weights. The optimal recombination weights are symmetric about zero, i.e. $w_{i}=-w_{\lambda-i+1}$ , whereas our negative weights tend to level out for the following reasons. The above mentioned quality gain results consider only the mean update. The obtained optimal values are not necessarily optimal for the covariance matrix adaptation. Since we use only positive recombination weights for the mean vector update, the negative weights do not need to correspond to these optimal recombination weights. Furthermore, the optimal negative weights—greater absolute values for worse steps—counteract our motivation for rescaling of unpromising steps discussed in Section 3.1. Our choice of the negative recombination weights is a natural extension of the default positive recombination weights. The shape of our recombination weights is somewhat similar to the one in xCMA-ES (Krause and Glasmachers,, 2015), where the first half is $w_{i}-1/\lambda$ and the last half is $-1/\lambda$ . A minor difference is that xCMA-ES assigns negative values even for some of the better half of the candidate solutions, whereas our setting assigns positive values only for the better half and negative values only for the worse half.

Positive and negative recombination weights are scaled differently. Positive weights are scaled to sum to one, whereas negative weights are scaled so that the sum of their absolute values is the minimum of $1+\frac{c_{1}}{c_{\mu}}$ and $1+\frac{2\mu_{w}^{-}}{\mu_{w}+2}$ . The latter corrects for unbalanced positive and negative weights, but is usually greater than the former with our default values. The former makes the decay factor $1-c_{1}-c_{\mu}\sum_{i=1}^{\lambda}w_{i}$ of the covariance matrix update (see (14)) to $1$ . That is, the covariance matrix does not passively shrink, and only the negative update can shrink the covariance matrix. Krause and Glasmachers, (2015) mention to have no passive shrinkage by setting the sum of the weights to zero, but the effect of the rank-one update was not taken into account and thus in xCMA the passive decay factor of $1-c_{1}$ remains.

4 Adaptive Diagonal Decoding (dd-CMA)

The separable CMA-ES (Ros and Hansen,, 2008) enables to adapt a diagonal covariance matrix faster than the standard CMA-ES because the learning rates are $O(n)$ times greater than in standard CMA-ES. This works well on separable objective functions, where separable CMA-ES adapts the (co)variance matrix by a factor of $O(n)$ faster than CMA-ES. To accelerate standard CMA, we combine separable CMA and standard CMA. We adapt both $\mathbf{D}$ and $\mathbf{C}$ at the same time999Considering the eigen decomposition of the covariance matrix, $\mathbf{E}\mathbf{\Lambda}\mathbf{E}^{\mathsf{T}}$ , instead of the decomposition $\mathbf{D}\mathbf{C}\mathbf{D}$ , our attempts to additionally adapt $\mathbf{\Lambda}$ were unsuccessful. We attribute the failure to the strong interdependency between $\mathbf{\Lambda}$ and $\mathbf{E}$ . Compared to the eigen decomposition, the diagonal decoding model $\mathbf{D}\mathbf{C}\mathbf{D}$ also has the advantage to be interpretable as a rescaling of variables. We never considered the decomposition $\sqrt{\mathbf{C}}\mathbf{D}^{2}\sqrt{\mathbf{C}}$ as proposed by one of the reviewers of this paper, however, at first sight, we do not see any particular advantage in favor of this decomposition. , where $\mathbf{D}$ is updated with adaptive learning rates which can be much greater than those used to update $\mathbf{C}$ .

4.1 Algorithm

The update of $\mathbf{D}$ is similar to separable CMA, but is done in local exponential coordinates. The update of $\mathbf{D}$ is as follows

[TABLE]

where $[\Delta_{D}]_{k,k}$ is the $k$ -th diagonal element of the diagonal matrix $\Delta_{D}$ , the evolution path $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c,D}$ feeds the rank-one $\mathbf{D}$ update, $c_{1,D}$ and $c_{\mu,D}$ are the learning rates for the rank-one and rank- $\mu$ $\mathbf{D}$ updates, respectively, the recombination weights $w_{i,D}$ are computed by (27) using $c_{1,D}/c_{\mu,D}$ instead of $c_{1}/c_{\mu}$ , $\tilde{}\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}_{i:\lambda}$ is defined in (16), and $\beta^{(t)}$ is the damping factor for the diagonal matrix update given in (31) below. The evolution path $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c,D}$ and its normalization factor $\gamma_{c,D}$ are updated in the same way as (9) and (10) with the cumulation factor $c_{c,D}$ rather than $c_{c}$ . The matrix exponential of a diagonal matrix is $\exp(\operatorname{diag}(d_{1},\dots,d_{n}))=\operatorname{diag}(\exp(d_{1}),\dots,\exp(d_{n}))$ .

Using the correlation matrix of the covariance matrix $\mathbf{C}$ of the last time the matrix was decomposed,

[TABLE]

where $\operatorname{diag}(\mathbf{C})$ is the diagonal matrix whose diagonal elements are the diagonal elements of $\mathbf{C}$ , the damping factor is based on the square root of the condition number of this correlation matrix as

[TABLE]

where $\beta_{\mathrm{thresh}}:=2$ is the threshold parameter at which $\beta^{(t)}$ becomes larger than one. We remark that $\beta^{(t)}$ changes only every $t_{\text{eig}}$ iterations.

The $\mathbf{D}$ -update is multiplicative as in Ostermeier et al., (1994) or Krause and Glasmachers, (2015) to be unbiased on the log-scale and to simply guarantee the positive definiteness of $\mathbf{D}$ . Note that the first order approximation of the update formula $\mathbf{D}^{(t)}\exp((2\beta)^{-1}\Delta_{D})$ gives

[TABLE]

which is the update in separable CMA.

Importantly, we set the learning rates for the $\mathbf{D}$ update, $c_{1,D}$ and $c_{\mu,D}$ , to be about $n$ times greater than the learning rates for the $\mathbf{C}$ update, $c_{1}$ and $c_{\mu}$ . Moreover, we maintain an additional evolution path, $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c,D}$ , for the rank-one update of $\mathbf{D}$ since an adequate cumulation factor, $c_{c,D}$ , may be different from the one for $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}$ .

4.2 Damping factor

The dynamic damping factor $\beta^{(t)}$ is the crucial novelty that prevents diagonal decoding to disrupt the adaptation of the covariance matrix $\mathbf{C}$ in CMA-ES. The damping factor is introduced to prevent the sampling distributions from drastically changing due to the diagonal matrix update. Figure 3 visualizes three example cases with different $\mathbf{C}$ . When the diagonal decoding matrix changes from the identity matrix to $\mathbf{D}$ , the change of the distribution from $\mathcal{N}(\bm{0},\mathbf{C})$ to $\mathcal{N}(\bm{0},\mathbf{D}\mathbf{C}\mathbf{D})$ is minimal when $\mathbf{C}$ is diagonal, and is comparatively large if $\mathbf{C}$ is non-diagonal. It will be greater as the condition number of the correlation matrix $\mathrm{corr}(\mathbf{C})$ increases. In the worst case in Figure 3, we see that two distributions are overlapping only at the center of distribution.

The damping factor $\beta^{(t)}$ computed in (31) is motivated from the KL divergence between the sampling distributions before and after the $\mathbf{D}$ update. The KL divergence between two multivariate normal distributions that have the same mean vectors is equivalent to the half of the Log-Determinant divergence between the covariance matrices, defined as

[TABLE]

where $\mathbf{A}$ and $\mathbf{B}$ are two positive definite symmetric matrices. Let $\mathbf{A}=\mathbf{D}^{\prime}\mathbf{C}\mathbf{D}^{\prime}$ and $\mathbf{B}=\mathbf{D}\mathbf{C}\mathbf{D}$ , i.e., $\mathbf{A}$ and $\mathbf{B}$ are the covariance matrices after and before the $\mathbf{D}$ update, respectively. The divergence $D_{ld}$ is asymmetric, and the value changes if we exchange $\mathbf{A}$ and $\mathbf{B}$ , but we will obtain a symmetric approximation in the end. To analyze the divergence between $\mathbf{A}$ and $\mathbf{B}$ due to the difference in shape and put aside the effect of a global scaling, we scale both to be of determinant one. We denote the resulting divergence by $\bar{D}_{ld}(\mathbf{A},\mathbf{B})=D_{ld}(\mathbf{A}/\det(\mathbf{A}),\mathbf{B}/\det(\mathbf{B}))$ . Then, we obtain

[TABLE]

where $\bar{\Delta}_{D}=\Delta_{D}-(\operatorname{Tr}(\Delta_{D})/n)\mathbf{I}$ . With the approximation $\exp\left(\frac{1}{2\beta}\bar{\Delta}_{D}\right)\approx\mathbf{I}+\left(\frac{1}{2\beta}\right)\bar{\Delta}_{D}+\frac{1}{2}\left(\frac{1}{2\beta}\right)^{2}\bar{\Delta}_{D}^{2}$ and neglecting $(\Delta_{D}/\beta)^{3}$ and higher terms, we have101010Let $\mathbf{A}$ and $\mathbf{B}$ be an arbitrary positive definite symmetric matrix and an arbitrary symmetric matrix, respectively. Let $\otimes$ and $\mathrm{vec}$ denote the Kronecker product of two matrices and the matrix-vector rearrangement operator that successively stacks the columns of a matrix. From Theorem 16.2.2 of Harville, (2008), we have $\operatorname{Tr}(\mathbf{A}\mathbf{B}\mathbf{A}^{-1}\mathbf{B})=\mathrm{vec}(\mathbf{B})^{\mathsf{T}}(\mathbf{A}\otimes\mathbf{A}^{-1})\mathrm{vec}(\mathbf{B})$ . The eigenvalues of the Kronecker product $\mathbf{A}\otimes\mathbf{A}^{-1}$ are the products of the eigenvalues of $\mathbf{A}$ and $\mathbf{A}^{-1}$ (Theorem 21.11.1 of Harville, (2008)). They are all positive and upper bounded by the condition number of $\mathbf{A}$ . Therefore, we have $\mathrm{vec}(\mathbf{B})^{\mathsf{T}}(\mathbf{A}\otimes\mathbf{A}^{-1})\mathrm{vec}(\mathbf{B})\leqslant\operatorname{Cond}(\mathbf{A})\lVert\mathrm{vec}(\mathbf{B})\rVert^{2}=\operatorname{Cond}(\mathbf{A})\operatorname{Tr}(\mathbf{B}^{2})$ . Letting $\mathbf{A}=\mathrm{corr}(\mathbf{C})$ and $\mathbf{B}=\bar{\Delta}_{D}$ , we have $\operatorname{Tr}\left(\bar{\Delta}_{D}\mathbf{C}\bar{\Delta}_{D}\mathbf{C}^{-1}\right)=\operatorname{Tr}(\mathbf{A}\mathbf{B}\mathbf{A}^{-1}\mathbf{B})\leqslant(\beta+\beta_{\mathrm{thresh}}-1)^{2}\operatorname{Tr}(\bar{\Delta}_{D}^{2})$ .

[TABLE]

Note that $\bar{D}_{ld}((\mathbf{D}^{(t+1)})^{2},(\mathbf{D}^{(t)})^{2})\approx 2^{-1}\operatorname{Tr}\left(\bar{\Delta}_{D}^{2}\right)$ if $\mathbf{C}$ is diagonal and $\beta=1$ . Therefore, with $\beta$ computed in (31), we bound the realized divergence approximately by the divergence of $\mathbf{D}$ like

[TABLE]

for large $\beta$ .

In a nutshell, we have quantified the distribution change from changing $\mathbf{D}$ by measuring its KL-divergence. We derive that $\beta$ in (31) upper bounds the KL-divergence due to changes of $\mathbf{D}$ approximately by the KL-divergence from the same change of $\mathbf{D}$ when $\mathbf{C}=\mathbf{I}$ . The latter is directly determined by the learning rates $c_{1,D}$ and $c_{\mu,D}$ .

4.3 Implementation Remark

To avoid unnecessary numerical errors and additional computational effort for eigen decompositions in the adaptive diagonal decoding mechanism (28), (29), (31), we force the diagonal elements of $\mathbf{C}$ to be all one by assigning

[TABLE]

These lines are performed just before the eigen decomposition. It means that $\mathbf{D}$ and $\mathbf{C}$ are the variance matrix and the correlation matrix of the sampling distribution, respectively. This reparametrization does not change the algorithm itself, it only improves the numerical stability of the implementation. Note that if $\mathbf{C}$ is constrained to be a correlation matrix, then $\mathbf{D}\mathbf{C}\mathbf{D}$ is a unique parametrization for the covariance matrix.

The eigen decomposition of $\mathbf{C}^{(t)}$ is performed every $t_{\text{eig}}$ iterations. We keep $\sqrt{\mathbf{C}}$ and $\sqrt{\mathbf{C}}^{-1}$ until the next matrix decomposition is performed. Suppose that the eigen decomposition $\mathbf{C}^{(t)}=\mathbf{E}\Lambda\mathbf{E}^{\mathsf{T}}$ is performed at iteration $t$ . Then, the above matrices are $\sqrt{\mathbf{C}}=\mathbf{E}\Lambda^{\frac{1}{2}}\mathbf{E}^{\mathsf{T}}$ and $\sqrt{\mathbf{C}}^{-1}=\mathbf{E}\Lambda^{-\frac{1}{2}}\mathbf{E}^{\mathsf{T}}$ . Then, we can compute $\beta$ in (31) as $\beta=\max(1,(\max_{i}([\Lambda]_{i,i})/\min_{i}([\Lambda]_{i,i}))^{\frac{1}{2}}-\beta_{\mathrm{thresh}}+1)$ . This $\beta$ is kept until the next eigen decomposition is performed. Then, the additional computational cost for adaptive diagonal decoding is $O(n^{2}/t_{\mathrm{eig}}+\lambda n)$ , which is smaller than the computational cost for the other parts of the CMA-ES, $O(n^{3}/t_{\mathrm{eig}}+\lambda n^{2})$ per iteration.

One might think that maintaining $\mathbf{D}$ is not necessary and $\mathbf{C}$ could be updated by pre- and post-multiplying by a diagonal matrix absorbing the effect of the $\mathbf{D}$ update. However, because diagonal decoding may lead to a fast change of the distribution shape, the decomposition of $\mathbf{C}$ then needs to be done every iteration. The chosen parametrization circumvents frequent matrix decompositions despite changing the sampling distribution rapidly.

5 Algorithm Summary and Strategy Parameters

The dd-CMA-ES combines weighted recombination, active covariance matrix update with positive definiteness guarantee (described in Section 3.1), adaptive diagonal decoding (described in Section 4) and CSA, and is summarized in Algorithm 1.111111Its python code is available at

https://gist.github.com/youheiakimoto/1180b67b5a0b1265c204cba991fa8518

Providing good default values for the strategy parameters (aka hyper parameters) is, needless to say, essential for the success of a novel algorithm. Especially in a black-box optimization scenario, parameter tuning for an application relies on trial-and-error and is usually prohibitively time consuming. The computation of the default parameter values is summarized in Algorithm 2. Since we have $c_{1}/c_{\mu}=c_{1,D}/c_{\mu,D}$ as long as $c_{1}+c_{\mu}<1$ (see below), the recombination weights for the update of $\mathbf{C}$ and of $\mathbf{D}$ are the same, $w_{i}=w_{i,D}$ .

The learning rates for $\mathbf{D}$ and $\mathbf{C}$ are modified to improve the adaptation speed especially in high dimension. Let $m$ be the degrees of freedom of the matrix to be adapted, i.e., $m=(n+1)n/2$ for the $\mathbf{C}$ -update and $m=n$ for the $\mathbf{D}$ -update. The learning rate parameters and the cumulation factors are set to the following values

[TABLE]

The first important change is the scaling of the learning rate with the dimension $n$ . In previous works (Ros and Hansen,, 2008; Hansen and Auger,, 2014; Akimoto and Hansen,, 2016), the default learning rates were set inversely proportional to $m$ , i.e., $\Theta(n^{-2})$ for the original CMA and $\Theta(n^{-1})$ for the separable CMA. Our choice is based on empirical observations that $\Theta(n^{-2})$ is exceedingly conservative for higher dimensional problems, say $n>100$ : in experiments to identify $c_{\mu}$ for the original CMA-ES, we vary $c_{\mu}$ and investigate the number of evaluations to reach a given target. We find that the location of the minimum and the shape of the dependency remains for a wide range of different dimensions roughly the same when the evaluations are taken against $c_{\mu}m/n^{0.35}$ (see also Figure 5 below). The observation suggests in particular that $c_{\mu}=\Theta(\sqrt{n}/m)$ is likely to fail with increasing dimension, whereas $c_{\mu}=\Theta(n^{0.35}/m)$ is a stable setting such that the new setting of $\Theta(n^{0.25}/m)$ (replacing $\Theta(n^{0}/m)$ ) is still sufficiently conservative. Figure 4 depicts the difference between the default learning rate values in Hansen and Auger, (2014) and the above formulas.121212The learning rate for the rank- $\mu$ update is usually $\mu^{\prime}$ times greater than the learning rate for the rank-one update, where $\mu^{\prime}\in(\mu_{w}-2,\mu_{w}-1/2)$ is monotonous in $\mu_{w}$ and approaches $\mu_{w}-3/2$ for $\lambda\to\infty$ . Using $\mu^{\prime}$ instead of $\mu_{w}$ is a correction for small $\mu_{w}$ . When $\mu_{w}=1$ , we have $\mu^{\prime}<1/2$ (the first three terms cancel out). In this case, because $\mu=1$ , the rank-one update contains more information than the rank- $\mu$ update as the evolution path accumulates information over time. Using $\mu_{w}$ instead of $\mu^{\prime}$ would result in the same learning rates for both updates, whereas the learning rate for the rank- $\mu$ update should be smaller in this case.

Another important change in the default parameter values from previous studies (Ros and Hansen,, 2008; Hansen and Auger,, 2014) is in the cumulation factor $c_{c}$ . Previously, the typical choices were $c_{c}=\frac{4}{n+4}$ or $c_{c}=\frac{4+\mu_{w}/n}{n+4+2\mu_{w}/n}$ .

The new and simpler default value for the cumulation factor131313The appearance of $\sqrt{\mu_{w}}$ in the cumulation factor is motivated as follows. Hansen and Ostermeier, (2001) analyzed the situation where the selection vector $\sqrt{\mu_{w}}\sum_{i=1}^{\mu}w_{i}\mathchoice{\mbox{\boldmath$ \displaystyle z $}}{\mbox{\boldmath$ \textstyle z $}}{\mbox{\boldmath$ \scriptstyle z $}}{\mbox{\boldmath$ \scriptscriptstyle z $}}_{i:\lambda}$ alternates between two vectors $\bm{v}$ and $\bm{w}$ over iterations and showed the squared Euclidean norm of the evolution path, i.e. the eigenvalue of $\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}^{\mathsf{T}}$ , remains in $O(\lVert\bm{v}+\bm{w}\rVert^{2}/c_{c})$ if $\lVert\bm{v}+\bm{w}\rVert>0$ . On the other hand, Akimoto et al., (2008) showed that the above selection vector can potentially be proportional to $\sqrt{\mu_{w}}$ when $\sigma$ is (too) small. Altogether, we have $c_{1}\lVert\mathchoice{\mbox{\boldmath$ \displaystyle p $}}{\mbox{\boldmath$ \textstyle p $}}{\mbox{\boldmath$ \scriptstyle p $}}{\mbox{\boldmath$ \scriptscriptstyle p $}}_{c}\rVert^{2}\in O(c_{1}\mu_{w}/c_{c})$ , where $c_{1}\mu_{w}\to 2$ and $c_{c}\to\sqrt{2}/2$ as $\mu_{w}\to\infty$ . Therefore, the eigenvalue added by the rank-one update is bounded by a constant with overwhelming probability. is motivated by an experiment on the Cigar function, investigating the dependency of $c_{c}$ on $c_{1}$ . The effect of the rank-one update of the covariance matrix $\mathbf{C}$ is most pronounced when the covariance matrix needs to increase a single eigenvalue. To skip over the usual initial phase where sigma decreases without changing the covariance matrix and emphasize on the effect of $c_{c}$ , the mean vector is initialized at $\mathchoice{\mbox{\boldmath$ \displaystyle m $}}{\mbox{\boldmath$ \textstyle m $}}{\mbox{\boldmath$ \scriptstyle m $}}{\mbox{\boldmath$ \scriptscriptstyle m $}}^{(0)}=(1,0,\dots,0)$ and $\sigma^{(0)}=10^{-6}$ . The rank- $\mu$ update is off, i.e., $c_{\mu}=c_{\mu,D}=0$ . We compared the number of $f$ -calls until the target function value of $10^{-2}$ is reached. Note that $\sigma^{(0)}$ is chosen to avoid $\sigma$ adaptation at the beginning and the target value is chosen so that we stop the optimization once the covariance matrix learned the right scaling. Figure 5 compares the number of $f$ -calls spent by the original CMA and the separable CMA on the $20$ dimensional Cigar function. Because the $x$ -axis is normalized with the new default value for $c_{c}$ , the shape of the graph and the location of the minimum remains roughly the same for different values of $c_{1}$ and $\lambda$ . The adaptation speed of the covariance matrix tends to be the best around $c_{c}=\sqrt{\mu_{w}c_{1}}/2$ or $\sqrt{\mu_{w}c_{1,D}}/2$ . The minimum is more pronounced for smaller $c_{1}$ (or $c_{1,D}$ ). Nevertheless, we observe little improvement with the new setting in practice since more $f$ -calls are usually spent to adapt the overall covariance matrix and even to adapt the step-size to converge. We have done the same experiments in dimension $80$ and observe the same trend.

None of the strategy parameters are meant to be tuned by users, except the population size $\lambda$ . A larger population size typically results in finding a better local minimum on rugged functions such as multimodal or noisy functions (Hansen and Kern,, 2004). It is also advantageous when the objective function values are evaluated in parallel (Hansen et al.,, 2003). Restart strategies that perform independent restarts with increasing population size such as IPOP strategy (Harik and Lobo,, 1999; Auger and Hansen, 2005b, ) or BIPOP strategy (Hansen,, 2009) or NIPOP strategy (Loshchilov et al.,, 2012) are handy policies that automate the parameter tuning. They can be applied to the CMA-ES with diagonal acceleration in a straight-forward way. We leave research in this line as future work.

6 Experiments

We conduct numerical simulations to see the effect of the algorithmic components of CMA-ES with diagonal acceleration, dd-CMA-ES—namely the active update with positive definiteness guarantee and the adaptive diagonal decoding. Since the effect of the active covariance matrix update with default population size has been investigated by Jastrebski and Arnold, (2006) and our contribution is a positive definiteness guarantee for the covariance matrix especially for large population size, we investigate how the effect of the active update scales as the population size increases. Our main focus is however on the effect of adaptive diagonal decoding. Particularly, we investigate (i) how much better dd-CMA scales on separable functions than the CMA without diagonal decoding (plain CMA), (ii) how closely dd-CMA matches the performance of separable CMA on separable functions, and (iii) how the scaling of dd-CMA compares to the plain CMA on various non-separable functions. Moreover, we investigated how effective dd-CMA is when the population size is increased.

6.1 Common Setting

Table 1 summarizes the test functions used in the following experiments together with the initial $\mathchoice{\mbox{\boldmath$ \displaystyle m $}}{\mbox{\boldmath$ \textstyle m $}}{\mbox{\boldmath$ \scriptstyle m $}}{\mbox{\boldmath$ \scriptscriptstyle m $}}^{(0)}$ and $\sigma^{(0)}$ . The initial covariance matrix $\mathbf{C}^{(0)}$ and the diagonal decoding matrix $\mathbf{D}^{(0)}$ are always set to the identity matrix. The random vectors and the random matrix appearing in Table 1 are initialized randomly for each problem instance, but the same values are used for different algorithms for fair comparison. A trial is considered as success if the target function value of $10^{-8}$ is reached before the algorithm spends $5\times 10^{4}n$ function evaluations, otherwise regarded as failure. For $n\leqslant 40$ we conducted $20$ independent trials for each setting, $10$ for $80\leqslant n\leqslant 320$ , and $3$ for $n\geqslant 640$ . When the computational effort of evaluating the test function scales worse than linear with the dimension (i.e., on rotated functions) we may omit dimensions larger than $320$ .

We compare the following CMA variants:

Plain CMA

updates $\mathbf{C}$ while $\mathbf{D}$ is kept the identity matrix, with or without active update described in Section 3 (Algorithm 1 without $\mathbf{D}$ -update);

Separable CMA

updates $\mathbf{D}$ as described in Section 4 while $\mathbf{C}$ is kept the identity matrix, with active update or without active update, where negative recombination weights are set to zero (Algorithm 1 without $\mathbf{C}$ -update);

dd-CMA

as summarized in Algorithm 1, updates $\mathbf{C}$ as described in Section 2.1 with active update described in Section 4, and updates $\mathbf{D}$ as described in Section 4.

All strategy parameters such as the learning rates are set to their default value presented in Section 5. Note that the separable CMA is different from the original publication (Ros and Hansen,, 2008) in that the matrix is updated in multiplicative form which however barely affects the performance.

6.2 Active CMA

First, the effect of the active update is investigated. The plain and separable CMA with and without active update are compared.

Figure 6 shows the number of iterations spent by each algorithm plotted against population size $\lambda$ . The number of iterations to reach the target function value decreases as $\lambda$ increases and tends to level out. The plain CMA is consistently faster with active update than without. As expected from the results of Jastrebski and Arnold, (2006), the effect of active covariance matrix update is most pronounced on functions with a small number of sensitive variables such as the Discus function. The advantage of the active update diminishes as $\lambda$ increases in plain CMA, whereas the speed-up in separable CMA becomes even slightly more pronounced.

6.3 Adaptive Diagonal Decoding

Next, we compare dd-CMA with the plain and the separable CMA with active update.

Figure 7 compares dd-CMA with the plain CMA on the rotated Cigar, Ellipsoid, and Discus functions, and with the separable CMA on the separable Cigar, Ellipsoid, and Discus functions. Note that the plain CMA scales equally well on separable functions and rotated functions, while the separable CMA will not find the target function value on rotated functions before the maximum budget is exhausted (Ros and Hansen,, 2008). Displayed are the number of function evaluations to reach the target function value on each problem instance. The results of dd-, plain and separable CMA on the Sphere function are displayed for reference.

On the Sphere function, no covariance matrix adaptation is required, and all algorithms perform quite similarly. On the Cigar function, the rank-one covariance matrix update is known to be rather effective, and even the plain CMA scales linearly on the rotated Cigar function. On both, separable and rotated Cigar functions, dd-CMA is competitive with separable and plain CMA thereby combining the better performance of the two.

The discrepancy between plain and separable CMA is much more pronounced on Ellipsoid and Discus functions, where the plain CMA scales super-linearly, whereas the separable CMA exhibits linear or slightly worse than linear scaling on separable instances and fails to find the optimum within the given budget on rotated instances. On the rotated functions, dd-CMA is competitive with the plain CMA, whereas it significantly reduces the number of function evaluations on the separable functions compared to plain CMA, e.g., ten times less $f$ -calls are required on the $160$ dimensional separable Ellipsoid function.

The dd-CMA is competitive with separable CMA on the separable Discus function up to $5120$ dimension, which is the ideal situation since we can not expect faster adaptation of the distribution shape on separable functions. On the separable Ellipsoid function, the performance curve of dd-CMA starts deviating from that of the separable CMA around $n=320$ , and scales more or less the same as the plain CMA afterwards, yet it requires ten times less $f$ -calls. To improve the scaling of dd-CMA on the separable Ellipsoid function, we might need to set $\beta_{\mathrm{thresh}}$ depending on $n$ , or use another monotone transformation of the condition number of the correlation matrix in (31). Performing the eigen decomposition of $\mathbf{C}$ less frequently (i.e., setting $\beta_{\mathrm{eig}}$ smaller) is another possible way to improve the performance of dd-CMA on the separable Ellipsoid function, while compromising the performance on the rotated Ellipsoid function141414If we set $\beta_{\text{eig}}=10$ instead of $\beta_{\text{eig}}=10n$ , dd-CMA spends about $1.5$ times more FEs on the $5120$ -dimensional separable Ellipsoid function than separable CMA as displayed in Figure 7(d), whereas it spends about $15\%$ more FEs on the $320$ -dimensional rotated Ellipsoid function than plain CMA as displayed in Figure 7(d). This might be a more practical choice, but further investigation is required..

Figures 8(a) and 8(b) show the results on Ell-Cig and Ell-Dis functions, which are non-separable ill-conditioned functions with additional coordinate-wise scaling, Figure 8(c) shows the results on non-rotated and rotated Rosenbrock functions. The separable CMA locates the target $f$ -value within the given budget only on the non-rotated Rosenbrock function in smaller dimension. The experiment on the Ell-Cig function reveals that diagonal decoding can improve the scaling of the plain CMA even on non-separable functions. The improvement on the Ell-Dis function is much less pronounced. The reason why, compared to the plain CMA, dd-CMA is more suitable for Ell-Cig than for Ell-Dis is discussed in relation with Figure 9 below. On the non-rotated Rosenbrock function, dd-CMA becomes advantageous as the dimension increases, just as the separable CMA is faster than the plain CMA when $n>80$ .

Figure 9 visualizes insights in the typical behaviour of dd-CMA on different functions. On the separable Ellipsoid, the correlation matrix deviates from the identity matrix due to stochastics and the initial parameter setting, and the inverse damping decreases marginally. This effect is more pronounced for $n>320$ and is the reason for the impaired scaling of dd-CMA on the separable Ellipsoid in Figure 7(d). The diagonal decoding matrix, i.e., variance matrix, adapts the coordinate-wise scaling efficiently as long as the condition number of the correlation matrix $\mathbf{C}$ is not too high. On the rotated Ellipsoid function, where the diagonal elements of the inverse Hessian of the objective function, $\frac{1}{2}\mathbf{R}^{\mathsf{T}}\mathbf{D}_{\mathrm{ell}}^{-6}\mathbf{R}$ , likely have similar values, the diagonal decoding matrix remains close to the identity and the correlation matrix learns the ill-conditioning. The inverse damping parameter decreases so that the adaptation of the diagonal decoding matrix does not disturb the adaptation of the distribution shape. Figure 9(b), below, shows that adaptive diagonal decoding without the dynamic damping factor (31) severely disturbs the adaptation of $\mathbf{C}$ on the rotated Ellipsoid.

Ideally, the diagonal decoding matrix $\mathbf{D}$ adapts the coordinate-wise standard deviation and $\mathbf{C}$ adapts the correlation matrix of the sampling distribution. If the correlation matrix $\mathbf{C}$ is easier to adapt than the full covariance matrix $\mathbf{D}\mathbf{C}\mathbf{D}$ , we expect a speed-up of the adaptation of the distribution shape. The functions Ell-Cig and Ell-Dis in Figures 9(c) and 9(d) are such examples. Once $\mathbf{D}$ becomes inversely proportional to $\mathbf{D}_{\text{ell}}^{2}$ , the problems become rotated Cigar and rotated Discus functions, respectively. The run on the Ell-Cig function in 9(c) depicts the ideal case. The coordinate-wise scaling is first adapted by $\mathbf{D}$ and then the correlation matrix learns a long Cigar axis. The inverse damping factor is kept at a high value similar to the one observed on the separable Ellipsoid and it starts decreasing only after $\mathbf{D}$ is adapted. On the Ell-Dis function (9(d)) however, the short non-coordinate axis is learned (too) quickly by the correlation matrix. Therefore, the inverse damping factor decreases before $\mathbf{D}$ has adapted the full coordinate-wise scaling. Since the function value is more sensitive in the subspace corresponding to great eigenvalues of the Hessian of the objective and the ranking of candidate solutions is mostly determined by this subspace, the CMA-ES first shrinks the distribution in this subspace. This is the main reason why dd-CMA is more efficient on Ell-Cig than on Ell-Dis.

Figure 10 shows, similar to Figure 6, the number of iterations versus the population size $\lambda$ on three functions in dimension 40. For all larger populations sizes up to $13312$ , dd-CMA performs on par with the better of plain and separable CMA, as ideally to be expected.

We remark that dd-CMA with default $\lambda=13$ has a relatively large variance in performance on the separable TwoAxes function. Increasing $\lambda$ from its default value reduces this variance and even reduces the number of $f$ -calls to reach the target. This defect of dd-CMA is even more pronounced for higher dimensions. It may suggest to increase the default $\lambda$ for dd-CMA. We leave this to be addressed in future work.

Figure 11 shows the number of iterations in successful runs, the success rate and the average number of evaluations in successful runs divided by the success rate (Price,, 1997; Auger and Hansen, 2005a, ) on two multimodal $40$ -dimensional functions. The maximum numbers of $f$ -calls are $5n\times 10^{4}$ on Bohachevsky and $2n\times 10^{5}$ on Rastrigin. Comparing separable to dd-CMA on separable functions and plain to dd-CMA on rotated functions, dd-CMA tends to spend less iterations but to require greater $\lambda$ to reach the same success rate. The latter is attributed to doubly shrinking the overall variance by updating $\mathbf{C}$ and $\mathbf{D}$ . For $\lambda\gg n$ , we have $c_{\mu}\approx 1$ and $c_{\mu,D}\approx 1$ and the effect of shrinking the overall variance due to $\mathbf{C}$ and $\mathbf{D}$ updates is not negligible. Then, the overall variance is smaller than in plain and separable CMA, which also leads to faster convergence. This effect disappears if we force the $\mathbf{D}$ update to keep its determinant unchanged, which can be realized by replacing $\Delta_{D}$ in (29) with $\Delta_{D}-(\operatorname{Tr}(\Delta_{D})/n)\mathbf{I}$ .

7 Summary and Conclusion

In this paper we have put forward several techniques to improve multi-recombinative non-elitistic covariance matrix adaptation evolution strategies without changing their internal computation effort.

The first component is concerned with the active covariance matrix update, which utilizes unsuccessful candidate solutions to actively shrink the sampling distribution in unpromising directions. We propose two ways to guarantee the positive definiteness of the covariance matrix by rescaling unsuccessful candidate solutions.

The second and main component is a diagonal acceleration of CMA by adaptive diagonal decoding, dd-CMA. The covariance matrix adaptation is accelerated by adapting a coordinate-wise scaling separately from the positive definite symmetric covariance matrix. This drastically accelerates the adaptation speed on high-dimensional functions with coordinate-wise ill-conditioning, whereas it does not significantly influence the adaptation of the covariance matrix on highly ill-conditioned non-separable functions without coordinate-wise scaling component.

The last component is a set of improved default parameters for CMA. The scaling of the learning rates are relaxed from $\Theta(1/n^{2})$ to $\Theta(1/n^{1.75})$ for default CMA and from $\Theta(1/n)$ to $\Theta(1/n^{0.75})$ for separable CMA. This contributes to accelerate the learning in all investigated CMA variants on high dimensional problems, say $n\geqslant 100$ .

Algorithm selection as well as hyper-parameter tuning of an algorithm is a troublesome issue in black-box optimization. For CMA-ES, we needed to make a decision whether we use the plain CMA or the separable CMA, based on the limited knowledge on a problem of interest. If we select the separable CMA but the objective function is non-separable and highly ill-conditioned, we will not obtain a reasonable solution within most given function evaluation budgets. Therefore, plain CMA is a safer choice, though it may be slower on nearly separable functions. The dd-CMA automizes this decision and achieves faster adaptation speed on functions with ill-conditioned variables usually without compromising the performance on non-separable and highly ill-conditioned functions. Moreover, the advantage of dd-CMA is not limited to separable problems. We found a class of non-separable problems with variables of different sensitivity on which dd-CMA-ES decidedly outperforms CMA-ES and separable CMA-ES (see Figure 8(a)). As dd-CMA-ES improves the performance on a relatively wide range of problems with mis-scaling of variables, it is a prime candidate to become the default CMA-ES variant.

8 Acknowledgements

The authors thank the associate editor and the anonymous reviewers for their valuable comments and suggestions. This work is partly supported by JSPS KAKENHI Grant Number 19H04179.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Akimoto et al., (2014) Akimoto, Y., Auger, A., and Hansen, N. (2014). Comparison-based natural gradient optimization in high dimension. In Proceedings of Genetic and Evolutionary Computation Conference , pages 373–380.
2Akimoto et al., (2018) Akimoto, Y., Auger, A., and Hansen, N. (2018). Quality gain analysis of the weighted recombination evolution strategy on general convex quadratic functions. Theoretical Computer Science .
3Akimoto and Hansen, (2016) Akimoto, Y. and Hansen, N. (2016). Projection-based restricted covariance matrix adaptation for high dimension. In Genetic and Evolutionary Computation Conference, GECCO 2016, Denver, Colorado, USA, July 20-24, 2016 , pages 197–204. ACM.
4Akimoto et al., (2010) Akimoto, Y., Nagata, Y., Ono, I., and Kobayashi, S. (2010). Bidirectional relation between CMA evolution strategies and natural evolution strategies. In Parallel Problem Solving from Nature - PPSN XI , number 6238 in LNCS, pages 154–163. Springer-Verlag.
5Akimoto et al., (2008) Akimoto, Y., Sakuma, J., Ono, I., and Kobayashi, S. (2008). Functionally specialized CMA-ES: a modification of CMA-ES based on the specialization of the functions of covariance matrix adaptation and step size adaptation. In GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary computation . ACM.
6Arnold, (2006) Arnold, D. V. (2006). Weighted multirecombination evolution strategies. Theoretical Computer Science , 361:18–37.
7Arnold and Hansen, (2010) Arnold, D. V. and Hansen, N. (2010). Active covariance matrix adaptation for the (1+1)-CMA-ES. In Proceedings of Genetic and Evolutionary Computation Conference , pages 385–392, Portland, Oregon. ACM.
8(8) Auger, A. and Hansen, N. (2005 a). Performance evaluation of an advanced local search evolutionary algorithm. In 2005 IEEE Congress on Evolutionary Computation , pages 1777–1784. Ieee.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Diagonal Acceleration for Covariance Matrix Adaptation Evolution Strategies

Abstract

1 Introduction

2 Evolution Strategy with Covariance Matrix Adaptation

2.1 Original CMA-ES

Sampling and Recombination

Step-Size Adaptation

Covariance Matrix Adaptation

2.2 Separable Covariance Matrix Adaptation

3 Active covariance matrix update with guarantee of positive definiteness

3.1 Method 1: Scaling Down the Update Factor

3.2 Method 2: Scaling Down Negative Weights

Theorem 3.1**.**

Proof of Theorem 3.1.

3.3 Choice of Recombination Weights

4 Adaptive Diagonal Decoding (dd-CMA)

4.1 Algorithm

4.2 Damping factor

4.3 Implementation Remark

5 Algorithm Summary and Strategy Parameters

6 Experiments

6.1 Common Setting

6.2 Active CMA

6.3 Adaptive Diagonal Decoding

7 Summary and Conclusion

8 Acknowledgements

Theorem 3.1.