Diagonal Acceleration for Covariance Matrix Adaptation Evolution Strategies
Youhei Akimoto, Nikolaus Hansen

TL;DR
This paper introduces dd-CMA, an acceleration technique for CMA-ES that uses adaptive diagonal decoding to improve performance on both separable and non-separable problems, especially in high dimensions.
Contribution
The paper presents a novel diagonal acceleration method for CMA-ES, enhancing its speed and scalability without sacrificing performance on complex, non-separable functions.
Findings
dd-CMA-ES outperforms original CMA-ES on ill-conditioned functions
Significant speedup observed in high-dimensional optimization
Effective even with large population sizes up to dimension squared
Abstract
We introduce an acceleration for covariance matrix adaptation evolution strategies (CMA-ES) by means of adaptive diagonal decoding (dd-CMA). This diagonal acceleration endows the default CMA-ES with the advantages of separable CMA-ES without inheriting its drawbacks. Technically, we introduce a diagonal matrix D that expresses coordinate-wise variances of the sampling distribution in DCD form. The diagonal matrix can learn a rescaling of the problem in the coordinates within linear number of function evaluations. Diagonal decoding can also exploit separability of the problem, but, crucially, does not compromise the performance on non-separable problems. The latter is accomplished by modulating the learning rate for the diagonal matrix based on the condition number of the underlying correlation matrix. dd-CMA-ES not only combines the advantages of default and separable CMA-ES, but mayā¦
| Sphere | |||
|---|---|---|---|
| Cigar | |||
| Discus | |||
| Ellipsoid | |||
| TwoAxes | |||
| Ell-Cig | |||
| Ell-Dis | |||
| Rosenbrock | |||
| Bohachevsky | |||
| Rastrigin |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques Ā· Neural Networks and Applications Ā· Metaheuristic Optimization Algorithms Research
Diagonal Acceleration for Covariance Matrix Adaptation Evolution Strategies
\nameY. Akimoto \[email protected]
\addrUniversity of Tsukuba, 1-1-1 Tennodai, Tsukuba, JAPAN \AND\nameN. Hansen \[email protected]
\addrInria, RandOpt Team, CMAP, Ecole Polytechnique, Palaiseau, FRANCE
Abstract
We introduce an acceleration for covariance matrix adaptation evolution strategies (CMA-ES) by means of adaptive diagonal decoding (dd-CMA). This diagonal acceleration endows the default CMA-ES with the advantages of separable CMA-ES without inheriting its drawbacks. Technically, we introduce a diagonal matrix that expresses coordinate-wise variances of the sampling distribution in form. The diagonal matrix can learn a rescaling of the problem in the coordinates within linear number of function evaluations. Diagonal decoding can also exploit separability of the problem, but, crucially, does not compromise the performance on non-separable problems. The latter is accomplished by modulating the learning rate for the diagonal matrix based on the condition number of the underlying correlation matrix. dd-CMA-ES not only combines the advantages of default and separable CMA-ES, but may achieve overadditive speedup: it improves the performance, and even the scaling, of the better of default and separable CMA-ES on classes of non-separable test functions that reflect, arguably, a landscape feature commonly observed in practice.
The paper makes two further secondary contributions: we introduce two different approaches to guarantee positive definiteness of the covariance matrix with active CMA, which is valuable in particular with large population size; we revise the default parameter setting in CMA-ES, proposing accelerated settings in particular for large dimension.
All our contributions can be viewed as independent improvements of CMA-ES, yet they are also complementary and can be seamlessly combined. In numerical experiments with dd-CMA-ES up to dimension 5120, we observe remarkable improvements over the original covariance matrix adaptation on functions with coordinate-wise ill-conditioning. The improvement is observed also for large population sizes up to about dimension squared.
Keywords
Evolution strategies, covariance matrix adaptation, adaptive diagonal decoding, active covariance matrix update, default strategy parameters.
1 Introduction
In real world applications of continuous optimization involving simulations such as physics or chemical simulations, the input-output relation between a candidate solution and its objective function value is barely expressible in explicit mathematical formula. The objective function value is computed through a complex simulation with a candidate solution as an input. In such scenarios, we gain the information of the problem only through the evaluation of the objective function value of a given candidate solution. A continuous optimization of is referred to as black-box continuous optimization if we gain the information of the problem only through the evaluation \mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}\mapsto f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}) of a given candidate solution \mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}\in\mathbb{R}^{n}.
Black-box continuous optimization arises widely in real world applications such as model parameter calibration and design of robot controller. It often involves computationally expensive simulation to evaluate the quality of candidate solutions. The search cost of black-box continuous optimization is therefore the number of simulations, i.e., the number of objective function calls, and a search algorithm is desired to locate good quality solutions with as few -calls as possible. Practitioners need to choose one or a few search algorithms to solve their problems and tune their hyper-parameters based on the prior knowledge into their problems. However, prior knowledge is often limited in the black-box situation due to the black-box relation between and f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}). Hence, algorithm selection, as well as hyper-parameter tuning, is a tedious task for practitioners who are typically not experts in search algorithms.
Covariance Matrix Adaptation Evolution Strategy (CMA-ES), developed by Hansen and Ostermeier, (2001); Hansen etĀ al., (2003); Hansen and Kern, (2004); Jastrebski and Arnold, (2006), is recognized as a state-of-the-art derivative-free search algorithm for difficult continuous optimization problems (Rios and Sahinidis,, 2013). CMA-ES is a stochastic and comparison-based search algorithm that maintains the multivariate normal distribution as a sampling distribution of candidate solutions. The distribution parameters such as the mean vector and the covariance matrix are updated at each iteration based on the candidate solutions and their objective value ranking, so that the sampling distribution will produce promising candidate solutions more likely in the next algorithmic iteration. The update of the distribution parameters is partially found as the natural gradient ascent on the manifold of the distribution parameter space equipped with the Fisher metric (Akimoto etĀ al.,, 2010; Glasmachers etĀ al.,, 2010; Ollivier etĀ al.,, 2017), thereby revealing the connection to natural evolution strategies Wierstra etĀ al., (2008); Sun etĀ al., (2009); Glasmachers etĀ al., (2010); Wierstra etĀ al., (2014), whose parameter update is derived explicitly from the natural gradient principle.
Invariance (Hansen,, 2000; Hansen etĀ al.,, 2011) is one of the governing principles of the design of CMA-ES and the essence of its success on difficult continuous optimization problems consisting of ruggedness, ill-conditioning, and non-separability. CMA-ES exhibits several invariance properties such as invariance to order preserving transformation of the objective function, invariance to translation, rotation and coordinate-wise scaling of the search space (Hansen and Auger,, 2014). Invariance guarantees the algorithm to work identically on an original problem and its transformed version. Thanks to its invariance properties, CMA-ES works, after an adaptation phase, equally well on separable and well-conditioned functions, which are easy for most search algorithms, and on non-separable and ill-conditioned functions produced by an affine coordinate transformation of the former, which are considered difficult for many other search algorithms. This also contributes to allow default hyper-parameter values to depend solely on the search space dimension and the population size, whereas many other search algorithms need tuning depending on problem difficulties to make the algorithms efficient (Karafotias etĀ al.,, 2015).
On the other hand, exploiting problem structure, if possible, is beneficial for optimization speed. Different variants of CMA-ES have been proposed to exploit problem structure such as separability and limited variable dependency. They aim to achieve a better scaling with the dimension (Knight and Lunacek,, 2007; Ros and Hansen,, 2008; Akimoto etĀ al.,, 2014; Akimoto and Hansen,, 2016; Loshchilov,, 2017). However, they lose some of the invariance properties that CMA-ES has and compromise the performance on problems where their specific, more or less restrictive assumptions on the problem structure do not hold. For instance, the separable CMA-ES (Ros and Hansen,, 2008) reduces the degrees of freedom of the covariance matrix from to by adapting only the diagonal elements of the covariance matrix. It scales better in terms of internal time and space complexity and in terms of number of function evaluations to adapt the coordinate-wise variance of the search distribution. Good results are hence observed on functions with weak variable dependencies. However, unsurprisingly, the convergence speed of the separable CMA is significantly deteriorated on non-separable and ill-conditioned functions, where the shape of the level sets of the objective function can not be reasonably well approximated by the equi-probability ellipsoid defined by the normal distribution with diagonal (co)variance matrix.
In this paper, we aim to improve the performance of CMA-ES on a relatively wide class of problems by exploiting problem structure, however crucially, without compromising the performance on more difficult problems without this structure.111Any covariance matrix, , can be uniquely decomposed into , where is a diagonal matrix and is a correlation matrix. The addressed problem class can be characterized in that for the best problem approximation both matrices, Ā and , have non-negligible condition number, say no less than .
The first mechanism we are concerned with is the so-called active covariance matrix update, which was originally proposed for the -CMA-ES with intermediate recombination (Jastrebski and Arnold,, 2006), and later incorporated into the -CMA-ES (Arnold and Hansen,, 2010). It utilizes unpromising solutions to actively decrease the eigenvalues of the covariance matrix. The active update consistently improves the adaptation speed of the covariance matrix in particular on functions where a low dimensional subspace dominates the function value. The positive definiteness of the covariance matrix is however not guaranteed when the active update is utilized. Practically, a small enough learning rate of the covariance matrix is sufficient to keep the covariance matrix positive definite with overwhelming probability, however, we would like to increase the learning rate when the population size is large. We propose two novel schemes that guarantee the positive definiteness of the covariance matrix, so that we take advantage of the active update even when a large population size is desired, e.g., when the objective function evaluations are distributed on many CPUs or when the objective function is rugged.
The main contribution of this paper is the diagonal acceleration of CMA by means of adaptive diagonal decoding, referred to as dd-CMA. We introduce a coordinate-wise variance matrix, , of the sampling distribution alongside the positive definite symmetric matrix , such that the resulting covariance matrix of the sampling distribution is represented by . We call the diagonal decoding matrix. We update with the original CMA, whereas is updated similarly to separable CMA. An important point is that we want to update faster than , by setting higher learning rates for the update. However, when contains non-zero covariances, the update of can result in a drastic change of the sampling distribution and disturb the adaptation of . To resolve this issue, we introduce an adaptive damping mechanism for the update, so that the difference (e.g., Kullback-Leibler divergence) between the distributions before and after the update remains sufficiently small. With this damping, is updated as fast as by separable CMA on a separable function if the correlation matrix of the sampling distribution is close to the identity, and it suppresses the update when the correlation matrix is away from the identity, i.e., variable dependencies have been learned.
The update of breaks the rotation invariance of the original CMA, hence we lose a mathematical guarantee of the original CMA that it performs identical on functions in an equivalence group defined by this invariance property. The ultimate aim of this work is to gain significant speed-up in some situations and to preserve the performance of the original CMA in the worst case. Functions where we expect that the diagonally accelerated CMA outperforms the original one have variables with different sensitivities, i.e., coordinate-wise ill-conditioning. Such functions may often appear in practice, since variables in a black-box continuous optimization problem can have distinct meanings. Diagonal acceleration however can even be superior to the optimal additive portfolio of the original CMA and the separable CMA. We demonstrate in numerical experiments that dd-CMA outperforms the original CMA not only on separable functions but also on non-separable ill-conditioned functions with coordinate-wise ill-conditioning that separable CMA can not efficiently solve.
The last contribution is a set of improved and simplified default parameter settings for the covariance matrix adaptation and for the cumulation factor for the so-called evolution path. These learning rates, whose default values have been previously expressed with independent formulas, are reformulated so that their dependencies are clearer. The new default learning rates also improve the adaptation speed of the covariance matrix on high dimensional problems without compromising stability.
The rest of this paper is organized as follows. We introduce the original and separable CMA-ES in SectionĀ 2. The active update of the covariance matrix with positive definiteness guarantee is proposed in SectionĀ 3. The adaptive diagonal decoding mechanism is introduced in SectionĀ 4. SectionĀ 5 is devoted to explain the renewed default hyper-parameter values for the covariance matrix adaptation and provides an algorithm summary of dd-CMA-ES and a link to publicly available Python code. Numerical experiments are conducted in SectionĀ 6 to see how effective each component of CMA with diagonal acceleration works in different situations. We conclude the paper in SectionĀ 7.
2 Evolution Strategy with Covariance Matrix Adaptation
We summon up the (, )-CMA-ES consisting of weighted recombination, cumulative step-size adaptation, and rank-one and rank- covariance matrix adaptation.
The CMA-ES maintains the multivariate normal distribution, \mathcal{N}(\mathchoice{\mbox{\boldmath\displaystyle m}}{\mbox{\boldmath\textstyle m}}{\mbox{\boldmath\scriptstyle m}}{\mbox{\boldmath\scriptscriptstyle m}},\sigma^{2}\mathbf{D}\mathbf{C}\mathbf{D}), where \mathchoice{\mbox{\boldmath\displaystyle m}}{\mbox{\boldmath\textstyle m}}{\mbox{\boldmath\scriptstyle m}}{\mbox{\boldmath\scriptscriptstyle m}}\in\mathbb{R}^{n} is the mean vector that represents the center of the search distribution, is the so-called step-size that represents the scaling factor of the distribution spread, and is a positive definite symmetric matrix that represents the shape of the distribution ellipsoid. Though the covariance matrix of the sampling distribution is , we often call the covariance matrix in the context of CMA-ES. In this paper, we apply this terminology, and we will call the covariance matrix of the sampling distribution to distinguish them when necessary. The positive definite diagonal matrix is regarded as a diagonal decoding matrix, which represents the scaling factor of each design variable. It is fixed during the optimization, usually , and does not appear in the standard terminology. However, it plays an important role in this paper, and we define the CMA-ES with for the later modification.
2.1 Original CMA-ES
The CMA-ES repeats the following steps until it meets one or more termination criteria:
Sample candidate solutions, \mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i}, independently from \mathcal{N}(\mathchoice{\mbox{\boldmath\displaystyle m}}{\mbox{\boldmath\textstyle m}}{\mbox{\boldmath\scriptstyle m}}{\mbox{\boldmath\scriptscriptstyle m}},\sigma^{2}\mathbf{D}\mathbf{C}\mathbf{D}); 2. 2.
Evaluate candidate solutions on the objective, f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i}), and sort them in ascending order222Ties, f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i:\lambda})=\cdots=f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i+k-1:\lambda}), are treated by redistributing the averaged recombination weights to tied solutions \mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i:\lambda},\dots,\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i+k-1:\lambda}., f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{1:\lambda})\leq\cdots\leq f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{\lambda:\lambda}), where is the index of the -th best candidate solution among \mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{1},\dots,\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{\lambda}; 3. 3.
Update the distribution parameters, , , and .
Sampling and Recombination
To generate candidate solutions, we compute the (unique, symmetric) square root of the covariance matrix obeying \mathbf{C}^{(t)}=\big{(}\sqrt{\mathbf{C}}\big{)}^{2}. The candidate solutions are the affine transformation of independent and standard normally distributed random vectors,
[TABLE]
The default population size is . To reduce the time complexity per -call without compromising the performance, we compute the matrix decomposition \mathbf{C}^{(t)}=\big{(}\sqrt{\mathbf{C}}\big{)}^{2} every t_{\text{eig}}=\max\big{(}1,\big{\lfloor}(\beta_{\mathrm{eig}}(c_{1}+c_{\mu}))^{-1}\big{\rfloor}\big{)} iteration, where and are the learning rates for the covariance matrix adaptation that appear later, and . If the learning rates are small enough (), the covariance matrix is regarded as insignificantly changing in each iteration and we stall the decomposition.
The mean vector is updated by taking the weighted average of the promising candidate directions,
[TABLE]
where is the learning rate for the mean vector update, usually . The number of promising candidate solutions are denoted by , and \big{(}w_{i}\big{)}_{i=1}^{\mu} are recombination weights satisfying for . A standard choice is for and and .
Step-Size Adaptation
The cumulative step-size adaptation (CSA) updates the step-size based on the length of the evolution path that accumulates the mean shift normalized by the current distribution covariance, i.e.,333When is not the identity, (3) is not exactly equivalent to the original CSA (Hansen and Ostermeier,, 2001): \mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}_{i:\lambda}=(\mathbf{D}\sqrt{\mathbf{C}})^{-1}(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i:\lambda}-\mathchoice{\mbox{\boldmath\displaystyle m}}{\mbox{\boldmath\textstyle m}}{\mbox{\boldmath\scriptstyle m}}{\mbox{\boldmath\scriptscriptstyle m}})/\sigma in this paper whereas originally \mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}_{i:\lambda}=\sqrt{\mathbf{D}\mathbf{C}\mathbf{D}}^{-1}(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i:\lambda}-\mathchoice{\mbox{\boldmath\displaystyle m}}{\mbox{\boldmath\textstyle m}}{\mbox{\boldmath\scriptstyle m}}{\mbox{\boldmath\scriptscriptstyle m}})/\sigma. This difference results in rotating the second term of the RHS of (3) at each iteration with a different orthogonal matrix, and ends up in a different \lVert\mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{\sigma}\rVert. Krause etĀ al., (2016) have theoretically studied the effect of this difference and argued that this systematic difference becomes small if the parameterization of the covariance matrix of the sampling distribution is unique and it converges up to scale. If is fixed, the parameterization is unique. Later in this paper, we update both and but we force the parameterization to be unique by (34) and (35). Hence the systematic difference is expected to be small. See Krause etĀ al., (2016) for details.
[TABLE]
where is the so-called variance effective selection mass, and is the inverse time horizon for the \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{\sigma} update, aka cumulation factor. The scalar is a correction factor for small and converges to , where \gamma_{\sigma}^{(0)}=\|\mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{\sigma}^{(0)}\|^{2}/n=0.444An elegant alternative to introducing is to use in place of Ā in (3), assuming the first . This resembles a simple average while and only afterwards discounts older information by the original decay of .
The log step-size is updated as
[TABLE]
where is the damping factor for the update and \chi_{n}=\sqrt{n}\big{(}1-\frac{1}{4n}+\frac{1}{21n^{2}}\big{)} is an approximation of the expected value of the norm of an -dimensional standard normally distributed random vector, \sqrt{2}\Gamma\big{(}\frac{n+1}{2}\big{)}/\Gamma\big{(}\frac{n}{2}\big{)}. The default values for and are
[TABLE]
The damping parameter is introduced to stabilize the step-size adaptation when the population size is large (Hansen and Kern,, 2004). When the step-size becomes too small by accident or is initialized so, the norm of the evolution path will become O\big{(}\sqrt{\mu_{w}/n}\big{)}, which results in a quick increase of if (Akimoto etĀ al.,, 2008). For large , the chosen damping factor prevents an unreasonable increase of at the price of a reduced convergence speed. In case of , the covariance matrix adaptation is the main component decreasing the overall variance of the sampling distribution, while the CSA is still effective to increase when necessary.
Covariance Matrix Adaptation
The covariance matrix is updated by the following formula that combines rank-one and rank- update
[TABLE]
where and are the learning rates for the rank-one update (2nd term) and the rank- update (3rd term), respectively, \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c} is the evolution path that accumulates the successive mean movements and is the correction factor for the rank-one update, which are updated as
[TABLE]
where is the inverse time horizon for the \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c} update. The Heaviside function is introduced to stall the update of \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c} if \lVert\mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{\sigma}\rVert is large, i.e., when the step-size is rapidly increasing. It is defined as
[TABLE]
The default parameters for , , and are smaller than one and presented later.
2.2 Separable Covariance Matrix Adaptation
The separable covariance matrix adaptation (sep-CMA, Ros and Hansen, (2008)) adapts only the coordinate-wise variance of the sampling distribution, i.e., the diagonal elements of the covariance matrix in the same way as (8), but with larger learning rates. In our notation scheme, we keep Ā to be the identity and describe sep-CMA by updating . The update of coordinate follows
[TABLE]
The learning rates and are set differently from those used for the update.
One advantage of the separable CMA is that all operations can be done in linear time per -call. Therefore, it is promising if -calls are cheap. The other advantage is that one can set the learning rate greater than those used for the update, since the degrees of freedom of the covariance matrix of the sampling distribution is , rather than . The adaptation speed of the covariance matrix is faster than for the original CMA. However, if the problem is non-separable and has strong variable dependencies, adapting the coordinate-wise scaling is not enough to make the search efficient. More concisely, if the inverse Hessian of the objective function, , is not well approximated by a diagonal matrix, the convergence speed will be , which is empirically observed on convex quadratic functions as well as theoretically deduced in Akimoto etĀ al., (2018). In practice, it is rarely known in advance whether the separable CMA is appropriate or not. Ros and Hansen, (2008) propose to use the separable CMA for hundred times dimension function evaluations and then switch to the original CMA afterwards. Such an algorithm has been benchmarked in Ros, (2009), where the first iterations are used for the separable CMA.
3 Active covariance matrix update with guarantee of positive definiteness
Active covariance matrix adaptation (referred to as Active-CMA, Jastrebski and Arnold,, 2006) utilizes the information of unpromising solutions, \mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}}_{i:\lambda} for , to update the covariance matrix. The modification is rather simple. We prepare recombination weights for the active covariance matrix update, where are not anymore restricted to be positive. For example, the recombination weights used in Jastrebski and Arnold, (2006) are
[TABLE]
The update formula (8) is then replaced with
[TABLE]
where shaded areas depict the difference to (8). The only difference is that we use all candidate solutions to update the covariance matrix with positive and negative recombination weights.555As of 2018, many implementations of CMA-ES feature the active update of Ā and it should be considered as default.
Each component of the covariance matrix adaptation, rank-one update, rank- update, and active update, produces complementary effects. The rank-one update of the covariance matrix (the second term on the RHS of (14), Hansen and Ostermeier,, 2001) accumulates the successive steps of the distribution mean and increases the eigenvalues of the covariance matrix in the direction of the successive movements. It excels at learning one long axis of the covariance matrix, and is effective on functions with a subspace of a relatively small dimension where the function value is less sensitive than in its orthogonal subspace. FigureĀ 1(a) visualizes an example of such function. See also FigureĀ 7 in Hansen and Auger, (2014) for numerical results. On the other hand, since the update is always of rank one, the learning rate needs to be sufficiently small to keep the covariance matrix regular and stable.
The rank- update (the third term on the RHS of (14) with positive , Hansen etĀ al.,, 2003) utilizes the information of successful candidate solutions in a different way than the rank-one update. It computes the empirical covariance matrix of successful mutation steps \mathchoice{\mbox{\boldmath\displaystyle y}}{\mbox{\boldmath\textstyle y}}{\mbox{\boldmath\scriptstyle y}}{\mbox{\boldmath\scriptscriptstyle y}}_{i}. The update matrix is with probability one of rank , allowing to set a relatively large learning rate . It reduces the number of iterations to adapt the covariance matrix when .
Both, rank-one and rank- update, try to increase the variances in the subspace of successful mutation steps. The eigenvalues corresponding to unsuccessful mutation steps only passively fade out. The active update actively decreases such eigenvalues. It consistently accelerates covariance matrix adaptation, and the improvement is particularly pronounced on functions with a small number of dominating eigenvalues of the Hessian of the objective function. FigureĀ 1(c) is an example of such function.
A disadvantage of the active update with negative recombination weights such as (13) is to have no guarantee of the positive definiteness of the covariance matrix anymore. It is easy to see that the rank-one and rank- update of CMA in (8) guarantee that the minimal eigenvalue of is no smaller than the minimal eigenvalue of times , since the second and third terms only increase the eigenvalues. However, the introduction of negative recombination weights can violate the positive definiteness because the third term may decrease the minimum eigenvalue arbitrarily. Jastrebski and Arnold, (2006) set a sufficiently small learning rate for the active update, i.e., the absolute values of the negative recombination weights sum up to a smaller value than one. It will prevent the covariance matrix from being non-positive with high probability, but it does not guarantee positive definiteness. Moreover, it becomes ineffective when the population size is relatively large and a greater learning rate is desired.
Krause and Glasmachers, (2015) apply the active update with positive definiteness guarantee by introducing the exponential covariance matrix update, called xCMA. Instead of updating the covariance matrix in an additive way as in (14), the covariance matrix is updated as
[TABLE]
where is a symmetric matrix. Since the eigenvalues of the matrix exponential are where are the eigenvalues of , the positive definiteness is naturally guaranteed. Arnold and Hansen, (2010) achieve the positive definiteness guarantee in the -CMA-ES by rescaling the negative recombination weights depending on the norm of unsuccessful mutation steps \lVert\mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}\rVert. In this paper we introduce two strategies that are both considered as generalization of this idea to the -CMA-ES.666There are variants of CMA-ES that update a factored matrix satisfying (e.g., eNES by Sun etĀ al.,, 2009). No matter how is updated, the positive semi-definiteness of is guaranteed since is always positive semi-definite. However this approach has a potential drawback that a negative update may end up increasing a variance. To see this, consider the case , where is some vector and represents the learning rate times the recombination weight. Then, the covariance matrix follows . This update shrinks a variance if is sufficiently small (), however, it increases the variance if is large (). Hence, a negative update with a long vector and/or a large will cause an opposite effect. Therefore, the factored matrix update is not a conclusive solution to the positive definiteness issue.
3.1 Method 1: Scaling Down the Update Factor
To guarantee the positive definiteness of the covariance matrix, we rescale the update factor of the covariance matrix so that the changes of the minimum eigenvalue of the covariance matrix is bounded. We start by introducing the rescaling of unpromising solutions,
[TABLE]
This rescaling results in projecting unpromising solutions onto the surface of the hyper-ellipsoid defined by \mathchoice{\mbox{\boldmath\displaystyle y}}{\mbox{\boldmath\textstyle y}}{\mbox{\boldmath\scriptstyle y}}{\mbox{\boldmath\scriptscriptstyle y}}^{\mathsf{T}}\mathbf{C}^{-1}\mathchoice{\mbox{\boldmath\displaystyle y}}{\mbox{\boldmath\textstyle y}}{\mbox{\boldmath\scriptstyle y}}{\mbox{\boldmath\scriptscriptstyle y}}=n. By this projection we achieve three desired effects. First, the update becomes bounded which makes it easier to control positive definiteness. Second, short steps are elongated, enhancing their effect in the update. Third, long steps are shortened, reducing their effect in the update. The two latter effects counter the correlation between rank and length of steps in unfavorable directions. For any given unfavorable direction, longer steps in this direction are most likely ranked worse than shorter steps.
With these scaled solutions, the covariance matrix is updated as
[TABLE]
where shaded areas highlight the difference to (14) and is the scaling factor to guarantee the positive definiteness of the covariance matrix. Note that if , it is equivalent to (14) except for the rescaling of the unpromising samples. Importantly, the rescaling of unpromising solutions does not affect the stationarity of the parameter update, i.e., under a random function such as f(\mathchoice{\mbox{\boldmath\displaystyle x}}{\mbox{\boldmath\textstyle x}}{\mbox{\boldmath\scriptstyle x}}{\mbox{\boldmath\scriptscriptstyle x}})\sim\mathcal{U}(0,1). This is shown as follows. First, given \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c}^{(t)}\sim\mathcal{N}(\bm{0},\gamma_{c}^{(t)}\mathbf{D}\mathbf{C}^{(t)}\mathbf{D}), it is easy to see that \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c}^{(t+1)}\sim\mathcal{N}(\bm{0},\gamma_{c}^{(t+1)}\mathbf{D}\mathbf{C}^{(t)}\mathbf{D}). Second, using \mathbb{E}[\mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}\mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}^{\mathsf{T}}/\lVert\mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}\rVert^{2}]=\mathbf{I}/n (LemmaĀ 1 of Heijmans,, 1999) we have \mathbb{E}\big{[}\sum_{i=1}^{\lambda}w_{i}\tilde{}\mathchoice{\mbox{\boldmath\displaystyle y}}{\mbox{\boldmath\textstyle y}}{\mbox{\boldmath\scriptstyle y}}{\mbox{\boldmath\scriptscriptstyle y}}_{i:\lambda}\tilde{}\mathchoice{\mbox{\boldmath\displaystyle y}}{\mbox{\boldmath\textstyle y}}{\mbox{\boldmath\scriptstyle y}}{\mbox{\boldmath\scriptscriptstyle y}}_{i:\lambda}^{\mathsf{T}}\mid\mathbf{C}^{(t)}\big{]}=\big{(}\sum_{i=1}^{\lambda}w_{i}\big{)}\mathbf{C}^{(t)}. Finally combining them, we obtain .
The main idea is to scale down, if necessary, the update of the covariance matrix by setting in (17). To provide a better intuition, we start by considering the case , i.e., for every iteration . Equation (17) can be written as
[TABLE]
with
[TABLE]
Then, the scaling down parameter in (18) is taken so that is positive definite. Then, with maximum and minimum eigenvalue of a matrix denoted by and , respectively, we have that
[TABLE]
where mean that is positive semi-definite (i.e., all eigenvalues are non-negative)777Some references (e.g., Harville, (2008)) use the term āpositive semi-definite matrixā for a matrix with positive and zero eigenvalues and ānon-negative definite matrixā is used for matrices with non-negative eigenvalues, whereas in other references these terms are used for matrices with non-negative eigenvalues. In this paper, we apply the latter terminology. for any two square matrices and . Moreover, . Therefore, if , the resulting covariance matrix is guaranteed to be positive definite.
For the general case (), let be the iteration at which the last eigen decomposition was performed, i.e., . For iterations , we store the intermediate update matrix . The covariance matrix is updated only when the eigen decomposition is required, i.e., at iteration , as
[TABLE]
Analogously to the above argument, the resulting covariance matrix is positive definite if the inside of the brackets is positive definite. We set the scaling down parameter as
[TABLE]
The first argument guarantees that the minimum eigenvalue of is greater than or equal to th of the minimum eigenvalue of . More concisely, we have . The last argument, which is the most frequent case in practice, implies that the covariance matrix update does not need to be scaled down.
3.2 Method 2: Scaling Down Negative Weights
An alternative way to guarantee the positive definiteness of the covariance matrix is to scale down the sum of the absolute values of the negative weights, combined with the projection of unpromising solutions introduced above. We use the same update formula (21), but .
The positive definiteness of the covariance matrix is guaranteed under the condition . More precisely, we have the following claim.
Theorem 3.1**.**
If , then \mathbf{C}^{(t+t_{\mathrm{eig}})}\succcurlyeq\big{(}1-t_{\mathrm{eig}}\big{(}c_{1}+c_{\mu}\sum_{i=1}^{\lambda}w_{i}+nc_{\mu}\sum_{i:\ w_{i}<0}\lvert w_{i}\rvert\big{)}\big{)}\mathbf{C}^{(t)}.
Proof of TheoremĀ 3.1.
First, we consider the case . Using the fact that the self outer product of a vector is a matrix of rank with the only nonzero eigenvalue of , we have
[TABLE]
The analogous argument holds for as well. This completes the proof. ā
We show how to construct the recombination weights so that they satisfy the sufficient condition of TheoremĀ 3.1 as follows. Let be the pre-defined weights that are non-increasing. Without loss of generality, we assume that the first weights are positive and sum to one. The recombination weights are for and for , where . Then, the sufficient condition in TheoremĀ 3.1 reads . It holds if we set for example , as satisfied by the default choice t_{\text{eig}}=\max\big{(}1,\big{\lfloor}(\beta_{\mathrm{eig}}(c_{1}+c_{\mu}))^{-1}\big{\rfloor}\big{)}, and
[TABLE]
Then, it is guaranteed that .
This method is simpler than the method described in SectionĀ 3.1 since it does not require an additional eigenvalue computation. However, to guarantee the positive definiteness in this way, we bound the unrealistic worst case where all unpromising candidate solutions are sampled on a line. Therefore, the scaling down factor is set much smaller than the value that is actually needed in practice. FigureĀ 2 visualizes the correction factor under our choice of the default pre-weights and learning rates described in SectionĀ 5. The sum of the negative recombination weights needs to decrease as the population size increases. For , the factor is less than for . Unless the internal computational time becomes a critical bottleneck, we prefer the method described in SectionĀ 3.1 over the method in SectionĀ 3.2.
3.3 Choice of Recombination Weights
We first review the rationale for the choice of the positive recombination weights. The default positive recombination weights, for , are based on the quality gain analysis of the weighted recombination ES with isotropic Gaussian distribution (i.e., ). Arnold, (2006) studied the quality gain on a spherical function, and derived the optimal recombination weights for the infinite dimensional case. They are
[TABLE]
where is the ordered statistics from the independent and standard normally distributed random variables . Recently, it has been shown that the same recombination weights are optimal for a general convex quadratic function with Hessian matrix satisfying for large (Akimoto etĀ al.,, 2018). The value approximates the first half of the optimal recombination weights.888Interesting results are derived from the quality gain (and progress rate) analysis, see e.g.Ā SectionĀ 4.2 of Hansen etĀ al., (2015). Comparing the ()-ES and (, )-ES with intermediate recombination, we realize that they reach the same quality gain in the limit for to infinity when is the optimal value, (Beyer,, 1995). That is, a non-elitist strategy with optimal for intermediate recombination can reach the same speed as the elitist strategy. With the optimal recombination weights above, the weighted recombination ES is times faster than those algorithms. If we employ only the positive half of the optimal recombination weights, the speed will be halved, yet it is faster by the factor of .
We propose to use the recombination weights constructed as follows. Let
[TABLE]
The variance effective selection mass for the positive and negative weights is computed as
[TABLE]
The learning rates and may be computed depending on the above quantities. The default values for the learning rates are discussed in SectionĀ 5. The recombination weights are set as follows
[TABLE]
The positive recombination weights are unchanged from the default settings without active CMA. They are approximately proportional to the positive half of the optimal recombination weights. However, this is not the case for the negative recombination weights. The optimal recombination weights are symmetric about zero, i.e.Ā , whereas our negative weights tend to level out for the following reasons. The above mentioned quality gain results consider only the mean update. The obtained optimal values are not necessarily optimal for the covariance matrix adaptation. Since we use only positive recombination weights for the mean vector update, the negative weights do not need to correspond to these optimal recombination weights. Furthermore, the optimal negative weightsāgreater absolute values for worse stepsācounteract our motivation for rescaling of unpromising steps discussed in SectionĀ 3.1. Our choice of the negative recombination weights is a natural extension of the default positive recombination weights. The shape of our recombination weights is somewhat similar to the one in xCMA-ES (Krause and Glasmachers,, 2015), where the first half is and the last half is . A minor difference is that xCMA-ES assigns negative values even for some of the better half of the candidate solutions, whereas our setting assigns positive values only for the better half and negative values only for the worse half.
Positive and negative recombination weights are scaled differently. Positive weights are scaled to sum to one, whereas negative weights are scaled so that the sum of their absolute values is the minimum of and . The latter corrects for unbalanced positive and negative weights, but is usually greater than the former with our default values. The former makes the decay factor of the covariance matrix update (see (14)) to . That is, the covariance matrix does not passively shrink, and only the negative update can shrink the covariance matrix. Krause and Glasmachers, (2015) mention to have no passive shrinkage by setting the sum of the weights to zero, but the effect of the rank-one update was not taken into account and thus in xCMA the passive decay factor of remains.
4 Adaptive Diagonal Decoding (dd-CMA)
The separable CMA-ES (Ros and Hansen,, 2008) enables to adapt a diagonal covariance matrix faster than the standard CMA-ES because the learning rates are times greater than in standard CMA-ES. This works well on separable objective functions, where separable CMA-ES adapts the (co)variance matrix by a factor of faster than CMA-ES. To accelerate standard CMA, we combine separable CMA and standard CMA. We adapt both and at the same time999Considering the eigen decomposition of the covariance matrix, , instead of the decomposition , our attempts to additionally adapt Ā were unsuccessful. We attribute the failure to the strong interdependency between Ā and . Compared to the eigen decomposition, the diagonal decoding model also has the advantage to be interpretable as a rescaling of variables. We never considered the decomposition as proposed by one of the reviewers of this paper, however, at first sight, we do not see any particular advantage in favor of this decomposition. , where is updated with adaptive learning rates which can be much greater than those used to update .
4.1 Algorithm
The update of is similar to separable CMA, but is done in local exponential coordinates. The update of is as follows
[TABLE]
where is the -th diagonal element of the diagonal matrix , the evolution path \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c,D} feeds the rank-one update, and are the learning rates for the rank-one and rank- updates, respectively, the recombination weights are computed by (27) using instead of , \tilde{}\mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}_{i:\lambda} is defined in (16), and is the damping factor for the diagonal matrix update given in (31) below. The evolution path \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c,D} and its normalization factor are updated in the same way as (9) and (10) with the cumulation factor rather than . The matrix exponential of a diagonal matrix is .
Using the correlation matrix of the covariance matrix of the last time the matrix was decomposed,
[TABLE]
where is the diagonal matrix whose diagonal elements are the diagonal elements of , the damping factor is based on the square root of the condition number of this correlation matrix as
[TABLE]
where is the threshold parameter at which becomes larger than one. We remark that changes only every iterations.
The -update is multiplicative as in Ostermeier etĀ al., (1994) or Krause and Glasmachers, (2015) to be unbiased on the log-scale and to simply guarantee the positive definiteness of . Note that the first order approximation of the update formula gives
[TABLE]
which is the update in separable CMA.
Importantly, we set the learning rates for the update, and , to be about times greater than the learning rates for the update, and . Moreover, we maintain an additional evolution path, \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c,D}, for the rank-one update of since an adequate cumulation factor, , may be different from the one for \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c}.
4.2 Damping factor
The dynamic damping factor is the crucial novelty that prevents diagonal decoding to disrupt the adaptation of the covariance matrix in CMA-ES. The damping factor is introduced to prevent the sampling distributions from drastically changing due to the diagonal matrix update. FigureĀ 3 visualizes three example cases with different . When the diagonal decoding matrix changes from the identity matrix to , the change of the distribution from to is minimal when is diagonal, and is comparatively large if is non-diagonal. It will be greater as the condition number of the correlation matrix increases. In the worst case in FigureĀ 3, we see that two distributions are overlapping only at the center of distribution.
The damping factor computed in (31) is motivated from the KL divergence between the sampling distributions before and after the update. The KL divergence between two multivariate normal distributions that have the same mean vectors is equivalent to the half of the Log-Determinant divergence between the covariance matrices, defined as
[TABLE]
where and are two positive definite symmetric matrices. Let and , i.e., and are the covariance matrices after and before the update, respectively. The divergence is asymmetric, and the value changes if we exchange and , but we will obtain a symmetric approximation in the end. To analyze the divergence between and due to the difference in shape and put aside the effect of a global scaling, we scale both to be of determinant one. We denote the resulting divergence by . Then, we obtain
[TABLE]
where . With the approximation and neglecting and higher terms, we have101010Let and be an arbitrary positive definite symmetric matrix and an arbitrary symmetric matrix, respectively. Let and denote the Kronecker product of two matrices and the matrix-vector rearrangement operator that successively stacks the columns of a matrix. From TheoremĀ 16.2.2Ā of Harville, (2008), we have . The eigenvalues of the Kronecker product are the products of the eigenvalues of and (TheoremĀ 21.11.1Ā of Harville, (2008)). They are all positive and upper bounded by the condition number of . Therefore, we have . Letting and , we have .
[TABLE]
Note that if is diagonal and . Therefore, with computed in (31), we bound the realized divergence approximately by the divergence of Ā like
[TABLE]
for large .
In a nutshell, we have quantified the distribution change from changing Ā by measuring its KL-divergence. We derive that in (31) upper bounds the KL-divergence due to changes of Ā approximately by the KL-divergence from the same change of Ā when . The latter is directly determined by the learning rates Ā and .
4.3 Implementation Remark
To avoid unnecessary numerical errors and additional computational effort for eigen decompositions in the adaptive diagonal decoding mechanism (28), (29), (31), we force the diagonal elements of to be all one by assigning
[TABLE]
These lines are performed just before the eigen decomposition. It means that and are the variance matrix and the correlation matrix of the sampling distribution, respectively. This reparametrization does not change the algorithm itself, it only improves the numerical stability of the implementation. Note that if is constrained to be a correlation matrix, then is a unique parametrization for the covariance matrix.
The eigen decomposition of is performed every iterations. We keep and until the next matrix decomposition is performed. Suppose that the eigen decomposition is performed at iteration . Then, the above matrices are and . Then, we can compute in (31) as . This is kept until the next eigen decomposition is performed. Then, the additional computational cost for adaptive diagonal decoding is , which is smaller than the computational cost for the other parts of the CMA-ES, per iteration.
One might think that maintaining is not necessary and could be updated by pre- and post-multiplying by a diagonal matrix absorbing the effect of the update. However, because diagonal decoding may lead to a fast change of the distribution shape, the decomposition of then needs to be done every iteration. The chosen parametrization circumvents frequent matrix decompositions despite changing the sampling distribution rapidly.
5 Algorithm Summary and Strategy Parameters
The dd-CMA-ES combines weighted recombination, active covariance matrix update with positive definiteness guarantee (described in SectionĀ 3.1), adaptive diagonal decoding (described in SectionĀ 4) and CSA, and is summarized in AlgorithmĀ 1.111111Its python code is available at
https://gist.github.com/youheiakimoto/1180b67b5a0b1265c204cba991fa8518
Providing good default values for the strategy parameters (aka hyper parameters) is, needless to say, essential for the success of a novel algorithm. Especially in a black-box optimization scenario, parameter tuning for an application relies on trial-and-error and is usually prohibitively time consuming. The computation of the default parameter values is summarized in AlgorithmĀ 2. Since we have as long as (see below), the recombination weights for the update of and of are the same, .
The learning rates for and are modified to improve the adaptation speed especially in high dimension. Let be the degrees of freedom of the matrix to be adapted, i.e., for the -update and for the -update. The learning rate parameters and the cumulation factors are set to the following values
[TABLE]
The first important change is the scaling of the learning rate with the dimensionĀ . In previous works (Ros and Hansen,, 2008; Hansen and Auger,, 2014; Akimoto and Hansen,, 2016), the default learning rates were set inversely proportional to , i.e., for the original CMA and for the separable CMA. Our choice is based on empirical observations that is exceedingly conservative for higher dimensional problems, say : in experiments to identify for the original CMA-ES, we vary and investigate the number of evaluations to reach a given target. We find that the location of the minimum and the shape of the dependency remains for a wide range of different dimensions roughly the same when the evaluations are taken against (see also FigureĀ 5 below). The observation suggests in particular that is likely to fail with increasing dimension, whereas is a stable setting such that the new setting of (replacing ) is still sufficiently conservative. FigureĀ 4 depicts the difference between the default learning rate values in Hansen and Auger, (2014) and the above formulas.121212The learning rate for the rank- update is usually times greater than the learning rate for the rank-one update, where is monotonous in Ā and approaches for . Using instead of is a correction for small . When , we have (the first three terms cancel out). In this case, because , the rank-one update contains more information than the rank- update as the evolution path accumulates information over time. Using instead of would result in the same learning rates for both updates, whereas the learning rate for the rank- update should be smaller in this case.
Another important change in the default parameter values from previous studiesĀ (Ros and Hansen,, 2008; Hansen and Auger,, 2014) is in the cumulation factor . Previously, the typical choices were or .
The new and simpler default value for the cumulation factor131313The appearance of in the cumulation factor is motivated as follows. Hansen and Ostermeier, (2001) analyzed the situation where the selection vector \sqrt{\mu_{w}}\sum_{i=1}^{\mu}w_{i}\mathchoice{\mbox{\boldmath\displaystyle z}}{\mbox{\boldmath\textstyle z}}{\mbox{\boldmath\scriptstyle z}}{\mbox{\boldmath\scriptscriptstyle z}}_{i:\lambda} alternates between two vectors and over iterations and showed the squared Euclidean norm of the evolution path, i.e.Ā the eigenvalue of \mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c}\mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c}^{\mathsf{T}}, remains in if . On the other hand, Akimoto etĀ al., (2008) showed that the above selection vector can potentially be proportional to when is (too) small. Altogether, we have c_{1}\lVert\mathchoice{\mbox{\boldmath\displaystyle p}}{\mbox{\boldmath\textstyle p}}{\mbox{\boldmath\scriptstyle p}}{\mbox{\boldmath\scriptscriptstyle p}}_{c}\rVert^{2}\in O(c_{1}\mu_{w}/c_{c}), where and as . Therefore, the eigenvalue added by the rank-one update is bounded by a constant with overwhelming probability. is motivated by an experiment on the Cigar function, investigating the dependency of on . The effect of the rank-one update of the covariance matrix is most pronounced when the covariance matrix needs to increase a single eigenvalue. To skip over the usual initial phase where sigma decreases without changing the covariance matrix and emphasize on the effect of , the mean vector is initialized at \mathchoice{\mbox{\boldmath\displaystyle m}}{\mbox{\boldmath\textstyle m}}{\mbox{\boldmath\scriptstyle m}}{\mbox{\boldmath\scriptscriptstyle m}}^{(0)}=(1,0,\dots,0) and . The rank- update is off, i.e., . We compared the number of -calls until the target function value of is reached. Note that is chosen to avoid adaptation at the beginning and the target value is chosen so that we stop the optimization once the covariance matrix learned the right scaling. FigureĀ 5 compares the number of -calls spent by the original CMA and the separable CMA on the dimensional Cigar function. Because the -axis is normalized with the new default value for , the shape of the graph and the location of the minimum remains roughly the same for different values of and . The adaptation speed of the covariance matrix tends to be the best around or . The minimum is more pronounced for smaller (or ). Nevertheless, we observe little improvement with the new setting in practice since more -calls are usually spent to adapt the overall covariance matrix and even to adapt the step-size to converge. We have done the same experiments in dimension and observe the same trend.
None of the strategy parameters are meant to be tuned by users, except the population size . A larger population size typically results in finding a better local minimum on rugged functions such as multimodal or noisy functions (Hansen and Kern,, 2004). It is also advantageous when the objective function values are evaluated in parallel (Hansen etĀ al.,, 2003). Restart strategies that perform independent restarts with increasing population size such as IPOP strategy (Harik and Lobo,, 1999; Auger and Hansen, 2005b, ) or BIPOP strategy (Hansen,, 2009) or NIPOP strategy (Loshchilov etĀ al.,, 2012) are handy policies that automate the parameter tuning. They can be applied to the CMA-ES with diagonal acceleration in a straight-forward way. We leave research in this line as future work.
6 Experiments
We conduct numerical simulations to see the effect of the algorithmic components of CMA-ES with diagonal acceleration, dd-CMA-ESānamely the active update with positive definiteness guarantee and the adaptive diagonal decoding. Since the effect of the active covariance matrix update with default population size has been investigated by Jastrebski and Arnold, (2006) and our contribution is a positive definiteness guarantee for the covariance matrix especially for large population size, we investigate how the effect of the active update scales as the population size increases. Our main focus is however on the effect of adaptive diagonal decoding. Particularly, we investigate (i) how much better dd-CMA scales on separable functions than the CMA without diagonal decoding (plain CMA), (ii) how closely dd-CMAĀ matches the performance of separable CMA on separable functions, and (iii) how the scaling of dd-CMA compares to the plain CMA on various non-separable functions. Moreover, we investigated how effective dd-CMA is when the population size is increased.
6.1 Common Setting
TableĀ 1 summarizes the test functions used in the following experiments together with the initial \mathchoice{\mbox{\boldmath\displaystyle m}}{\mbox{\boldmath\textstyle m}}{\mbox{\boldmath\scriptstyle m}}{\mbox{\boldmath\scriptscriptstyle m}}^{(0)} and . The initial covariance matrix and the diagonal decoding matrix are always set to the identity matrix. The random vectors and the random matrix appearing in TableĀ 1 are initialized randomly for each problem instance, but the same values are used for different algorithms for fair comparison. A trial is considered as success if the target function value of is reached before the algorithm spends function evaluations, otherwise regarded as failure. For we conducted independent trials for each setting, for , and for . When the computational effort of evaluating the test function scales worse than linear with the dimension (i.e., on rotated functions) we may omit dimensions larger than .
We compare the following CMA variants:
Plain CMA
updates while is kept the identity matrix, with or without active update described in SectionĀ 3 (AlgorithmĀ 1 without -update);
Separable CMA
updates as described in SectionĀ 4 while is kept the identity matrix, with active update or without active update, where negative recombination weights are set to zero (AlgorithmĀ 1 without -update);
dd-CMA
as summarized in AlgorithmĀ 1, updates as described in SectionĀ 2.1 with active update described in SectionĀ 4, and updates as described in SectionĀ 4.
All strategy parameters such as the learning rates are set to their default value presented in SectionĀ 5. Note that the separable CMA is different from the original publication (Ros and Hansen,, 2008) in that the matrix is updated in multiplicative form which however barely affects the performance.
6.2 Active CMA
First, the effect of the active update is investigated. The plain and separable CMA with and without active update are compared.
FigureĀ 6 shows the number of iterations spent by each algorithm plotted against population size . The number of iterations to reach the target function value decreases as increases and tends to level out. The plain CMA is consistently faster with active update than without. As expected from the results of Jastrebski and Arnold, (2006), the effect of active covariance matrix update is most pronounced on functions with a small number of sensitive variables such as the Discus function. The advantage of the active update diminishes as increases in plain CMA, whereas the speed-up in separable CMA becomes even slightly more pronounced.
6.3 Adaptive Diagonal Decoding
Next, we compare dd-CMA with the plain and the separable CMA with active update.
FigureĀ 7 compares dd-CMA with the plain CMA on the rotated Cigar, Ellipsoid, and Discus functions, and with the separable CMA on the separable Cigar, Ellipsoid, and Discus functions. Note that the plain CMA scales equally well on separable functions and rotated functions, while the separable CMA will not find the target function value on rotated functions before the maximum budget is exhausted (Ros and Hansen,, 2008). Displayed are the number of function evaluations to reach the target function value on each problem instance. The results of dd-, plain and separable CMA on the Sphere function are displayed for reference.
On the Sphere function, no covariance matrix adaptation is required, and all algorithms perform quite similarly. On the Cigar function, the rank-one covariance matrix update is known to be rather effective, and even the plain CMA scales linearly on the rotated Cigar function. On both, separable and rotated Cigar functions, dd-CMA is competitive with separable and plain CMA thereby combining the better performance of the two.
The discrepancy between plain and separable CMA is much more pronounced on Ellipsoid and Discus functions, where the plain CMA scales super-linearly, whereas the separable CMA exhibits linear or slightly worse than linear scaling on separable instances and fails to find the optimum within the given budget on rotated instances. On the rotated functions, dd-CMA is competitive with the plain CMA, whereas it significantly reduces the number of function evaluations on the separable functions compared to plain CMA, e.g., ten times less -calls are required on the dimensional separable Ellipsoid function.
The dd-CMA is competitive with separable CMA on the separable Discus function up to dimension, which is the ideal situation since we can not expect faster adaptation of the distribution shape on separable functions. On the separable Ellipsoid function, the performance curve of dd-CMA starts deviating from that of the separable CMA around , and scales more or less the same as the plain CMA afterwards, yet it requires ten times less -calls. To improve the scaling of dd-CMA on the separable Ellipsoid function, we might need to set depending on , or use another monotone transformation of the condition number of the correlation matrix in (31). Performing the eigen decomposition of less frequently (i.e., setting smaller) is another possible way to improve the performance of dd-CMA on the separable Ellipsoid function, while compromising the performance on the rotated Ellipsoid function141414If we set instead of , dd-CMA spends about times more FEs on the -dimensional separable Ellipsoid function than separable CMA as displayed in FigureĀ 7(d), whereas it spends about more FEs on the -dimensional rotated Ellipsoid function than plain CMA as displayed in FigureĀ 7(d). This might be a more practical choice, but further investigation is required..
FiguresĀ 8(a) and 8(b) show the results on Ell-Cig and Ell-Dis functions, which are non-separable ill-conditioned functions with additional coordinate-wise scaling, FigureĀ 8(c) shows the results on non-rotated and rotated Rosenbrock functions. The separable CMA locates the target -value within the given budget only on the non-rotated Rosenbrock function in smaller dimension. The experiment on the Ell-Cig function reveals that diagonal decoding can improve the scaling of the plain CMA even on non-separable functions. The improvement on the Ell-Dis function is much less pronounced. The reason why, compared to the plain CMA, dd-CMA is more suitable for Ell-Cig than for Ell-Dis is discussed in relation with FigureĀ 9 below. On the non-rotated Rosenbrock function, dd-CMA becomes advantageous as the dimension increases, just as the separable CMA is faster than the plain CMA when .
FigureĀ 9 visualizes insights in the typical behaviour of dd-CMA on different functions. On the separable Ellipsoid, the correlation matrix deviates from the identity matrix due to stochastics and the initial parameter setting, and the inverse damping decreases marginally. This effect is more pronounced for and is the reason for the impaired scaling of dd-CMA on the separable Ellipsoid in FigureĀ 7(d). The diagonal decoding matrix, i.e., variance matrix, adapts the coordinate-wise scaling efficiently as long as the condition number of the correlation matrix is not too high. On the rotated Ellipsoid function, where the diagonal elements of the inverse Hessian of the objective function, , likely have similar values, the diagonal decoding matrix remains close to the identity and the correlation matrix learns the ill-conditioning. The inverse damping parameter decreases so that the adaptation of the diagonal decoding matrix does not disturb the adaptation of the distribution shape. FigureĀ 9(b), below, shows that adaptive diagonal decoding without the dynamic damping factor (31) severely disturbs the adaptation of on the rotated Ellipsoid.
Ideally, the diagonal decoding matrix adapts the coordinate-wise standard deviation and adapts the correlation matrix of the sampling distribution. If the correlation matrix is easier to adapt than the full covariance matrix , we expect a speed-up of the adaptation of the distribution shape. The functions Ell-Cig and Ell-Dis in FiguresĀ 9(c) and 9(d) are such examples. Once becomes inversely proportional to , the problems become rotated Cigar and rotated Discus functions, respectively. The run on the Ell-Cig function in 9(c) depicts the ideal case. The coordinate-wise scaling is first adapted by and then the correlation matrix learns a long Cigar axis. The inverse damping factor is kept at a high value similar to the one observed on the separable Ellipsoid and it starts decreasing only after is adapted. On the Ell-Dis function (9(d)) however, the short non-coordinate axis is learned (too) quickly by the correlation matrix. Therefore, the inverse damping factor decreases before has adapted the full coordinate-wise scaling. Since the function value is more sensitive in the subspace corresponding to great eigenvalues of the Hessian of the objective and the ranking of candidate solutions is mostly determined by this subspace, the CMA-ES first shrinks the distribution in this subspace. This is the main reason why dd-CMA is more efficient on Ell-Cig than on Ell-Dis.
FigureĀ 10 shows, similar to FigureĀ 6, the number of iterations versus the population size on three functions in dimension 40. For all larger populations sizes up to , dd-CMA performs on par with the better of plain and separable CMA, as ideally to be expected.
We remark that dd-CMA with default has a relatively large variance in performance on the separable TwoAxes function. Increasing from its default value reduces this variance and even reduces the number of -calls to reach the target. This defect of dd-CMA is even more pronounced for higher dimensions. It may suggest to increase the default for dd-CMA. We leave this to be addressed in future work.
FigureĀ 11 shows the number of iterations in successful runs, the success rate and the average number of evaluations in successful runs divided by the success rate (Price,, 1997; Auger and Hansen, 2005a, ) on two multimodal -dimensional functions. The maximum numbers of -calls are on Bohachevsky and on Rastrigin. Comparing separable to dd-CMA on separable functions and plain to dd-CMA on rotated functions, dd-CMA tends to spend less iterations but to require greater to reach the same success rate. The latter is attributed to doubly shrinking the overall variance by updating and . For , we have and and the effect of shrinking the overall variance due to and updates is not negligible. Then, the overall variance is smaller than in plain and separable CMA, which also leads to faster convergence. This effect disappears if we force the update to keep its determinant unchanged, which can be realized by replacing in (29) with .
7 Summary and Conclusion
In this paper we have put forward several techniques to improve multi-recombinative non-elitistic covariance matrix adaptation evolution strategies without changing their internal computation effort.
The first component is concerned with the active covariance matrix update, which utilizes unsuccessful candidate solutions to actively shrink the sampling distribution in unpromising directions. We propose two ways to guarantee the positive definiteness of the covariance matrix by rescaling unsuccessful candidate solutions.
The second and main component is a diagonal acceleration of CMA by adaptive diagonal decoding, dd-CMA. The covariance matrix adaptation is accelerated by adapting a coordinate-wise scaling separately from the positive definite symmetric covariance matrix. This drastically accelerates the adaptation speed on high-dimensional functions with coordinate-wise ill-conditioning, whereas it does not significantly influence the adaptation of the covariance matrix on highly ill-conditioned non-separable functions without coordinate-wise scaling component.
The last component is a set of improved default parameters for CMA. The scaling of the learning rates are relaxed from to for default CMA and from to for separable CMA. This contributes to accelerate the learning in all investigated CMA variants on high dimensional problems, say .
Algorithm selection as well as hyper-parameter tuning of an algorithm is a troublesome issue in black-box optimization. For CMA-ES, we needed to make a decision whether we use the plain CMA or the separable CMA, based on the limited knowledge on a problem of interest. If we select the separable CMA but the objective function is non-separable and highly ill-conditioned, we will not obtain a reasonable solution within most given function evaluation budgets. Therefore, plain CMA is a safer choice, though it may be slower on nearly separable functions. The dd-CMA automizes this decision and achieves faster adaptation speed on functions with ill-conditioned variables usually without compromising the performance on non-separable and highly ill-conditioned functions. Moreover, the advantage of dd-CMA is not limited to separable problems. We found a class of non-separable problems with variables of different sensitivity on which dd-CMA-ES decidedly outperforms CMA-ES and separable CMA-ES (see FigureĀ 8(a)). As dd-CMA-ES improves the performance on a relatively wide range of problems with mis-scaling of variables, it is a prime candidate to become the default CMA-ES variant.
8 Acknowledgements
The authors thank the associate editor and the anonymous reviewers for their valuable comments and suggestions. This work is partly supported by JSPS KAKENHI Grant Number 19H04179.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Akimoto et al., (2014) Akimoto, Y., Auger, A., and Hansen, N. (2014). Comparison-based natural gradient optimization in high dimension. In Proceedings of Genetic and Evolutionary Computation Conference , pages 373ā380.
- 2Akimoto et al., (2018) Akimoto, Y., Auger, A., and Hansen, N. (2018). Quality gain analysis of the weighted recombination evolution strategy on general convex quadratic functions. Theoretical Computer Science .
- 3Akimoto and Hansen, (2016) Akimoto, Y. and Hansen, N. (2016). Projection-based restricted covariance matrix adaptation for high dimension. In Genetic and Evolutionary Computation Conference, GECCO 2016, Denver, Colorado, USA, July 20-24, 2016 , pages 197ā204. ACM.
- 4Akimoto et al., (2010) Akimoto, Y., Nagata, Y., Ono, I., and Kobayashi, S. (2010). Bidirectional relation between CMA evolution strategies and natural evolution strategies. In Parallel Problem Solving from Nature - PPSN XI , number 6238 in LNCS, pages 154ā163. Springer-Verlag.
- 5Akimoto et al., (2008) Akimoto, Y., Sakuma, J., Ono, I., and Kobayashi, S. (2008). Functionally specialized CMA-ES: a modification of CMA-ES based on the specialization of the functions of covariance matrix adaptation and step size adaptation. In GECCO ā08: Proceedings of the 10th annual conference on Genetic and evolutionary computation . ACM.
- 6Arnold, (2006) Arnold, D. V. (2006). Weighted multirecombination evolution strategies. Theoretical Computer Science , 361:18ā37.
- 7Arnold and Hansen, (2010) Arnold, D. V. and Hansen, N. (2010). Active covariance matrix adaptation for the (1+1)-CMA-ES. In Proceedings of Genetic and Evolutionary Computation Conference , pages 385ā392, Portland, Oregon. ACM.
- 8(8) Auger, A. and Hansen, N. (2005 a). Performance evaluation of an advanced local search evolutionary algorithm. In 2005 IEEE Congress on Evolutionary Computation , pages 1777ā1784. Ieee.
