Deep Reinforcement Learning Based Parameter Control in Differential   Evolution

Mudita Sharma; Alexandros Komninos; Manuel Lopez Ibanez; Dimitar; Kazakov

arXiv:1905.08006·cs.NE·May 21, 2019

Deep Reinforcement Learning Based Parameter Control in Differential Evolution

Mudita Sharma, Alexandros Komninos, Manuel Lopez Ibanez, Dimitar, Kazakov

PDF

1 Repo

TL;DR

This paper introduces a Deep Reinforcement Learning approach using Double Deep Q-Learning to adaptively control mutation strategies in Differential Evolution, improving performance on benchmark functions.

Contribution

It presents a novel AOS method based on DDQN for DE, trained offline to predict optimal mutation strategies during optimization.

Findings

01

DE-DDQN outperforms non-adaptive DE algorithms on benchmark functions.

02

The method achieves results comparable to top CEC2005 competition winners.

03

Training on diverse features enables effective strategy selection across different problems.

Abstract

Adaptive Operator Selection (AOS) is an approach that controls discrete parameters of an Evolutionary Algorithm (EA) during the run. In this paper, we propose an AOS method based on Double Deep Q-Learning (DDQN), a Deep Reinforcement Learning method, to control the mutation strategies of Differential Evolution (DE). The application of DDQN to DE requires two phases. First, a neural network is trained offline by collecting data about the DE state and the benefit (reward) of applying each mutation strategy during multiple runs of DE tackling benchmark functions. We define the DE state as the combination of 99 different features and we analyze three alternative reward functions. Second, when DDQN is applied as a parameter controller within DE to a different test set of benchmark functions, DDQN uses the trained neural network to predict which mutation strategy should be applied to each…

Tables5

Table 1. Table 1. State features

Index	Feature	Notes
1	$\frac{f ({\vec{x}}_{i}) - f_{bsf}}{f_{wsf} - f_{bsf}}$	${\vec{x}}_{i}$ denotes the $i$ -th solution of the population and $f ({\vec{x}}_{i})$ denotes its fitness; $f_{bsf}$ and $f_{wsf}$ denote the best-so-far and worst-so-far fitness values found up to this step within a single run
2	$\frac{\sum_{j = 1}^{N P} \frac{f ({\vec{x}}_{j})}{N P} - f_{bsf}}{f_{wsf} - f_{bsf}}$	$N P$ is the population size
3	$\frac{{std}_{j = 1, \dots, N P} (f ({\vec{x}}_{j}))}{{std}^{max}}$	$std (\cdot)$ calculates the standard deviation and ${std}^{max}$ is the value when $N P / 2$ solutions have fitness $f_{wsf}$ and the other half have fitness $f_{bsf}$
4	$\frac{F E^{max} - t}{F E^{max}}$	$F E^{max}$ is the maximum number of function evaluations per run, and $F E^{max} - t$ gives the remaining number of evaluations at step $t$
5	$\frac{{dim}_{f}}{{dim}^{max}}$	${dim}_{f}$ is the dimension of the benchmark function $f$ being optimised, and ${dim}^{max}$ is the maximum dimension among all training functions
6	$\frac{stagcount}{F E^{max}}$	stagcount is the stagnation counter, i.e., the number of function evaluations (steps) without improving $f_{bsf}$
7-11	$\frac{dist ({\vec{x}}_{i} - {\vec{x}}_{j})}{{dist}^{max}}$ , $\forall j \in {r_{1}, r_{2}, r_{3}, r_{4}, r_{5}}$	$dist (\cdot)$ is the Euclidean distance between two solutions; ${dist}^{max}$ is the maximum distance possible, calculated between the lower and upper bounds of the decision space; ${r_{1}, r_{2}, r_{3}, r_{4}, r_{5}}$ are random indexes
12	$\frac{dist ({\vec{x}}_{i} - {\vec{x}}_{best})}{{dist}^{max}}$	${\vec{x}}_{best}$ is the best parent in the current population
13-17	$\frac{f ({\vec{x}}_{i}) - f ({\vec{x}}_{j})}{f_{wsf} - f_{bsf}}$ , $\forall j \in {r_{1}, r_{2}, r_{3}, r_{4}, r_{5}}$
18	$\frac{f ({\vec{x}}_{i}) - f ({\vec{x}}_{best})}{f_{wsf} - f_{bsf}}$
19	$\frac{dist ({\vec{x}}_{i} - {\vec{x}}_{bsf})}{{dist}^{max}}$	${\vec{x}}_{bsf}$ denotes the solution with fitness $f_{bsf}$
20-35	$\sum_{g = 1}^{gen} \frac{N_{m}^{succ} (g, op)}{N^{tot} (g, op)}$	For each op and $m \in {1, 2, 3, 4}$ and normalised over all operators; gen is the number of recent generations recorded; $N_{m}^{succ} (g, op)$ and $N^{tot} (g, op)$ are successful and total applications of op according to ${OM}_{m}$ at generation $g$
36-51	$\frac{\sum_{g = 1}^{gen} \sum_{k = 1}^{N_{m}^{succ} (g, op)} {OM}_{m} (g, k, op)}{\sum_{g = 1}^{gen} N^{tot} (g, op)}$
52-67	$\frac{{OM}_{m}^{best} (gen, op) - {OM}_{m}^{best} (gen - 1, op)}{{OM}_{m}^{best} (gen - 1, op) \cdot \| N^{tot} (gen, op) - N^{tot} (gen - 1, op) \|}$	For each op and $m \in {1, 2, 3, 4}$ and normalised over all operators; ${OM}_{m}^{best} (g, op)$ is the maximum value of ${OM}_{m} (g, k, op)$
68-83	$\sum_{g = 1}^{gen} {OM}_{m}^{best} (g, op)$	For each op and $m \in {1, 2, 3, 4}$ and normalised over all operators
84-99	$\sum_{w = 1}^{W} {OM}_{m} (w, op)$	For each op and $m \in {1, 2, 3, 4}$ and normalised over all operators; ${OM}_{m} (w, op)$ is the $w$ -th value in the window generated by op

Table 2. Table 2. Hyperparameter values of DE-DDQN

Training and online parameters	Parameter value
Scaling factor ( $F$ )	$0.5$
Crossover rate ( $C R$ )	$1.0$
Population size ( $N P$ )	$100$
$F E^{max}$ per function	$10^{4}$ function evaluations
Max. generations (gen)	$10$
Window size ( $W$ )	$50$
Type of neural network	Multi layer perceptron
Hidden layers	$4$
Hidden nodes	$100$ per hidden layer
Activation function	Rectified linear (Relu) (Nair and Hinton, 2010)
Batch size	$64$
Training only parameters	Parameter value
Training policy	$ϵ$ -greedy ( $ϵ = 0.1$ )
Discount factor ( $γ$ )	$0.99$
Target network synchronised ( $C$ )	every $1 e 3$ steps
Observation memory capacity	$10^{5}$
Warm-up size	$10^{4}$
NN training algorithm	Adam (learning rate: $10^{- 4}$ )
Online phase parameters	Parameter value
Online policy	Greedy

Table 3. Table 3. Mean (and standard deviation in parenthesis) of function error values obtained by 25 runs for each function on test set. Former five are dimension 10 and last five are dimension 30. We refer DE-DDQN as DDQN. Bold entry is the minimum mean error found by any method for each function.

Function

Random

DE1

DE2

DE3

DE4

AdapSS

FAUC

RecPM

LR

IPOP

DDQN1

DDQN2

DDQN3

F ​ 3

-10

2.34e+8

(1.06e+8)

2.78e+8

(1.30e+8)

2.26e+8

(1.10e+8)

2.38e+8

(1.23e+8)

2.63e+8

(1.42e+8)

3.37e+4

(3.62e+5)

3.53e+5

(1.65e+4)

3.08e+4

(2.64e+4)

4.94e-9

(1.45e-9)

5.60e-9

(1.93e-9)

3.98e+3

(1.91e+3)

7.38 e+0

(3.59e0)

2.12e+1

(1.14e+1)

F ​ 9

-10

1.20e+2

(1.32e+1)

1.18e+2

(1.20e+1)

1.22e+2

(1.88e+1)

1.16e+2

(1.44e+1)

1.22e+2

(1.71e+1)

4.10e+1

(6.36e+0)

4.36e+1

(5.99e+0)

3.79e+1

(6.33e+0)

8.60e+1

(3.84e+1)

6.21e+0

(2.10e+0)

4.19e+1

(6.21e+0)

3.68e+1

(4.64e+0)

3.86e+1

(7.66e+0)

F ​ 16

-10

6.46e+2

(1.02e+2)

6.50e+2

(9.65e+1)

6.31e+2

(1.15e+2)

5.91e+2

(1.07e+2)

6.33e+2

(9.97e+1)

1.90e+2

(2.21e+1)

2.05e+2

(1.41e+1)

1.89e+2

(1.25e+1)

1.49e+2

(8.01e+1)

1.11e+2

(1.66e+1)

1.93e+2

(1.24e+1)

1.79e+2

(2.05e+1)

1.88e+2

(1.41e+1)

F ​ 18

-10

1.33e+3

(1.16e+2)

1.36e+3

(8.81e+1)

1.39e+3

(1.11e+2)

1.36e+3

(1.09e+2)

1.36e+3

(9.67e+1)

6.13e+2

(1.67e+2)

6.94e+2

(1.93e+2)

6.48e+2

(1.82e+2)

8.40e+2

(2.17e+2)

6.02e+2

(2.76e+2)

5.20e+2

(1.93e+2)

5.81e+2

(2.47e+2)

5.98e+2

(2.61e+2)

F ​ 23

-10

1.49e+3

(5.16e+1)

1.51e+3

(6.71e+1)

1.51e+3

(6.03e+1)

1.51e+3

(5.58e+1)

1.49e+3

(4.97e+1)

6.66e+2

(1.99e+2)

7.73e+2

(2.05e+2)

6.37e+2

(1.23e+2)

1.22e+3

(5.16e+2)

9.49e+2

(3.52e+2)

6.18e+2

(1.40e+2)

6.56e+2

(1.57e+2)

6.90e+2

(1.35e+2)

F ​ 3

-30

2.48e+9

(6.60e+8)

2.68e+9

(7.84e+8)

2.50e+9

(9.04e+8)

2.65e+9

(6.69e+8)

2.51e+9

(8.22e+8)

1.52e+7

(5.50e+7)

6.44e+7

(5.88e+6)

1.31e+7

(6.84e+6)

1.28e+6

(7.13e+5)

6.11e+6

(3.79e+6)

1.52e+7

(9.07e+6)

3.06e+6

(2.54e+6)

5.72e+6

(1.30e+7)

F ​ 9

-30

5.33e+2

(3.09e+1)

5.27e+2

(3.40e+1)

5.42e+2

(3.73e+1)

5.19e+2

(4.53e+1)

5.41e+2

(3.43e+1)

2.54e+2

(2.69e+1)

2.88e+2

(1.72e+1)

2.53e+2

(1.26e+1)

4.19e+2

(1.02e+2)

4.78e+1

(1.15e+1)

2.73e+2

(1.97e+1)

2.39e+2

(1.52e+1)

2.73e+2

(2.24e+1)

F ​ 16

-30

1.19e+3

(1.36e+2)

1.18e+3

(1.72e+2)

1.18e+3

(1.16e+2)

1.21e+3

(1.35e+2)

1.20e+3

(1.63e+2)

3.11e+2

(6.26e+1)

3.48e+2

(5.27e+1)

2.97e+2

(3.00e+1)

2.52e+2

(2.08e+2)

1.96e+2

(1.45e+2)

3.18e+2

(4.22e+1)

3.74e+2

(9.03e+1)

3.39e+2

(8.41e+1)

F ​ 18

-30

1.41e+3

(5.70e+1)

1.43e+3

(4.70e+1)

1.41e+3

(6.47e+1)

1.42e+3

(4.59e+1)

1.42e+3

(5.54e+1)

9.65e+2

(5.59e+1)

1.02e+3

(2.37e+1)

9.71e+2

(2.31e+1)

9.64e+2

(1.46e+2)

9.08e+2

(2.76e+0)

1.04e+3

(2.27e+1)

9.45e+2

(1.42e+1)

9.48e+2

(3.25e+1)

F ​ 23

-30

1.58e+3

(4.64e+1)

1.57e+3

(4.05e+1)

1.55e+3

(4.51e+1)

1.57e+3

(4.14e+1)

1.57e+3

(5.15e+1)

9.43e+2

(1.40e+2)

1.10e+3

(1.01e+2)

9.67e+2

(1.30e+2)

7.51e+2

(3.30e+2)

6.92e+2

(2.38e+2)

1.17e+3

(6.30e+1)

9.74e+2

(1.69e+2)

9.64e+2

(1.70e+2)

Table 4. Table 4. Average ranking of all methods.

Algo	IPOP	DDQN2	DDQN3	RecPM	LR	AdapSS	DDQN1	FAUC	Random	DE3	DE2	DE4	DE1
Rank	2.3	3.3	4.1	4.4	4.4	4.9	5.4	7.2	10.5	10.8	10.8	11.4	11.5

Table 5. Table 5. Post-hoc (Li) using DE-DDQN2 as control method.

Comparison

Statistic

Adjusted

p-value

Result

DDQN2 vs DE1

4.70819

0.00001

H0 is rejected

DDQN2 vs DE4

4.65077

0.00008

H0 is rejected

DDQN2 vs DE2

4.30627

0.00005

H0 is rejected

DDQN2 vs DE3

4.30627

0.00005

H0 is rejected

DDQN2 vs Random

4.13402

0.00010

H0 is rejected

DDQN2 vs FAUC

2.23926

0.06630

H0 is not rejected

DDQN2 vs DDQN1

1.20576

0.39166

H0 is not rejected

DDQN2 vs AdapSS

0.91867

0.50299

H0 is not rejected

DDQN2 vs Rec-PM

0.63159

0.59848

H0 is not rejected

DDQN2 vs LR

0.63159

0.59848

H0 is not rejected

DDQN2 vs IPOP

0.57417

0.61515

H0 is not rejected

DDQN2 vs DDQN3

0.45934

0.64599

H0 is not rejected

Equations3

= max {f (x_{i}) - f (u_{i}), 0}

= max {f (x_{i}) - f (u_{i}), 0}

= max {\frac{f ( x _{i} ) - f ( u _{i} )}{f ( u _{i} ) - f _{optimum}}, 0}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mudita11/DE-DDQN
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsQ-Learning

Full text

Deep Reinforcement Learning Based Parameter Control

in Differential Evolution

Mudita Sharma

University of YorkYorkU.K.

[email protected]

,

Alexandros Komninos

University of YorkYorkU.K.

[email protected]

,

Manuel López-Ibáñez

University of ManchesterManchesterU.K.

[email protected]

and

Dimitar Kazakov

University of YorkYorkU.K.

[email protected]

(2019)

Abstract.

Adaptive Operator Selection (AOS) is an approach that controls discrete parameters of an Evolutionary Algorithm (EA) during the run. In this paper, we propose an AOS method based on Double Deep Q-Learning (DDQN), a Deep Reinforcement Learning method, to control the mutation strategies of Differential Evolution (DE). The application of DDQN to DE requires two phases. First, a neural network is trained offline by collecting data about the DE state and the benefit (reward) of applying each mutation strategy during multiple runs of DE tackling benchmark functions. We define the DE state as the combination of 99 different features and we analyze three alternative reward functions. Second, when DDQN is applied as a parameter controller within DE to a different test set of benchmark functions, DDQN uses the trained neural network to predict which mutation strategy should be applied to each parent at each generation according to the DE state. Benchmark functions for training and testing are taken from the CEC2005 benchmark with dimensions 10 and 30. We compare the results of the proposed DE-DDQN algorithm to several baseline DE algorithms using no online selection, random selection and other AOS methods, and also to the two winners of the CEC2005 competition. The results show that DE-DDQN outperforms the non-adaptive methods for all functions in the test set; while its results are comparable with the last two algorithms.

Parameter Control, Reinforcement Learning, Differential Evolution

††copyright: rightsretained††journalyear: 2019††copyright: acmcopyright††conference: Genetic and Evolutionary Computation Conference; July 13–17, 2019; Prague, Czech Republic††booktitle: Genetic and Evolutionary Computation Conference (GECCO ’19), July 13–17, 2019, Prague, Czech Republic††price: 15.00††doi: 10.1145/3321707.3321813††isbn: 978-1-4503-6111-8/19/07††ccs: Computing methodologies Bio-inspired approaches††ccs: Computing methodologies Reinforcement learning

1. Introduction

Evolutionary algorithms for numerical optimization come in many variants involving different operators, such as mutation strategies and types of crossover. In the case of differential evolution (DE) (Storn and Price, 1997), experimental analysis has shown that different mutation strategies perform better for specific optimization problems (Mezura-Montes et al., 2006) and that choosing the right mutation strategy at specific stages of an optimization process can further improve the performance of DE (Fialho et al., 2010a). As a result, there has been great interest in methods for controlling or selecting the value of discrete parameters while solving a problem, also called adaptive operator selection (AOS).

In the context of DE, there is a finite number of mutation strategies (operators) that can be applied at each generation to produce new solutions from existing (parent) solutions. An AOS method will decide, at each generation, which operator should be applied, measure the effect of this application and adapt future choices according to some reward function. An inherent difficulty is that we do not know which operator is the most useful at each generation to solve a previously unseen problem. Moreover, different operators may be useful at different stages of an algorithm’s run.

There are multiple AOS methods proposed in the literature (Karafotias et al., 2015b; Aleti and Moser, 2016; Gong et al., 2010) and several of them are based on reinforcement learning (RL) techniques such as probability matching (Fialho et al., 2010b; Sharma et al., 2018), multi-arm bandits (Gong et al., 2010), $Q(\lambda)$ learning (Pettinger and Everson, 2002) and SARSA (Chen et al., 2005; Eiben et al., 2006; Sakurai et al., 2010), among others (Karafotias et al., 2014). These RL methods use one or few features to capture the state of the algorithm at each generation, select an operator to be applied and calculate a reward from this application. Typical state features are fitness standard deviation, fitness improvement from parent to offspring, best fitness, and mean fitness (Eiben et al., 2006; Karafotias et al., 2014). Typical reward functions measure improvement achieved over the previous generation (Karafotias et al., 2014). Other parameter control methods use an offline training phase to collect more data about the algorithm than what is available within a single run. For example, Kee et al. (2001) uses two types of learning: table-based and rule-based. The learning is performed during an offline training phase that is followed by an online execution phase where the learned tables or rules are used for choosing parameter values. More recently, Karafotias et al. (2012) trains offline a feed-forward neural network with no hidden layers to control the numerical parameter values of an evolution strategy. To the best of our knowledge, none of the AOS methods that use offline training are based on reinforcement learning.

In this paper, we adapt Double Deep Q-Network (DDQN) (van Hasselt et al., 2016), a deep reinforcement learning technique that uses a deep neural network as a prediction model, as an AOS method for DE. The main differences between DDQN and other RL methods are the possibility of training DDQN offline on large amounts of data and of using a larger number of features to define the current state. When applied as an AOS method within DE, we first run the proposed DE-DDQN algorithm many times on training benchmark problems by collecting data on $99$ features, such as the relative fitness of the current generation, mean and standard deviation of the population fitness, dimension of the problem, number of function evaluations, stagnation, distance among solutions in decision space, etc. After this training phase, the DE-DDQN algorithm can be applied to unseen problems. It will observe the run time value of these features and predict which mutation strategy should be used at each generation. DE-DDQN also requires the choice of a suitable reward definition to facilitate learning of a prediction model. Some RL-based AOS methods calculate rewards per individual (Pettinger and Everson, 2002; Chen et al., 2005), while others calculate it per generation (Sakurai et al., 2010). Moreover, reward functions can be designed in different ways depending on the problem at hand. For example, Karafotias et al. (2015a) defines and compares four per-generation reward definitions for RL-based AOS methods. Here, we also find that the reward definition has a strong effect on the performance of DE-DDQN and, hence, we analyze three alternative reward definitions that assign reward for each application of a mutation strategy.

As an experimental benchmark, we use functions from the cec2005 special session on real-parameter optimization (Suganthan et al., 2005). In particular, the proposed DE-DDQN method is first trained on 16 functions for both dimensions 10 and 30, i.e., a total of 32 training functions. Then, we run the trained DE-DDQN on a different set of 5 functions, also for dimensions 10 and 30, i.e., a total of 10 test functions. We also run on these 10 test functions the following algorithms for comparison: four DE variants, each using a single specific mutation strategy, DE with a random selection among mutation strategies at each generation, DE using various AOS methods (PM-AdapSS (Fialho et al., 2010b), F-AUC (Gong et al., 2010), and RecPM-AOS (Sharma et al., 2018)), and the two winners of CEC2005 (Suganthan et al., 2005) competition, which are both variants of CMAES: LR-CMAES (LR) (Auger and Hansen, 2005a) and IPOP-CMAES (IPOP) (Auger and Hansen, 2005b).

Our experimental results show that the DE variants using AOS completely outperform the DE variants using a fixed mutation strategy or a random selection. Although a non-parametric post-hoc test does not find that the differences between the CMAES algorithms and the AOS-enabled DE algorithms (including DE-DDQN) are statistically significant, DE-DDQN is the second best approach, behind IPOP-CMAES, in terms of mean rank.

The paper is structured as follows. First, we give a brief introduction to DE, mutation strategies and deep reinforcement learning. In Sect. 3, we introduce our proposed DE-DDQN algorithm, and explain its training and online (deployment) phases. Section 4 introduces the state features and reward functions used in the experiments, which are described in Sect. 5. We summarise our conclusions in Sect. 6.

2. Background

2.1. Differential Evolution

Differential Evolution (DE) (Price et al., 2005) is a population-based algorithm that uses a mutation strategy to create an offspring solution $\vec{u}$ . A mutation strategy is a linear combination of three or more parent solutions $\vec{x}_{i}$ , where $i$ is the index of a solution in the current population. Some mutation strategies are good at exploration and others at exploitation, and it is well-known that no single strategy performs best for all problems and for all stages of a single run. In this paper, we consider these frequently used mutation strategies:

[TABLE]

where $F$ is a scaling factor, $\vec{u}_{i}$ and $\vec{x}_{i}$ are the $i$ -th offspring and parent solution vectors in the population, respectively, $\vec{x}_{\text{best}}$ is the best parent in the population, and $r_{1}$ , $r_{2}$ , $r_{3}$ , $r_{4}$ , and $r_{5}$ are randomly generated indexes within $[1,NP]$ , where $NP$ is the population size. An additional numerical parameter, the crossover rate ( $CR\in[0,1]$ ), determines whether the mutation strategy is applied to each dimension of $\vec{x}_{i}$ to generate $\vec{u}_{i}$ . At least one dimension of each $\vec{x}_{i}$ vector is mutated.

2.2. Deep Reinforcement Learning

In RL (Sutton and Barto, 1998), an agent takes actions in an environment that returns the reward and the next state. The goal is to maximize the cumulative reward at each step. RL estimates the value of an action given a state called Q-value to learn a policy that returns an action given a state. A variety of different techniques are used in RL to learn this policy and some of them are applicable only when the set of actions is finite.

When the features that define a state are continuous or the set of states is very large, the policy becomes a function that implicitly maps between state features and actions, as opposed to keeping an explicit map in the form of a lookup table. In deep reinforcement learning, this function is approximated by a deep neural network and the weights of the network are optimized to maximize the cumulative reward.

Deep Q-network (DQN) (Mnih et al., 2015) is a deep RL technique that extends Q-learning to continuous features by approximating a non-linear Q-value function of the state features using a neural network (NN). The classical DQN algorithm sometimes overestimates the Q-values of the actions, which leads to poor policies. Double DQN (DDQN) (van Hasselt et al., 2016) was proposed as a way to overcome this limitation and enhance the stability of the Q-values. DDQN employs two neural networks: a primary network selects an action and a target network generates a target Q-value for that action. The target-Q values are used to compute the loss function for every action during training. The weights of the target network are fixed, and only periodically or slowly updated to the primary Q-networks values.

In this work, we integrate DDQN into DE as an AOS method that selects a mutation strategy at each generation.

3. DE-DDQN

When integrated with DE as an AOS method, DDQN is adapted as follows. The environment of DDQN becomes the DE algorithm performing an optimization run for a maximum of $FE^{\text{max}}$ function evaluations. A state $s_{t}$ is a collection of features that measure static or run time features of the problem being solved or of DE at step $t$ (function evaluation or generation counter). The actions that DDQN may take are the set of mutation strategies available (Sect. 2.1), and $a_{t}$ is the strategy selected and applied at step $t$ . Once a mutation strategy is applied, a reward function returns the estimated benefit (reward) $r_{t}$ of applying action $a_{t}$ , and the DE run reaches a new state, $s_{t+1}$ . We refer to the tuple $\langle s_{t}$ , $a_{t}$ , $r_{t}$ , $s_{t+1}\rangle$ as an observation.

Our proposed DE-DDQN algorithm operates in two phases. In the first training phase, the two deep neural networks of DDQN are trained on observations by running the DE-DDQN algorithm multiple times on several benchmark functions. In a second online (or deployment) phase, the trained DDQN is used to select which mutation strategy should be applied at each generation of DE when tackling unseen (or test) problems not considered during the training phase. We describe these two phases in detail next.

3.1. Training phase

In the training phase, DDQN uses two deep neural networks (NNs), namely primary NN and target NN. The primary NN predicts the Q-values $Q(s_{t},a;\theta)$ that are used to select an action $a$ given state $s_{t}$ at step $t$ , while the target NN estimates the target Q-values $\hat{Q}(s_{t},a;\hat{\theta})$ after the action $a$ has been applied, where $\theta$ and $\hat{\theta}$ are the weights of the primary and target NNs, respectively, $s_{t}$ is the state vector of DE, and $a$ is a mutation strategy.

The goal of the training phase is to train the primary NN of DDQN so that it learns to approximate the target $\hat{Q}$ function. The training data is a memory of observations that is collected by running DE-DDQN several times on training benchmark functions. Training the primary NN involves finding its weights $\theta$ through gradient optimization.

The training process of DE-DDQN is shown in Algorithm 1. Training starts by running DE with random selection of mutation strategy for a fixed number of steps (warm-up size) that generates observations to populate a memory of capacity $N$ , which can be different from the warm-up size (line 2). This memory stores a fixed number of $N$ recent observations, old ones are removed as new ones are added. Once the warm-up phase is over, DE is executed $M$ times, and each run is stopped after $FE^{\text{max}}$ function evaluations or the known optimum of the training problem is reached (line 7). For each solution in the population, the $\epsilon$ -greedy policy is used to select mutation strategy, i.e., with $\epsilon$ probability a random mutation is selected, otherwise the mutation strategy with maximum Q-value is selected. Using the current DE state $s_{t}$ , the primary NN is responsible for generating a Q-value per possible mutation strategy (line 12). The use of a $\epsilon$ -greedy policy forces the primary NN to explore mutation strategies that may be currently predicted less optimal. The selected mutation strategy is applied (line 13) and a new state $s_{t+1}$ is achieved (line 14). A reward value $r_{t}$ is computed by measuring the performance progress made at this step.

To prevent the primary NN from only learning about the immediate state of this DE run, randomly draw mini batches of observations (line 16) from memory to perform a step of gradient optimization. Training the primary NN with the randomly drawn observations helps to robustly learn to perform well in the task.

The primary NN is used to predict the next mutation strategy $\hat{a}_{t+1}$ (line 20) and its reward (line 21), without actually applying the mutation. A target reward value $r^{\text{target}}$ is used to train the primary NN, i.e., finding the weights $\theta$ that minimise the loss function $(r^{\text{target}}-Q(s_{j},a_{j};\theta))^{2}$ (line 22). If the run terminates, i.e., if the budget assigned to the problem is finished, $r^{\text{target}}$ is the same as the reward $r_{t}$ . Otherwise, $r^{\text{target}}$ is estimated (line 21) as a linear combination of the current reward $r_{t}$ and the predicted future reward $\gamma\hat{Q}(s_{t+1},\hat{a}_{t+1})$ , where $\hat{Q}$ is the (predicted) target Q-value and $\gamma$ is the discount factor that makes the training focus more on immediate results compared to future rewards.

Finally, the primary and target NNs are synchronised periodically by copying the weights $\theta$ from the primary NN to the $\hat{\theta}$ of the target NN every fixed number of $C$ training steps (line 23). That is, the target NN uses an older set of weights to compute the target Q-value, which keeps the target value $r^{\text{target}}$ from changing too quickly. At every step of training (line 22), the Q-values generated by the primary NN shift. If we are using a constantly shifting set of values to calculate $r^{\text{target}}$ (line 21) and adjust the NN weights (line 22), then the target value estimations can easily become unstable by falling into feedback loops between $r^{\text{target}}$ and the (target) Q-values used to calculate $r^{\text{target}}$ . In order to mitigate that risk, the target NN is used to generate target Q-values ( $\hat{Q}$ ) that are used to compute $r^{\text{target}}$ , which is used in the loss function for training the primary NN. While the primary NN is trained, the weights of the target NN are fixed.

3.2. Online phase

Once the learning is finished, the weights of the primary NN are frozen. In the testing phase, the mutation strategy is selected online during an optimization run on an unseen function. The online AOS with DE is shown in Algorithm 2. Since the weights of the NN are not updated in this phase, we do not maintain a memory of observations or compute rewards. As a new state is observed $s_{t}$ , the Q-values per mutation strategy are calculated and a new mutation strategy is chosen according to the greedy policy (line 7).

4. State features and reward

In this section we describe the new state features and reward definitions explored for the proposed DE-DDQN method.

4.1. State representation

The state representation needs to provide sufficient information so that the NN can decide which action is more suitable at the current step. We propose a state vector consisting of various features capturing properties of the landscape and the history of operator performance. Each feature is normalised to the range $[0,1]$ by design in order to abstract absolute values specific to particular problems and help generalisation. Features are summarised in Table 1.

Our state needs to encode information about how the current solutions in the population are distributed in the decision space and their differences in fitness values. The fitness of current parent $f(\vec{x}_{i})$ is given to the NN as a first state feature. The next feature is the mean of the fitness of the current population. The first two features in the state are normalised by the difference of worst and best seen so far solution. The third feature calculates the standard deviation of the population fitness values. Feature 4 measures the remaining budget of function evaluations. Feature 5 is the dimension of the function being solved. The training set includes benchmark functions with different dimensions in the hope that the NN are able to generalise to functions of any dimension within the training range. Feature 6, stagnation count, calculates the number of function evaluations since the last improvement of the best fitness found for this run (normalised by $FE^{\text{max}}$ ).

The next set of feature values describe the relation between the current parent and the six solutions used by the various mutation strategies, i.e., the five random indexes ( $r_{1}$ , $r_{2}$ , $r_{3}$ , $r_{4}$ , $r_{5}$ ) and the best parent in the population ( $\vec{x}_{\text{best}}$ ). Features 7–12 measure the Euclidean distance in decision space between the current parent $\vec{x}_{i}$ and the six solutions. These six euclidean distances help the NN learn to select the strategy that best combines these solutions. Features 13–18 use the same six solutions to calculate the fitness difference w.r.t. $f(\vec{x}_{i})$ . Feature 19 measures the normalised Euclidean distance in decision space between $\vec{x}_{i}$ and the best solution seen so far. We use distances instead of positions to make the state representation independent of the dimensionality of the solution space.

Describing the current population is not sufficient to select the best strategy. Reinforcement learning requires the state to be Markov, i.e., to include all necessary information for selecting an action. To this end, we enhance the state with features about the run time history. Using historical information has shown to be useful in our previous work (Sharma et al., 2018). In addition to the remaining budget and the stagnation counter described above, we also store four metric values $\textit{OM}_{m}(g,k,\textit{op})$ after the application of op at generation $g$ :

(1)

$\textit{OM}_{1}(g,k,\textit{op})=f(\vec{x}_{i})-f(\vec{u}_{i})$ , that is, the $k$ -th fitness improvement of offspring $\vec{u}_{i}$ over parent $\vec{x}_{i}$ ; 2. (2)

$\textit{OM}_{2}(g,k,\textit{op})$ , the $k$ -th fitness improvement of offspring over $\vec{x}_{\text{best}}$ , the best parent in the current population; 3. (3)

$\textit{OM}_{3}(g,k,\textit{op})$ , the $k$ -th fitness improvement of offspring over $\vec{x}_{\text{bsf}}$ , the best so far solution; and 4. (4)

$\textit{OM}_{4}(g,k,\textit{op})$ , the $k$ -th fitness improvement of offspring over the median fitness of the parent population.

For each $\textit{OM}_{m}$ , the total number of fitness improvements (successes) is given by $N^{\text{succ}}_{m}(g,\textit{op})$ , that is, the index $k$ is always $1\leq k\leq N^{\text{succ}}_{m}(g,\textit{op})$ . The counter $N^{\text{tot}}(g,\textit{op})$ gives the total number of applications of op at generation $g$ . We store this historical information for the last gen number of generations.

With the information above, we compute the sum of success rates over the last gen generations, where each success rate is the number of successful applications of operator op, i.e., mutation strategy, in generation $g$ that improve metric $\textit{OM}_{m}$ divided by the total number of applications of op in the same generation. For each metric $\textit{OM}_{m}$ , the values for an operator are normalised by the sum of all values of all operators. A different success rate is calculated for each combination of $\textit{OM}_{m}$ ( $m\in\{1,2,3,4\}$ ) and op (four mutation strategies) resulting in features 20–35.

We also compute the sum of fitness improvements for each $\textit{OM}_{m}$ divided by the total number of applications of op over the last gen generations (features 36–51). Features 52–67 are defined in terms of best fitness improvement of a mutation strategy op according to metric $\textit{OM}_{m}$ over a given generation $g$ , that is, $\textit{OM}^{\text{best}}_{m}(g,\textit{op})=\max_{k}^{N^{\text{succ}}_{m}(g,\textit{op})}\textit{OM}_{m}(g,k,\textit{op})$ . In this case, we calculate the relative difference in best improvement of the last generation with respect to the previous one, divided by the difference in number of applications between the last two generations (gen and $\textit{gen}-1$ ). Any zero value in the denominator is ignored. The sum of best improvement seen for combination of operator and metric is given as features 68–83.

Features 84-99 are calculated by maintaining a fixed size window $W$ where each element is a tuple of the four metric values $OM_{m},m\in\{1,2,3,4\}$ and $f(\vec{u}_{i})$ resulting from the application of a mutation strategy to $\vec{x}_{i}$ that generates $\vec{u}_{i}$ . Initially the window is filled with $OM_{m}$ values as new improved offsprings are produced. Once it is full, new elements replace existing ones generated by that mutation strategy according to the First-In First-Out (FIFO) rule. If there is no element produced by that operator in the window, the element with the worst (highest) $f(\vec{u}_{i})$ is replaced. Each feature is the sum of $OM_{m}$ values within the window for each $m$ and each operator. The difference between features extracted from recent generations (68-83) and from the fixed-size window (84-99) is that the window captures the best solutions for each operator, and the number of solutions present per operator vary. In a sense, solutions compete to be part of the window. Whereas when computing features from the last gen generations, all successful improvements per generation are captured and there is no competition among elements. As the most recent history is the most useful, we use small values for last $\textit{gen}=10$ generations and window size $W=50$ .

4.2. Reward definitions

While we only know the true reward of a sequence of actions after a full run of DE is completed, i.e., the best fitness found, such sparse rewards provide a very weak signal and can slow down training. Instead, we calculate rewards after every action has been taken, i.e., a new offspring $\vec{u}_{i}$ is produced from parent $\vec{x}_{i}$ . In this paper, we explore three reward definitions, each one using different information related to fitness improvement:

[TABLE]

R1 is the fitness difference of offspring from parent when an improvement is seen. This definition has been used commonly in literature for parameter control (Pettinger and Everson, 2002; Chen et al., 2005; Sakurai et al., 2010). R2 assigns a higher reward to an improvement over the best so far solution than to an improvement over the parent. Finally, R3 is a variant of R1 relative to the difference between the offspring fitness and the optimal fitness, i.e., maximise the fitness difference between parent and offspring and minimise fitness difference between offspring and optimal solution. This definition can only be used when the optimum values of the functions used for training are known in advance.

5. Experimental design

In our implementation of DE-DDQN, the primary and target NNs are multi-layer perceptrons. We integrate the three reward definitions R1, R2 and R3 into DE-DDQN and the resulting methods are denoted DE-DDQN1, DE-DDQN2 and DE-DDQN3, respectively. For each of these methods, we trained four NNs using batch sizes 64 or 128 and 3 or 4 hidden layers, and we picked the best combination of batch size and number of hidden layers according to the total accumulated reward during the training phase. In all cases, the most successful configuration was batch size 64 with 4 hidden layers. Results of other configurations are not shown in the paper.

The rest of the parameters are not tuned but set to typical values. In the training phase, we applied $\epsilon$ -greedy policy with $\epsilon=10\%$ of the actions selected randomly and the rest according to the highest Q-value. In the warm-up phase during training, we set the capacity of the memory of observations larger than the warm-up size so that 90% of the memory is filled up with observations from random actions and the rest with actions selected by the NN. The gradient descent algorithm used to update the weights of the NN during training is Adam (Kingma and Ba, 2014). Table 2 shows all hyperparameter values.

We compared the three proposed DE-DDQN variants with ten baselines: random selection of mutation strategies (Random), four different fixed-strategy DEs (DE1-DE4), PM-AdapSS (AdapSS) (Fialho et al., 2010b), F-AUC (FAUC) (Gong et al., 2010), RecPM-AOS (RecPM) (Sharma et al., 2018) and the two winners of CEC2005 competition, which are both variants of CMAES: LR-CMAES (LR) (Auger and Hansen, 2005a) and IPOP-CMAES (IPOP) (Auger and Hansen, 2005b). Among all these alternatives, AdapSS, FAUC, RecPM are AOS methods that were proposed to adaptively select mutation strategies. The parameters of these AOS methods were previously tuned with the help of an offline configurator irace (Sharma et al., 2018) and the tuned hyperparameter values (parameters of AOS and not DE) have been used in the experiments. The first eight baselines involve the DE algorithm with the following parameter values: population size ( $\textit{NP}=100$ ), scaling factor ( $F=0.5$ ) and crossover rate ( $CR=1.0$ ). This choice for parameter $F$ has shown good results (Fialho, 2010). CR as $1.0$ has been chosen to see the full potential of mutation strategies to evolve each dimension of each parent. The results of LR and IPOP are taken from their original papers from the cec2005 competition for the comparison.

5.1. Training and testing

In order to force the NN to learn a general policy, we train on different classes of functions. From the 25 functions of the cec2005 benchmark suite (Suganthan et al., 2005), we excluded non-deterministic functions and functions without bounds (functions $F4$ , $F7$ , $F17$ and $F25$ ). The remaining 21 functions can be divided into four classes: unimodal functions $F1$ – $F5$ ; basic multimodal functions $F6$ – $F12$ ; expanded multimodal functions $F13$ – $F14$ ; and hybrid composition functions $F15$ – $F24$ . We split these 21 functions into roughly $75\%$ training and $25\%$ testing sets, that is, $16$ functions ( $F1$ , $F2$ , $F5$ , $F6$ , $F8$ , $F10$ – $F15$ , $F19$ – $F22$ and $F24$ ) are assigned to the training set and the rest ( $F3$ , $F9$ , $F16$ , $F18$ and $F23$ ) are assigned to the test set. According to the above classification, the training set contains at least two functions from each class and the test set contains at least one function from each class except for expanded multimodal functions, as both functions of this class are included in the training set. For each function, we consider both dimensions $10$ and $30$ , giving a total of $32$ problems for training and $10$ problems for testing.

During training, we cycle through the 32 training problems multiple times and keep track of the mean reward achieved in each cycle. We overwrite the weights of the NN if the mean reward is better than what we have observed in previous cycles. We found this measure of progress was better than comparing rewards after individual runs, because different problems vary in difficulty making rewards incomparable. After each cycle, the 32 problems are shuffled before being used again. The mean reward stopped improving after 1890 cycles (60480 problems, $6048\times 10^{5}$ FEs) which indicated the convergence of the learning process.

Although the computational cost of the training phase is significant compared to a single run of DE, this cost is incurred offline, i.e., one time on known benchmark functions before solving any unseen function, and it can be significantly reduced by means of parallelisation and GPUs. On the other hand, we conjecture that training on even more data from different classes of functions should allow the application of DE-DDQN to a larger range of unknown functions.

After training, the NN weights were saved and used for the testing (online) phase.111The weights obtained after training are available on Github (Sharma et al., 2019) together with the source code, and can be used for testing on similar functions including expanded multimodal. The code may be adapted to train or test using other benchmark suites such as bbob with functions of up to dimension $50$ . For testing, each DE-DDQN variant was independently run 25 times on each test problem and each run was stopped when either absolute error difference from the optimum is smaller than $10^{-8}$ or $10^{4}$ function evaluations are exhausted. Mean and standard deviation of the final error values achieved by each of the 25 runs are reported in Table 3.

5.2. Discussion of results

The average rankings of each method among the 10 test problem instances are shown in Table 4. The differences among the 13 algorithms are significant ( $p<.01$ ) according to the non-parametric Friedman test. We conducted a post-hoc analysis using the best performing method (DE-DDQN2) among the newly proposed ones as the control method for pairwise comparisons with the other methods. The p-values adjusted for multiple comparisons (Li, 2008) are shown in Table 5. The differences between DE-DDQN2 and the five baselines, random selection of operators and single strategy DEs (DE1-DE4), are significant while differences with other methods are not. The analysis makes clear that the proposed method learns to adaptively select the strategy at different stages of a DE run.

While differences between the three reward definitions are not statistically significant, the rankings provide some evidence that R2 performs better than the other two definitions. R2 being a simple definition assigning fixed reward values does not get affected by the function range, whereas R1 and R3 involving raw functions values may mislead the NN when dealing with functions with different fitness ranges. R2 assigns ten times more reward when offspring improves over the best so far solution than when it improves over its parent. Thus, DE-DDQN2 may learn to generate offspring that not only tend to improve over the parent but also improve the best fitness seen so far. On the contrary, R1 considers the improvement of offspring over parent only and is less informative than R3, which considers improvement over parent and optimum value. The improvement can be small or large when function values with different ranges is considered. As a result, R1 and R3 become less informative about choosing operators that will solve the problem within the given number of function evaluations. Although R3 scales fitness improvement with distance from the optimum which partially mitigates the effect of different ranges among functions, inconsistent ranges are still problematic. The R2 definition encourages the generation of better offsprings than the best so far candidate and it is invariant to differences in function ranges. Comparing with other methods proposed in the literature shows that DE variants with a suitable operator selection strategy can perform similarly to CMAES variants which are known to be the best performing methods for this class of problems.

To further analyze the difference between DE-DDQN and other AOS methods we provide boxplots of the results of 25 runs of DE-DDQN2, PM-AdapSS and RecPM-AOS on each function (Fig. 1). We observe that the overall minimum function value found across the 25 runs is lower for DE-DDQN2 in all problems except $F9$ -10 and $F16$ -30. As seen in box plots, for $F18$ and $F23$ with dimension 10, DE-DDQN2 often gets stuck at local optima, but manages to find a better overall solution compared to the other methods. Other methods find high variance solutions in these cases. At the same time, the median values of solutions found are better for six out of ten problems. This observation suggests that incorporating restart strategies similar to those used by IPOP-CMAES can be particularly useful for DE-DDQN and give us a direction for future work. DE-DDQN2 performs well consistently for the unimodal $F3$ with both 10 and 30 dimensions, while the other AOS methods find relatively higher error solutions with high variance. We interpret this as an indication that DE-DDQN can identify this type of problem and apply a more suitable AOS strategy than Rec-PM and PM-AdapSS. On the other hand, we see that for $F16$ -30 and $F23$ -30, DE-DDQN2 exhibits higher variance of solutions, which suggests that higher dimensional multimodal functions often confuse the NN, leading it to suboptimal behaviour.

6. Conclusion

We presented DE-DDQN, a Deep-RL-based operator selection method that learns to select online the mutation strategies of DE. DE-DDQN has two phases, offline training and online evaluation phase. During training we collected data from DE runs using a reward metric to assess the performance of the selected mutation action and 99 features to evaluate the state of the DE. Features and reward values are used to optimise the weights of a neural network to learn the most rewarding mutation given the DE state. The weights learned during training are then used during the online phase to predict the mutation strategy to use when solving a new problem. Experiments were run using 21 functions from cec2005 benchmark suite, each function was evaluated with dimensions 10 and 30. A set of 32 functions was used for training and we run the online phase on a different test set of 10 functions.

All three proposed methods outperform all the non-AOS baselines based on mean error seen in 25 runs on test functions. This shows that the proposed methods can learn to select the right strategy at different stages of the algorithm. Our statistical analysis suggests that differences between the best proposed method and the AOS methods from the literature are not significant, but the best performing version of our model, DE-DDQN2, was ranked overall second after IPOP-CMAES. The R2 reward function, which assigns fixed reward values when better solutions are found, is more helpful for learning an AOS strategy.

For future work, we want to explore applications of Deep RL for learning to control more parameters of evolutionary algorithms, including combinations of discrete and continuous parameters. We also expect that an extensive tuning of state features and hyperparameter values will further improve performance of the method.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Aleti and Moser (2016) A. Aleti and I. Moser. 2016. A systematic literature review of adaptive parameter control methods for evolutionary algorithms. Comput. Surveys 49, 3, Article 56 (Oct. 2016), 35.
3Auger and Hansen (2005 a) A. Auger and N. Hansen. 2005 a. Performance evaluation of an advanced local search evolutionary algorithm. In Proceedings of the 2005 Congress on Evolutionary Computation (CEC 2005) . IEEE Press, Piscataway, NJ, 1777–1784.
4Auger and Hansen (2005 b) A. Auger and N. Hansen. 2005 b. A restart CMA evolution strategy with increasing population size. In Proceedings of the 2005 Congress on Evolutionary Computation (CEC 2005) . IEEE Press, Piscataway, NJ, 1769–1776.
5Chen et al . (2005) F. Chen, Y. Gao, Z.-q. Chen, and S.-f. Chen. 2005. SCGA: Controlling genetic algorithms with Sarsa(0). In Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on , Vol. 1. IEEE, 1177–1183.
6Eiben et al . (2006) A. E. Eiben, M. Horvath, W. Kowalczyk, and M. C. Schut. 2006. Reinforcement learning for online control of evolutionary algorithms. In International Workshop on Engineering Self-Organising Applications . Springer, 151–160.
7Fialho (2010) Á. Fialho. 2010. Adaptive operator selection for optimization . Ph.D. Dissertation. Université Paris Sud-Paris XI.
8Fialho et al . (2010 a) Á. Fialho, R. Ros, M. Schoenauer, and M. Sebag. 2010 a. Comparison-based adaptive strategy selection with bandits in differential evolution. In Parallel Problem Solving from Nature, PPSN XI , R. Schaefer et al . (Eds.). Lecture Notes in Computer Science, Vol. 6238. Springer, Heidelberg, Germany, 194–203.