A Reinforcement Learning Perspective on the Optimal Control of Mutation   Probabilities for the (1+1) Evolutionary Algorithm: First Results on the   OneMax Problem

Luca Mossina; Emmanuel Rachelson; Daniel Delahaye

arXiv:1905.03726·cs.NE·May 10, 2019

A Reinforcement Learning Perspective on the Optimal Control of Mutation Probabilities for the (1+1) Evolutionary Algorithm: First Results on the OneMax Problem

Luca Mossina, Emmanuel Rachelson, Daniel Delahaye

PDF

Open Access

TL;DR

This paper explores using Reinforcement Learning to dynamically control mutation probabilities in a (1+1) evolutionary algorithm on the OneMax problem, demonstrating how RL can optimize algorithm parameters without prior knowledge of transition probabilities.

Contribution

It introduces a novel RL-based approach for parameter control in evolutionary algorithms, combining model-based and model-free methods to improve optimization performance.

Findings

01

RL can effectively optimize mutation probabilities in evolutionary algorithms.

02

Q-Learning approach does not require explicit transition probabilities.

03

Method allows integration of expert knowledge into parameter control.

Abstract

We study how Reinforcement Learning can be employed to optimally control parameters in evolutionary algorithms. We control the mutation probability of a (1+1) evolutionary algorithm on the OneMax function. This problem is modeled as a Markov Decision Process and solved with Value Iteration via the known transition probabilities. It is then solved via Q-Learning, a Reinforcement Learning algorithm, where the exact transition probabilities are not needed. This approach also allows previous expert or empirical knowledge to be included into learning. It opens new perspectives, both formally and computationally, for the problem of parameter control in optimization.

Figures2

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Empirical time-to-termination, 2000 runs

Policy	Constant	1/(s+1)	MDP
Average	442	430	412
Standard Deviation	163	165	164

Equations6

p_{Z^{\prime}}(z)=\left\{\begin{array}[]{ll}\sum_{i=0}^{n_{0}}p_{W}(i+z;n_{0},\theta)\times p_{L}(i;n_{1},\theta):z\geq 0\\ \sum_{i=0}^{n_{1}}p_{W}(i;n_{0},\theta)\times p_{L}(i+z;n_{1},\theta):z<0.\end{array}\right.

p_{Z^{\prime}}(z)=\left\{\begin{array}[]{ll}\sum_{i=0}^{n_{0}}p_{W}(i+z;n_{0},\theta)\times p_{L}(i;n_{1},\theta):z\geq 0\\ \sum_{i=0}^{n_{1}}p_{W}(i;n_{0},\theta)\times p_{L}(i+z;n_{1},\theta):z<0.\end{array}\right.

V_{n + 1} (s) = θ max [r (s, θ) + s^{'} \sum P (s^{'} ∣ s, θ) V_{n} (s^{'})]

V_{n + 1} (s) = θ max [r (s, θ) + s^{'} \sum P (s^{'} ∣ s, θ) V_{n} (s^{'})]

Q (s, θ) \leftarrow Q (s, θ) + α [r + θ^{'} max Q (s^{'}, θ^{'})]

Q (s, θ) \leftarrow Q (s, θ) + α [r + θ^{'} max Q (s^{'}, θ^{'})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Reinforcement Learning in Robotics · Metaheuristic Optimization Algorithms Research

MethodsQ-Learning

Full text

A Reinforcement Learning Perspective on the Optimal Control of Mutation Probabilities for the (1+1) Evolutionary Algorithm: First Results on the OneMax Problem

Luca Mossina1

Emmanuel Rachelson1

Daniel Delahaye2

(1ISAE-SUPAERO, Université de Toulouse

[email protected]

2ENAC, Université de Toulouse

)

Abstract

We study how Reinforcement Learning can be employed to optimally control parameters in evolutionary algorithms. We control the mutation probability of a (1+1) evolutionary algorithm on the OneMax function. This problem is modeled as a Markov Decision Process and solved with Value Iteration via the known transition probabilities. It is then solved via $Q$ -Learning, a Reinforcement Learning algorithm, where the exact transition probabilities are not needed. This approach also allows previous expert or empirical knowledge to be included into learning. It opens new perspectives, both formally and computationally, for the problem of parameter control in optimization.

1 Problem statement

We maximize the OneMax function: $OM(x)=\sum_{i=1}^{n}x_{i},\forall x_{i}\in\{0,1\}$ via the (1+1) Evolutionary Algorithm (EA) by which, given a random initialization of $x\in\{0,1\}^{n}$ , at every iteration, each of the bits is flipped (mutated) with probability $\theta$ , yielding a solution candidate $x^{\prime}$ . If $OM(x^{\prime})>OM(x)$ , $x^{\prime}$ is kept. We proceed until the terminal condition $OM(x)=n$ is met. We regard the evolution of $x$ as a stochastic process, conditioned at each step by $\theta$ . This yields a Markov Decision Process (MDP (Puterman,, 2014)), whose optimal control policy can be found via Dynamic Programming when the transition probabilities are known and Reinforcement Learning (RL) when only experience data is available111An online compendium with proofs and code to replicate our results is available at https://github.com/**********.

2 Related Work

Recent results (Karafotias et al.,, 2015) have proposed new mechanisms to dynamically control parameters in evolutionary algorithms, in opposition to just tuning and fixing them prior to optimization. Some theoretical results (Bottcher et al.,, 2010; Doerr and Wagner,, 2018; Doerr and Doerr,, 2015; Giessen and Witt,, 2015) have demonstrated the intuition (e.g. 1/5th rule) that adaptive parameters can perform substantially better than static tuning, producing also optimal behaviours in some cases. When exact analyses are not possible, we propose to use RL (Sutton et al.,, 1998) to estimate such optimal behaviours. Indeed, promising results (Karafotias et al.,, 2014; Buzdalova et al.,, 2014) have hinted the potential of the generic use of RL in EA.

3 Markov Decision Process

During the execution of the EA, we want to sequentially change $\theta$ to minimize the expected termination time. This problem can be formulated as an MDP, with states $S=\{0,1,2,\dots,n\},s=OM(x)$ and actions $A=\{\theta_{1},\theta_{2},\dots\}=\{0.01,0.02,\dots,0.99,1\}$ (a discretization of the mutation probability $\theta\in[0,1]$ ). At each step, a reward $r(s,\theta)$ is obtained, where $r(s,\theta)=0$ if the terminal state $s=n$ is reached, and $r(s,\theta)=-1$ otherwise. An optimal parameter control policy $\pi(s)=\theta$ maximizes222For readers used to MDP notations: this total reward criterion (no discount factor $\gamma$ ) is well defined for Stochastic Shortest Path problems such as the one considered here. $\mathbb{E}\left(\sum_{t=0}^{\infty}r_{t}\right)$ for any initial state $s$ .

3.1 Transition Probabilities

The transition matrix $P=[\mathbb{P}\left(s^{\prime}\mid s,\theta\right)]_{\forall(s,\theta)\in(S,A)}$ , describes the probability of transitioning to a state $s^{\prime}=OM(x^{\prime})$ from any $s=OM(x)$ given any action $\theta\in A$ . At any iteration $t$ , $x_{t}\in\{0,1\}^{n}$ has $OM(x_{t})=n_{1}$ ones and $n_{0}=n-n_{1}$ zeros. Let $W\sim Bin(n_{0},\theta)$ 333binomial distribution of parameters $(n_{0},\theta)$ . be the random variable (r.v.) describing the ones gained at the end of an iteration and $L\sim Bin(n_{1},\theta)$ be the r.v. for the ones lost. $Z^{\prime}=W-L\in[-n_{1},-n_{1}+1,\dots,n_{0}-1,n_{0}]$ is the r.v. for the net gain after a mutation. Note that $Z^{\prime}=W-L$ is the difference of independent binomial distributions. By convolution, it follows that the probability mass function of $Z^{\prime}$ is:

[TABLE]

Under the (1+1)EA, if $Z^{\prime}\leq 0$ , the solution candidate $x^{\prime}$ is rejected as no negative values are admissible. The r.v. $Z^{\prime}_{s}$ for the state $s^{\prime}$ of our EA process has thus values $Z^{\prime}_{s}\in[0,1,2,\dots,n_{0}-1,n_{0}]$ , where $\mathbb{P}(Z^{\prime}_{s}=0)=\mathbb{P}(Z^{\prime}<0)+\mathbb{P}(Z^{\prime}=0)$ and $\mathbb{P}(Z^{\prime}_{s}=k)=\mathbb{P}(Z^{\prime}=k)\ \forall k>0$ .

4 Optimal Parameter Control

We briefly introduce the two main methods used to compute the optimal policy: one based on Dynamic Programming (Bellman,, 1957), the other relying on $Q$ -Learning (Watkins,, 1989).

4.1 Dynamic Programming

The function $V^{\pi}:s\mapsto\mathbb{E}\left(\sum_{t=0}^{\infty}r_{t}|s_{0}=s\right)$ (called $\pi$ ’s value function) maps state $s$ to (minus) their expected time-to-termination. The optimal policy’s value function $V^{*}$ is defined recursively by Equation 1. Value Iteration is the Dynamic Programming algorithm that repeatedly applies Equation 1 until convergence to $V^{*}$ .

[TABLE]

Figure 1 reports the computed value functions for the following three policies (plotted in Figure 2):

•

the constant $\theta=\frac{1}{n}$ commonly used in (1+1)EA,

•

the $\pi(s)=\frac{1}{1+s}$ policy from (Bottcher et al.,, 2010) (originally designed for the LeadingOnes function),

•

the optimal policy found via Value Iteration.

In Figure 1 the marks corresponds to the empirical average $T$ , for 2000 runs initialized respectively at $S_{init}=\{5,10,22,45\}$ . In Table 1 one can find the average $T$ for a random starting state.

4.2 $Q$ -Learning

Although one can explicitly compute the transition probabilities for the parameter control problem based on the OneMax function, such probabilities are generally not available. Learning mechanisms such as $Q$ -Learning, allow to obtain $\pi^{*}$ , using sampled transitions, without explicitly requiring $P$ .

[TABLE]

To that end, it learns the optimal state-action value function $Q^{*}(s,\theta)=r(s,\theta)+\mathbb{E}_{s^{\prime}}\left(V^{*}(s^{\prime})\right)$ . $Q$ -learning is a stochastic approximation process: it repeats the operation of Equation 2 in all states and actions until convergence to $Q^{*}$ (which boils down to solving Equation 1). The optimal policy is then the greedy policy $\pi^{*}(s)=\arg\max_{\theta}Q^{*}(s,\theta)$ .

5 Discussion

The approach presented above generalizes straightforwardly to other problems and algorithms. Our goal in this contribution was to illustrate how a RL perspective on (optimal) Parameter Control can help bring new contributions to the Optimization field. Extending this contribution to a larger class of problems opens new challenges:

•

Continuous actions (parameters) are a common limitation in RL, generally overcome using Policy Gradient methods.

•

The state of an optimization process is problem and algorithm specific and might not always define a Markov process, thus leading to partial observability and/or approximations.

•

The curse or dimensionality is a crucial issue in RL and introducing expert knowledge in the learning process can greatly help the convergence.

•

Convergence to an optimal parameter control policy can take advantage of sampling the optimization process at will.

•

Minimizing the expected termination time is not the only relevant criterion. For instance, a natural alternative would be to maximize the time-discounted value function improvements (an approach close to the idea of regret minimization).

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bellman, (1957) Bellman, R. (1957). Dynamic Programming . Princeton University Press.
2Bottcher et al., (2010) Bottcher, S., Doerr, B., and Neumann, F. (2010). Optimal fixed and adaptive mutation rates for the leadingones problem. In Parallel Problem Solving from Nature, PPSN XI .
3Buzdalova et al., (2014) Buzdalova, A., Kononov, V., and Buzdalov, M. (2014). Selecting evolutionary operators using reinforcement learning: Initial explorations. In Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation .
4Doerr and Doerr, (2015) Doerr, B. and Doerr, C. (2015). Optimal parameter choices through self-adjustment: Applying the 1/5-th rule in discrete settings. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation .
5Doerr and Wagner, (2018) Doerr, C. and Wagner, M. (2018). On the effectiveness of simple success-based parameter selection mechanisms for two classical discrete black-box optimization benchmark problems. ar Xiv preprint ar Xiv:1803.01425 .
6Giessen and Witt, (2015) Giessen, C. and Witt, C. (2015). Population size vs. mutation strength for the (1+ λ 𝜆 \lambda ) ea on onemax. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation .
7Karafotias et al., (2014) Karafotias, G., Eiben, A. E., and Hoogendoorn, M. (2014). Generic parameter control with reinforcement learning. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation .
8Karafotias et al., (2015) Karafotias, G., Hoogendoorn, M., and Eiben, Á. E. (2015). Parameter control in evolutionary algorithms: Trends and challenges. IEEE Transactions on Evolutionary Computation , 19(2):167–187.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Reinforcement Learning Perspective on the Optimal Control of Mutation Probabilities for the (1+1) Evolutionary Algorithm: First Results on the OneMax Problem

Abstract

1 Problem statement

2 Related Work

3 Markov Decision Process

3.1 Transition Probabilities

4 Optimal Parameter Control

4.1 Dynamic Programming

4.2 QQQ-Learning

5 Discussion

4.2 $Q$ -Learning