The Runtime of the Compact Genetic Algorithm on Jump Functions

Benjamin Doerr

arXiv:1908.06527·cs.NE·October 12, 2021

The Runtime of the Compact Genetic Algorithm on Jump Functions

Benjamin Doerr

PDF

TL;DR

This paper improves the understanding of the runtime of the compact genetic algorithm on jump functions, showing it can efficiently optimize small jump sizes and establishing tight exponential bounds for larger jumps.

Contribution

It significantly refines previous runtime bounds for the cGA on jump functions, especially for small jump sizes, and proves exponential lower bounds for large jumps, highlighting the algorithm's capabilities and limitations.

Findings

01

For small jump sizes, cGA achieves $O(n \, \log n)$ runtime.

02

For large jump sizes, the exponential runtime bound is tight and unavoidable.

03

cGA can cross fitness valleys efficiently for small jumps, unlike many classic evolutionary algorithms.

Abstract

In the first and so far only mathematical runtime analysis of an estimation-of-distribution algorithm (EDA) on a multimodal problem, Hasen\"ohrl and Sutton (GECCO 2018) showed for any $k = o (n)$ that the compact genetic algorithm (cGA) with any hypothetical population size $μ = Ω (n e^{4 k} + n^{3.5 + ε})$ with high probability finds the optimum of the $n$ -dimensional jump function with jump size $k$ in time $O (μ n^{1.5} lo g n)$ . We significantly improve this result for small jump sizes $k \leq \frac{1}{20} ln n - 1$ . In this case, already for $μ = Ω (n lo g n) \cap poly (n)$ the runtime of the cGA with high probability is only $O (μ n)$ . For the smallest admissible values of $μ$ , our result gives a runtime of $O (n lo g n)$ , whereas the previous one only shows $O (n^{5 + ε})$ . Since it is known that the cGA with high probability needs at…

Equations219

X \sim Sample (f)

X \sim Sample (f)

Pr [X = y] = i : y_{i} = 1 \prod f_{i} i : y_{i} = 0 \prod (1 - f_{i}) .

Pr [X = y] = i : y_{i} = 1 \prod f_{i} i : y_{i} = 0 \prod (1 - f_{i}) .

\operatorname{minmax}(\ell,r,u)\coloneqq\max\{\ell,\min\{r,u\}\}=\begin{cases}\ell&\mbox{if $r<\ell$}\\ r&\mbox{if $r\in[\ell,u]$}\\ u&\mbox{if $r>u$}\end{cases}

\operatorname{minmax}(\ell,r,u)\coloneqq\max\{\ell,\min\{r,u\}\}=\begin{cases}\ell&\mbox{if $r<\ell$}\\ r&\mbox{if $r\in[\ell,u]$}\\ u&\mbox{if $r>u$}\end{cases}

F : = F_{μ} : = {\frac{1}{n} + \frac{i}{μ} ∣ i \in [0.. n_{μ}]} .

F : = F_{μ} : = {\frac{1}{n} + \frac{i}{μ} ∣ i \in [0.. n_{μ}]} .

\textsc O n e M a x (x) = ∥ x ∥_{1} = i = 1 \sum n x_{i}

\textsc O n e M a x (x) = ∥ x ∥_{1} = i = 1 \sum n x_{i}

\operatorname{\textsc{Jump}}_{nk}(x)=\begin{cases}\|x\|_{1}+k&\mbox{if $\|x\|_{1}\in[0..n-k]\cup\{n\}$,}\\ n-\|x\|_{1}&\mbox{if $\|x\|_{1}\in[n-k+1\,..\,n-1]$}.\end{cases}

\operatorname{\textsc{Jump}}_{nk}(x)=\begin{cases}\|x\|_{1}+k&\mbox{if $\|x\|_{1}\in[0..n-k]\cup\{n\}$,}\\ n-\|x\|_{1}&\mbox{if $\|x\|_{1}\in[n-k+1\,..\,n-1]$}.\end{cases}

G_{nk} : = {x \in {0, 1}^{n} ∣ n - k < ∥ x ∥_{1} < n}

G_{nk} : = {x \in {0, 1}^{n} ∣ n - k < ∥ x ∥_{1} < n}

j = 0 \sum i_{0} - 1 2^{j} \geq j = ⌈ l o g_{2} \tilde{μ} ⌉ \sum i_{0} - 1 2^{j} = μ_{0} j = 0 \sum ⌊ l o g_{2} T ⌋ 2^{j} \geq μ_{0} T .

j = 0 \sum i_{0} - 1 2^{j} \geq j = ⌈ l o g_{2} \tilde{μ} ⌉ \sum i_{0} - 1 2^{j} = μ_{0} j = 0 \sum ⌊ l o g_{2} T ⌋ 2^{j} \geq μ_{0} T .

i = i_{0} \sum \infty (\frac{1}{4})^{i - i_{0}} (\frac{3}{4}) i 2^{i} = \frac{3}{4} 2^{i_{0}} j = 0 \sum \infty 2^{- j} (j + i_{0}) = 3 \cdot 2^{i_{0} - 1} (i_{0} + 1)

i = i_{0} \sum \infty (\frac{1}{4})^{i - i_{0}} (\frac{3}{4}) i 2^{i} = \frac{3}{4} 2^{i_{0}} j = 0 \sum \infty 2^{- j} (j + i_{0}) = 3 \cdot 2^{i_{0} - 1} (i_{0} + 1)

3 \cdot 2^{i_{0} - 1} (i_{0} + 1) \leq 6 \tilde{μ} T (lo g_{2} (\tilde{μ} T) + 3) = : T_{par} .

3 \cdot 2^{i_{0} - 1} (i_{0} + 1) \leq 6 \tilde{μ} T (lo g_{2} (\tilde{μ} T) + 3) = : T_{par} .

3 \cdot 2^{i_{0} - 1} (i_{0} + 1) \leq 6 \tilde{μ} T (lo g_{2} (\tilde{μ} T) + 3) = \frac{9}{2} T^{*} (lo g_{2} (\frac{3}{4} T^{*}) + 3),

3 \cdot 2^{i_{0} - 1} (i_{0} + 1) \leq 6 \tilde{μ} T (lo g_{2} (\tilde{μ} T) + 3) = \frac{9}{2} T^{*} (lo g_{2} (\frac{3}{4} T^{*}) + 3),

Pr [X \geq k] \leq (k n) p^{k} .

Pr [X \geq k] \leq (k n) p^{k} .

Pr [X \geq (1 + δ) μ^{+}]

Pr [X \geq (1 + δ) μ^{+}]

Pr [X \leq (1 - \tilde{δ}) μ^{-}]

Pr [d (x) \geq (1 + δ) D^{+}]

Pr [d (x) \geq (1 + δ) D^{+}]

Pr [d (x) \leq (1 - \tilde{δ}) D^{-}]

Pr [\exists j \in [1.. n] : i = 1 \sum j X_{i} \geq λ] \leq exp (- \frac{2 λ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}}),

Pr [\exists j \in [1.. n] : i = 1 \sum j X_{i} \geq λ] \leq exp (- \frac{2 λ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}}),

Pr [\exists j \in [1.. n] : i = 1 \sum j X_{i} \leq - λ] \leq exp (- \frac{2 λ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}}) .

Pr [X \geq λ] \leq Pr [Y \geq λ] .

Pr [X \geq λ] \leq Pr [Y \geq λ] .

∥ f_{t + 1} ∥_{1} - ∥ f_{t + 1}^{'} ∥_{1} ⪯ ∥ (f_{t + 1})_{∣ L} ∥_{1} - ∥ (f_{t + 1}^{'})_{∣ L} ∥_{1} ⪯ \frac{1}{μ} ∣ M ∣ ⪯ \frac{1}{μ} Bin (n, \frac{2}{n}) .

∥ f_{t + 1} ∥_{1} - ∥ f_{t + 1}^{'} ∥_{1} ⪯ ∥ (f_{t + 1})_{∣ L} ∥_{1} - ∥ (f_{t + 1}^{'})_{∣ L} ∥_{1} ⪯ \frac{1}{μ} ∣ M ∣ ⪯ \frac{1}{μ} Bin (n, \frac{2}{n}) .

∥ f_{t + 1}^{'} ∥_{1} - ∥ f_{t + 1} ∥_{1} ⪯ ∥ (f_{t + 1}^{'})_{∣ L} ∥_{1} - ∥ (f_{t + 1})_{∣ L} ∥_{1} ⪯ \frac{1}{μ} ∣ M ∣ ⪯ \frac{1}{μ} Bin (n, \frac{2}{n}) .

∥ f_{t + 1}^{'} ∥_{1} - ∥ f_{t + 1} ∥_{1} ⪯ ∥ (f_{t + 1}^{'})_{∣ L} ∥_{1} - ∥ (f_{t + 1})_{∣ L} ∥_{1} ⪯ \frac{1}{μ} ∣ M ∣ ⪯ \frac{1}{μ} Bin (n, \frac{2}{n}) .

i = 1 \prod n (1

i = 1 \prod n (1

= exp (- i = 1 \sum n ∣ x_{i}^{*} - f_{i t} ∣) = exp (- ∥ x^{*} - f ∥_{1}) = exp (- D) .

lo g f_{i} = lo g (α_{i} c + (1 - α_{i}) 1) \geq α_{i} lo g c + (1 - α_{i}) lo g 1 = lo g (c^{α_{i}}) .

lo g f_{i} = lo g (α_{i} c + (1 - α_{i}) 1) \geq α_{i} lo g c + (1 - α_{i}) lo g 1 = lo g (c^{α_{i}}) .

Pr [x = (1, \dots, 1)]

Pr [x = (1, \dots, 1)]

Pr [d (x^{j}) \geq 2 ln (n)^{2}] \leq exp (- \frac{1}{3} ln (n)^{2}) = n^{- ω (1)}

Pr [d (x^{j}) \geq 2 ln (n)^{2}] \leq exp (- \frac{1}{3} ln (n)^{2}) = n^{- ω (1)}

1 - L n^{ω (1)} \geq 1 - n^{ω (1)},

1 - L n^{ω (1)} \geq 1 - n^{ω (1)},

0.3 2^{2 D_{t_{0}} /0.68}

0.3 2^{2 D_{t_{0}} /0.68}

= exp (0.2 ln (n) ln (0.32) /0.68) \geq n^{0.2 l n (0.32) /0.68} = : q .

(1 - q)^{2 L} \leq exp (- 2 q L) \leq exp (- n^{0.2 l n (0.32) /0.68} \cdot Ω (\frac{μ}{l n ( n ) ^{2}})) \leq exp (- Ω (n^{0.16})) .

(1 - q)^{2 L} \leq exp (- 2 q L) \leq exp (- n^{0.2 l n (0.32) /0.68} \cdot Ω (\frac{μ}{l n ( n ) ^{2}})) \leq exp (- Ω (n^{0.16})) .

Pr [∥ x^{1} ∥_{1} = ∥ x^{2} ∥_{1}]

Pr [∥ x^{1} ∥_{1} = ∥ x^{2} ∥_{1}]

= i = 0 \sum m Pr [∥ x^{1} ∥_{1} = i] \cdot Pr [∥ x^{2} ∥_{1} = i]

= i = 0 \sum m Pr [∥ \overset{x}{ˉ}^{1} ∥_{1} = m - i] \cdot Pr [∥ \overset{x}{ˉ}^{2} ∥_{1} = m - i]

= i = 0 \sum m Pr [∥ \overset{x}{ˉ}^{1} ∥_{1} = m - i = ∥ \overset{x}{ˉ}^{2} ∥_{1}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The Runtime of the Compact Genetic Algorithm on Jump Functions††thanks: Extended version of results that appeared at GECCO 2019 [Doe19c] and FOGA 2019 [Doe19b]. It contains as new result the $\Omega(\mu\sqrt{n}+n\log n)$ lower bound. All other results have been significantly rewritten, both to polish the arguments and to give a more unified treatment of the two previous works. In this process, the GECCO 2019 results were extended to subjump functions, the FOGA 2019 results were extended to superjump functions – two natural extensions of the jump functions class. This work was supported by a public grant as part of the Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH, in a joint call with Gaspard Monge Program for optimization, operations research and their interactions with data sciences.

Benjamin Doerr

Laboratoire d’Informatique (LIX)

CNRS

École Polytechnique

Institut Polytechnique de Paris

Palaiseau

France

Abstract

In the first and so far only mathematical runtime analysis of an estimation-of-distribution algorithm (EDA) on a multimodal problem, Hasenöhrl and Sutton (GECCO 2018) showed for any $k=o(n)$ that the compact genetic algorithm (cGA) with any hypothetical population size $\mu=\Omega(ne^{4k}+n^{3.5+\varepsilon})$ with high probability finds the optimum of the $n$ -dimensional jump function with jump size $k$ in time $O(\mu n^{1.5}\log n)$ .

We significantly improve this result for small jump sizes $k\leq\frac{1}{20}\ln n-1$ . In this case, already for $\mu=\Omega(\sqrt{n}\log n)\cap\operatorname{poly}(n)$ the runtime of the cGA with high probability is only $O(\mu\sqrt{n})$ . For the smallest admissible values of $\mu$ , our result gives a runtime of $O(n\log n)$ , whereas the previous one only shows $O(n^{5+\varepsilon})$ . Since it is known that the cGA with high probability needs at least $\Omega(\mu\sqrt{n})$ iterations to optimize the unimodal OneMax function, our result shows that the cGA in contrast to most classic evolutionary algorithms here is able to cross moderate-sized valleys of low fitness at no extra cost.

For large $k$ , we show that the exponential (in $k$ ) runtime guarantee of Hasenöhrl and Sutton is tight and cannot be improved, also not by using a smaller hypothetical population size. We prove that any choice of the hypothetical population size leads to a runtime that, with high probability, is at least exponential in the jump size $k$ . This result might be the first non-trivial exponential lower bound for EDAs that holds for arbitrary parameter settings.

To complete the picture, we show that the cGA with hypothetical population size $\mu=\Omega(\log n)$ with high probability needs $\Omega(\mu\sqrt{n}+n\log n)$ iterations to optimize any $n$ -dimensional jump function. This bound was known for OneMax, but, as we also show, the usual domination arguments do not allow to extend lower bounds on the performance of the cGA on OneMax to arbitrary functions with unique optimum.

As a side result, we provide a simple general method based on parallel runs that, under mild conditions, (i) overcomes the need to specify a suitable population size and still gives a performance close to the one stemming from the best-possible population size, and (ii) transforms EDAs with high-probability performance guarantees into EDAs with similar bounds on the expected runtime.

1 Introduction

Estimation-of-distribution algorithms (EDAs) [LL02, PHL15] are a particular class of evolutionary algorithms. Whereas typical classic evolutionary algorithms evolve a population of (hopefully good) solutions, EDAs evolve a probabilistic model of the search space, that is, a probability distribution over the set of all solutions. The target is to obtain distributions that allow to easily sample good solutions for the optimization problem regarded.

While the mathematical analysis of classical evolutionary algorithms (EAs) has produced a plethora of insightful results, see, e.g., [NW10, AD11, Jan13, DN20], the rigorous understanding of EDAs is much less developed, see, e.g., the recent survey [KW20a]. Obviously, this is due to the highly complex stochastic processes that describe the runs of such algorithms. In consequence, despite significant efforts and deep results [Dro06, SW19, LSW18], not even the runtime of the compact genetic algorithm (cGA) on the OneMax benchmark function is fully understood (here we would argue that the cGA is the simplest EDA and that the unimodal OneMax function, counting the number of ones in a bit string, is the easiest optimization problem with unique global optimum). It is therefore not surprising that many questions which are well-understood for EAs are only started to be understood for EDAs.

One such question is how EDAs optimize objective functions that are not unimodal. In the first and, prior to this work, only runtime analysis of an EDA on a multimodal problem, Hasenöhrl and Sutton [HS18] regard the optimization time of the cGA on the jump function class. These functions are unimodal apart from having a valley of low fitness of scalable size $k$ around the global optimum. For a sufficiently large constant $C$ and any constant $\varepsilon>0$ , they show [HS18, Theorem 3.3] that the cGA with hypothetical population size $\mu\geq\max\{Cne^{4k},n^{3.5+\varepsilon}\}$ 111In the paper, this is stated as minimum of the two terms, but from the proofs it is clear that it should be the maximum. with probability $1-o(1)$ finds the optimum of any jump function with jump size $k=o(n)$ in $O(\mu n^{1.5}\log n)$ generations (which is also the number of fitness evaluations, since the cGA evaluates only two search points in each iteration).

This result is remarkable in that it shows that the cGA with the right choice of $\mu$ and for $k\geq 6$ is more efficient on jump functions than most evolutionary algorithms, which have a runtime of at least $\Omega(n^{k})$ ; see Section 2.3.

1.1 An Improved Upper Bound for Small Jump Sizes

When the jump size $k$ is small, the runtime guarantee given by Hasenöhrl and Sutton [HS18] is still relatively large. We note that even when choosing the smallest possible population size $\mu=n^{3.5+\varepsilon}$ , the runtime guarantee becomes at least $\Omega(n^{5+\varepsilon})$ . While clearly a polynomial runtime, and thus efficient in the classic complexity theory view, this is a runtime that is not practical in many applications. Also, this runtime guarantee is weaker than the $O(n^{k})$ bound for simple mutation-based EAs such as the $(1+1)$ EA when $k\leq 5$ . Hence one could feel that the result of Hasenöhrl and Sutton shows the superiority of EDAs rather for problem instances for which both the runtime of typical EAs and the performance guarantee for the cGA are prohibitively large. In a similar vein, one has to question if a practitioner would run the cGA with a hypothetical population size of more than $n^{3.5}$ when solving a problem defined over bit strings of length $n$ .

Our first main result is that these potential weaknesses of the cGA are not real and that the cGA performs in fact much better than what the previous work shows. We prove rigorously that the cGA with hypothetical population size $\mu\geq K\sqrt{n}\log n$ , $K$ a sufficiently large constant, and $\mu$ polynomially bounded in $n$ , with high probability222that is, with probability $1-o(1)$ , where the asymptotics is in $n$ for a fixed $k$ (which might be a function of $n$ ) optimizes any $n$ -dimensional jump function with jump size $k\leq\frac{1}{20}\ln n-1$ in only $O(\mu\sqrt{n})$ iterations. Hence we both improve the runtime guarantee in terms of $n$ and we enlarge the range of admissible values for $\mu$ . For the smallest admissible population size $\mu=\Theta(\sqrt{n}\log n)$ , we obtain a runtime guarantee of $O(n\log n)$ .

From a broader perspective our result yields that the cGA (and we expect similar results to hold for other EDAs) does not suffer from moderate-size valleys of low fitness. We recall that Sudholt and Witt [SW19] have shown that the cGA with any hypothetical population size (polynomial in $n$ ) with high probability needs $\Omega(\mu\sqrt{n})$ iterations to optimize the OneMax function. Hence our result shows that adding a valley of low fitness to the OneMax function does not worsen the asymptotic performance of the cGA as long as the fitness valley has a width of at most $\frac{1}{20}\ln n-1$ .

On the technical side, our work makes some arguments of [HS18] more rigorous. In particular, we observe that the progress of the cGA cannot be estimated by taking the progress one would have when no fitness valley were present and correcting this estimate by inverting the progress with the probability that a search point is sampled in the fitness valley. This argument ignores the stochastic dependencies between the absolute value of the progress and the event that a solution in the fitness valley is sampled. These dependencies are real and have, in fact, a negative impact on the progress as discussed in more detail before Lemma 17.

We note that the approach of intentionally ignoring some dependencies to make a mathematical analysis tractable, often called mean-field analysis, is common in some scientific areas, most notably statistical physics, and has also been used in evolutionary computation, e.g., [DZ20c]. This approach, however, needs an additional justification, e.g., via specific experiments, why the omission of the dependencies should not change the matter substantially. In any case, such mean-field approaches do not lead to results fully proven with mathematical rigor. In this sense, we hope that our work also provides methods that help in future analyses of EDAs on multimodal optimization problems.

1.2 An Exponential Lower Bound

When $k$ is larger, say $k=\omega(\log n)$ , then the runtime guarantee given in [HS18] is exponential in $k$ , simply because $\mu$ has to be at least exponential in $k$ to fulfill the assumptions of the result. It is clear that with an exponential hypothetical population size, the runtime must be exponential as well (for the sake of completeness, we shall make this elementary argument precise in Lemma 1). What is not immediately clear is if by choosing a smaller hypothetical population size the cGA can optimize jump functions more efficiently.

Our second main result is a negative answer to this question. In Theorem 22 we show that, regardless of the hypothetical population size, the runtime of the cGA on a jump function with jump size $k$ with high probability is at least exponential in $k$ . Interestingly, not only our result is a uniform lower bound independent of the hypothetical population size, but our proof is also “uniform” in the sense that it needs case distinctions neither w.r.t. the hypothetical population size nor w.r.t. the different reasons for the lower bound. Here we recall that the existing runtime analyses, see, e.g., again [Dro06, SW19, LSW18], find two reasons why an EDA can be inefficient. (i) The hypothetical population size is large and consequently it takes long to move the frequencies into the direction of the optimum. (ii) The hypothetical population size is small and thus, in the absence of a strong fitness signal, the random walk of the frequencies brings some frequencies close to the boundaries of the frequency spectrum (this effect is known as genetic drift, see [DZ20b] for a recent discussion and relatively precise quantification); from there they are hard to move back into the game.

We avoid such potentially tedious case distinctions via an elegant drift argument on the sum of the frequencies. Ignoring some technicalities here, we show that, regardless of the hypothetical population size, the frequency sum overshoots a value of $n-\frac{1}{4}k$ only after an expected number of $\exp(\Omega(k))$ iterations. However, in an iteration where the frequency sum is below $n-\frac{1}{4}k$ , the optimum is sampled only with probability $\exp(-\Omega(k))$ . These two arguments prove our lower bound of order $\exp(\Omega(k))$ .

1.3 A Lower Bound for Small Jump Sizes

Since the exponential lower bound just discussed is not very strong for small jump sizes $k$ , we also prove a lower bound of $\Omega(\mu\sqrt{n}+n\log n)$ for the performance of the cGA with hypothetical population size $\mu=\Omega(\log n)$ on any jump function. This lower bound was shown before for the OneMax function [SW19]. While it is not surprising that the cGA is not more efficient on jump functions than on OneMax, this is not trivial to show. As we also observe in Section 6, a result like “OneMax is an easiest function with a unique global optimum”, which is true for many other evolutionary algorithms, cannot be proven with the usual arguments for the cGA. In fact, we currently have no indication that such a result is true for the cGA, nor do we have a counter-example.

1.4 Expected Runtimes of EDAs vs. Bounds with High Probability

As a side result, triggered by the fact that we “only” show an upper bound that holds with high probability, but not a bound on the expected runtime, we provide in Section 2.4 a general approach to transform an EDA using a population size parameter $\mu$ into an algorithm that does not require the specification of such a parameter, but has a performance similar to the one of the EDA with optimally chosen parameter. This performance guarantee also holds for the expected runtime, even if for the EDA only a with-high-probability runtime guarantee is known.

2 Preliminaries

2.1 The Compact Genetic Algorithm

The compact genetic algorithm (cGA) is an estimation-of-distribution algorithm (EDA) proposed by Harik, Lobo, and Goldberg [HLG99] for the maximization of pseudo-Boolean functions $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ . Being a univariate EDA, it develops a probabilistic model described by a frequency vector $f\in[0,1]^{n}$ . This frequency vector describes a probability distribution on the search space $\{0,1\}^{n}$ . If $X=(X_{1},\dots,X_{n})\in\{0,1\}^{n}$ is a search point sampled according to this distribution—we write

[TABLE]

to indicate this—then we have $\Pr[X_{i}=1]=f_{i}$ independently for all $i\in[1..n]\coloneqq\{1,\dots,n\}$ . In other words, the probability that $X$ equals some fixed search point $y$ is

[TABLE]

In each iteration, the cGA updates this probabilistic model as follows. It samples two search points $x^{1},x^{2}\sim\operatorname{Sample}(f)$ , computes the fitness of both, and defines $(y^{1},y^{2})=(x^{1},x^{2})$ when $x^{1}$ is at least as fit as $x^{2}$ and $(y^{1},y^{2})=(x^{2},x^{1})$ otherwise. Consequently, $y^{1}$ is the better search point of the two (if not both have the same fitness). We then define a preliminary frequency vector by $f^{\prime}\coloneqq f+\frac{1}{\mu}(y^{1}-y^{2})$ , where $\mu$ is an algorithm parameter called hypothetical population size. This definition ensures that, when $y^{1}$ and $y^{2}$ differ in some bit position $i$ , the $i$ -th preliminary frequency moves by a step of $\frac{1}{\mu}$ into the direction of $y^{1}_{i}$ , which we hope to be the right direction since $y^{1}$ is the better of the two search points. The hypothetical population size $\mu$ controls how strong this update is.

To avoid a premature convergence, we ensure that the new frequency vector is in $[\frac{1}{n},1-\frac{1}{n}]^{n}$ by capping too small or too large values at the corresponding boundaries. More precisely, for all $\ell\leq u$ and all $r\in\mathbb{R}$ we define

[TABLE]

and we lift this notation to vectors by reading it component-wise. Now the new frequency vector is $\operatorname{minmax}(\frac{1}{n}\mathbf{1}_{n},f^{\prime},(1-\frac{1}{n})\mathbf{1}_{n})$ .

This iterative frequency development is pursued until some termination criterion is met. Since we aim at analyzing the time (number of iterations) it takes to sample the optimal solution (this is what we call the runtime of the cGA), we do not specify a termination criterion and pretend that the algorithm runs forever.

The pseudo-code for the cGA is given in Algorithm 1. We shall use the notation given there frequently in our proofs. For the frequency vector $f_{t}$ obtained at the end of iteration $t$ , we denote its $i$ -th component by $f_{i,t}$ or, when there is no risk of ambiguity, by $f_{it}$ . We shall frequently argue with the sum of the frequencies, which can be written as $\|f_{t}\|_{1}:=\sum_{i=1}^{n}|f_{it}|$ since the frequencies are non-negative. With a slight abuse of notation, we extend this common notation also to preliminary frequency vectors $f^{\prime}$ and thus write $\|f^{\prime}\|_{1}\coloneqq\sum_{i=1}^{n}f^{\prime}_{it}$ , when there is not danger of confusion. Where there could be a chance of a critical misunderstanding, we use the much less common notation $\sum[v]\coloneqq\sum_{i=1}^{n}v_{i}$ to denote the sum of the entries of an $n$ -dimensional vector $v\in\mathbb{R}^{n}$ .

Well-behaved frequency assumption: For the hypothetical population size $\mu$ , we take the common assumption that any two frequencies that can occur in a run of the cGA differ by a multiple of $\frac{1}{\mu}$ . We call this the well-behaved frequency assumption. This assumption was implicitly already made in [HLG99] by using even $\mu$ in all experiments (note that the hypothetical population size is denoted by $n$ in [HLG99]). This assumption was made explicit in [Dro06] by requiring $\mu$ to be even. Both works do not use the frequencies boundaries $\frac{1}{n}$ and $1-\frac{1}{n}$ , so an even value for $\mu$ ensures well-behaved frequencies.

For the case with frequency boundaries, the well-behaved frequency assumption is equivalent to $(1-\frac{2}{n})$ being an even multiple of the update step size $\frac{1}{\mu}$ . In this case, $n_{\mu}=(1-\frac{2}{n})\mu\in 2\mathbb{N}$ and the set of frequencies that can occur is

[TABLE]

This assumption was made, e.g., in the papers [FKKS17] (see the last paragraph of Section II.C) and [LSW18] (see the paragraph following Lemma 2.1) as well as in the proof of Theorem 2 in [SW19].

A trivial lower bound: We finish this subsection on the cGA with the following very elementary remark, which shows that the cGA with hypothetical population size $\mu$ with probability $1-\exp(-\Omega(n))$ has a runtime of at least $\min\{\frac{\mu}{4},\exp(\Theta(n))\}$ on any $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ with a unique global optimum (and also on all functions with a sufficiently small exponential number of optima). This shows, in particular, that the cGA with the parameter value $\mu=\exp(\Omega(k))$ used to optimize jump functions with gap size $k\in\omega(\log n)\cap o(n)$ in time $\exp(O(k))$ in [HS18] cannot have a runtime better than exponential in $k$ .

Lemma 1.

Let $\alpha,\beta\geq 0$ be constants such that $\alpha\beta<\frac{4}{3}$ . Let $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ have at most $\alpha^{n}$ optima. The probability that the cGA generates an optimum of $\mathcal{F}$ in $T=\min\{\frac{\mu}{4},\beta^{n}\}$ iterations is at most $2(\alpha\beta\frac{3}{4})^{n}=\exp(-\Omega(n))$ .

Proof.

By the definition of the cGA, the frequency vector $f$ used in iteration $t=1,2,3,\dots$ satisfies $f\in[\frac{1}{2}-\frac{t-1}{\mu},\frac{1}{2}+\frac{t-1}{\mu}]^{n}$ . Consequently, the probability that a fixed one of the two search points which are generated in this iteration is a fixed solution, is at most $(\frac{1}{2}+\frac{t-1}{\mu})^{n}$ . For $t\leq\frac{\mu}{4}$ , this is at most $(\frac{3}{4})^{n}$ . Hence by a simple union bound (over time and the global optima), the probability that an optimum is generated in the first $T=\min\{\frac{\mu}{4},\beta^{n}\}$ iterations, is at most $2\alpha^{n}T(\frac{3}{4})^{n}\leq 2(\alpha\beta\frac{3}{4})^{n}=\exp(-\Omega(n))$ . ∎

2.2 Runtime Analysis for the cGA

In this subsection, we briefly describe the relevant previous runtime analyses for the cGA. For simplicity, we shall always assume that the hypothetical population size is at most polynomial in the problem size $n$ , that is, that there is a constant $c$ such that $\mu\leq n^{c}$ . This is justified, among others, by Lemma 1, which shows that a super-polynomial hypothetical population size immediately leads to a super-polynomial runtime on any objective function with at most $\alpha^{n}$ optima, where $\alpha$ can be any constant less than $\frac{4}{3}$ .

The first to conduct a rigorous runtime analysis for the cGA was Droste in his seminal work [Dro06]. He regarded the cGA without frequency boundaries, that is, he just took $f_{t+1}\coloneqq f^{\prime}_{t+1}$ in our notation. He showed that this algorithm with $\mu\geq n^{1/2+\varepsilon}$ , $\varepsilon>0$ any positive constant, finds the optimum of the OneMax function defined by

[TABLE]

for all $x\in\{0,1\}^{n}$ with probability at least $\frac{1}{2}$ in $O(\mu\sqrt{n})$ iterations [Dro06, Theorem 8].

Droste also showed that this cGA for any objective function $\mathcal{F}$ with unique optimum has an expected runtime of $\Omega(\mu\sqrt{n})$ when conditioning on no premature convergence [Dro06, Theorem 6]. It is easy to see that his proof of the lower bound can be extended to the cGA with frequency boundaries, that is, to Algorithm 1. For this, it suffices to deduce from his drift argument the result that the first time $T_{n/4}$ that the frequency distance $D=\sum_{i=1}^{n}(1-f_{it})$ is less than $\frac{n}{4}$ satisfies $E[T_{n/4}]\geq\frac{\sqrt{2}}{4}\mu\sqrt{n}$ . Since the probability to sample the optimum from a frequency distance of at least $\frac{n}{4}$ is at most $\exp(-\frac{n}{4})$ , see Lemma 9, the algorithm with high probability does not find the optimum before time $T_{n/4}$ .

Around ten years after Droste’s work, Sudholt and Witt [SW19] showed that the $O(\mu\sqrt{n})$ upper bound also holds for the cGA with frequency boundaries. There (but the same should be true for the cGA without boundaries) a hypothetical population size of $\mu=\Omega(\sqrt{n}\log n)$ suffices (recall that Droste required $\mu=\Omega(n^{1/2+\varepsilon})$ ). The technically biggest progress with respect to upper bounds most likely lies in the fact that the analysis in [SW19] also holds for the expected optimization time, which means that it also includes the rare case that frequencies reach the lower boundary (see our discussion of the relation of expectations and tail bounds for runtimes of EDAs in Section 2.4). Sudholt and Witt also show that the cGA with frequency boundaries with high probability (and thus also in expectation) needs at least $\Omega(\mu\sqrt{n}+n\log n)$ iterations to optimize OneMax. While the $\Omega(\mu\sqrt{n})$ lower bound could have been also obtained with methods similar to Droste’s (in Lemma 15 we do something very similar), the innocent-looking $\Omega(n\log n)$ bound is surprisingly difficult to prove.

Not much is known for hypothetical population sizes below the order of $\sqrt{n}$ . It is clear that then the frequencies will reach the lower boundary of the frequency range, so working with a non-trivial lower boundary like $\frac{1}{n}$ is necessary to prevent premature convergence. The recent lower bound $\Omega(\mu^{1/3}n)$ valid for $\mu=O(\frac{\sqrt{n}}{\log n\log\log n})$ of [LSW18] indicates that already a little below the $\sqrt{n}$ regime significantly larger runtimes occur, but with no upper bounds this regime remains largely not understood.

We refer the reader to the recent survey [KW20a] for more results on the runtime of the cGA on classic unimodal test functions like LeadingOnes and BinVal. Interestingly, nothing was known for multimodal functions before the recent work of Hasenöhrl and Sutton [HS18] on jump functions, which we discussed already in the introduction.

The general topic of lower bounds on runtimes of EDAs remains largely little understood. Apart from the lower bounds for the cGA on OneMax discussed above, the following is known. Krejca and Witt [KW20b] prove a lower bound for the UMDA on OneMax, which is of a similar flavor as the lower bound for the cGA of Sudholt and Witt [SW19]: For $\lambda=(1+\beta)\mu$ , where $\beta>0$ is a constant, and $\lambda$ polynomially bounded in $n$ , the expected runtime of the UMDA on OneMax is $\Omega(\mu\sqrt{n}+n\log n)$ . For the binary value function BinVal, Droste [Dro06] and Witt [Wit18] together give a lower bound of $\Omega(\min\{n^{2},\mu n\})$ for the runtime of the cGA. Apart from these sparse results, we are not aware of any lower bounds for EDAs. Of course, the black-box complexity of the problem is a lower bound for any black-box algorithm, hence also for EDAs, but these bounds are often lower than the true complexity of a given algorithm. For example, the black-box complexities of OneMax, LeadingOnes, and jump functions with jump size $k\leq\frac{1}{2}n-n^{\varepsilon}$ , $\varepsilon>0$ any constant, are $\Theta(\frac{n}{\log n})$ [DJW06, AW09], $\Theta(n\log\log n)$ [AAD*+*19], and $\Theta(\frac{n}{\log n})$ [BDK16], respectively.

2.3 Runtime Results for Jump Functions

To complete the picture, we briefly describe some typical runtimes of evolutionary algorithms on jump functions. We recall that the $n$ -dimensional jump function with jump size $k\geq 1$ is defined by

[TABLE]

Hence for $k=1$ , we have a fitness landscape identical to the one of OneMax apart from all fitness values being larger by one. For larger values of $k$ , we still have a fitness landscape identical to OneMax apart from constant shifts when only regarding the lowest $n-k$ fitness levels of the OneMax function, however, now there is a fitness valley (“gap”)

[TABLE]

consisting of the $k-1$ highest sub-optimal fitness levels of the OneMax function.

This valley is hard to cross via standard-bit mutation with mutation rate $\frac{1}{n}$ . Consequently, as proven in the classic paper [DJW02], the $(1+1)$ EA has an expected optimization time of at least $n^{k}$ on $\operatorname{\textsc{Jump}}_{nk}(x)$ . This lower bound also holds for the $(\mu+\lambda)$ EA for all values of $\mu$ and $\lambda$ as well as, more surprisingly, for the $(\mu,\lambda)$ EA for large ranges of the population sizes [Doe20a]. By using larger mutation rates or a heavy-tailed mutation operator, a $k^{\Theta(k)}$ runtime improvement for the runtime of the $(1+1)$ EA can be obtained [DLMN17], but the runtime remains $\Omega(n^{k})$ for $k$ constant (and this is also true for the variation of the heavy-tailed mutation rate proposed in [FQW18]). The runtime stemming from the optimal mutation rate can be automatically obtained (apart from constant factors) via a self-adjusting choice of the mutation rate [RW20].

Asymptotically better runtimes can be achieved when using crossover, though this is harder than expected. The first work in this direction [JW02], among other results, could show that a simple $(\mu+1)$ genetic algorithm using uniform crossover with rate $p_{c}=O(\frac{1}{kn})$ obtains an $O(\mu n^{2}k^{3}+2^{2k}p_{c}^{-1})$ runtime when the population size is at least $\mu=\Omega(k\log n)$ . A shortcoming of this result, already noted by the authors, is that it only applies to uncommonly small crossover rates. Using a different algorithm that first always applies crossover and then mutation, a runtime of $O(n^{k-1}\log n)$ was achieved by Dang et al. [DFK*+*18, Theorem 2]. For $k\geq 3$ , the logarithmic factor in the runtime can be removed by using a higher mutation rate. With additional diversity mechanisms, the runtime can be further reduced up to ${O(n\log n+4^{k})}$ , see [DFK*+*16]. In the light of this last result, the insight stemming from the previous work [HS18] and ours is that the cGA apparently without further modifications supplies the necessary diversity to obtain a runtime of $O(n\log n+2^{O(k)})$ .

With a three-parent majority vote crossover, among other results, a runtime of $O(n\log n)$ could be obtained via a suitable island model for all $k=O(n^{\frac{1}{2}-\varepsilon})$ [FKK*+*16]. Via a hybrid genetic algorithm using as variation operators only local search and a deterministic voting crossover, an $O(n)$ runtime for $m=O(\log n)$ was obtained in [WVHM18]. Via a different voting mechanism, an $O(n\log n)$ runtime was obtained even for $m$ as large as $O(n)$ [RA19]. With the right static or heavy-tailed parameters, the $(1+(\lambda,\lambda))$ GA optimizes jump functions in time roughly $n^{(k+O(1))/2}$ [ADK20, AD20], however, when using the parameterization developed for OneMax [DDE15], then several self-adjusting versions of the $(1+(\lambda,\lambda))$ GA cannot beat mutation-based EAs as shown in [FS20].

Finally, we note that runtimes of $O(n\binom{n}{k})$ and $O(k\log(n)\binom{n}{k})$ were shown for the $(1+1)$ IAhyp and the $(1+1)$ Fast-IA artificial immune systems, respectively [COY17, COY18].

2.4 Expected Runtimes versus Guarantees with High Probability

We note that our main upper bound result as well as the previous one [HS18] for this problem give runtime bounds that hold with high probability, that is, with probability $1-o(1)$ . However, we do not show a bound on the expected runtime. Let us quickly argue what the differences are, why we chose to prove a high-probability statement, and how to transform EDAs with high-probability guarantees into EDAs with guarantees on the expected runtime. We note that Wegener [Weg05, Section 3] with different arguments also suggests to prefer high-probability guarantees over expected runtimes.

For most evolutionary algorithms a high-probability guarantee can easily be turned into a bound on the expected runtime. If we know that a certain algorithm from any initial state finds the optimum in time $T$ with at least constant probability, then by splitting time into consecutive segments of length $T$ we see that after time $\gamma T$ the probability that the algorithm has not succeeded is at most $\exp(-\Omega(\gamma))$ . Consequently, the runtime is stochastically dominated (see Section 3.2 for the definition of this notation) by $T$ times a geometric random variable with constant success rate, and consequently, the expected runtime is $O(T)$ . The same argument gives a scalable tail bound of type “for all $\gamma>1$ , the probability that the runtime is more than $\gamma T$ is at most $\exp(-\Omega(\gamma))$ .”

For EDAs, it is usually much harder to show a good performance for any initial situation since there are some states which are particularly unfavorable (usually when all frequencies are close to the wrong boundary value). This does not rule out that the expected runtime and the time that is obtained with high probability are of the same order, but proving the bound on the expected runtime needs stronger arguments. The analysis of the expected runtime of the cGA on OneMax in [SW19] is an example for such a result.

This additional proof complexity raises the question if this effort is justified if the hardest part is dealing with states of the algorithm that are rarely reached (in [SW19] with probability $O(n^{-c})$ only, where $c$ can be any positive constant). While we think that it was very valuable that the work [SW19] showed how to compute expected runtimes for EDAs, we feel that such results are not always needed, both because of the difficulty to obtain such results and because, in some sense, they are a mildly unnatural remedy to the deeper problem.

As said, the main reason why guarantees for the expected runtime of an EDA can be difficult to show is that the EDA with small probability can end up in a state from which the optimum is hard to reach. When in such a state, however, instead of spending much time to leave the unfavorable state, it would be more efficient and more natural to simply restart the algorithm and have a new good chance for a fast optimization process. While we cannot expect the algorithm to detect that it is in an unfavorable state, the following simple parallel-run strategy under mild assumptions can do this automatically. More precisely, via suitable parallel runs we obtain an expected runtime that is only a logarithmic factor above the runtime the EDA would have with high probability when using the optimal population size. Hence this approach both obtains expected runtimes and optimizes the value of the parameter $\mu$ . We note that the “noise-oblivious scheme” proposed in [FKKS17, Algorithm 4] can also be used to optimize the parameter $\mu$ , however only under the much stronger assumption that the runtime (or an upper bound, which influences the runtime of the scheme) is known. In this case, a simple restart scheme with multiplicatively increasing $\mu$ values does the job.

We now proceed with detailing our parallel-run strategy. In the remainder, we shall assume the following.

General assumption: Let $\mathcal{A}$ be an EDA (or any other randomized search heuristic) with a parameter $\mu$ and let $\mathcal{P}$ be a problem instance we want to solve. We assume that there are unknown values $\tilde{\mu}$ and $T$ such that $\mathcal{A}$ with any parameter value $\mu\geq\tilde{\mu}$ solves $\mathcal{P}$ in time $\mu T$ with probability at least $\frac{3}{4}$ .

For this situation, we proposed the following strategy.

Parallel EDA runs with exponentially growing population size: We propose the following strategy to solve $\mathcal{P}$ via parallel runs of $\mathcal{A}$ with different parameter values. We start with no process running. In round $i=1,2,\dots$ of our strategy, we let all running processes (which are process $1$ to $i-1$ ) use a computational budget of $2^{i-1}$ ; further, we start process $i$ with parameter $\mu=2^{i-1}$ and let it use a budget of $\sum_{j=0}^{i-1}2^{j}$ . These processes can be run in parallel or sequentially in any order. The pseudocode for this strategy is given in Algorithm 2.

Analysis: We observe that at the end of round $i$ , processes $1$ to $i$ are running and have each spent a budget of $\sum_{j=0}^{i-1}2^{j}=2^{i}-1$ up to this point in time. Consequently, the total budget spent in the first $i$ rounds is less than $i2^{i}$ .

Note that after round $i_{0}\coloneqq 1+\lceil\log_{2}\tilde{\mu}\rceil+\lfloor\log_{2}T\rfloor$ , the process started with parameter value $\mu=\mu_{0}\coloneqq 2^{\lceil\log_{2}\tilde{\mu}\rceil}\geq\tilde{\mu}$ has started and has used a time budget of

[TABLE]

Consequently, with probability $\frac{3}{4}$ this process has found the optimum at that time. With the same type of computation, we see that after round $i_{0}+j$ , the process with parameter value $\mu=2^{j}\mu_{0}$ is finished with probability $\frac{3}{4}$ . Consequently, the round in which we find the solution is stochastically dominated (see Section 3.2) by $i_{0}-1$ plus a geometric distribution (on $1,2,\ldots$ ) with success rate $\frac{3}{4}$ . The expected time taken by this strategy to solve $\mathcal{P}$ thus is at most

[TABLE]

using the well-known equality $\sum_{j=0}^{\infty}j\,2^{-j}=2$ . We continue estimating the expected runtime of our parallel-run strategy by

[TABLE]

We note that if the values of $\tilde{\mu}$ and $T$ were known in advance, then restarting the EDA with $\mu=\tilde{\mu}$ and with a budget of $T$ until the problem is solved would immediately give an algorithm with expected runtime at most $T^{*}=\frac{4}{3}\tilde{\mu}T$ . This is the best-possible expected runtime that can be deduced from our assumptions. Consequently, our parallel-run strategy with its $O(T^{*}\log T^{*})$ expected runtime obtains the optimal expected runtime apart from a logarithmic factor.

In summary, we have shown the following result.

Theorem 2.

Under the general assumptions made above, with $i_{0}\coloneqq 1+\lceil\log_{2}\tilde{\mu}\rceil+\lfloor\log_{2}T\rfloor$ , the parallel run strategy described above has the following performance.

•

The expected time until $\mathcal{P}$ is solved is at most

[TABLE]

where $T^{*}=\frac{4}{3}\tilde{\mu}T$ is the best expected runtime that can be achieved via restarts of $\mathcal{A}$ under the general assumptions.

•

For all $j=0,1,2,\ldots$ , the probability that a runtime of $2^{i_{0}+j}(i+j)$ does not suffice to solve $\mathcal{P}$ , is at most $4^{-j-1}$ .

We remark that a logarithmic factor performance loss over the optimal strategy (requiring the precise values of $\tilde{\mu}$ and $T$ ) is not a lot compared to what can be lost by choosing a wrong algorithm parameter, in particular, when the parameter is hard to guess. We note here that the recent work [LSW18] suggests that already for the simple OneMax function, the hypothetical population size has a non-obvious influence on the runtime: Sufficiently small values give an $O(n\log n)$ runtime, in a middle regime the runtime increases to $\tilde{\Omega}(n^{7/6})$ before dropping again to $O(n\log n)$ and then increasing linearly with $\mu$ . In the light of such results, a logarithmic overhead for automatically finding a near-optimal rate appears to be a good trade-off.

Finally, we remark without further proof that when our general assumption is fulfilled with some failure probability $p$ instead of $\frac{1}{4}$ , then tail probabilities in the second item of Theorem 2 are of order $p^{j+1}$ instead of $(\frac{1}{4})^{j+1}$ . This could potentially be interesting when the performance of $\mathcal{A}$ is strongly concentrated so that the general assumptions hold with some $p=o(1)$ . We also note that our strategy could be adjusted to deal with smaller success probabilities than $\frac{3}{4}$ , either by increasing the $\mu$ value by a smaller factor than $2$ or by having several processes using the same $\mu$ value. We spare the details.

Finally, we note that recently a similar approach was proposed in [DZ20a]. The main difference to ours is that runs where stopped after a time that was based on a mathematical analysis of when genetic drift could become problematic. From the implementation point of view, in this approach the runs with different values of $\mu$ can be conducted one after the other. From the theoretical perspective, this approach has the advantage that with the right choice of the hyperparameters the $\Theta(\log(\tilde{\mu}T))$ factor in the runtime bound of Theorem 2 can be saved. The experimental results in [DZ20a] suggest that their approach is superior when the hyperparameters are chosen suitably, which is however non-trivial. We note that here that there is a general agreement in the community that genetic drift leads to an undesired behavior of the EDA. Genetic drift can lead to catastrophic runtimes (compare the results of [LN19, DK20b]), but not always does ([LN17, Wit19] show that the UMDA already with a population size of $\Theta(\log n)$ and thus clearly in the genetic drift regime can optimize OneMax in the for this algorithm best known runtime $O(n\log n)$ ).

3 Technical Tools

In this section, we collect a number of technical results that will be used in our main proofs. These include standard arguments like elementary estimates, Chernoff bounds, and drift theorems, as well as original arguments for the analysis of the cGA which might be of general interest such as a tool to quantify the effect of frequencies being capped at the boundaries (Lemma 8), an upper and a lower bound for the probability of sampling the optimum given the $\ell_{1}$ -distance between the current frequency vector and the optimum (Lemma 9 and 10), an estimate for the time taken to sample a search point close to the current frequency vector, and a lower bound on the probability to sample two different search points in one iteration (Lemma 12).

3.1 Standard Tools

The following estimate seems well-known (e.g., it was used in [JJW05] without proof or reference). Gießen and Witt [GW17, Lemma 3] give a proof via estimates of binomial coefficients and the binomial identity. A more elementary proof can be found in [Doe20c, Lemma 1.10.37].

Lemma 3.

Let $X\sim\operatorname{Bin}(n,p)$ . Let $k\in[0..n]$ . Then

[TABLE]

We regularly use the following well-known multiplicative Chernoff bounds, which can be derived from [Hoe63], see, e.g., Theorems 1.10.1 and 1.10.5 together with Section 1.10.1.8 in [Doe20c].

Theorem 4.

Let $X_{1},\ldots,X_{n}$ be independent random variables taking values in $[0,1]$ . Let $X=\sum_{i=1}^{n}X_{i}$ . Let $\mu^{+}\geq E[X]$ and $\mu^{-}\leq E[X]$ . Let $\delta\geq 0$ and $\tilde{\delta}\in[0,1]$ . Then

[TABLE]

A direct consequence of these Chernoff bounds are the following estimates, which state that the OneMax fitness of a search point sampled from $\operatorname{Sample}(f)$ is close to the expected OneMax fitness $\|f\|_{1}$ . Since we mostly need such results for frequency vectors close to $(1,\ldots,1)$ , we formulate this result in terms of distances to the maximum value $n$ .

Lemma 5.

Let $f\in[0,1]^{n}$ , $D\coloneqq n-\|f\|_{1}$ , $D^{-}\leq D\leq D^{+}$ , $x\sim\operatorname{Sample}(f)$ , and $d(x)\coloneqq n-\|x\|_{1}$ . Then for all $\delta\geq 0$ and $\tilde{\delta}\in[0,1]$ , we have

[TABLE]

Proof.

The random variable $n-\|x\|_{1}$ can be written as a sum $n-\|x\|_{1}=\sum_{i=1}^{n}Z_{i}\eqqcolon Z$ of $n$ independent binary random variables $Z_{1},\dots,Z_{n}$ such that $\Pr[Z_{i}=1]=1-f_{i}$ . By definition, $E[Z]=D$ . The claims follow directly from Theorem 4. ∎

We need the lemma above in particular to argue that the probability to sample a search point in the gap region of the $\operatorname{\textsc{Jump}}$ function is small. For the $\operatorname{\textsc{Jump}}_{nk}$ function, we observe that when $D\coloneqq n-\|f\|_{1}$ is at least $2k$ , then the probability that $x\sim\operatorname{Sample}(f)$ lies in the gap, that is, satisfies $n-k<\|x\|_{1}<n$ , is $e^{-\Omega(k)}$ . This result is sufficient for our purposes. We note that we could also obtain a low constant probability for sampling in the gap when $D\geq k+\Omega(\sqrt{k})$ with large implicit constant. In [HS18, Lemma 3.2], a gap probability of at most $1-\frac{1}{\sqrt{2}}\leq 0.293$ is claimed already when $D\geq k+c$ for $c$ a sufficiently large constant and $k=o(n)$ , but we are skeptical that this is true. Note that when $f=\frac{n-k-c}{n}\mathbf{1}_{n}$ , then $X=n-\|x\|_{1}$ with $x\sim\operatorname{Sample}(f)$ follows a binomial distribution with parameters $n$ and $\frac{k+c}{n}$ . Hence if $k$ is large compared to $c$ , then $\Pr[X<k]=\Pr[X<E[X]-c]\approx\frac{1}{2}$ .

At one point, in the proof of Lemma 19, we need an additive Chernoff bound not only for the sum of independent random variables, but also for all partial sums. Such bounds are less known despite the fact that many classical Chernoff bounds hold equally well in this more demanding fashion. The following result is from Hoeffding [Hoe63, Theorem 2 together with (2.17)]. It can also be found in [Doe20c], Theorems 1.10.9 and 1.10.31.

Theorem 6.

Let $X_{1},\ldots,X_{n}$ be independent random variables such that for all $i\in[1..n]$ , the variable $X_{i}$ takes values in some interval $[a_{i},b_{i}]$ and has expectation $E[X_{i}]=0$ . Then for all $\lambda\geq 0$ , we have

[TABLE]

Finally, we state the additive drift theorem of He and Yao [HY01] (see also the recent survey [Len20]), which allows to translate an expected progress (or bounds on it) into bounds for expected hitting times.

Theorem 7.

Let $S\subseteq\mathbb{R}_{\geq 0}$ be finite and $0\in S$ . Let $X_{0},X_{1},\ldots$ be a random process taking values in $S$ . Let $\delta>0$ . Let $T=\inf\{t\geq 0\mid X_{t}=0\}$ .

(i)

If for all $t\geq 0$ and all $s\in S\setminus\{0\}$ we have $E[X_{t}-X_{t+1}\mid X_{t}=s]\geq\delta$ , then $E[T]\leq\frac{E[X_{0}]}{\delta}$ . 2. (ii)

If for all $t\geq 0$ and all $s\in S\setminus\{0\}$ we have $E[X_{t}-X_{t+1}\mid X_{t}=s]\leq\delta$ , then $E[T]\geq\frac{E[X_{0}]}{\delta}$ .

3.2 Tools for the Analysis of the cGA

In this section, we prove a number of general arguments for the analysis of the cGA. Since we expect that they are helpful for other runtime analyses of EDAs, we fix no general notation apart from the one defined in Algorithm 1 (at the price of occasionally restating a notation).

We recall the notation of stochastic domination, which will be used several times in this work. For two random variables $X$ and $Y$ , not necessarily defined over the same probability space, we say that $Y$ stochastically dominates $X$ , written as $X\preceq Y$ , if for all $\lambda\in\mathbb{R}$ we have

[TABLE]

Stochastic domination is a strong way of saying that $Y$ is not smaller than $X$ . It implies that $E[X]\leq E[Y]$ . We refer to [Doe19a] for more details.

Boundary effects: When, in the notation of Algorithm 1, the current frequency vector $f_{t}$ is such that $f_{it}\in\{\frac{1}{n},1-\frac{1}{n}\}$ for some $i\in[1..n]$ , then it may happen that $f^{\prime}_{t+1}\notin[\frac{1}{n},1-\frac{1}{n}]$ and consequently $f_{t+1}$ does not satisfy the nice relation $f_{t+1}=f_{t}+\frac{1}{\mu}(y^{1}-y^{2})$ . The following lemma quantifies these discrepancies. We here recall the common definition that for an $n$ -dimensional vector $x$ and a subset $L\subseteq[1..n]$ of its index set, $x_{|L}$ denotes the restriction of $x$ to $L$ , that is, the vector $(x_{\ell})_{\ell\in L}$ .

Lemma 8.

Let $P=2\frac{1}{n}(1-\frac{1}{n})$ . Let $t\geq 0$ . Using the notation given in Algorithm 1, consider iteration $t+1$ of a run of the cGA started with a fixed frequency vector $f_{t}\in[\frac{1}{n},1-\frac{1}{n}]^{n}$ .

(i)

Let $L=\{i\in[1..n]\mid f_{it}=\frac{1}{n}\}$ , $\ell=|L|$ , and $M=\{i\in L\mid x^{1}_{i}\neq x^{2}_{i}\}$ . Then $|M|\sim\operatorname{Bin}(\ell,P)$ and

[TABLE] 2. (ii)

Let $L=\{i\in[1..n]\mid f_{it}=1-\frac{1}{n}\}$ , $\ell=|L|$ , and $M=\{i\in L\mid x^{1}_{i}\neq x^{2}_{i}\}$ . Then $|M|\sim\operatorname{Bin}(\ell,P)$ and

[TABLE]

Proof.

By symmetry, it suffices to prove the first part. For an $i\in L$ , we have $\Pr[x^{1}_{i}\neq x^{2}_{i}]=2\frac{1}{n}(1-\frac{1}{n})=P$ . Since the bits of $x^{1}$ and $x^{2}$ were sampled independently, we have $|M|\sim\operatorname{Bin}(\ell,P)$ .

By the well-behaved frequency assumption and the fact that $f^{\prime}_{t+1}=f_{t}+\frac{1}{\mu}(y^{1}-y^{2})$ for binary vectors $y^{1}$ and $y^{2}$ , we can have $f^{\prime}_{i,t+1}<\frac{1}{n}$ and thus $f_{i,t+1}>f^{\prime}_{i,t+1}$ only when $f_{it}=\frac{1}{n}$ and $x^{1}_{i}\neq x^{2}_{i}$ , that is, when $i\in M$ . This shows $\|f_{t+1}\|_{1}-\|f^{\prime}_{t+1}\|_{1}\preceq\|(f_{t+1})_{|L}\|_{1}-\|(f^{\prime}_{t+1})_{|L}\|_{1}$ .

Since $f_{i,t+1}>f^{\prime}_{i,t+1}$ implies $f_{i,t+1}=f^{\prime}_{i,t+1}+\frac{1}{\mu}$ , we also have $\|(f_{t+1})_{|L}\|_{1}-\|(f^{\prime}_{t+1})_{|L}\|_{1}\preceq\frac{1}{\mu}|M|\preceq\tfrac{1}{\mu}\operatorname{Bin}(n,\tfrac{2}{n})$ . ∎

Sampling a particular solution: The following two elementary estimates give an upper and a lower bound on the probability to sample a particular search point $x^{*}$ . Note that the quantity $D=\|x^{*}-f\|_{1}$ is our usual distance measure $D=n-\|f\|_{1}$ when $x^{*}=(1,\dots,1)$ .

Lemma 9.

Let $x^{*}\in\{0,1\}^{n}$ . Let $f\in[0,1]^{n}$ and $D=\|x^{*}-f\|_{1}$ . Let $x\sim\operatorname{Sample}(f)$ . Then $\Pr[x=x^{*}]\leq\exp(-D)$ .

Proof.

The probability to sample $x^{*}$ is

[TABLE]

∎

To ease reading, we formulate the following estimate only for $x^{*}=(1,\dots,1)$ , but it is clear that by symmetry analoguous statements hold for arbitrary $x^{*}$ (when $\|x^{*}-f\|_{\infty}\leq 1-c$ ).

Lemma 10.

Let $0<c<1$ , $f\in[c,1]^{n}$ , and $D=n-\|f\|_{1}$ . Let $x\sim\operatorname{Sample}(f)$ . Then $\Pr[x=(1,\dots,1)]\geq c^{D/(1-c)}$ .

Proof.

For $i\in[1..n]$ , let $\alpha_{i}\coloneqq\frac{1-f_{i}}{1-c\phantom{{}_{i}}}$ . Then $f_{i}=\alpha_{i}c+(1-\alpha_{i})1$ is the unique representation of $f_{i}$ as convex combination of $c$ and $1$ . Since the logarithm is concave, we have

[TABLE]

Since the logarithm is monotonically increasing, this inequality implies ${f_{i}\geq c^{\alpha_{i}}}$ . Consequently,

[TABLE]

∎

Time to sample a search point when close: We use the above lower bound on the probability to sample $(1,\dots,1)$ to prove the following result. It shows that when $\mu$ is large enough and $D_{t}\coloneqq n-\|f_{t}\|_{1}$ is small enough, then regardless of the fitness function we sample the search point $(1,\ldots,1)$ quickly with high probability. The main argument is that when $\mu$ is sufficiently large, then $D_{t}$ stays small sufficiently long.

Lemma 11.

Let $\mu$ be at least $\sqrt{n}$ , but polynomially bounded in $n$ . Consider a run of the cGA with hypothetical population size $\mu$ on an arbitrary fitness function. Assume that at some time $t_{0}$ , we have $D_{t_{0}}:=n-\|f_{t_{0}}\|_{1}\leq\frac{1}{10}\ln n$ and $f_{t_{0}}\in[\frac{1}{3},1]^{n}$ . Then with probability at least $1-n^{-\omega(1)}$ , the search point $(1,\dots,1)$ is sampled in the next $\frac{D_{t_{0}}\mu}{2\ln(n)^{2}}$ iterations.

Proof.

We first argue that if at some time $t$ we have $D_{t}\leq\ln(n)^{2}$ , then $\Pr[D_{t+1}\geq D_{t}+\frac{2}{\mu}\ln(n)^{2}]\leq n^{-\omega(1)}$ . By Lemma 5, we have

[TABLE]

for $j=1,2$ . Consequently, with probability $1-n^{-\omega(1)}$ , we have both $\|x^{1}\|_{1}>n-2\ln(n)^{2}$ and $\|x^{2}\|_{1}>n-2\ln(n)^{2}$ . Now regardless of how $x^{1}$ and $x^{2}$ are sorted into $(y^{1},y^{2})$ , less than $2\ln(n)^{2}$ frequencies are decreased in the frequency update of this iteration. We conclude that $D_{t+1}<D_{t}+\frac{2}{\mu}\ln(n)^{2}$ .

Let $L=\lfloor\frac{D_{t_{0}}}{(2/\mu)\ln(n)^{2}}\rfloor$ . By a union bound, with probability

[TABLE]

we have $D_{t+1}\leq D_{t}+\frac{2}{\mu}\ln(n)^{2}$ in all iterations $t=t_{0},\dots,t_{0}+L-1$ that start with $D_{t}\leq 2D_{t_{0}}$ . Let us condition on this in the following. Then by induction, we have $D_{t}\leq D_{t_{0}}+(t-t_{0})\frac{2}{\mu}\ln(n)^{2}\leq 2D_{t_{0}}$ throughout these $L$ iterations.

Note that $L=O(\frac{\mu}{\log(n)})$ , hence throughout this period we also have $f_{it}\geq\tfrac{1}{3}-\tfrac{1}{\mu}L\geq 0.32$ (assuming $n$ to be sufficiently large) for all ${i\in[1..n]}$ . By Lemma 10, the probability that a fixed search point sampled in this period equals $(1,\dots,1)$ , is at least

[TABLE]

Since $0.2\ln(0.32)/0.68>-0.34$ and $L=\Omega(\frac{\mu}{\log(n)^{2}})$ , the probability that $(1,\dots,1)$ is not sampled in this period is at most

[TABLE]

∎

Sampling search points with different $1$ -norm: To argue that the cGA makes at least some small progress, we shall use the following blunt estimate for the probability that two bit strings $x,y\sim\operatorname{Sample}(f)$ sampled from the same product distribution have a different distance from the all-ones string (and, by symmetry, from any other string, but this is a statement which we do not need here).

Lemma 12.

Let $n\in\mathbb{N}$ , $m\in[\frac{n}{2}..n]$ , and $f\in[\frac{1}{n},1-\frac{1}{n}]^{m}$ . Let $x^{1},x^{2}\sim\operatorname{Sample}(f)$ be independent. Then $\Pr[\|x^{1}\|_{1}\neq\|x^{2}\|_{1}]\geq\frac{1}{16}$ .

Proof.

For all $v\in\mathbb{R}^{m}$ and $a,b\in[1..m]$ with $a\leq b$ we use the abbreviation $v_{[a..b]}\coloneqq\sum_{i=a}^{b}v_{i}$ . We first argue that by symmetry, we can assume that $f_{[1..m]}\leq\frac{m}{2}$ . Indeed, let $f_{[1..m]}>\frac{m}{2}$ and assume the claim shown for the case that $f_{[1..m]}\leq\frac{m}{2}$ . Let $\bar{f}=\mathbf{1}_{m}-f$ and $\bar{x}^{1},\bar{x}^{2}\sim\operatorname{Sample}(\bar{f})$ independent. Then

[TABLE]

where the last estimate follows from our assumption and the fact that $\bar{f}_{[1..m]}\leq\frac{m}{2}$ . This shows the claim for $f$ and justifies that in the remainder, we assume $f_{[1..m]}\leq\frac{m}{2}$ .

Without loss of generality, we may further assume that $f_{i}\leq f_{i+1}$ for all $i\in[1..m-1]$ . We have $f_{\lfloor m/4\rfloor}\leq\frac{2}{3}$ as otherwise

[TABLE]

contradicting our assumption.

Let $\ell$ be minimal such that $f_{[1..\ell]}\geq\frac{1}{8}$ . Since $\ell\leq\frac{n}{8}\leq\frac{m}{4}$ , we have $f_{\ell}\leq\frac{2}{3}$ and thus $f_{[1..\ell]}\leq\frac{1}{8}+\frac{2}{3}=\frac{19}{24}$ .

For $j\in\{0,1\}$ let $q_{j}=\Pr[x^{1}_{[1..\ell]}=j]=\Pr[x^{2}_{[1..\ell]}=j]$ . We compute

[TABLE]

the latter by the inequality of the arithmetic and geometric mean. Using Bernoulli’s inequality, we estimate coarsely

[TABLE]

Since the function $z\mapsto z(1-z)^{2}$ is unimodal in $[0,1]$ , the minimum in any subinterval of $[0,1]$ is necessarily found at a boundary of the interval. We thus obtain

[TABLE]

∎

4 An Upper Bound for the Runtime of the cGA on Jump Functions

In this section, we state precisely and prove our $O(\mu\sqrt{n})$ upper bound for the runtime of the cGA on jump functions with small jump size $k$ . With the smallest admissable hypothetical populations size $\mu=\Theta(\sqrt{n}\log n)$ , it gives a runtime guarantee of $O(n\log n)$ . Our result includes the trivial case $k=1$ , that is, the OneMax benchmark function, a result that was known before.

Our results are true not only for jump functions, but for the larger class of all functions that (apart from a uniform additive constant) agree with OneMax on $\{0,1\}^{n}\setminus G_{nk}$ and that have $(1,\ldots,1)$ as an optimum (recall that $G_{nk}$ was defined to be the gap region of $\operatorname{\textsc{Jump}}_{nk}$ , see (2)). This observation is interesting in its own right, e.g., it yields that our result also holds for the plateau functions defined in [AD18]. It also helps us formulating the proofs in a more concise manner since we can now assume that $k$ is sufficiently large (since a jump function with small jump parameter $k^{\prime}$ is included in this larger class for all $k\geq k^{\prime}$ ).

To make things precise, for all $n\in\mathbb{N}$ and $k\geq 1$ we define the class of subjump functions with jump size $k$ as the class of all functions $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ such that there is a $K\in\mathbb{N}$ such that

•

$\mathcal{F}(x)=\|x\|_{1}+K$ if $\|x\|_{1}\in[0..n-k]\cup\{n\}$ ,

•

$\mathcal{F}(x)\leq n+K$ if $\|x\|_{1}\in[n-k+1..n-1]$

for all $x\in\{0,1\}^{n}$ . Here the prefix sub is to be understood in the sense that these functions are seen as at most as hard as the true jump function with jump size $k$ or that, if $\mathcal{F}$ is a jump function, its jump size is at most $k$ . It is not to be understood in the sense that a subjump functions is pointwise less or equal to the corresponding jump function (which is not true).

As said before, subjump functions have an optimum at $(1,\dots,1)$ , but there could be others in the gap region $G_{nk}$ . We continue to call $G_{nk}$ the gap even though this is not for all subjump functions a fitness valley. We note that the class of subjump functions with jump size $k$ includes all subjump functions with jump size smaller than $k$ , in particular, all jump functions with smaller jump size and thus also the OneMax function (to ensure this property, we needed the additional variable $K$ in the definition).

Without further details, we note that many previously proven upper bounds for runtimes on jump functions are also valid for subjump functions. However, this might not be true for results that heavily exploit the particular structure of the true jump functions, say the symmetry of the set of local optima, such as the results on crossover-based algorithms.

The main result of this section is the following runtime guarantee for subjump functions.

Theorem 13.

Let $k\leq\frac{1}{20}\ln(n)-1$ . For a sufficiently large constant $c_{\mu}$ , let $\mu\geq c_{\mu}\sqrt{n}\ln(n)$ , but polynomially bounded in $n$ . Then the cGA with frequency boundaries (Algorithm 1) with hypothetical population size $\mu$ with probability $1-o(1)$ finds the optimum of any $n$ –dimensional subjump function with jump size $k$ in time $O(\mu\sqrt{n})$ . This time is $O(n\log n)$ when $\mu=\Theta(\sqrt{n}\ln(n))$ .

We start by giving a rough overview of the proof in the following subsection and then state the formal proof in the two subsequent subsections.

4.1 Proof Overview

We now give a brief overview of our runtime analysis and show how the different partial results work together. We leave it to the reader to read this section now or after the presentation of the partial results (or twice).

In our analysis, we roughly distinguish three phases of the optimization process. The first phase, analyzed in Lemma 16, lasts until for the first time the frequency distance $D_{t}\coloneqq n-\|f_{t}\|_{1}$ is $O(\log n)$ with a large implicit constant. During this phase, by Lemma 5 and a union bound, with high probability we will never sample a solution in the gap. Consequently, we can pretend that we are optimizing the OneMax function and use our analysis of Lemma 15, which reuses arguments of the classic result by Droste [Dro06] including Lemma 14.

The second phase, analyzed in Lemma 18, then lasts until we have a $D_{t}$ value of less than $2k$ (or less than some constant in the case of a very small $k$ ). In this phase, we use the drift computed in Lemma 17. We profit from the fact that in this phase we only need to obtain a moderate decrease of $D_{t}$ and apply the additive drift theorem (Theorem 7 (i)) with the smallest drift that can occur in this phase, which is $\Omega(\frac{1}{\mu})$ . Since this phase is so short, a simple Markov bound suffices to show that the phase ends with high probability in due time.

Once we have reached a $D_{t}$ value of $O(k)$ , we have a reasonable chance to sample the optimum as shown in Lemma 11. Since in this third phase samples in the gap occur frequently, we have less control over $D_{t}$ , in particular, we cannot exhibit an expected decrease of $D_{t}$ . We therefore pessimistically estimate $D_{t}$ as if $D_{t}$ would always increase, which gives (apart from the boundary effects described in Lemma 8) an increase of $|\|x^{1}\|_{1}-\|x^{2}\|_{1}|$ . Since $D_{t}$ is small, these increases are small as well, as again ensured by Lemma 5. With this observation, we can argue that we have a $D_{t}$ value of $O(k)$ for almost $\mu$ iterations, which together with Lemma 10 shows that we sample the optimum with high probability.

All the arguments above need that the frequencies are bounded away from the lower boundary of $\frac{1}{n}$ , more precisely, that they are $\Omega(1)$ at all times. In the first two phases, we ensure this via Lemma 19, our general result for random processes that are not Markov processes. To this aim, we estimate the probabilities of certain frequency changes by adjusting this data from the OneMax process (Lemma 20, taken from Sudholt and Witt [SW19]) via a pessimistic estimate of the negative influence of search points sampled in the gap. For the third phase, the fact that this phase only last $o(\mu)$ iterations implies that frequencies change by at most $o(1)$ , hence the $\Omega(1)$ lower bound remains intact.

4.2 Proof Ingredients

In this section, we prove separately the main arguments needed in our final proof. We also state some known results on how the cGA optimizes OneMax.

The following result is a weaker form of what was shown in the proof of Lemma 5 in [Dro06]. The result of Lemma 5 in [Dro06], bounding the expected progress instead of showing that a certain progress can be observed with constant probability, is not sufficient for our purposes, see the discussion below.

Lemma 14 ([Dro06]).

There is a constant $C>0$ such that the following holds. Let $n\in\mathbb{N}$ and $D\in\mathbb{N}$ . Let $f\in[\frac{1}{3},1]^{n}$ such that $\|f\|_{1}\leq n-D$ . Let $x^{1},x^{2}\sim\operatorname{Sample}(f)$ independent. Then

[TABLE]

We use this lemma to now conduct a (partial) runtime analysis of the cGA on OneMax. Such an analysis is helpful for our purposes since the optimization process of the cGA on a subjump function is identical to the one on the OneMax function as long as no search point in the gap region is sampled.

Our analysis on OneMax differs from Droste’s analysis of the cGA on OneMax [Dro06, Theorem 8] in several respects. First, we aim at a guarantee that holds with high probability. For this reason, we cannot use the approach via additive drift, and this is the reason why we need Lemma 14 instead of Lemma 5 from [Dro06].

We note that Droste’s drift argument is also not perfectly complete. In each of his $\Theta(n)$ relatively short phases, he uses additive drift to estimate from the expected progress the time to reach the phase target, but he ignores the fact that his expected progress also takes into account progress beyond the phase target. This could lead to an overestimation of the progress. This problem does not occur in the additive drift theorem as stated in Theorem 7, since there the process lives in the non-negative numbers and the process target is zero. We have no doubt, though, that this technical gap can be fixed with additional arguments.

We regard the cGA with non-trivial boundaries, which requires additional arguments as the capping of the frequencies can change the drift of the frequency sum (albeit by not a lot, as our proof shows). We note that without these extra arguments, our proof also applies to the setting without boundaries.

We only regard the time needed to reach a frequency vector with constant distance to the all-ones vector. We note that our analysis can be extended to also give a bound for the time to sample an optimal solution, but we do not need such a result (and in fact, such a result is implied by our main result). Also, a simplified version of our proof would apply to the cGA without boundaries.

Lemma 15.

Let $C$ be the constant from Lemma 14. Consider a run of the cGA with $\mu\geq\log_{2}n$ on the OneMax benchmark function. Let $D_{t}\coloneqq n-\|f_{t}\|_{1}$ for all $t$ . Let $K$ be a sufficiently large constant. Let $T$ be the first time that $D_{t}\leq K$ or that there is an $i\in[1..n]$ with $f_{it}<\frac{1}{3}$ . Then

[TABLE]

We formulated the result above in the slightly cumbersome manner of giving a time guarantee for the event of reaching a near-optimal frequency vector or reaching a frequency below $\frac{1}{3}$ . By Lemma 19 we will be able to rule out the latter event via a simple union bound over the failure probabilities. This approach is technically simpler than conditioning on the frequencies to not go below $\tfrac{1}{3}$ and then working in the conditional probability space.

Proof of Lemma 15.

Define $D^{\prime}_{t}\coloneqq n-\|f^{\prime}_{t}\|_{1}$ for all $t\geq 1$ . For $i=1,2,\dots$ let $d_{i}=2^{-i}n$ . Without loss of generality, we may assume that $K=2^{-\ell-1}n$ for some $\ell\in\mathbb{N}$ . Note that $\ell\leq\log_{2}n$ . We say that the optimization process enters Phase $i$ (and thus leaves its current phase) when for the first time $D_{t}\leq d_{i}$ . Note that we stay in Phase $i$ even when after entering this phase $D_{t}$ increases beyond $d_{i}$ . Note further that, by definition, the process starts in Phase 1. We also say that the current phase ends when a frequency reaches a value below $\frac{1}{3}$ .

We analyze the time spend in each Phase $i\leq\ell$ (when assuming that all frequencies are at least $\frac{1}{3}$ at the start of the phase) and show that this time, with probability at least $1-\exp(-\Omega(\mu))$ , is at most $T_{i}=\lceil 20\frac{1}{C}\mu\sqrt{d_{i+1}}\rceil$ . Let $t^{\prime}$ be the iteration in which the process enters Phase $i$ . To ease the argument, we now consider exactly $T_{i}$ iterations. In case the phase ends earlier, we shall from that point on regard an artificial process, with a slight abuse of notation also denoted by $D_{t}$ and $D^{\prime}_{t}$ , that satisfies the conditions

[TABLE]

Such an artificial extension of a process was, to the best of our knowledge, in the theory of evolutionary algorithms first used in [DHK11].

When all frequencies are at least $\frac{1}{3}$ , by Lemma 14 we have $\Pr[|\|x^{1}\|_{1}-\|x^{2}\|_{1}|\geq\tfrac{1}{5}\sqrt{D_{t}}]\geq C$ . Since we have $\|y^{1}\|_{1}\geq\|y^{2}\|_{1}$ when optimizing OneMax, we have that $D^{\prime}_{t+1}$ with probability at least $C$ satisfies $D^{\prime}_{t+1}\leq D_{t}-\tfrac{1}{5\mu}\sqrt{D_{t}}\leq D_{t}-\tfrac{1}{5\mu}\sqrt{d_{i+1}}$ . We call this a success. Note that the probability for a success is at least $C$ regardless of what happened before in this phase. Consequently, in $T_{i}$ iterations, we not only have an expected number of at least $20\mu\sqrt{d_{i+1}}$ successes, but, using the multiplicative Chernoff bounds (Theorem 4) and the fact that “sequential independence” suffices for Chernoff bounds to be admissible (Lemma 11 in [DJ10] or Section 1.10.2.1 in [Doe20c]), we also have at least $10\mu\sqrt{d_{i+1}}$ successes with probability at least $1-\exp(-\tfrac{5}{2}\mu\sqrt{d_{i+1}})$ . Note that with probability one we have $D^{\prime}_{t+1}\leq D_{t}$ , again because $\|y^{1}\|_{1}\geq\|y^{2}\|_{1}$ .

By Lemma 8 (ii), we have $D_{t+1}\preceq D^{\prime}_{t+1}+\frac{1}{\mu}\operatorname{Bin}(n,\frac{2}{n})$ , again regardless of what happened in earlier iterations. Consequently, the total number of times we increase $D_{t}$ by $\frac{1}{\mu}$ due to reaching an upper frequency boundary can be estimated by a sum of $T_{i}n$ independent binary random variables with success probability $\frac{2}{n}$ . Hence the expectation of this number is at most $2T_{i}\leq 40\frac{1}{C}\mu\sqrt{d_{i+1}}+2$ and, by Theorem 4, with probability at least $1-\exp(\frac{2T_{i}}{3})\geq 1-\exp(-\frac{40}{3}\frac{1}{C}\mu\sqrt{d_{i+1}})$ this number is at most $4T_{i}=80\frac{1}{C}\mu\sqrt{d_{i+1}}+4$ .

Taking these two observations together, we see that with probability

[TABLE]

we have

[TABLE]

Since $K=2^{-\ell-1}n\leq d_{i+1}$ was chosen sufficiently large, we can assume that $-2d_{i+1}+\frac{80}{C}\sqrt{d_{i+1}}+\frac{4}{\mu}\leq-d_{i+1}$ and thus $D_{t^{\prime}+T_{i}}\leq D_{t^{\prime}}-d_{i+1}$ , that is, $D_{t^{\prime}+T_{i}}$ belongs to a later phase already. Consequently, we have that with probability at least $1-\exp(-\Omega(\mu))$ , at most $T_{i}$ rounds are spend in Phase $i$ .

We finally show our claim first by noting that there are only $O(\log n)$ phases, hence with probability at least $1-O(\log n)\exp(-\Omega(\mu))=1-\exp(-\Omega(\mu))$ no phase takes longer than the desired $T_{i}$ iterations, and second by computing

[TABLE]

∎

Lemma 15 can be extended to give a time bound for subjump function as long as the target distance from the optimum is sufficiently large.

Lemma 16.

Let $C$ be the constant from Lemma 14 and let $C_{\mu}$ be any constant. Consider a run of the cGA with $\mu=\omega(\log n)$ and $\mu\leq n^{C_{\mu}}$ on a subjump function $\mathcal{F}$ with jump size $k\leq\tfrac{1}{20}\ln n$ . Let $D_{t}\coloneqq n-\|f_{t}\|_{1}$ for all $t$ . Let $K=(8C_{\mu}+12)\ln n$ . Then with probability $1-O(\frac{1}{n})$ , there is a $t\leq T\coloneqq\frac{10(2+\sqrt{2})}{C}\,\mu\sqrt{n}$ such that $D_{t}\leq K$ or $f_{it}<\tfrac{1}{3}$ for some $i\in[1..n]$ .

Proof.

We regard the modified optimization process where we start with a run of the cGA on $\mathcal{F}$ , but change the fitness function to OneMax when for the first time $D_{t}\leq K$ . Clearly, this modified process satisfies our claim if and only if the original process on $\mathcal{F}$ does.

We now couple the modified process to the optimization process of the cGA with same $\mu$ value on the OneMax function. We construct this coupling as follows. For each $t=1,2,\ldots$ and each $j\in[1..2]$ we let $r^{tj}\in[0,1]^{n}$ be a vector chosen uniformly at random. If $f_{t}$ is the frequency vector of the modified or the OneMax process, then the $j$ -th sample $x^{tj}\in\{0,1\}^{n}$ in iteration $t$ of this process is defined by $x^{tj}_{i}=1$ if and only if $r^{tj}_{i}\leq f_{it}$ . Clearly, the two (marginal) processes defined this way are identically distributed to the two processes we wanted to couple. More interestingly, the two processes in the coupling are identical up to the point where the modified subjump process samples a search point in the gap region and thus before it changed the fitness to OneMax. If we denote the probability of this event happening within the first $T$ iterations by $p$ , then by Lemma 15 and a union bound over the two failure probabilities, we have that with probability at least $1-(\exp(-\Omega(\mu))+p)$ , within $T$ iterations the modified process has reached a $D_{t}$ value of at most $K$ or has reached a frequency below $\frac{1}{3}$ .

Hence it remains to show that $p$ is sufficiently small. For this we note that by Lemma 5, the probability that in the modified subjump process before the switch to the OneMax fitness a particular search point $x^{tj}$ lies in the gap, is at most

[TABLE]

where we wrote $d(x^{tj}):=n-\|x^{tj}\|$ as earlier in this work. By a union bound, $p\leq 2Tn^{-C_{\mu}}n^{-1.5}=O(\frac{1}{n})$ . ∎

We now analyze the drift in $D_{t}$ when we are that close to the gap that we cannot assume anymore that we never sample a search point in the gap. We recall the definition of the gap by

[TABLE]

and we further define $G^{+}\coloneqq G\cup\{(1,\dots,1)\}$ .

A difficulty here, which was not treated fully rigorously in [HS18, Lemma 3.1], is that the event $G_{t}$ that $x^{1}$ or $x^{2}$ lie in the gap and the random variable $|\|x^{1}\|_{1}-\|x^{2}\|_{1}|$ are not independent. Consequently, the estimate $E[D_{t}-D_{t+1}\mid D_{t}]=\frac{1}{\mu}|\|x^{1}\|_{1}-\|x^{2}\|_{1}|(1-2\Pr[G_{t}])$ is not correct. In fact, the correlation is indeed not in our favor. When $|\|x^{1}\|_{1}-\|x^{2}\|_{1}|$ is large, the probability that a search point in the gap was sampled (and thus the frequency update is done in the unwanted direction) is higher. We solve this difficulty by computing an estimate for $|\|x^{1}\|_{1}-\|x^{2}\|_{1}|$ conditional on that at least one of the search points lies in the gap.

Lemma 17.

Let $\mu$ be arbitrary (but, as always in this work, satisfying the well-behaved frequency assumption). Let $k\in[1..\frac{1}{2}n-1]$ . Consider an iteration $t$ of the cGA optimizing a subjump function with jump size $k$ started with a frequency vector $f_{t}$ such that $D_{t}\coloneqq n-\|f_{t}\|_{1}\geq 2k$ and $f_{t}\in[\frac{1}{3},1-\frac{1}{n}]^{n}$ . Then

[TABLE]

where $C$ is the constant from Lemma 14.

Proof.

From the definition of the cGA, we note that when $x^{1}$ and $x^{2}$ are both not in $G^{+}$ , then $D^{\prime}_{t+1}\coloneqq n-\|f^{\prime}_{t+1}\|_{1}$ satisfies $D^{\prime}_{t+1}=D_{t}-\frac{1}{\mu}|\|x^{1}\|_{1}-\|x^{2}\|_{1}|$ as if we were optimizing OneMax. In all other cases, we have $D^{\prime}_{t+1}\leq D_{t}+\frac{1}{\mu}|\|x^{1}\|_{1}-\|x^{2}\|_{1}|$ . Consequently,

[TABLE]

When the frequencies are all at least $\frac{1}{3}$ , we conclude from Lemma 14 that $E[|\|x^{1}\|_{1}-\|x^{2}\|_{1}|]\geq\frac{1}{5}C\sqrt{D_{t}}$ .

For the contribution when search points are in $G^{+}$ , we first note that the second bound of Lemma 5 (with $\delta=\frac{1}{2}$ and $D^{-}=D_{t}$ ) and $D_{t}\geq 2k$ yield

[TABLE]

Then, exploiting the symmetry between $x^{1}$ and $x^{2}$ , counting the case $x^{1},x^{2}\in G^{+}$ twice, and using again $\frac{1}{2}D_{t}\geq k$ , we compute

[TABLE]

In summary, we have

[TABLE]

By Lemma 8, we further have $E[\mu D_{t+1}-\mu D^{\prime}_{t+1}]\leq 2$ . Consequently, recalling that the linearity of expectation holds also for dependent random variables, we have

[TABLE]

∎

From Lemma 17, we obtain the following coarse estimate for the time to reach a frequency distance $D_{t}$ below $2k$ (or at least below some constant).

Lemma 18.

Let $k\in[1..\frac{n}{2}-1]$ . Consider a run of the cGA with arbitrary hypothetical population size $\mu$ (satisfying the well-behaved frequency assumption) and started with a fixed frequency vector $f_{0}$ instead of the usual initialization $f_{0}=(\frac{1}{2},\ldots,\frac{1}{2})$ . For all $t\geq 0$ , let $D_{t}:=n-\|f_{t}\|_{1}$ . Let $D^{\prime\prime}\geq 2k$ and at least some sufficiently large constant (depending on the constant $C$ from Lemma 14). Let $T$ be the first time $t$ that this run reaches a frequency vector $f_{t}$ with $D_{t}<D^{\prime\prime}$ or that there is a frequency $f_{it}$ that is less than $\frac{1}{3}$ . Then

[TABLE]

Proof.

Based on the run of the cGA, we define a random process $\tilde{D}_{t}$ as follows. If for some $s\in[0..t]$ we have $D_{s}<D^{\prime\prime}$ or there is an $i\in[1..n]$ with $f_{is}<\frac{1}{3}$ , then $\tilde{D}_{t}=0$ . Otherwise, we let $\tilde{D}_{t}=D_{t}$ . In other words, the process $(\tilde{D}_{t})$ agrees with $(D_{t})$ while $(D_{t})$ is at least $D^{\prime\prime}$ and there is no frequency below $\frac{1}{3}$ , and then is constant zero.

By Lemma 17 and using our assumption that $D^{\prime\prime}$ is a large absolute constant, we have $E[\tilde{D}_{t}-\tilde{D}_{t+1}]\geq\frac{1}{\mu}$ whenever $D_{t}\geq D^{\prime\prime}$ and $f_{t}\in[\frac{1}{3},1]^{n}$ , that is, we have $E[\tilde{D}_{t}-\tilde{D}_{t+1}\mid\tilde{D}_{t}>0]\geq\frac{1}{\mu}$ for all $t\geq 0$ .

Since $T=\inf\{t\geq 0\mid\tilde{D}_{t}=0\}$ , the additive drift theorem (Theorem 7 (i)) yields $E[T]\leq\frac{D_{0}}{1/\mu}$ . ∎

We end this section of preliminary results with an argument showing that the frequencies stay away from the lower boundary for a decent amount of time. On the formal level, this argument will be used to argue that at the times $T$ estimated on Lemmas 16 and 18, we have the desired small $D_{t}$ value and not the case that a frequency went below $\frac{1}{3}$ . On the intuitive level, this argument is necessary for two reasons. On the one hand, as can be seen from the proof or via simple counter-examples, a lower bound for the probability of sampling the optimum such as Lemma 10 is not anymore true if arbitrarily small frequencies are allowed. On the other hand, small frequencies support that the two offspring sampled in one iteration agree in the corresponding bit. In this case, no change of the frequency is possible, which slows down the progress and rules out progress guarantees such as Lemma 17.

A guarantee that all frequencies stay away from the lower boundary in a run of the cGA on jump functions was also given in [HS18, Lemma 2.4]. Unfortunately, the proof appears not complete to us. It seems to us that the main technical prerequisite of this result, Lemma 2.2 in [HS18] with a proof of a little over one page in the condensed proceedings style, is not correct for two reasons. Since the proof of Lemma 2.2 never refers to the frequency boundaries, it is not clear if it is applicable for the cGA with these boundaries. Rather, a frequency vector having one entry $f_{it}=\frac{1}{n}$ and another one $f_{jt}=1-\frac{1}{n}$ seems to be a counter-example (note that the frequency vector is called $p_{t}$ instead of $f_{t}$ in [HS18]). However, also for the case without boundaries counter-examples seem to exist for all values of $\mu$ , e.g., the frequency vector $f_{t}=(\frac{1}{100},\frac{1}{2})$ .

We did not see how to repair the otherwise elegant argument via the Azuma-Hoeffding inequality. For this reason, using a sequence of elementary reductions, we argue that the true random process of a frequency, which is not a Markov process when regarding one frequency in isolation, can be pessimistically replaced by a fair random walk on an unbounded frequency domain. For the analysis of the latter, classic Chernoff bounds can be used. This general approach was also taken in [Dro06], however in the easier situation that there are no frequency boundaries (apart from the trivial boundaries, which are absorbing). For this reason, some additional arguments are necessary in our situation.

Lemma 19.

Let $\mu$ be arbitrary (except that it satisfies the well-behaved frequency assumption). Let $\varepsilon>0$ . Let $Z_{0},Z_{1},\dots$ be any random process on $F_{\mu}$ (defined in (1)) such that

(i)

$Z_{0}=\frac{1}{2}$ , 2. (ii)

for all $t=0,1,\dots$ such that $Z_{t}\geq\frac{1}{2}-\varepsilon$ there are numbers $p_{t},q_{t},r_{t}\in[0,1]$ , depending on $Z_{0},Z_{1},\dots,Z_{t}$ , such that $p_{t}+q_{t}+r_{t}=1$ and

[TABLE]

We further assume that $q_{t}\geq r_{t}$ when $Z_{t}\neq 1-\frac{1}{n}$ .

Then for all $T\in\mathbb{N}$ ,

[TABLE]

Proof.

For the ease of the argument, we can without loss of generality assume that condition (ii) also holds when $Z_{t}<\frac{1}{2}-\varepsilon$ . We conduct a sequence of reductions to a fair unbiased random walk on an infinite line. We first observe that we can assume $p_{t}=0$ for all $t$ . The event $Z_{t+1}=Z_{t}$ that the process does not move only slows down the process in the sense that it visits fewer states, and thus is less likely to approach the lower boundary.

We now argue that w.l.o.g. we can assume that $q_{t}=r_{t}=\tfrac{1}{2}$ for all $t\in[0..T-1]$ except in the cases $Z_{t}\in\{\frac{1}{n},1-\frac{1}{n}\}$ . To make this argument formal, we inductively modify $q_{t}$ and $r_{t}$ in the time interval $t\in[0..T-1]$ . The modified process will be denoted by $(\tilde{Z}_{t})_{t=0,\ldots,T}$ and described via $\tilde{q}_{t}$ and $\tilde{r}_{t}$ , $t\in[0..T-1]$ , which again are functions of $\tilde{Z}_{0},\ldots,\tilde{Z}_{t}$ . We start with $(\tilde{Z}_{t})$ being a copy of $(Z_{t})$ . We denote by

[TABLE]

the “failure probabilities” of both processes given a particular starting point and time.

Assume that $(\tilde{Z}_{t})$ is such that for some $t_{0}\in[1..T]$ we have that for all $s\in[t_{0}..T-1]$

(i)

$\tilde{q}_{s}=\tilde{r}_{s}=\frac{1}{2}$ regardless of $\tilde{Z}_{0},\ldots,\tilde{Z}_{s}$ (except in the boundary cases $\tilde{Z}_{s}\in\{\frac{1}{n},1-\frac{1}{n}\}$ ); 2. (ii)

$P_{is}\leq\tilde{P}_{is}$ for all $i\in F_{\mu}$ .

Note that our initial copy $(\tilde{Z}_{t})$ satisfies these conditions for $t_{0}=T$ . We now modify $(\tilde{Z}_{t})$ so that the new process satisfies these conditions already for $t_{0}-1$ . To this end, let $(Z^{\prime}_{t},q^{\prime}_{t},r^{\prime}_{t})$ be a copy of $(\tilde{Z}_{t},\tilde{q}_{t},\tilde{r}_{t})$ expect that we define $q^{\prime}_{t_{0}-1}=r^{\prime}_{t_{0}-1}=\frac{1}{2}$ (except in the boundary cases $Z^{\prime}_{t_{0}-1}\in\{\frac{1}{n},1-\frac{1}{n}\}$ ). Since from time $t_{0}$ on $(Z^{\prime}_{t})$ equals $(\tilde{Z}_{t})$ , we have $P^{\prime}_{is}=\tilde{P}_{is}\geq P_{is}$ for all $i\in F_{\mu}$ . Since further from time $t_{0}$ on $(Z^{\prime}_{t})$ and $(\tilde{Z}_{t})$ are a fair random walks with reflecting boundaries, a simple coupling argument shows $P^{\prime}_{i-1/\mu,t_{0}}\geq P^{\prime}_{i+1/\mu,t_{0}}$ for all $i\in F_{\mu}\setminus\{\frac{1}{n},1-\frac{1}{n}\}$ . From this, $\tilde{r}_{t_{0}-1}\leq\tilde{q}_{t_{0}-1}$ , and $\tilde{r}_{t_{0}-1}+\tilde{q}_{t_{0}-1}=1$ , we obtain for all $i\in F_{\mu}\setminus\{\frac{1}{n},1-\frac{1}{n}\}$ that

[TABLE]

In the boundary cases, we trivially have $P^{\prime}_{1/n,t_{0}-1}=1=P_{1/n,t_{0}-1}$ and $P^{\prime}_{1-1/n,t_{0}-1}=P^{\prime}_{1-1/n-1/\mu,t_{0}}=\tilde{P}_{1-1/n-1/\mu,t_{0}}\geq P_{1-1/n-1/\mu,t_{0}}=P_{1-1/n,t_{0}-1}$ . This proves our claim. An elementary induction gives a process $(\tilde{Z}_{t})$ that satisfies (i), (ii) above from $t_{0}=0$ on. This process, then, is a simple unbiased random walk with reflecting boundaries. From (ii) we see that such an unbiased random walk is not better than the original process $(Z_{t})$ in terms of avoiding to go below $\frac{1}{2}-\varepsilon$ .

Hence we can now assume that $(Z_{t})$ is an unbiased random walk on $F_{\mu}$ with reflecting boundaries. We shall show that

[TABLE]

Being interested in the event that the process reaches a state outside ${[\tfrac{1}{2}-\varepsilon,\tfrac{1}{2}+\varepsilon]}$ at least once, we can also drop the boundary conditions and assume that we have $Z_{t+1}\in\{Z_{t}-\frac{1}{\mu},Z_{t}+\frac{1}{\mu}\}$ uniformly at random at all times $t$ . We can now rewrite the $Z_{t}$ as follows. Let $X_{1},\dots,X_{T}$ be independent random variables uniformly distributed on $\{-\frac{1}{\mu},\frac{1}{\mu}\}$ . Define $Z^{\prime\prime}_{t}\coloneqq\frac{1}{2}+\sum_{i=1}^{t}X_{t}$ for all $t\in[0..T]$ . Then $(Z_{0},\ldots,Z_{T})$ and $(Z^{\prime\prime}_{0},\ldots,Z^{\prime\prime}_{T})$ are identically distributed. Consequently, we can apply to $(Z_{t})$ and $(Z^{\prime\prime}_{t})$ the additive Chernoff bound in the sharper version working also for partial sums, Theorem 6, and obtain

[TABLE]

∎

To apply Lemma 19, we need a deeper understanding of the random process describing a single frequency. For this, we build on the following estimate of the expected change of a frequency that is not affected by the boundaries in the OneMax process. This result was proven in [SW19, Lemma 3].

Lemma 20.

Let $\mu$ be arbitrary (but satisfying the well-behaved frequency assumption). Consider a run of the cGA optimizing OneMax. Consider an iteration starting with a frequency vector $f_{t}$ . Let $i\in[1..n]$ be such that $\frac{1}{n}+\frac{1}{\mu}\leq f_{it}\leq(1-\frac{1}{n})-\frac{1}{\mu}$ . Then

[TABLE]

From Lemmas 19 and 20, we now obtain the following lower bound guarantee for the frequencies in the optimization process on subjump functions. Regarding the restriction $k\geq 17$ , we recall that a subjump function with jump size smaller than $17$ also is a subjump function with jump size $17$ . The lemma thus applies also to these (in a suitable manner). We could have alternatively formulated the lemma for all $k$ and defined $D^{\prime\prime}=\max\{2k+1,35\}$ .

Lemma 21.

Let $k\in[1..n]$ be arbitrary. Consider the run of the cGA with hypothetical population size $\mu$ on a subjump function with jump size $k\geq 17$ . Let $D_{t}=n-\|f_{t}\|_{1}$ for all $t$ . Let $D^{\prime\prime}=2k+1$ and $T^{\prime\prime}=\inf\{t\geq 0\mid D_{t}\leq D^{\prime\prime}\}$ . Then for all $T\in\mathbb{N}$ , with $T^{\prime\prime\prime}\coloneqq\min\{T^{\prime\prime},T\}$ , we have

[TABLE]

Proof.

Consider some time $t$ such that $f_{t}\in[\frac{1}{3},1]^{n}$ and $D_{t}\geq D^{\prime\prime}$ . Consider a fixed bit $i\in[1..n]$ such that $f_{it}\neq 1-\frac{1}{n}$ . If we were optimizing the OneMax function, then by Lemma 20,

[TABLE]

Regardless of whether we optimize OneMax or a subjump function, the events $f_{i,t+1}=f_{it}+\tfrac{1}{\mu}$ and $f_{i,t+1}=f_{it}-\tfrac{1}{\mu}$ can only occur when the two search points sampled in this iteration satisfy $x^{1}_{i}\neq x^{2}_{i}$ . The definition of $f_{i,t+1}$ in the subjump case differs from the OneMax case at most when at least one of $x^{1}$ and $x^{2}$ lie in the gap $G_{nk}$ . Hence the following coarse correction of the above estimate is valid for the optimization of subjump functions of jump size $k$ .

[TABLE]

We now estimate this correction term. We note that

[TABLE]

By symmetry and the union bound, we have

[TABLE]

Conditional on $x^{1}_{i}\neq x^{2}_{i}$ , the bit string $x^{1}$ is sampled from $\operatorname{Sample}(f_{t})$ , however, conditional on the $i$ -th bit being zero or one. In either case, to have $x^{1}\in G_{nk}$ , we need that $\tilde{D}=\sum_{j\neq i}(1-x^{1}_{j})$ is at most $k\leq\frac{1}{2}(D_{t}-1)$ , where we recall that $D_{t}\geq D^{\prime\prime}=2k+1$ . Since $E[\tilde{D}]=D_{t}-(1-f_{it})\geq D_{t}-1$ , by Lemma 5 with $\delta=\frac{1}{2}$ this event happens with probability at most $\exp(-\tfrac{1}{8}(D_{t}-1))$ . Together with $\Pr[x^{1}_{i}\neq x^{2}_{i}]=2f_{it}(1-f_{it})$ , we obtain

[TABLE]

which is non-negative since $D_{t}\geq D^{\prime\prime}=2k+1\geq 35$ .

Consequently, the process $(f_{it})_{t}$ satisfies the assumptions of Lemma 19 up to time $T^{\prime\prime}$ . If $T^{\prime\prime}<T$ , we artificially extend the process (for the following argument only) by setting $f_{it}=f_{iT^{\prime\prime}}$ for all $t\in[T^{\prime\prime}+1..T]$ . We apply Lemma 19 to this extended process and obtain that up to time $T$ , the $i$ -th frequency is always at least $\frac{1}{3}$ with probability $1-2\exp(-\frac{\mu^{2}}{72T})$ . With a union bound over the $n$ frequencies, we have $f_{t}\in[\frac{1}{3},1]^{n}$ up to time $T$ with probability at least $1-2n\exp(-\frac{\mu^{2}}{72T})$ in the extended process, and up to time $T^{\prime\prime\prime}$ in the true process. ∎

4.3 Proof of Theorem 13

We are now ready to formulate the full proof of our main upper bound result.

Proof of Theorem 13.

To allow the reader to easily check that all implicit constants can be chosen in a way that they give the claimed result, we make these constants explicit in the following proof, but note that for most of them it just suffices to choose them sufficiently large.

We consider the optimization of a subjump function $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ with jump size $k\leq\frac{1}{20}\ln(n)-1$ . Without loss of generality, we can assume that $k\geq 17$ .333In fact, we could just assume that $k=\lfloor\frac{1}{20}\ln(n)\rfloor-1$ , but we find it more insightful to present the proof in a way that the arguments are adjusted to the true value of $k$ (assuming it to be at least $17$ ).

Let $\mu\geq c_{\mu}\sqrt{n}\ln(n)$ for a constant $c_{\mu}$ to be defined in a moment. Assume further that for some constant $C_{\mu}$ we have $\mu\leq n^{C_{\mu}}$ . Without loss of generality, we assume that $C_{\mu}\geq 1$ .

Consider a run of the cGA with hypothetical population size $\mu$ on $\mathcal{F}$ . Let $D_{t}\coloneqq n-\|f_{t}\|_{1}$ for all $t\geq 0$ .

Let $D^{\prime}\coloneqq C_{D^{\prime}}\ln n$ , where $C_{D^{\prime}}\geq 8C_{\mu}+12$ is a constant. Let $T^{\prime}$ be the first time that $D_{t}\leq D^{\prime}$ or that there is a frequency $f_{it}$ that is less than $\frac{1}{3}$ . By Lemma 16, with probability at least $1-O(\frac{1}{n})$ we have $T^{\prime}\leq\frac{10(2+\sqrt{2})}{C}\mu\sqrt{n}$ , where $C$ is the constant from Lemma 14.

Let $D^{\prime\prime}\coloneqq\max\{2k+1,C_{D^{\prime\prime}}\}$ , where $C_{D^{\prime\prime}}$ is a sufficiently large constant (that depends only on the constant $C$ from Lemma 14). Let $T^{\prime\prime}$ be the first time that $D_{t}<D^{\prime\prime}$ or that there is a frequency $f_{it}$ that is less than $\frac{1}{3}$ . By Lemma 18, we have $E[T^{\prime\prime}-T^{\prime}]=O(\mu\log n)$ . Hence a simple Markov bound gives $T^{\prime\prime}\leq T^{\prime}+\mu n^{0.4}\ln n$ with probability $1-O(n^{-0.4})$ .

Finally, let $c_{T}\coloneqq\frac{10(2+\sqrt{2})}{C}+1$ and assume that $c_{\mu}\geq 144c_{T}$ . Using our assumption that $k\geq 17$ , we first invoke Lemma 21 with $T=c_{T}\mu\sqrt{n}$ and obtain that up to time $T^{\prime\prime\prime}=\min\{T^{\prime\prime},T\}$ , all frequencies are at least $\frac{1}{3}$ with probability $1-2n\exp(-\frac{\mu^{2}}{72T})\geq 1-2n\exp(-\frac{\mu}{72c_{T}\sqrt{n}})\geq 1-2n\exp(-\frac{c_{\mu}}{72c_{T}}\ln n)=1-O(\frac{1}{n})$ by choice of $c_{\mu}$ .

Putting these three arguments together, we see that with probability $1-O(\frac{1}{n})-O(n^{-0.4})-O(\frac{1}{n})=1-O(n^{-0.4})$ , there is a time $t=O(\mu\sqrt{n})$ such that $D_{t}\leq D^{\prime\prime}\leq\frac{1}{10}\ln(n)$ and $f_{t}\in[\frac{1}{3},1]^{n}$ . By Lemma 11, we now find the optimum in $O(\frac{\mu}{\log n})$ iterations with probability $1-n^{-\omega(1)}$ . This shows that the total runtime is $O(\mu\sqrt{n})$ with probability $1-O(n^{-0.4})-n^{-\omega(1)}=1-O(n^{-0.4})$ . ∎

Let us remark that we did not try to optimize the implicit constants, nor did we try to find the largest constant $C_{k}$ such that the $O(n\log n)$ runtime guarantee holds for all $k\leq C_{k}\ln(n)-1$ . We further note that all but one argument in the above proof, by choosing the constants right, would give a success probability of $1-n^{-c}$ , where $c$ can be any constant. This is not true for the Markov bound argument in the analysis of the time to reach a $D_{t}$ value of at most $D^{\prime\prime}$ . Without further details, we note that also for this phase an arbitrary inverse-polynomial failure probability could be obtained with stronger methods.

Finally, we note that by taking $k=1$ , our result also applies to the OneMax function.

4.4 General Insights From This Proof

Our result that the cGA can cross small fitness valleys at no extra cost, whereas many EAs pay an $\Omega(n^{k})$ price for this, raises the question why these algorithms differ that significantly. From our proof, we obtain the following insight.

To ease the presentation, we take as point of comparison the simple $(1+1)$ EA, but as discussed earlier, similar behaviors are observed for many other mutation-based EAs. Again, when talking about the cGA, we measure the progress via the frequency distance $D_{t}=n-\|f_{t}\|$ , which is the expected fitness distance of a sample. For the $(1+1)$ EA, naturally, we regard the Hamming distance $d(x_{t})=n-\textsc{OneMax}(x_{t})$ of the current solution $x_{t}$ from the optimum.

We observe that both algorithms easily reach a distance of $O(k)$ . For the cGA this is “only” $O(k)$ and for the $(1+1)$ EA this is exactly $k$ , but this difference is not important. The important difference is that from such a state, the $(1+1)$ EA samples the optimum only with probability $O(n^{-k})$ , whereas the cGA does so with probability $\exp(-\Omega(k))$ , at least when $f_{t}\in[\frac{1}{3},1]^{n}$ .

A first observation is that the cGA samples solutions with higher variance. This is easiest visible from Lemma 14, which implies that with constant probability the distance $d(y)$ of a sample $y$ is $\Omega(\sqrt{D_{t}})$ away from the expected distance $E[d(y)]=D_{t}$ .

For the $(1+1)$ EA, the sampling variance is much smaller. Since the number of bits that are flipped in a mution follows a binomial distribution with parameters $n$ and $\frac{1}{n}$ , which is asymptotically a Poisson distribution with parameter $\lambda=1$ , we see that larger fitness changes can only occur with relatively small probability (e.g., a super-constant fitness change happens only with probability $o(1)$ , a fitness change of $\delta$ happens with probability at most $\delta^{-\Omega(\delta)}$ ).

The reason for this low sampling variance of the $(1+1)$ EA, obviously, is the small mutation rate of $\frac{1}{n}$ usually employed. However, raising the mutation rate does not solve the problem and, in fact, creates new problems. When using a larger mutation rate, then the expected OneMax fitness of the offspring gets worse. If $x$ is a search point with distance $d(x)=k=O(\log n)$ and $y$ is obtained from $x$ via standard-bit mutation with mutation rate $p$ , then the expected distance of $y$ from the optimum is $E[d(y)]=d(x)+pn(1-2d(x)/n)$ .

Clearly, worsening the expected quality of the offspring can only make sense if there is a clear gain from this. Unfortunately, there is no such gain. Indeed, when using a larger mutation rate $p$ , then the expected distance $d(y)$ has a larger variance. However, this variance mostly works into the wrong direction. When not only looking at the first or second moment, but at the precise distribution, then we see that the distance gain or loss is distributed as $d(x)-d(y)\sim-X_{n-k,p}+X_{k,p}$ , where $X_{n-k,p}$ and $X_{k,p}$ are independent random variables following binomial laws with parameters $(n-k,p)$ and $(k,p)$ , respectively. Consequently, a positive gain can only stem from the $X_{k,p}$ part, which (unless $p$ is ridiculously large) again has a small variance since $k$ is small.

In summary, we see that regardless of how we set the mutation rate, the $(1+1)$ EA only with relatively small probability reduces the distance by a larger amount. This is caused by a generally small sampling variance when $p$ is small, say $p=\frac{1}{n}$ , or by the fact that the distribution of the distance change is highly asymmetric in the way that true distance reductions are unlikely (when $p$ is larger).

For the cGA, things are different. Assuming for simplicity a frequency vector $f_{t}=(1-\frac{2k}{n})\textbf{1}_{n}$ , then the fitness gain of a sample $y$ over the expectation is distributed like $D_{t}-d(y)\sim X_{n,\frac{2k}{n}}$ , where again $X_{n,\frac{2k}{n}}$ denotes a random variable following a binomial law with parameters $n$ and $\frac{2k}{n}$ . While this distribution is not perfectly symmetric, it is not too strongly concentrated in both directions and thus allows larger improvements with reasonable probability, in particular, sampling the optimum with probability $\exp(-O(k))$ . This substantially different way how solutions are sampled seems to be the key to the significantly better performance of the cGA on jump functions.

5 An Exponential Lower Bound

We now prove that the cGA, regardless of the value of the parameter $\mu$ , optimizes jump functions in a time that is at least exponential in the jump size $k$ .

As for our upper bound result, also this lower bound is valid for a broader class of functions. We say that a function $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ is a superjump function with jump size $k$ if it has a unique global maximum $x^{*}$ and for all $r\in[1..k-1]$ and $x,y\in\{0,1\}^{n}$ with $H(x,x^{*})=r$ and $H(y,x^{*})=r+1$ we have $\mathcal{F}(x)<\mathcal{F}(y)$ ; here we recall the definition

[TABLE]

of the Hamming distance between the bit strings $x$ and $y$ of length $n$ . In other words, $\mathcal{F}$ has a unique global maximum and is fully deceptive in a ball of radius $k$ around this optimum: search points closer to the optimum have a lower fitness. Clearly, all jump functions with jump size $k$ or larger are superjump functions with jump size $k$ . Also, by arbitrarily modifying a jump function outside the gap region and in a way that the global optimum remains the unique global optimum, we obtain superjump functions.

We now show the following result.

Theorem 22.

There are constants $\alpha_{1},\alpha_{2}>0$ such that for any $n$ sufficiently large and any $k\in[1..n]$ , regardless of the hypothetical population size $\mu$ , the runtime of the cGA on any superjump function with jump size $k$ with probability $1-\exp(-\alpha_{1}k)$ is at least $\exp(\alpha_{2}k)$ . In particular, the expected runtime is exponential in $k$ .

We note that we intentionally prove a runtime bound that holds with high probability. The reason is that, as discussed in Section 2.4, EDAs may with small probability reach states from which they find it very hard to reach the optimum. Such a situation could lead to a very high expected runtime even when the EDA with high probability is very efficient. For that reason, lower bounds that hold with high probability are particularly desirable for EDAs. Needless to say, a lower bound that holds with high probability immediately implies an asymptotically identical lower bound on the expected runtime.

We also note that the cGA is treating all bit-positions and the two bit values zero and one in a symmetric fashion (this property was called unbiased in [LW12]). Consequently, in any runtime analysis of the cGA on a pseudo-Boolean function we can assume that $(1,\ldots,1)$ is an optimum. Since further the actions of the cGA do not depend on absolute fitness values, but only on relative ones (this property was called ranking-based in [DW14]), its performance is invariant under monotonic rescalings of the fitness function. For this reason, it suffices to regard superjump functions that agree with the jump function of jump size $k$ on all search points $x$ with $\|x\|_{1}\geq n-k$ (and thus also have $(1,\dots,1)$ as unique global optimum). To ease the presentation, we shall take this assumption in the remainder without further notice. This also allows us to continue to use the definition

[TABLE]

of the gap.

Before stating the formal proof, we briefly describe the main proof arguments on a more intuitive level. As in the previous section, we will regard the stochastic process $D_{t}\coloneqq n-\|f_{t}\|_{1}$ , that is, the distance between the sum of the frequencies and its ideal value $n$ . Our general argument is that this process with probability $1-\exp(-\Omega(k))$ stays above $\frac{1}{4}k$ for $\exp(\Omega(k))$ iterations. In each iteration with $D_{t}\geq\frac{1}{4}k$ , the probability that the optimum is sampled is only $\exp(-\Omega(k))$ , see Lemma 9. Hence there is a $T=\exp(\Omega(k))$ such that with probability $1-\exp(-\Omega(k))$ , the optimum is not sampled in the first $T$ iterations.

The heart of the proof is an analysis of the process $(D_{t})$ . It is intuitively clear that once the process is below $k$ , then often the two search points sampled in one iteration both lie in the gap region, which gives $D_{t}$ a positive drift (that is, a decrease of the average frequency). To turn this drift away from the target (a small $D_{t}$ value) into an exponential lower bound on the runtime, we consider the process

[TABLE]

that is, an exponential rescaling of $D_{t}$ . Such a rescaling has recently also been used in [ADY19]. We note that the usual way to prove exponential lower bounds is the negative drift theorem of Oliveto and Witt [OW12]. We did not immediately see how to use it for our purposes, though, since in our process we do not have very strong bounds on the one-step differences. E.g., when $D_{t}=\frac{1}{2}k$ , then the underlying frequency vector may be such that $D_{t+1}\geq D_{t}+\sqrt{k}$ happens with constant probability. We also note that after the submission of this work, a negative multiplicative drift theorem was proposed [Doe20b], which would be applicable to our setting as well. It would, however, not greatly simplify the proof as the main work, estimating the drift of the process $(Y_{t})$ , would still be needed.

We shall show that the process $Y_{t}$ has at most a constant point-wise drift, more precisely, that

[TABLE]

holds for all $y<Y_{\max}\coloneqq\exp(\frac{c}{4}k)$ . From this statement, the lower bound version of the additive drift theorem (Theorem 7 (ii)) would immediately show that the expected time to reach a $D_{t}$ value of $\frac{k}{4}$ or less is at least exponential in $k$ . However, since we aim at a runtime bound that holds with high probability, we take a different (and, in fact, more elementary) route. We regard the process $(\tilde{Y}_{t})$ which is identical to $Y$ until $Y$ first reaches $Y_{\max}$ and then stays constant at $Y_{\max}$ . This process satisfies $E[\tilde{Y}_{t+1}-\tilde{Y}_{t}]\leq 2$ for all times $t$ . From this and $\tilde{Y}_{0}=Y_{0}<1$ we obtain $E[\tilde{Y}_{t}]\leq 1+2t$ . Hence for $T=\exp(\Omega(k))$ sufficiently small, we have

[TABLE]

and a simple Markov bound argument is enough to show that ${\Pr[\tilde{Y}_{T}=Y_{\max}]}=\exp(-\Omega(k))$ . Note that $\tilde{Y}_{T}<Y_{\max}$ is equivalent to $Y_{t}<Y_{\max}$ for all $t\in[0..T]$ .

The main work in the following proof is showing (3). The difficulty here is hidden in a small detail. When $D_{t}\in[\frac{1}{4}k,\frac{3}{4}k]$ , and this is the most interesting case (case 2 in the formal proof), then we have $\|f^{\prime}_{t+1}\|_{1}\leq\|f_{t}\|$ whenever the two search points sampled lie in the gap region, and hence with probability $1-\exp(-\Omega(k))$ ; from Lemma 12 we obtain, in addition, a true decrease, that is, $\|f^{\prime}_{t+1}\|_{1}\leq\|f_{t}\|-\frac{1}{\mu}$ , with constant probability. This progress of $f^{\prime}_{t+1}$ over $f_{t}$ would be perfectly fine for our purposes. Hence the true difficulty arises from the capping of the frequencies into the interval $[\frac{1}{n},1-\frac{1}{n}]$ , that is, from the fact that the new frequency vector is $f_{t+1}\coloneqq\operatorname{minmax}(\frac{1}{n}\textbf{1}_{n},f^{\prime}_{t+1},(1-\frac{1}{n})\textbf{1}_{n})$ . This appears to be a minor problem, among others, because only a capping at the lower bound $\frac{1}{n}$ can have an adverse effect on our process, and there are at most $O(k)$ frequencies sufficiently close to the lower boundary. Things become difficult due to the exponential scaling, which can let rare events still have a significant influence on the expected change of the process.

We now make these arguments precise and prove Theorem 22. We recall that, while the theorem refers to arbitrary superjump functions with jump size $k$ , we can always assume that we regard a fitness function that agrees with the classic jump function with parameter $k$ on all search points $x$ with $\|x\|_{1}\geq n-k$ , including the unique global optimum $(1,\dots,1)$ .

Proof.

Since we are aiming at an asymptotic statement, we can assume in the following that $n$ is sufficiently large.

To ease the presentation of the main part of the proof, let us first give a basic argument for the case of small $k$ and then assume that $k\geq w(n)$ for some function $w:\mathbb{N}\to\mathbb{N}$ with $\lim_{n\to\infty}w(n)=\infty$ .

We first note that with probability $f_{it}^{2}+(1-f_{it})^{2}\geq\frac{1}{2}$ , the two search points $x^{1}$ and $x^{2}$ generated in the $t$ -th iteration agree in the $i$ -th bit, which in particular implies that $f_{i,t+1}=f_{it}$ . Hence with probability at least $2^{-T}$ , this happens for the first $T$ iterations, and thus $f_{it}=\frac{1}{2}$ for all $t\in[0..T]$ . Let us call such a bit position $i$ idle.

Note that the events of being idle are independent for all $i\in[1..n]$ . Hence, taking $T=\lfloor\frac{1}{2}\log_{2}n\rfloor$ , we see that the number $X$ of idle positions has an expectation of $E[X]\geq n2^{-T}\geq\sqrt{n}$ , and by a simple Chernoff bound (Theorem 4), we have $\Pr[X\geq\frac{1}{2}\sqrt{n}]\geq 1-\exp(-\Omega(\sqrt{n}))$ .

Conditional on having at least $\tfrac{1}{2}\sqrt{n}$ idle bit positions, the probability that a particular search point sampled in the first $T$ iterations is the optimum is at most $2^{-\frac{1}{2}\sqrt{n}}$ . By a simple union bound argument, the probability that at least one of the search points generated in the first $T$ iterations is the optimum is at most $2T2^{-\frac{1}{2}\sqrt{n}}=\exp(-\Omega(\sqrt{n}))$ . In summary, we have that with probability at least $1-\exp(-\Omega(\sqrt{n}))$ , the runtime of the cGA on any function with unique optimum (and in particular any superjump function) is greater than $T=\frac{1}{2}\log_{2}n$ . This implies the claim of this theorem for any $k\leq C\log\log n$ , where $C$ is a sufficiently small constant, and, as discussed above, $n$ is sufficiently large.

With this, we can now safely assume that $k=\omega(1)$ . Since $k\leq n$ and our result is invariant under constant-factor changes of $k$ , we can assume that $k\leq\frac{n}{320}$ .

Let $D_{t}\coloneqq n-\|f_{t}\|_{1}=n-\sum_{i=1}^{n}f_{it}$ be the distance of the sum of the frequencies from the ideal value $n$ .

Our intuition (which will be made precise) is that the process $(D_{t})$ finds it hard to go significantly below $k$ because there we will typically sample individuals in the gap, which leads to a decrease of the sum of frequencies (when the two individuals have different distances from the optimum). To obtain an exponential lower bound on the runtime, we suitably rescale the process by defining, for a sufficiently small constant $c$ ,

[TABLE]

Observe that $Y_{t}$ attains its maximal value $Y_{\max}=\exp(\frac{1}{4}ck)$ precisely when $D_{t}\leq\frac{1}{4}k$ . Also, $Y_{t}\leq 1$ for $D_{t}\geq\frac{1}{2}k$ .

To argue that we have $D_{t}>\tfrac{1}{4}k$ for a long time, we now show that for all $y<Y_{\max}$ the drift $E[Y_{t+1}-Y_{t}\mid Y_{t}=y]$ is at most constant. To this aim, we condition on a fixed value of $f_{t}$ , which also determines $D_{t}$ . We treat separately the two cases that $D_{t}\geq\frac{3}{4}k$ and that $\frac{3}{4}k>D_{t}>\frac{1}{4}k$ .

Case 1: Assume first that $D_{t}\geq\frac{3}{4}k$ . By Lemma 5, with probability $1-\exp(-\Omega(D_{t}))\geq 1-\exp(-\Omega(k))$ , the two search points $x^{1},x^{2}$ sampled in iteration $t+1$ both satisfy

[TABLE]

Here and in the following, when writing $\Omega(k)$ we mean that there is a positive constant $C$ , independent of $n$ , $k$ , and $c$ , such that the expression is at least $Ck$ . Let us call $A$ the event described in (4). In this case, we argue as follows. We recall the notation $\sum[v]\coloneqq\sum_{i=1}^{n}v_{i}$ to denote the sum of the elements of an $n$ -dimensional vector $v$ and we recall further that, with a slight abuse of notation, we defined $\|f^{\prime}\|_{1}\coloneqq\sum[f^{\prime}]$ for intermediate frequency vectors $f^{\prime}$ . Let $\{y^{1},y^{2}\}=\{x^{1},x^{2}\}$ such that $\mathcal{F}(y^{1})\geq\mathcal{F}(y^{2})$ . Then

[TABLE]

We still need to consider the possibility that $f_{i,t+1}>f^{\prime}_{i,t+1}$ for some $i\in[1..n]$ . By Lemma 8, not conditioning on $A$ , we have that $\|f_{t+1}\|_{1}-\|f^{\prime}_{t+1}\|_{1}\preceq\frac{1}{\mu}\operatorname{Bin}(\ell,P)\preceq\operatorname{Bin}(\ell,P)$ for some $\ell\in[1..n]$ and $P=2\frac{1}{n}(1-\frac{1}{n})$ .

Let us call $B$ the event that $\|f_{t+1}\|_{1}-\|f^{\prime}_{t+1}\|_{1}<\frac{1}{6}k$ . Note that $A\cap B$ implies $\|f_{t+1}\|_{1}<n-\frac{1}{2}k$ and thus $Y_{t+1}\leq 1$ . By Lemma 3 and the estimate $\binom{a}{b}\leq(\frac{ea}{b})^{b}$ , we have

[TABLE]

We conclude that the event $A\cap B$ holds with probability $1-\exp(-\Omega(k))$ ; in this case $Y_{t}\leq 1$ and $Y_{t+1}\leq 1$ . In all other cases, we bluntly estimate $Y_{t+1}-Y_{t}\leq Y_{\max}$ . This gives

[TABLE]

By choosing the constant $c$ in the definition of $(Y_{t})$ sufficiently small and taking $n$ sufficiently large, we have $E[Y_{t+1}-Y_{t}]\leq 2$ .

Case 2: Assume now that $\frac{3}{4}k>D_{t}>\frac{1}{4}k$ . Let $x^{1},x^{2}$ be the two search points sampled in iteration $t+1$ and let $y^{1},y^{2}$ be such that $\{y^{1},y^{2}\}=\{x^{1},x^{2}\}$ and $\mathcal{F}(y^{1})\geq\mathcal{F}(y^{2})$ . By Lemma 5 again, we have ${k>n-\|x^{i}\|_{1}>0}$ with probability $1-\exp(-\Omega(k))$ for both $i\in\{1,2\}$ . Let us call this event $A$ . Note that if $A$ holds, then both offspring lie in the gap region. Consequently, $\|y^{1}\|_{1}\leq\|y^{2}\|_{1}$ and thus $\|f^{\prime}_{t+1}\|_{1}\leq\|f_{t}\|_{1}$ .

Let $L=\{i\in[1..n]\mid f_{it}=\frac{1}{n}\}$ , $\ell=|L|$ , and $M=\{i\in L\mid x^{1}_{i}\neq x^{2}_{i}\}$ as in Lemma 8. Note that by definition, $D_{t}\geq(1-\frac{1}{n})\ell$ , hence from $D_{t}<\frac{3}{4}k$ and $n\geq 4$ we obtain $\ell<k$ .

Let $B_{0}$ be the event that $|M|=0$ , that is, $x^{1}_{|L}=x^{2}_{|L}$ . Note that in this case, we have $f_{t+1}\leq f^{\prime}_{t+1}$ (component-wise) and thus

[TABLE]

By Lemma 8, Bernoulli’s inequality, and $\ell\leq k$ , we have

[TABLE]

Since $\ell<k\leq\frac{n}{320}<\frac{n}{2}$ , by Lemma 12, we have $\|x^{1}_{|[n]\setminus L}\|_{1}\neq\|x^{2}_{|[n]\setminus L}\|_{1}$ with probability at least $\frac{1}{16}$ . This event, called $C$ in the following, is independent of $B_{0}$ . We have

[TABLE]

If $A\cap B_{0}\cap C$ holds, then $\|f_{t+1}\|_{1}\leq\|f^{\prime}_{t+1}\|_{1}\leq\|f_{t}\|_{1}-\frac{1}{\mu}$ . If $A\cap B_{0}\cap\overline{C}$ holds, then we still have $\|f_{t+1}\|_{1}\leq\|f^{\prime}_{t+1}\|_{1}\leq\|f_{t}\|_{1}$ .

Let us now, for $j\in[1..\ell]$ , denote by $B_{j}$ the event that $|M|=j$ , that is, that $x^{1}_{|L}$ and $x^{2}_{|L}$ differ in exactly $j$ bits. By Lemma 8 again, we have $\Pr[B_{j}]=\Pr[\operatorname{Bin}(\ell,P)=j]$ .

The event $A\cap B_{j}$ implies $\|f_{t+1}\|_{1}\leq\|f^{\prime}_{t+1}\|_{1}+\frac{j}{\mu}\leq\|f_{t}\|_{1}+\frac{j}{\mu}$ and occurs with probability $\Pr[A\cap B_{j}]\leq\Pr[B_{j}]=\Pr[\operatorname{Bin}(\ell,P)=j]$ .

Taking these observations together, we compute

[TABLE]

We note that the second and third term amount to $Y_{t}E[\exp(\frac{cZ}{\mu})]$ , where $Z\sim\operatorname{Bin}(\ell,P)$ . Writing $Z=\sum_{i=1}^{\ell}Z_{i}$ as a sum of $\ell$ independent binary random variables with $\Pr[Z_{i}=1]=P$ , we obtain

[TABLE]

By assuming $c\leq 1$ and using the elementary estimate $e^{x}\leq 1+2x$ valid for $x\in[0,1]$ , see, e.g., Lemma 1.4.2(b) in [Doe20c], we have

[TABLE]

Hence with $P\leq\frac{2}{n}$ , $\mu\geq 1$ , and $\ell\leq\frac{n}{320}$ , we obtain

[TABLE]

again by using $e^{x}\leq 1+2x$ . The second and third term of (5) thus add up to at most $(1+\frac{c}{40})Y_{t}$ .

In the first term of (5), we again assume that $c$ is sufficiently small to ensure that $\exp(-\Omega(k))Y_{\max}=\exp(-\Omega(k))\exp(\frac{1}{4}ck)\leq 1$ . Recalling that $k\leq\frac{n}{320}$ and assuming $k$ sufficiently large (since $k=\omega(1)$ and $n$ is large), we finally estimate in the last term $\tfrac{1}{16}(1-\tfrac{2k}{n})-\exp(-\Omega(k))\geq\tfrac{1}{20}$ and, more interestingly, $1-\exp(-\tfrac{c}{\mu})\geq\tfrac{c}{\mu}(1-\frac{1}{e})$ using the estimate $e^{-x}\leq 1-x(1-\frac{1}{e})$ valid for all $x\in[0,1]$ , which stems simply from the convexity of the exponential function.

With these estimates we obtain

[TABLE]

and thus $E[Y_{t+1}-Y_{t}]\leq 1$ .

In summary, we have now shown that for all $y<Y_{\max}$ and at all times $t$ the process $(Y_{t})$ satisfies $E[Y_{t+1}-Y_{t}\mid Y_{t}=y]\leq 2$ . We note that $Y_{0}\leq 1$ with probability one. For the sake of the argument, let us artificially modify the process from the point on when it has reached a state of at least $Y_{\max}$ . So we define $(\tilde{Y}_{t})$ by setting $\tilde{Y}_{t}=Y_{t}$ , if $Y_{t}<Y_{\max}$ or if $Y_{t}\geq Y_{\max}$ and $Y_{t-1}<Y_{\max}$ , and $\tilde{Y}_{t}=\tilde{Y}_{t-1}$ otherwise. In other words, $(\tilde{Y}_{t})$ is a copy of $(Y_{t})$ until it reaches a state of at least $Y_{\max}$ and then does not move anymore. With this trick, we have $E[\tilde{Y}_{t+1}-\tilde{Y}_{t}]\leq 2$ for all $t$ .

A simple induction and the initial condition $\tilde{Y}_{0}\leq 1$ shows that ${E[\tilde{Y}_{t}]\leq 2t+1}$ for all $t$ . In particular, for $T=\frac{1}{2}\exp(\frac{1}{8}ck)-1$ , we have $E[Y_{T}]\leq\exp(\frac{1}{8}ck)$ and, by Markov’s inequality,

[TABLE]

Hence with probability $1-\exp(-\tfrac{1}{8}ck)$ , we have $\tilde{Y}_{T}<Y_{\max}$ . We now condition on this event. By construction of $(\tilde{Y}_{t})$ , we have $Y_{t}<Y_{\max}$ , equivalently $D_{t}>\frac{1}{4}k$ , for all $t\in[0..T]$ . If $D_{t}>\frac{1}{4}k$ , then by Lemma 9 the probability that a sample generated in this iteration is the optimum, is at most $\exp(-\tfrac{1}{4}k)$ . Assuming $c\leq 1$ again, we see that the probability that the optimum is generated in one of the first $T$ iterations, is at most $2T\exp(-\frac{1}{4}k)\leq\exp(\frac{1}{8}ck)\exp(-\frac{1}{4}k)=\exp(-\frac{1}{8}k)$ . This shows the claim. ∎

6 An $\Omega(n\log n)$ Lower Bound

With the exponential lower bound proven in the previous section, the runtime of the cGA on jump functions is well understood, except that the innocent looking lower bound $\Omega(n\log n)$ , matching the corresponding upper bound for $k\leq\frac{1}{20}\ln n-1$ and optimal choice of $\mu$ , is still missing. Since Sudholt and Witt [SW19] have proven an $\Omega(n\log n)$ lower bound for the simple unimodal function OneMax, which for many EAs is known to be one of the easiest functions with unique global optimum [DJW12, Sud13, Wit13, Doe19a], it would be very surprising if this lower bound would not hold for jump functions as well.

In this section, we first argue why, unlike for many other algorithms, it is hard to show that a lower bound on the runtime of the cGA on OneMax extends to a lower bound for any other function with unique optimum. We then analyze in detail the proof of the $\Omega(n\log n)$ lower bound for OneMax [SW19] and argue that the same arguments can be applied in the case of jump functions (but not superjump functions).

6.1 Domination Arguments Fail

The true reason why OneMax is the easiest optimization problem for many evolutionary algorithms $\mathcal{A}$ , implicit in all such proofs and explicit in [Doe19a], is that when comparing a run of $\mathcal{A}$ on OneMax and on some other function $\mathcal{F}$ with unique global optimum, then at all times the Hamming distance between the current-best solution and the optimum in the OneMax process is stochastically dominated by the same quantity in the other process. This follows by induction and a coupling argument from the following key insight (here formulated for the $(1+1)$ EA only).

Lemma 23.

Let $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ be some function with unique global optimum $x^{*}$ and let OneMax be the $n$ -dimensional OneMax function with unique global optimum $y^{*}=(1,\dots,1)$ . Let $x,y\in\{0,1\}^{n}$ such that $H(x,x^{*})\geq H(y,y^{*})$ , where $H(\cdot,\cdot)$ denotes the Hamming distance. Consider one iteration of the $(1+1)$ EA optimizing $\mathcal{F}$ , started with $x$ as parent individual, and denote by $x^{\prime}$ the parent in the next iteration. Define $y^{\prime}$ analogously for OneMax and $y$ . Then $H(x^{\prime},x^{*})\succeq H(y^{\prime},y^{*})$ .

As a side remark, note that the lemma applied in the special case $\mathcal{F}=\textsc{OneMax}$ shows that the intuitive rule “the closer a search point is to the optimum, the shorter is the optimization time when starting from this search point” holds for optimizing OneMax via the $(1+1)$ EA.

We now show that a statement like Lemma 23 is not true for the cGA. Since the states of a run of the cGA are the frequency vectors $f$ , the natural extension of the Hamming distance quality measure above is the $\ell_{1}$ -distance $d(f,x^{*})=\|f-x^{*}\|_{1}=\sum_{i=1}^{n}|f_{i}-x^{*}_{i}|$ . Note that for $x^{*}=(1,\ldots,1)$ , we have $d(f,x^{*})=n-\|f\|_{1}$ , the distance measure regarded in many of the other proofs in this work.

Lemma 24.

Let $n$ be even. We consider running the cGA with hypothetical population size $\mu=n$ . Then there are a fitness function $\mathcal{F}:\{0,1\}^{n}\to\mathbb{R}$ with unique global optimum $x^{*}=(1,\ldots,1)$ and frequency vectors $f,g\in(F_{\mu})^{n}$ such that the following holds. Let $\tilde{f}$ be the frequency vector obtained after one iteration of optimizing $\mathcal{F}$ via the cGA started with frequency vector $f$ . Let $\tilde{g}$ be the frequency vector obtained after one iteration running the cGA on OneMax (with unique global optimum $y^{*}\coloneqq x^{*}$ ) started with $g$ . Then $d(f,x^{*})\geq d(g,y^{*})$ , but $d(\tilde{f},x^{*})\not\succeq d(\tilde{g},y^{*})$ .

Proof.

Let $\mathcal{F}$ be any subjump function with jump size $k\leq\frac{n}{4}$ . Let $f=\frac{1}{2}\textbf{1}_{n}$ . Let $g\in[0,1]^{n}$ be such that half the entries of $g$ are equal to $\frac{1}{n}+\frac{1}{\mu}=\frac{2}{n}$ and the other half are equal to $1-\frac{1}{n}-\frac{1}{\mu}=1-\frac{2}{n}$ .

We obviously have $d(f,x^{*})\geq d(g,y^{*})$ , since both numbers are equal to $\frac{n}{2}$ . Since with probability $1-\exp(-\Omega(n))$ , both search points sampled in the jump process have between $\frac{n}{4}$ and $\frac{3}{4n}$ ones, their jump fitnesses equal their OneMax fitnesses. Consequently, we may apply Lemma 5 from [Dro06] (or, with one more argument, Lemma 14) and see that $E[d(\tilde{f},x^{*})]\leq\frac{n}{2}-\Omega(\frac{1}{\mu}\sqrt{n})$ . For the OneMax process started in $g$ , however, denoting the two search points generated in this iteration by $x^{1}$ and $x^{2}$ , we have

[TABLE]

From this and $d(\tilde{g},y^{*})=\|\tilde{g}-y^{*}\|_{1}=\|\tilde{g}-g+g-y^{*}\|_{1}\geq\|g-y^{*}\|_{1}-\|\tilde{g}-g\|_{1}$ , we obtain

[TABLE]

Since thus $E[d(\tilde{f},x^{*})]\leq E[d(\tilde{g},y^{*})]$ , we cannot have $d(\tilde{f},x^{*})\succeq d(\tilde{g},y^{*})$ . ∎

We note that a second imaginable domination result is also not true, namely that, roughly speaking, the frequency vector arising from one iteration started with a better initial frequency vector dominates the result of starting with a worse initial frequency vector. More precisely, we have the following.

Lemma 25.

Let $\mu$ be an arbitrary hypothetical population size for all cGAs considered here. There are frequency vectors $f,g\in(F_{\mu})^{n}$ with $f\leq g$ (componentwise) such that the following holds. Let $\mathcal{F}$ be any subjump function with jump size at most $\frac{n}{2}$ (including the OneMax function). Let $\tilde{f}$ be the frequency vector resulting from optimizing $\mathcal{F}$ for one iteration with the cGA started with frequency vector $f$ . Let $\tilde{g}$ be the frequency vector resulting from optimizing OneMax for one iteration with the cGA started with frequency vector $g$ . Then we do not have $\tilde{f}_{i}\preceq\tilde{g}_{i}$ for all $i\in[1..n]$ .

Proof.

Let $f=(\frac{1}{2},\frac{1}{n},\dots,\frac{1}{n})$ and $g=\frac{1}{2}\textbf{1}_{n}$ . Clearly, $f\leq g$ .

When performing one iteration of the cGA on $\mathcal{F}$ started with $f$ , and denoting the two samples by $x^{1}$ and $x^{2}$ and their quality difference in all but the first bit by $\Delta=\|x^{1}_{|[2..n]}\|_{1}-\|x^{2}_{|[2..n]}\|_{1}$ , then the argument that with probability $1-\exp(-\Omega(n))$ this iteration equals an iteration with OneMax as objective function shows that the resulting frequency vector $\tilde{f}$ satisfies

[TABLE]

Since $\Pr[\Delta\in\{-1,0\}]\geq\Pr[\|x^{1}_{|[2..n]}\|_{1}=\|x^{2}_{|[2..n]}\|_{1}=0]=(1-\frac{1}{n})^{2(n-1)}\geq\frac{1}{e^{2}}$ , we have $\Pr[\tilde{f}_{1}=\tfrac{1}{2}+\tfrac{1}{\mu}]\geq\frac{1}{4}+\frac{1}{4e^{2}}-\exp(-\Omega(n))$ .

When starting the iteration with $g$ , the resulting frequency vector $\tilde{g}$ satisfies an equation analogous to (6), but now $\Delta$ is the difference of two binomial distributions with parameters $n-1$ and $\frac{1}{2}$ . Hence, we have $\Pr[\Delta\in\{-1,0\}]=O(n^{-1/2})$ , see, e.g., [Doe20c, Lemma 1.4.13] for this elementary estimate, and thus $\Pr[\tilde{g}_{1}=\tfrac{1}{2}+\tfrac{1}{\mu}]=\frac{1}{4}+o(1)$ , disproving that $\tilde{f}_{1}\preceq\tilde{g}_{1}$ . ∎

In summary, the richer mechanism of building a probabilistic model of the search space in the cGA (as opposed to using a population in EAs) makes is hard to argue that OneMax is the easiest function for the cGA. This, in particular, has the consequence that lower bounds for the runtime of the cGA on OneMax cannot be easily extended to other functions with a unique global optimum.

6.2 Imitating the OneMax Proof

Above, we have seen that a simple, general argument why a lower bound for the runtime of the cGA on OneMax should extend to jump functions appears hard to find. For this reason, we now analyze the proof of the lower bound given in [SW19] and observe, fortunately, that its main arguments apply equally well to jump functions. Since the full proof in [SW19] is relatively long, namely more than twelve pages, we apologize to the reader that we cannot give a self-contained version of the proof, but that instead we only argue why the arguments given in [SW19] remain valid in our case.

We show the following result, which is independent from the jump size $k$ . This result, in particular, shows that our upper bound of Theorem 13 is asymptotically tight. We note that this result is proven only for jump functions, but not also for superjump functions. This is due to the fact that the lower bound in [SW19] is only proven for OneMax and not for all functions with unique global optimum.

Theorem 26.

Let $c>0$ be an arbitrary constant. Let $C$ be a constant that is sufficiently large compared to $c$ . Let $\mu\geq C\log n$ and $\mu\leq n^{c}$ . Then with probability $1-o(1)$ , the runtime of the cGA with hypothetical population size $\mu$ on any $n$ -dimensional jump function is at least $\Omega(\mu\sqrt{n}+n\log n)$ .

Proof.

When $k$ is $\Omega(n)$ , then Theorem 22 gives a lower bound of $\exp(\Omega(n))$ with high probability. For this reason, we can now conveniently assume that $k\leq\kappa n$ for an arbitrarily small constant $\kappa>0$ .

As announced, we argue that the main arguments of the proof of the corresponding result in [SW19], Theorem 8, remain valid. The proof of this Theorem 8 mostly consists of Lemma 10 to 15 (in [SW19]). There is nothing to show for Lemma 10 as it refers only to iterations in which a fixed bit is performing a random-walk step (in which the fitness function is irrelevant). Lemma 11 is a statement on sums of independent random variables and does not refer to the cGA at all. In Lemma 12, a lower bound on the probability of a non-random-walk step is given. Informally speaking, a non-random-walk step for a particular bit means that in this iteration, the particular bit has an influence on how the two offspring are sorted before the frequency update. Since two search points have the same OneMax value if and only if they have the same objective value w.r.t. some jump function, this probability for a non-random-walk step is the same for OneMax and the jump function. Lemma 13, while formulated in the language of the cGA, is a statement on independent parallel unbiased random walks. The basic argument in the proof of Lemma 14 is that when $n^{\varepsilon}$ frequencies have reached the lower boundary, then with high probability at least one of them will not move for $\Omega(n\log n)$ iterations, simply because the two offspring generated in each iteration always agree in this bit. The claim of Lemma 15 includes that $\Omega(n)$ frequencies stay in the interval $[\frac{1}{6},\frac{5}{6}]$ for a given time frame $T$ . To sample a search point in the gap, since $k$ is sufficiently small, at least a constant fraction of these bits have to be sampled as one. By a simple Chernoff bound (Theorem 4), this happens only with probability $\exp(-\Omega(n))$ in one iteration. Since Lemma 15 gives a statement with probability $1-\operatorname{poly}(n)2^{-\Omega(\min\{\mu,n\})}$ only, the probabilities of sampling a search point in the gap do not affect the failure probability of $\operatorname{poly}(n)2^{-\Omega(\min\{\mu,n\})}$ . The main proof of Theorem 8 consists mostly of applications of these intermediate results. Only the last two paragraphs discuss what happens after the time frame $T$ , which was analyzed in Lemma 15. These two paragraphs, however, again only use general properties of the cGA that are independent of the particular fitness function.444To be very precise, the argument that a frequency at the lower boundary leaves this boundary only with probability $O(n^{-3/2})$ in one iteration is not correct, but the authors of [SW19] convinced us that also with the correct estimate of $O(\frac{1}{n})$ and setting the implicit constants right, at least $\sqrt{n}$ frequencies remain at the lower boundary at the end of the first $T$ iterations. This is enough to apply Lemma 14. In summary, all arguments given in the proof of Theorem 8 in [SW19] are equally valid for the optimization of a jump function with $k\leq\kappa n$ instead of the OneMax function. This proves our claim. ∎

We note that the proof above (and thus our result) applies not only to jump functions, but to all functions where Theorem 22 can be employed and, more interestingly, to all functions that agree with OneMax on all search point $x$ with $\frac{n}{6}\leq\|x\|_{1}\leq\frac{5n}{6}$ . This restriction is necessary to use the arguments of [SW19]. Overcoming this restriction is most likely non-trivial. It would most likely immediately imply a general lower bound of $\Omega(n\log n)$ for the runtime of the cGA on any function with unique global optimum, which is a major open problem in the field.

7 Conclusion

This study (including the preliminary versions [Doe19b, Doe19c]) is, to the best of our knowledge, after [HS18] only the second mathematical analysis of an EDA on a multimodal optimization problem. Our two main results are

(i)

that the cGA can optimize jump functions with logarithmic jump sizes in asymptotically the same efficiency as the simple OneMax function; it thus does not suffer from the fitness valleys present in these objective functions; 2. (ii)

an $\exp(\Omega(k))$ lower bound for the runtime of the cGA on jump functions with jump size $k$ , regardless of the hypothetical population size $\mu$ . This result shows, in particular, that the corresponding upper bound by Hasenöhrl and Sutton [HS18] cannot be improved by running the cGA with a hypothetical population size that is sub-exponential in $k$ .

The obvious question arising from this work is whether similar results hold for other EDAs and other optimization problems, or whether this result is a particularity of the cGA and jump functions. Natural candidates for other EDAs could be the UMDA, for which several rigorous runtime results exist, see [KW20a], and the significance-based cGA [DK20a], which might profit from using only the three frequencies $\frac{1}{n}$ , $\frac{1}{2}$ , and $1-\frac{1}{n}$ . Candidates for optimization problems leading to a multimodal fitness landscape include the maximum matching problem [GW03, GW04] or the minimum vertex cover problem [OHY09, JOZ13].

We also proved an $\Omega(n\log n)$ lower bound for jump functions in Section 6, and did so by arguing that this lower bound is witnessed in the OneMax process at a time up to which the cGA most likely has not sampled a search point that lies in the gap of a jump function. For this reason, the proof of [SW19] extends to jump functions as well. This argument was sufficient for our purposes, but left the real (and most likely very difficult) question untouched, namely if $\Omega(n\log n)$ is a lower bound for the cGA optimizing any function with unique global optimum. We do not dare to speculate what is the answer.

Bibliography69

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AAD + 19] Peyman Afshani, Manindra Agrawal, Benjamin Doerr, Carola Doerr, Kasper Green Larsen, and Kurt Mehlhorn. The query complexity of a permutation-based variant of Mastermind. Discrete Applied Mathematics , 260:28–50, 2019.
2[AD 11] Anne Auger and Benjamin Doerr, editors. Theory of Randomized Search Heuristics . World Scientific Publishing, 2011.
3[AD 18] Denis Antipov and Benjamin Doerr. Precise runtime analysis for plateaus. In Parallel Problem Solving From Nature, PPSN 2018, Part II , pages 117–128. Springer, 2018.
4[AD 20] Denis Antipov and Benjamin Doerr. Runtime analysis of a heavy-tailed ( 1 + ( λ , λ ) ) 1 𝜆 𝜆 (1+(\lambda,\lambda)) genetic algorithm on jump functions. In Parallel Problem Solving From Nature, PPSN 2020 . Springer, 2020. To appear.
5[ADK 20] Denis Antipov, Benjamin Doerr, and Vitalii Karavaev. The ( 1 + ( λ , λ ) ) 1 𝜆 𝜆 (1+(\lambda,\lambda)) GA is even faster on multimodal problems. In Genetic and Evolutionary Computation Conference, GECCO 2020 , pages 1259–1267. ACM, 2020.
6[ADY 19] Denis Antipov, Benjamin Doerr, and Quentin Yang. The efficiency threshold for the offspring population size of the ( μ , λ ) 𝜇 𝜆 {(\mu,\lambda)} EA. In Genetic and Evolutionary Computation Conference, GECCO 2019 , pages 1461–1469. ACM, 2019.
7[AW 09] Gautham Anil and R. Paul Wiegand. Black-box search by elimination of fitness functions. In Foundations of Genetic Algorithms, FOGA 2009 , pages 67–78. ACM, 2009.
8[BDK 16] Maxim Buzdalov, Benjamin Doerr, and Mikhail Kever. The unrestricted black-box complexity of jump functions. Evolutionary Computation , 24:719–744, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Abstract

1 Introduction

1.1 An Improved Upper Bound for Small Jump Sizes

1.2 An Exponential Lower Bound

1.3 A Lower Bound for Small Jump Sizes

1.4 Expected Runtimes of EDAs vs. Bounds with High Probability

2 Preliminaries

2.1 The Compact Genetic Algorithm

Lemma 1**.**

Proof.

2.2 Runtime Analysis for the cGA

2.3 Runtime Results for Jump Functions

2.4 Expected Runtimes versus Guarantees with High Probability

Theorem 2**.**

3 Technical Tools

3.1 Standard Tools

Lemma 3**.**

Theorem 4**.**

Lemma 5**.**

Proof.

Theorem 6**.**

Theorem 7**.**

3.2 Tools for the Analysis of the cGA

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Lemma 10**.**

Proof.

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

4 An Upper Bound for the Runtime of the cGA on Jump Functions

Theorem 13**.**

4.1 Proof Overview

4.2 Proof Ingredients

Lemma 14** ([Dro06]).**

Lemma 15**.**

Proof of Lemma 15.

Lemma 16**.**

Proof.

Lemma 17**.**

Proof.

Lemma 18**.**

Proof.

Lemma 19**.**

Proof.

Lemma 20**.**

Lemma 21**.**

Proof.

4.3 Proof of Theorem 13

Proof of Theorem 13.

4.4 General Insights From This Proof

5 An Exponential Lower Bound

Theorem 22**.**

Proof.

6 An Ω(nlog⁡n)\Omega(n\log n)Ω(nlogn) Lower Bound

6.1 Domination Arguments Fail

Lemma 23**.**

Lemma 24**.**

Proof.

Lemma 25**.**

Proof.

6.2 Imitating the OneMax Proof

Theorem 26**.**

Proof.

7 Conclusion

Lemma 1.

Theorem 2.

Lemma 3.

Theorem 4.

Lemma 5.

Theorem 6.

Theorem 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Theorem 13.

Lemma 14 ([Dro06]).

Lemma 15.

Lemma 16.

Lemma 17.

Lemma 18.

Lemma 19.

Lemma 20.

Lemma 21.

Theorem 22.

6 An $\Omega(n\log n)$ Lower Bound

Lemma 23.

Lemma 24.

Lemma 25.

Theorem 26.