TL;DR
This paper conducts an empirical study on the hyper-parameters of the self-adjusting $(1+(nd(mbda,mbda))$ GA, revealing that slight modifications can significantly improve performance and that theoretical parameter settings may extend to dynamic variants.
Contribution
It provides the first detailed empirical analysis of hyper-parameter effects in the self-adjusting $(1+(nd(mbda,mbda))$ GA, including a new setup that reduces runtime and insights on parameter transferability.
Findings
15% reduction in average runtime with modified parameters
Non-identical offspring population sizes improve efficiency
Theoretical parameter settings extend to non-static variants
Abstract
It is known that the ~Genetic Algorithm (GA) with self-adjusting parameter choices achieves a linear expected optimization time on OneMax if its hyper-parameters are suitably chosen. However, it is not very well understood how the hyper-parameter settings influences the overall performance of the ~GA. Analyzing such multi-dimensional dependencies precisely is at the edge of what running time analysis can offer. To make a step forward on this question, we present an in-depth empirical study of the self-adjusting ~GA and its hyper-parameters. We show, among many other results, that a 15\% reduction of the average running time is possible by a slightly different setup, which allows non-identical offspring population sizes of mutation and crossover phase, and more flexibility in the choice of mutation rate and crossover…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11| quantiles | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 20% | 25% | 50% | 75% | 98% | mean | rsd | success rate | |||
| 500 | 3,009 | 3,048 | 3,239 | 3,493 | 4,214 | 3,296 | 11.5 | 1.11 | 0.65 | 5.05 |
| 1,000 | 6,118 | 6,207 | 6,534 | 6,876 | 7,719 | 6,573 | 7.9 | 1.07 | 0.79 | 4.52 |
| 1,500 | 9,682 | 9,764 | 10,199 | 10,646 | 11,775 | 10,267 | 6.6 | 1.08 | 0.63 | 6.68 |
| 2,000 | 12,764 | 12,897 | 13,349 | 13,877 | 15,082 | 13,411 | 5.6 | 1.08 | 0.73 | 5.29 |
| 2,500 | 15,879 | 16,038 | 16,650 | 17,241 | 18,649 | 16,683 | 5.5 | 1.05 | 0.85 | 4.11 |
| 3,000 | 19,733 | 19,896 | 20,653 | 21,458 | 23,924 | 20,778 | 6.2 | 1.12 | 0.63 | 4.92 |
| 3,500 | 22,675 | 22,808 | 23,458 | 24,196 | 25,902 | 23,537 | 4.3 | 1.06 | 0.79 | 5.02 |
| 4,000 | 26,573 | 26,730 | 27,518 | 28,378 | 30,688 | 27,639 | 4.7 | 1.11 | 0.64 | 5.40 |
| 4,500 | 29,368 | 29,649 | 30,354 | 31,289 | 33,506 | 30,494 | 4.3 | 1.07 | 0.78 | 4.71 |
| 5,000 | 33,243 | 33,454 | 34,358 | 35,607 | 38,232 | 34,601 | 4.6 | 1.10 | 0.63 | 5.74 |
| 6,000 | 40,279 | 40,543 | 41,406 | 42,373 | 45,166 | 41,535 | 3.7 | 1.09 | 0.65 | 6.01 |
| 7,000 | 46,469 | 46,712 | 47,840 | 48,992 | 52,307 | 47,977 | 3.8 | 1.08 | 0.74 | 5.01 |
| 8,000 | 53,206 | 53,529 | 54,666 | 55,956 | 58,903 | 54,807 | 3.5 | 1.07 | 0.76 | 5.08 |
| 9,000 | 59,949 | 60,206 | 61,279 | 62,618 | 66,868 | 61,547 | 3.4 | 1.07 | 0.76 | 4.96 |
| 10,000 | 65,761 | 66,144 | 67,315 | 68,767 | 71,414 | 67,444 | 2.8 | 1.04 | 0.86 | 4.89 |
| quantiles | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20% | 25% | 50% | 75% | 98% | mean | rsd | success rate | ||||||
| 500 | 2,659 | 2,683 | 2,807 | 2,972 | 3,328 | 2,835 | 8.1 | 0.522 | 1.854 | 0.876 | 1.229 | 0.598 | 3.497 |
| 1,000 | 5,414 | 5,462 | 5,680 | 5,949 | 6,447 | 5,711 | 6.1 | 0.428 | 1.438 | 1.053 | 1.157 | 0.740 | 3.063 |
| 1,500 | 8,281 | 8,336 | 8,568 | 8,870 | 9,365 | 8,600 | 4.6 | 0.378 | 1.727 | 1.004 | 1.143 | 0.736 | 3.288 |
| 2,000 | 11,193 | 11,278 | 11,616 | 12,009 | 12,795 | 11,652 | 4.5 | 0.414 | 1.380 | 1.125 | 1.153 | 0.660 | 3.913 |
| 2,500 | 13,973 | 14,064 | 14,432 | 14,801 | 15,865 | 14,472 | 4.1 | 0.473 | 1.494 | 1.150 | 1.145 | 0.723 | 3.391 |
| 3,000 | 17,333 | 17,428 | 17,782 | 18,206 | 19,065 | 17,822 | 3.2 | 0.504 | 2.524 | 0.619 | 1.255 | 0.526 | 3.824 |
| 3,500 | 19,702 | 19,855 | 20,296 | 20,822 | 21,861 | 20,336 | 3.6 | 0.441 | 1.686 | 0.842 | 1.160 | 0.702 | 3.386 |
| 4,000 | 22,679 | 22,762 | 23,262 | 23,811 | 25,133 | 23,325 | 3.3 | 0.426 | 1.720 | 0.896 | 1.168 | 0.675 | 3.539 |
| 4,500 | 25,473 | 25,566 | 26,095 | 26,676 | 27,788 | 26,133 | 3.1 | 0.363 | 1.429 | 1.202 | 1.149 | 0.719 | 3.372 |
| 5,000 | 28,454 | 28,572 | 29,162 | 29,670 | 31,114 | 29,165 | 2.9 | 0.359 | 1.413 | 1.238 | 1.167 | 0.691 | 3.391 |
| 6,000 | 34,238 | 34,436 | 34,978 | 35,694 | 37,000 | 35,056 | 2.7 | 0.373 | 1.654 | 1.070 | 1.164 | 0.711 | 3.255 |
| 7,000 | 40,065 | 40,260 | 40,982 | 41,733 | 43,422 | 41,021 | 2.7 | 0.342 | 1.187 | 1.227 | 1.109 | 0.738 | 3.934 |
| 8,000 | 45,477 | 45,660 | 46,410 | 47,178 | 48,546 | 46,412 | 2.3 | 0.490 | 1.606 | 0.954 | 1.110 | 0.783 | 3.352 |
| 9,000 | 51,284 | 51,464 | 52,176 | 52,995 | 54,696 | 52,248 | 2.3 | 0.447 | 1.447 | 1.109 | 1.106 | 0.779 | 3.482 |
| 10,000 | 57,852 | 58,064 | 59,013 | 59,931 | 62,026 | 59,033 | 2.4 | 0.435 | 1.271 | 1.111 | 1.141 | 0.722 | 3.475 |
| quantiles | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 20% | 25% | 50% | 75% | 98% | mean | rsd | |||||
| 500 | 3,142 | 3,181 | 3,382 | 3,652 | 4,386 | 3,437 | 11.2 | 6 | 49 | 7 | 0.0151 |
| 1,000 | 6,599 | 6,702 | 7,102 | 7,650 | 9,124 | 7,225 | 10.3 | 5 | 60 | 7 | 0.0143 |
| 1,500 | 10,321 | 10,428 | 11,048 | 11,880 | 14,492 | 11,277 | 10.7 | 6 | 62 | 5 | 0.0125 |
| 2,000 | 13,951 | 14,178 | 14,884 | 15,930 | 18,409 | 15,130 | 9.6 | 5 | 67 | 7 | 0.0117 |
| 2,500 | 18,056 | 18,228 | 19,178 | 20,376 | 23,125 | 19,398 | 8.6 | 6 | 58 | 7 | 0.0158 |
| 3,000 | 21,545 | 21,867 | 23,049 | 24,551 | 29,181 | 23,373 | 9.5 | 5 | 66 | 7 | 0.0109 |
| 3,500 | 25,946 | 26,218 | 27,258 | 28,670 | 33,538 | 27,677 | 7.9 | 6 | 76 | 7 | 0.0121 |
| 4,000 | 29,619 | 29,950 | 31,698 | 33,432 | 39,096 | 32,034 | 9 | 6 | 66 | 6 | 0.013 |
| 4,500 | 33,727 | 34,072 | 35,502 | 37,298 | 43,695 | 36,049 | 8.2 | 6 | 63 | 8 | 0.0124 |
| 5,000 | 37,728 | 38,200 | 39,900 | 42,145 | 49,722 | 40,540 | 8.6 | 6 | 61 | 7 | 0.013 |
| 6,000 | 46,126 | 46,656 | 48,946 | 51,592 | 58,755 | 49,475 | 8 | 6 | 78 | 6 | 0.0124 |
| 7,000 | 53,793 | 54,275 | 56,802 | 59,802 | 67,465 | 57,401 | 7.5 | 5 | 64 | 10 | 0.0108 |
| 8,000 | 61,621 | 62,201 | 64,872 | 68,728 | 79,383 | 65,957 | 8.1 | 6 | 43 | 9 | 0.0153 |
| 9,000 | 71,811 | 72,596 | 75,198 | 79,792 | 92,431 | 76,500 | 7.8 | 7 | 71 | 6 | 0.0126 |
| 10,000 | 79,349 | 80,324 | 83,984 | 88,606 | 102,910 | 85,087 | 8.6 | 5 | 75 | 7 | 0.0112 |
| quantiles | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| configuration | 20% | 25% | 50% | 75% | 98% | mean | rsd | mean/ | |
| dyn(C) | 500 | 2626.8 | 2662.75 | 2789.5 | 2941.5 | 3361.82 | 2810.562 | 8.32 | 5.62 |
| dyn(C2) | 500 | 2714 | 2747 | 2894 | 3033 | 3392.06 | 2904.076 | 7.69 | 5.81 |
| dyn(default) | 500 | 2981 | 3024.75 | 3222 | 3467.25 | 4166.72 | 3278.202 | 11.04 | 6.56 |
| dyn(C) | 1000 | 5417.8 | 5454.5 | 5686 | 5926.75 | 6398.26 | 5695.144 | 5.76 | 5.70 |
| dyn(C2) | 1000 | 5580.8 | 5645.75 | 5869 | 6112.25 | 6668.9 | 5893.592 | 6.04 | 5.89 |
| dyn(default) | 1000 | 6232 | 6299.5 | 6652 | 6980.25 | 7847.54 | 6671.378 | 7.86 | 6.67 |
| dyn(C) | 1500 | 8203.4 | 8277.75 | 8586 | 8880.75 | 9528.1 | 8591.36 | 5.21 | 5.73 |
| dyn(C2) | 1500 | 8546 | 8603.5 | 8928 | 9206.75 | 9868.02 | 8919.98 | 4.96 | 5.95 |
| dyn(default) | 1500 | 9528 | 9610.75 | 10068 | 10554.75 | 11866.46 | 10164.5 | 7.55 | 6.78 |
| dyn(C) | 2000 | 11099.4 | 11172 | 11465 | 11805 | 12716.14 | 11495.294 | 4.66 | 5.75 |
| dyn(C2) | 2000 | 11450.8 | 11532.5 | 11881 | 12223.5 | 12951.24 | 11893.274 | 4.21 | 5.95 |
| dyn(default) | 2000 | 12919.4 | 13010.75 | 13514 | 14079.25 | 15930.84 | 13636.004 | 6.9 | 6.82 |
| dyn(C) | 2500 | 13985.8 | 14076.25 | 14502 | 14910 | 15772.26 | 14509.296 | 4.1 | 5.80 |
| dyn(C2) | 2500 | 14431.6 | 14556.75 | 14903.5 | 15330.5 | 16292.36 | 14942.674 | 4.01 | 5.98 |
| dyn(default) | 2500 | 16342.4 | 16452.75 | 16948 | 17717.75 | 19403.6 | 17071.06 | 5.78 | 6.83 |
| dyn(C) | 3000 | 16861 | 16962.25 | 17376.5 | 17821 | 18880.14 | 17408.882 | 3.7 | 5.80 |
| dyn(C2) | 3000 | 17486.2 | 17582 | 17997 | 18444.5 | 19379.24 | 18021.416 | 3.6 | 6.01 |
| dyn(default) | 3000 | 19606 | 19750.75 | 20374 | 21125.25 | 23078.9 | 20492.278 | 5.4 | 6.83 |
| dyn(C) | 3500 | 19759 | 19844.5 | 20286.5 | 20736.75 | 21958.32 | 20329.532 | 3.34 | 5.81 |
| dyn(C2) | 3500 | 20526.2 | 20650 | 21029.5 | 21518.25 | 22560.68 | 21080.554 | 3.35 | 6.02 |
| dyn(default) | 3500 | 22966.2 | 23138.75 | 23958 | 24812.5 | 27550.58 | 24090.904 | 5.4 | 6.88 |
| dyn(C) | 4000 | 22674.8 | 22785.5 | 23333.5 | 23799.5 | 24974.34 | 23300.608 | 3.24 | 5.83 |
| dyn(C2) | 4000 | 23394.8 | 23511.5 | 24028 | 24476.75 | 25979.2 | 24031.834 | 3.32 | 6.01 |
| dyn(default) | 4000 | 26475.6 | 26681.5 | 27404.5 | 28250.75 | 30781.46 | 27543.488 | 4.81 | 6.89 |
| dyn(C) | 4500 | 25492.2 | 25614.25 | 26165.5 | 26790.25 | 28047.02 | 26225.81 | 3.21 | 5.83 |
| dyn(C2) | 4500 | 26358.2 | 26465.25 | 27053.5 | 27593.25 | 28873.2 | 27039.558 | 3.09 | 6.01 |
| dyn(default) | 4500 | 29849.8 | 30119.5 | 30932 | 31905.75 | 34210.14 | 31120.334 | 5.06 | 6.92 |
| dyn(C) | 5000 | 28394.6 | 28506 | 29045.5 | 29662.25 | 31293.42 | 29118.248 | 3.14 | 5.82 |
| dyn(C2) | 5000 | 29372 | 29533.75 | 30082.5 | 30720 | 32104.12 | 30148.46 | 3.01 | 6.03 |
| dyn(default) | 5000 | 33250.6 | 33419.75 | 34246.5 | 35297.25 | 38290.32 | 34445.674 | 4.43 | 6.89 |
| dyn(C) | 6000 | 34199.8 | 34371.75 | 34931 | 35606.5 | 37094.12 | 35010.002 | 2.71 | 5.84 |
| dyn(C2) | 6000 | 35347.6 | 35558.75 | 36206 | 36879.25 | 38274.04 | 36215.806 | 2.64 | 6.04 |
| dyn(default) | 6000 | 40115.2 | 40376.25 | 41403 | 42590 | 46005.26 | 41583.822 | 4.34 | 6.93 |
| dyn(C) | 7000 | 39937.8 | 40122.5 | 40832 | 41505.75 | 43196.24 | 40855.698 | 2.6 | 5.84 |
| dyn(C2) | 7000 | 41354.4 | 41556.75 | 42195 | 42814.75 | 44653.86 | 42235.238 | 2.48 | 6.03 |
| dyn(default) | 7000 | 46882.8 | 47156.25 | 48378.5 | 49780.5 | 53550.36 | 48597.422 | 4.11 | 6.94 |
| dyn(C) | 8000 | 45862 | 46084 | 46886.5 | 47539 | 49443.74 | 46865.952 | 2.48 | 5.86 |
| dyn(C2) | 8000 | 47440.8 | 47651.75 | 48352.5 | 49104.5 | 51046.26 | 48391.546 | 2.44 | 6.05 |
| dyn(default) | 8000 | 53844.8 | 54192 | 55451 | 56728 | 60514.12 | 55551.484 | 3.69 | 6.94 |
| dyn(C) | 9000 | 51605.6 | 51799.75 | 52525.5 | 53357.75 | 55259.4 | 52609.86 | 2.33 | 5.85 |
| dyn(C2) | 9000 | 53401 | 53591.5 | 54402.5 | 55290.5 | 56978.2 | 54434.674 | 2.26 | 6.05 |
| dyn(default) | 9000 | 60961 | 61244.5 | 62592.5 | 64145.75 | 67863.7 | 62824.338 | 3.52 | 6.98 |
| dyn(C) | 10000 | 57456.6 | 57635.25 | 58522 | 59468 | 61273.56 | 58563.928 | 2.19 | 5.86 |
| dyn(C2) | 10000 | 59399 | 59634.25 | 60561 | 61362.75 | 63161.1 | 60536.716 | 2.12 | 6.05 |
| dyn(default) | 10000 | 67604.2 | 68112 | 69470.5 | 71157.75 | 74865.24 | 69679.228 | 3.41 | 6.97 |
| dyn(C) | 20000 | 116124.4 | 116421.5 | 117812.5 | 119032.25 | 121907.32 | 117795.328 | 1.72 | 5.89 |
| dyn(C2) | 20000 | 119667.6 | 119940.25 | 121234 | 122663 | 125482.96 | 121326.13 | 1.65 | 6.07 |
| dyn(default) | 20000 | 137199 | 137519.5 | 139915.5 | 142270.25 | 149804.72 | 140264.274 | 2.77 | 7.01 |
| dyn(C) | 30000 | 174611.8 | 175053.75 | 176968.5 | 178318 | 181736 | 176833.916 | 1.39 | 5.89 |
| dyn(C2) | 30000 | 180407.8 | 180792 | 182259 | 183875.75 | 186964.2 | 182249.43 | 1.3 | 6.07 |
| dyn(default) | 30000 | 207066.8 | 207579.75 | 210926.5 | 214517.25 | 224366.5 | 211485.7 | 2.57 | 7.05 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Hyper-Parameter Tuning for the GA
Nguyen Dang1 and Carola Doerr2
( 1University of St Andrews, School of Computer Science, St Andrews, Scotland, UK
2Sorbonne Université, CNRS, Laboratoire d’Informatique de Paris 6, Paris, France
)
Abstract
It is known that the Genetic Algorithm (GA) with self-adjusting parameter choices achieves a linear expected optimization time on OneMax if its hyper-parameters are suitably chosen. However, it is not very well understood how the hyper-parameter settings influences the overall performance of the GA. Analyzing such multi-dimensional dependencies precisely is at the edge of what running time analysis can offer. To make a step forward on this question, we present an in-depth empirical study of the self-adjusting GA and its hyper-parameters. We show, among many other results, that a 15% reduction of the average running time is possible by a slightly different setup, which allows non-identical offspring population sizes of mutation and crossover phase, and more flexibility in the choice of mutation rate and crossover bias—a generalization which may be of independent interest. We also show indication that the parametrization of mutation rate and crossover bias derived by theoretical means for the static variant of the GA extends to the non-static case.
1 Introduction
The Genetic Algorithm (GA) is a crossover-based evolutionary algorithm that was introduced in [DDE15] to demonstrate that the idea of recombining previously evaluated solutions can be beneficial also on very smooth functions. More precisely, it was proven in [DDE15, DD18a] that the GA achieves an expected optimization time on OneMax, the problem of maximizing functions of the type . All purely mutation-based algorithms, in contrast, are known to require function evaluations, on average, to optimize these functions [LW12, DDY16].
The GA has three parameters, the population size of mutation and crossover phase, the mutation rate , and the crossover bias . It was shown in [DDE15] that an asymptotically optimal linear expected running time can be achieved by the GA when choosing these parameters in an optimal way, which depends on the fitness of a current-best solution. This result was extended in [DD18a] to a self-adjusting variant of the GA, which uses a fixed parametrization , , and an adaptive success-based choice of . More precisely, in the self-adjusting GA the parameter is chosen according to a one-fifth success rule, which decreases to if an iteration has produced a strictly better solution, and increases to otherwise. This linear runtime result proven in [DD18a] was the first example where a self-adjusting choice of the parameter values could be rigorously shown to outperform any possible static setting.
Despite these theoretically appealing results, the performances reported in the original work introducing this algorithm [DDE15] are rather disappointing in that they are much worse than those of Randomized Local Search for all tested problem dimensions up to . It was pointed out in [CD18] that this is partially due to a sub-optimal implementation; the average optimization times reduce drastically when enforcing that at least one bit is flipped in the mutation phase. In this case, the self-adjusting GA starts to outperform RLS already for dimensions around . Another possible reason lies in the fact that the hyper-parameters of the self-adjusting GA had not been optimized. In [DDE15] the authors had taken some default values from the literature, and show only some very basic sensitivity analysis with respect to the update strength, but not with respect to any of the other parameters such as the success rate. In [DD18a] some general advice on choosing the hyper-parameters is given, but their influence on the explicit running time is not discussed, mostly due to missing precision in the available results, which state the asymptotic linear order only, but not the leading constants or lower order terms. Also the update strength for which the linear running time is obtained is only shown to exist, but not made explicit in [DD18a].
To shed light on the question how much performance can be gained by choosing the hyper-parameters of the GA with more care, we present in this work a detailed empirical evaluation of this parameter tuning question. Our first finding is that the default setting studied in [DDE15], which uses update strength and the mentioned -th success rule is almost optimal. More precisely, we show that for all tested problem dimensions between and only marginal gains are possible by choosing different update strengths and/or a success rule different from 1/5.
We then introduce a more general variant of the GA, in which the offspring population sizes of mutation and crossover phase need not be identical, and in which more flexible choices of mutation strength and crossover bias are possible. This leaves us with a five-dimensional hyper-parameter tuning problem, which we address with the irace software [LDC*+*16]. We thereby find configurations whose average optimization times are around better than that of the default self-adjusting GA, for each of the tested dimensions. The configurations achieving these advantages are quite stable across all dimensions, so that we are able to derive configurations achieving these gains for all dimensions. We furthermore show that the relative advantage also extends to dimensions and , for which we did not perform any hyper-parameter tuning. This five-dimensional variant of the GA is also of independent interest, since it allows much greater flexibility than the standard versions introduced in [DDE15, DD18a].
We finally study if hyper-parameter tuning of a similarly extended static GA can give similar results, or whether the asymptotic discrepancy between non-static and static parameter settings proven in [DD18a] also applies relatively small dimensions. We show that indeed already for the smallest tested dimension, , the average optimization time of the best static setting identified by our methods is around 5% worse than the standard self-adjusting GA from [DDE15, DD18a], and by 22% worse than the best found five-dimensional configuration. This disadvantage increases to and in dimension , respectively, thus showing that not only the advantage of the self-adjusting GA kicks in already for small dimensions, but also confirming that the relative advantage increases with increasing problem dimensions.
Apart from introducing the new GA variants, which offer much greater flexibility than the standard versions, our work significantly enhance our understanding of the hyper-parameter setting in the GA, paving the way for a precise rigorous theoretical analysis. In particular the stable performance of the tuned configurations indicates that a precise running time analysis might be possible. We furthermore learn from our work that the parametrization of the mutation rate and the crossover bias, which were suggested and proven to be asymptotically optimal for the static case in [DD18a], seem to be optimal also in the non-static case with self-adjusting parameter choices. Finally, we also observe that for the generalized dynamic setting success rules with success rates between to seem to be slightly better than the classic one-fifth success rule with .
Broader Context: Parameter Control and Hyper-Parameter Tuning. All iterative optimization heuristics such as EAs, GAs, local search variants, etc. are parametrized algorithms. Choosing the right parameter values is a tedious, but important task, frequently coined the “Achilles’ heel of evolutionary computation” [FCSS10]. It is well known that choosing the parameter values of different parameter settings can result in much different performances. Extreme cases in which a small constant change in the mutation rate result in super-polynomial performance gaps were shown, for example, in [DJS*+*13, Len18].
To guide the user in the parameter selection task, two main approaches have been developed: parameter tuning and parameter control. Parameter tuning aims at developing tools that automatize the process of identifying reasonable parameter values, cf. [HHLB11, LDC*+*16, LJD*+*17, HHLBS09, AMS*+*15] for examples. Parameter control, in contrast, aims to not only identify such good values, but to also track the evolution of good configurations during the whole optimization process, thereby achieving additional performance gains over an optimally tuned static configuration, cf. [KHE15, AM16, DD18b] for surveys. In practice, parameter control mechanisms are parametrized themselves, thus introducing hyper-parameters, which again need to be chosen by the user or one of the tuning tools mentioned above. This is also the route taken in this present work: in Sections 2 and 3 we will use the iterated racing algorithm irace [LDC*+*16] to tune two different sets of hyper-parameters of the self-adjusting GA, a two-dimensional and a five-dimensional one. In Section 4 we then tune the four parameters of a generalized static GA variant. By comparing the results of these tuning steps, we obtain the mentioned estimates for the relative advantage of the self-adjusting over the best tuned static parameter configuration.
Reproducibility, Raw Data, and Computational Resources. We concentrate on reporting average values to match with the available theoretical and empirical results. We recall that in theoretical works the expected optimization time dominates all other performance measures. Selected boxplots for the most relevant configurations are provided in Section 5. Information about the selected parameter values and the empirical quantiles of the running times can be found in the appendix. Source codes, additional performance statistics, summarizing plots, heatmaps with different colormaps, and raw data can be found on our GitHub repository at [DD19]. All experiments were run on the HPCaVe cluster [aSU], whose each node consists of two 12-core Intel Xeon E5 2.3GHz with 128Gb memory.
2 Tuning the default GA
Our main interest is in tuning the self-adjusting variant of the GA proposed in [DDE15] and analyzed in [DD18a]. As in these works, we regard the performance of this algorithm on the OneMax problem. The OneMax problem is one of the most classic benchmark problems in the evolutionary computation literature. It asks to find a secret string via calls to the function and is thus identical to the problem of minimizing the Hamming distance to an unknown string . It is referred to as “OneMax” in evolutionary computation, since the performance of most EAs (including the GA) is identical on any of the functions , and it therefore suffices to study the instance .
It is known that the best possible mutation-based (i.e., formally, the best unary unbiased) black-box algorithms have an expected optimization time on OneMax of order [LW12, DDY16]. The GA, in contrast, achieves a linear expected optimization time if its parameters are suitably chosen [DDE15, DD18a]. Parameter control, i.e., a non-static choice of these parameters, is essential for the linear performance, since the GA with static parameter values cannot have an expected optimization time that is of better order than , which is super-linear.
2.1 The dynamic GA
The GA is a binary unbiased algorithm, i.e., it applies crossover but uses only variation operators that are invariant with respect to the problem representation. We present the pseudo-code of the GA in Algorithm 1, in which we denote by the nearest integer function, i.e., if and otherwise.
The GA has two phases, a mutation phase and a crossover phase, followed by a selection step. In the mutation phase offspring are evaluated. Each of them is sampled by the operator uniformly at random (u.a.r.) from all the points at a radius around the current-best solution . The radius is sampled from the conditional binomial distribution , which assigns to each positive integer the probability . Following the reasoning made in [CD18] we deviate here from the GA variants investigated in [DDE15], to avoid useless iterations. The variants analyzed in [DDE15, DD18a] allow , which is easily seen to create copies of the parent only. As it cannot advance the search, we enforce .
In the crossover phase, offspring are evaluated. They are sampled by the crossover operator , which creates an offspring by copying with probability , independently for each position, the entry of the second argument, and by copying from the first argument otherwise. We refer to the parameter as the crossover bias. Again following [CD18], we evaluate only those offspring that differ from both their two parents; i.e., offspring that are merely copies of or do not count towards the cost of the algorithm, since their function values are already known.
In the selection step, we replace the parent by its best offspring if the latter is at least as good. When a strict improvement has been found, the value of is updated to . It is increased to otherwise.
Note that in the description above and Algorithm 1 we have deviated from the commonly used representation of the GA, in that we have parametrized the mutation rate as , the offspring population size of the crossover phase as , the crossover bias as , and in that we allow more flexible update strengths and . We thereby obtain a more general variant of the GA, which we will show to outperform the standard self-adjusting one considerably. In this present section, however, we only generalize the update rule, not yet the other parameters. That is, we work in this section only with the GA variant , which uses , , and .
In our implementation we always ensure that and are at least and at most , by capping these values if needed. Slightly better performances may be obtained by allowing even smaller -values, but we put this question aside for this present work.
2.2 Influence of the Update Strengths
As mentioned above, in our first set of experiments we focus on investigating the influence of the update strengths and , i.e., we fix in the notation of Algorithm 1. In [DDE15] it was suggested to set and . These settings had previously been suggested in [Aug09, KMH*+*04] in a much different context, but seemed to work well enough for the purposes of [DDE15] and was hence not questioned further in that work (apart from a simple evaluation showing that for the influence of varying the update strength within the interval is not very pronounced). Note that the choices of and correspond to an implicit one-fifth success rule, in the sense that the value of is stable if one out of five iterations is successful. The success rate (five in this case) can be computed as . We emphasize that for notational convenience we prefer to speak of a success rate instead of a -th success rule.
The heatmap in Figure 1 shows the average running time of the self-adjusting GA in dependence of the update strengths and . We considered all combinations of 50 equally spaced values for and for (2 500 hyper-parameter settings). For each setting, we performed 100 independent runs of the algorithm . Each run has a maximum budget of 150 000 function evaluations. Our results are for problem dimension . To show more structure, we cap in Figure 1. (a) the values at . A zoom into the interesting region of combinations achieving an average optimization time can be found in Figure 1.(b). More versions with different color schemes and cappings are available at [DD19].
The best configuration is with an estimated average optimization time of . This configuration has a success rate of . The average optimization time of the default variant from [DDE15], denoted by in the following, over 500 runs is , and thus only worse than . of the tested configurations have a smaller average optimization time than , all of them with -values at most and -value at least . 106 configurations are worse by at most 3%, and 188 by at most 5%.
For a more stable comparison, we also ran 500 times, and its average optimization time increased to for these 500 independent runs, reducing the relative advantage over to . Boxplots with information about the runtime distributions can be found in Section 5.
In Figure 2 we plot the average optimization time for different success rates, sorted by the value . Note that for each tested -value we have averaged here over all configurations using the same rounded (by ) success rate. The performance of success rates 1 and 2 is much worse than and is therefore not plotted. We plot only results for success rates at most 10, for readability purposes. We see that success rates 4 and 5 are particularly efficient, given the proper values of . The performance curves for success rates seem to be roughly U-shaped with different values of in which the minimum is obtained. It could be worthwhile to extend the mathematical analysis of the presented in [DD18a] in order to identify the precise relationship.
2.3 Tuning with irace
The computation of the heatmaps presented above is quite resource-consuming, around 286 CPU days for the full heatmap with parameter combinations for . Since we are interested in studying the quality of the GA also for other problem dimensions, we therefore investigate how well automated tuning tools approximate the best known configuration. To this end, we run the configuration tool irace [LDC*+*16] with adaptive capping [CLHS17] enabled. This new mechanism was recently added to irace to make its search procedure more efficient when optimizing running time or time-compatible performance measurement. We use irace to optimize the configuration of the for values of between and , and values of between and . The allocated budget is 10 and 20 hours of walltime on one 24-core cluster node for and , respectively. This time budget is only a fraction of the ones required by heatmaps (around for ).
For irace suggests to use configuration , which is similar to the one showing best performance in the heatmap. The average optimization time of this configuration is (this number, like all numbers for the configurations suggested by irace are simulated from 500 independent runs each), and thus identical to the best one from the heatmap computations. The suggested configuration corresponds to a success rate.
Confident that irace is capable of identifying good parameter settings, we then run irace for various problem dimensions between and . The by normalized average optimization time of the suggested configurations are reported in Figure 4 in column . The chosen -values are between and and the -values are between and , with corresponding success rates between and , cf. Figure 3. We observe a quite stable suggestion for the parameter values. The suggested configurations, along with statistical data can be found in Table 1 in the appendix.
In Figure 4 we also display, in column , the normalized average optimization times of the default setting . The relative disadvantage of the over the ranges from to . The negative values (in four dimensions) may be due to a suboptimal suggestion of irace, or due to the variance of the algorithms; the relative standard deviation is between and , cf. also the boxplots in Section 5.
We also observe that the normalized average optimization times of increase slightly with increasing problem dimension. Note, however, that this does not necessarily tell us something about the constant factor in the linear running time of this algorithm, although the results indicate that this factor might be larger than . Already for the has a smaller average optimization time than RLS, the relative advantage of is around , and increases to around for .
3 5-dimensional Parameter Tuning
Next we turn our attention to the five-dimensional GA variant , in which not only the update strengths and are configurable, but also the dependence of , , . The dependencies of the parameters on are based on a theoretical result proven in [DD18a], where it is shown that any static configuration with (i.e., ) that achieves optimal asymptotic expected performance must necessarily satisfy and .
To investigate how much performance can be gained by this flexibility, and how reasonable parameter values look like, we run again irace, this time using the following parameter ranges: , , , and . The allocated budget is the same as for the , i.e., 240 CPU hours for and 480 CPU hours for .
The normalized average running times of the suggested configurations are presented in Column in Figure 4. We observe that the parametrization of , , and consistently allows to decrease the average optimization time by around 14%, when measured against the best variant.
3.1 Suggested Hyper-Parameters
The suggested parameter values are displayed in Figure 5. We observe that these are quite stable, in particular when ignoring the and dimensional configurations. More precisely, irace consistently suggests configurations with , , , , and , with corresponding success rates between and . These stable values suggest that the parametrization chosen in Algorithm 1 (and originally derived in [DD18a] for the static GA) is indeed suitable also for the non-static setting.
In Figure 6 we plot the average optimization time of the configurations tested by irace for in dependence of each of the five hyper-parameters and in dependence of the success rate . Note that the number of runs differs from point to point, depending on how many evaluations irace has performed for each of these configurations. It is important to note that the capping procedure may stop an algorithm before it has found an optimal solution, in order to save time for the evaluation of more promising configurations. The plotted values are the averages of the successful runs only. An exception to this rule is the chart on the lower right, which shows the whole range of all tested configurations; these values are the average time after which the configurations had either found the optimum or were stopped by the capping procedure. We thus see that irace has indeed tested across the whole range of admitted parameter values. Around of all runs were stopped before an optimum had been found. However, we already see here that for each parameter there are configurations which use a good value for this parameter, but which shows quite poor overall performance. These results indicate that no parameter alone explains the performance, but that interaction between different parameter values is indeed highly relevant; we will discuss this aspect in more detail below.
Out of the tested configurations only configurations had at least one successful run. The averages of all successful runs are plotted in the upper right chart of Figure 6. We observe that the well-performing region of values for each parameter is quite concentrated. The charts on the left and in the middle column zoom into those configurations which had an average optimization time smaller than . These plots give a good indication where the interesting regions for each parameter are. We also plot the average optimization time in dependence of the success rate and see good performance for success rates between 3 and 4.
For tested configurations only successful runs were reported; i.e., for these configurations none of the runs had been stopped before it had found an optimal solution. When restricting the zoomed plots in Figure 6 to only those configurations, we obtain a very similar picture. We omit a detailed discussion but note that these plots can be found in our repository [DD19].
The final configuration suggested by irace, has an average optimization time of in the 500 independent runs conducted for the values reported in Figure 4. During the irace optimization the estimated average was (across runs).
We see that some of the configurations in Figure 6 have a smaller average optimization time than this latter value. In fact, there are 292 such configurations with at least one successful run and configurations with only successful runs. As we can see from the plots in Figure 6 all these configurations have very similar parameter values. This observation nevertheless raises the question why irace has not suggested one of these presumably better configurations instead. To understand this behavior, we investigate in more detail the working principles of irace, and find two main reasons. One is that the time budget did not allow a further investigation of these configurations, so that statistical evidence that they are indeed superior to the suggested one was not sufficient. A second reason is that the capping suggested in [CLHS17] resulted in a somewhat harsh selection of “surviving” configurations. We leave the question if any of the 292 configurations would have been significantly better than the suggested one for future work. Overall, our investigation suggests that some adjustments to irace’s default setting might be useful for applications similar to ours, where the performance measure may potentially suffer from high variance.
We next investigated the influence of each parameter on the overall running time. To this end, we have applied the functional analysis of variance (fANOVA) [HLB14] on the performance data given by irace. fANOVA can efficiently recognize the importance of both individual algorithm parameters and their interactions through their percentage of contributions on the total performance variance. The software PyImp [aFU] is used for the analysis. Obtained results are quite consistent among different dimensions. The most important parameter is , which explains on average of the total variance. The second most important parameter is , explaining around of the total variance, on average. Other important effects include pairwise interaction between and or . Individual parameters and their pairwise interaction effects are able to explain almost of the total variance, so that there is no need to consider higher-order interactions.
In light of the quite stable parameter values suggested by irace (Figure 5) one might hope to obtain even better results when restricting the ranges of possible parameter values further. To investigate this question we run irace again on the configuration problem, this time with restricted parameter ranges , , , and . The normalized average running time results are reported in column dyn-restr. of Figure 4. We observe that the advantage is negligible, and in four of the tested dimensions the suggested configurations even have a slightly worse average optimization time. This effect is likely to be caused by the randomness of the running times and/or the irace procedure itself.
Finally, we derive from the suggested parameter values two configurations that we investigate in more detail: and , which we abbreviate as dyn(C) and dyn(C2), respectively. While dyn(C) consistently shows better performance than dyn(C2), the latter might be easier to analyze by theoretical means. Their normalized average optimization time across all tested dimensions can be found again in Figure 4. They are considerably better than that of , between and across all tested dimensions for dyn(C) and between and for dyn(C2). dyn(C2) is between and worse than the (for each dimension independently tuned) best suggested configuration. For dyn(C) we even observe that the average running times for the 500 runs are smaller than those of for 10 out of the 15 tested dimensions. The advantages of dyn(C) and dyn(C2) over also translate to larger dimensions, for which we did not perform hyper-parameter tuning. For and the advantage of dyn(C) over are 16% each, and for dyn(C2) a relative advantage of 14% is observed.
3.2 Fixed-Target Analysis
Finally, we address the question where the advantage of the self-adjusting GA over RLS originates from. To this end we perform an empirical fixed-target runtime analysis for two selected configurations, the default configuration and the configuration dyn(C) mentioned above.
The fixed-target running times have been computed with IOHprofiler [DWY*+*18], a recently announced tool which automates the performance analysis of iterative optimization heuristics. The average results of independent runs for are shown in Figure 7. We observe that RLS is significantly better for almost all target values. In fact, the configuration dyn(C) has better first hitting times than RLS only for OneMax values greater than , i.e., only for the last 22 target values. We recall from Figure 4 that the average optimization time of dyn(C) is better than that of RLS by around for . To study at which point dyn(C) starts to perform better than RLS, we compute the gradient of the curves plotted in Figure 7, showing that this happens around target value . For the default configuration the situation is as follows: It is has smaller first hitting time than RLS only for target values , although its overall average running time is smaller by around 23%. The gradient of is better than that of RLS starting at target value around . Finally, dyn(C) has smaller average hitting time than for Om-values at least , and a better gradient starting at around . We show in Figure 7 the hypothetical running times of an algorithm that runs RLS until target value and then switches to dyn(C). Its average running time is smaller than that of dyn(C), raising the interesting question how to detect such switching points on the fly.
We also want to understand how the value of evolves during the optimization process. We plot in Figure 8 for each target value the average value of in the iteration in which for the first a solution of this quality has been sampled. More precisely, we display in Figure 8 the logarithm of these parameter values. These results have again been computed with IOHprofiler [DWY*+*18]. We also show in this plot the logarithm of , the value for which the standard dynamic GA (with ) was first shown to have linear optimization time, cf. [DDE15]. We observe that for both configurations the average value of increases with already obtained function value, with a final value of for and for dyn(C). We recall that the number of offspring generated in each iteration is for and around for dyn(C).
4 Tuning the Static GA
We had concentrated in the previous sections on optimizing dynamic versions of the GA, since the theoretical results guarantee configurations for which linear expected running time can be obtained. In contrast, the best possible expected running time that can be achieved with static parameters , and arbitrary and is of order [DD18a]. While this rules out the possibility that there exists a static configuration that performs similarly well as across all dimensions, it is not known to date whether for concrete problem dimensions there exist static configurations that are similar in performance than the dynamic variants , dyn(C), or even . We next show that for the tested problem dimensions between 500 and this does not seem to be the case.
We study the four-dimensional variant presented in Algorithm 2. Following [DDE15], we enforce again that the mutation strength is strictly greater than zero, by sampling from the conditional distribution in line 2. We also allow , which was not the case in [DDE15]. In line with suggestions from [DDE15, DD18a] we set , and optimize for integer . We allow the same range for and . The crossover bias is optimized within the range .
The normalized average running time of the best configuration that irace has been able to identify with its given budget are reported in column of Figure 4. We observe that these running times are significantly larger than those of the dynamic GA variants. The relative disadvantage against the default dynamic variant monotonically increases from around for to around for . Against the best dynamic variant this relative disadvantage increases from around to around .
We also see from the results in Figure 4 that, with few exceptions, the normalized average running time increases with the problem dimension. This is in line with what the super-linear lower bound proven in [DD18a] suggests (note, however, that the theoretical results for the static GA assumes ). The relative increase of the normalized average running time is smaller than for RLS, again in line with the known theoretical results. The comparison with RLS also shows that the static GA variants start to outperform RLS at problem dimension . For the relative advantage of over RLS is around .
Finally, we study in Figure 9 the parameter values of the configurations suggested by irace. We observe that across all dimensions is significantly smaller than , which was different for the dynamic GA variants. Both and are relatively stable, with values ranging between 5 and 7 for and between 5 and 10 for . The values of fluctuates significantly more, between 43 and 78. The crossover rate is always within the range , and thus also quite stable. Since in the original works is assumed, we also note that for both and the factor between the minimal and maximal value is as small as and , respectively, with no clear monotonic relationship.
5 Runtime Distribution
In all figures mentioned above we have only considered average values, to obtain results that are more easily comparable with existing theoretical and empirical works. With Figure 10 we address the question how the running times are distributed. This figure provides boxplots for all tested dimensions . The plots confirm the performance advantages of the five-dimensional dynamic GA variants and dyn(C) over the 2-dimensional versions and . All adaptive versions perform consistently better than the best static version in term of both median values and variance. These advantages get more visible as the problem sizes increase. We also perform two types of statistical tests - paired Student t-test and Wilcoxon signed-rank test - between those versions. Results confirm that the difference between them are statistically significant with a confidence level of 99.9%.
6 Conclusion
We have presented a very detailed study of the hyper-parameters of the static and the self-adjusting GA on the OneMax problem. Among other results, we have seen that the self-adjusting GA gains only around in average optimization time with optimized update strengths and . We have then introduced a more flexible variant, the , in which the offspring population sizes of mutation and crossover phase need not be identical, and which offers more flexibility in the choice of the mutation rate and the crossover bias. This has reduced the average optimization times by another 15%. Interestingly, the parameter values by which these performance gains are achieved are quite consistent across all tested dimensions. We then analyzed a configuration in which we fixed the hyper-parameters according to the suggestions made by the tuning in lower dimensions to , and show that it performs very well also on the and dimensional OneMax problem.
Our results suggest that the GA can gain performance by introducing the additional hyper-parameters. We plan on investigating the gains for other problems, in particular the MaxSAT instances studied in [BD17]. Since all results shown in this work are quite consistent across all dimensions, we also plan on analyzing the advantages of the by rigorous means, both in terms of optimization time, but also in terms of more general fixed-target running times. As we have demonstrated in Section 3.2, the latter reveal that the advantage of the GA over RLS lies in the very final phases of the OneMax optimization problem, i.e., when finding improving moves is hard. Efficiently switching between the two algorithms at the time at which the GA starts to outperform RLS carries the potential to reduce the optimization time further. Automating such online algorithm selection is another line of research that we plan to investigate further. Techniques from the literature on parameter control [KHE15, DD18b], adaptive operator selection [FCSS10], and hyper-heuristics [BGH*+*13] might prove useful in this context.
On a meta-level, we have demonstrated with this work that hyper-parameter tuning provides useful insights that help us understand the working principles of randomized search heuristics. As mentioned, we are confident that we can leverage the empirical findings of this work for a precise theoretical analysis of the self-adjusting GA. Similar to the tuning in the loop approach suggested in [dOAS11], we thus see that there is important room for tuning in the theory loop. On the other hand, we also show an example that raises the question of how to adjust the default setting of current hyper-parameter tuning methods when a priori knowledge about the scenario is given (e.g., high variance in performance measure in our case). This is another line of research raised by this work.
Acknowledgments.
This work was supported by the Paris Ile-de-France Region, by a public grant as part of the Investissement d’avenir project ANR-11-LABX-0056-LMH, LabEx LMH, by the European Cooperation in Science and Technology (COST) action CA15140, and by the UK EPSRC grant EP/P015638/1. The simulations were performed at the HPCaVe at UPMC-Sorbonne Université.
Appendix A Selected Optimization Time Statistics
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[a FU] Ml 4AAD Group at Freiburg University. Pyimp. https://github.com/automl/Parameter Importance .
- 2[AM 16] Aldeida Aleti and Irene Moser. A systematic literature review of adaptive parameter control methods for evolutionary algorithms. ACM Computing Surveys , 49:56:1–56:35, 2016.
- 3[AMS + 15] Carlos Ansótegui, Yuri Malitsky, Horst Samulowitz, Meinolf Sellmann, and Kevin Tierney. Model-based genetic algorithms for algorithm configuration. In Proc. of International Conference on Artificial Intelligence (IJCAI’15) , pages 733–739. AAAI Press, 2015.
- 4[a SU] HP Ca Ve Cluster at Sorbonne University. http://hpcave.upmc.fr/index.php/resources/mesu-beta/ .
- 5[Aug 09] Anne Auger. Benchmarking the (1+1) evolution strategy with one-fifth success rule on the BBOB-2009 function testbed. In Companion Material for Proc. of Genetic and Evolutionary Computation Conference (GECCO’09) , pages 2447–2452. ACM, 2009.
- 6[BD 17] Maxim Buzdalov and Benjamin Doerr. Runtime analysis of the ( 1 + ( λ , λ ) ) 1 𝜆 𝜆 (1+(\lambda,\lambda)) Genetic Algorithm on random satisfiable 3-CNF formulas. In Proc. of Genetic and Evolutionary Computation Conference (GECCO’17) , pages 1343–1350. ACM, 2017.
- 7[BGH + 13] Edmund K. Burke, Michel Gendreau, Matthew R. Hyde, Graham Kendall, Gabriela Ochoa, Ender Özcan, and Rong Qu. Hyper-heuristics: a survey of the state of the art. Journal of the Operational Research Society , 64:1695–1724, 2013.
- 8[CD 18] Eduardo Carvalho Pinto and Carola Doerr. A simple proof for the usefulness of crossover in black-box optimization. In Proc. of Parallel Problem Solving from Nature (PPSN’18) , volume 11102 of Lecture Notes in Computer Science , pages 29–41. Springer, 2018. Full version available at http://arxiv.org/abs/1812.00493 .
