Lexicase Selection of Specialists

Thomas Helmuth; Edward Pantridge; Lee Spector

arXiv:1905.09372·cs.NE·January 3, 2020

Lexicase Selection of Specialists

Thomas Helmuth, Edward Pantridge, Lee Spector

PDF

1 Repo

TL;DR

This paper investigates how lexicase selection's ability to select specialists influences its effectiveness, showing that specialists contribute significantly to its performance and diversity in evolving solutions.

Contribution

It reveals that selecting specialists is crucial for lexicase selection's success, providing insights into its advantages over error-aggregating methods.

Findings

01

Lexicase selection's performance drops when specialists are excluded.

02

Specialists help maintain diversity and drive evolution toward global solutions.

03

Lexicase selection favors specialists more than tournament selection.

Abstract

Lexicase parent selection filters the population by considering one random training case at a time, eliminating any individuals with errors for the current case that are worse than the best error in the selection pool, until a single individual remains. This process often stops before considering all training cases, meaning that it will ignore the error values on any cases that were not yet considered. Lexicase selection can therefore select specialist individuals that have poor errors on some training cases, if they have great errors on others and those errors come near the start of the random list of cases used for the parent selection event in question. We hypothesize here that selecting these specialists, which may have poor total error, plays an important role in lexicase selection's observed performance advantages over error-aggregating parent selection methods such as tournament…

Tables2

Table 1. Table 1. PushGP system parameters and the usage rates of genetic operators.

Parameter	Value
population size	1000
max number of generations	300
tournament size for tournament selection	7
Genetic Operator Rates	Prob
alternation	0.2
uniform mutation	0.2
uniform close mutation	0.1
alternation followed by uniform mutation	0.5

Table 2. Table 2. Theoretical probability of tournament selection selecting an individual that would be removed by X % percent 𝑋 X\% elitist survival. For example, the probability of selecting an individual removed by 50% elitist survival is 0.00781, meaning that individuals with total error worse than the median make up less than 0.8% of the parents when using tournament selection.

o 2.3in X[1,r] X[2.7,r] % Elitist Survival	Probability of Selecting A Removed Individual
10	0.47829
20	0.20971
30	0.08235
40	0.02799
50	0.00781
60	0.00163
70	0.00021
80	0.00001
90	0.0000001
100	0

Equations2

p (i) = \frac{( P - i + 1 ) ^{t} - ( P - i ) ^{t}}{P ^{t}}

p (i) = \frac{( P - i + 1 ) ^{t} - ( P - i ) ^{t}}{P ^{t}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thelmuth/Clojush
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Lexicase Selection of Specialists

Thomas Helmuth

0000-0002-2330-6809

Hamilton CollegeClintonNew YorkUSA

[email protected]

,

Edward Pantridge

0000-0003-0535-5268

Swoop, Inc.CambridgeMassachusettsUSA

[email protected]

and

Lee Spector

0000-0001-5299-4797

Hampshire CollegeAmherstMassachusettsUSA

[email protected]

(2019)

Abstract.

Lexicase parent selection filters the population by considering one random training case at a time, eliminating any individuals with errors for the current case that are worse than the best error in the selection pool, until a single individual remains. This process often stops before considering all training cases, meaning that it will ignore the error values on any cases that were not yet considered. Lexicase selection can therefore select specialist individuals that have poor errors on some training cases, if they have great errors on others and those errors come near the start of the random list of cases used for the parent selection event in question. We hypothesize here that selecting these specialists, which may have poor total error, plays an important role in lexicase selection’s observed performance advantages over error-aggregating parent selection methods such as tournament selection, which select specialists much less frequently. We conduct experiments examining this hypothesis, and find that lexicase selection’s performance and diversity maintenance degrade when we deprive it of the ability of selecting specialists. These findings help explain the improved performance of lexicase selection compared to tournament selection, and suggest that specialists help drive evolution under lexicase selection toward global solutions.

genetic programming, lexicase selection, specialization

††journalyear: 2019††copyright: acmcopyright††conference: Genetic and Evolutionary Computation Conference; July 13–17, 2019; Prague, Czech Republic††booktitle: Genetic and Evolutionary Computation Conference (GECCO ’19), July 13–17, 2019, Prague, Czech Republic††price: 15.00††doi: 10.1145/3321707.3321875††isbn: 978-1-4503-6111-8/19/07††ccs: Computing methodologies Genetic programming

1. Introduction

Most parent selection methods used in genetic programming, and in genetic algorithms more generally, select individuals on the basis of scalar fitness values. For problems that involve multiple training cases, these fitness values are aggregated over all of the training cases, often by summing them. By contrast, lexicase selection selects parents on the basis of performance on un-aggregated training-case errors (Spector, 2012; Helmuth et al., 2015b; La Cava et al., 2018). It does this by considering training cases one at a time, in a different random order for each parent selection event. For each parent selection event it creates a pool that initially contains the entire population, and then for each training case, it filters the pool to retain only the individuals with the best error for each training case. If the pool is reduced to a single individual, then that individual is the selected parent. If many individuals survive filtering by all of the training cases, then a randomly chosen survivor is designated as the selected parent.

Prior work has shown that lexicase selection often works well in practice, but the reasons that it does so, and the contexts in which it does and doesn’t work well, are still topics of active investigation. In the present paper we address one hypothesis regarding the efficacy of lexicase selection: that selecting specialists is important for solving problems. By “specialists” we mean individuals with relatively low errors on a subset of of the training cases but high errors on other training cases and subsequently poor total error relative to the rest of the population. In contrast to specialists, generalists perform approximately the same on all training cases, not doing particularly well on any training cases while having overall good total error.

Our motivation for the present study stems from anecdotal evidence observed in an earlier study, which suggested that specialists might contribute in important ways to the evolution of solutions (McPhee et al., 2015). This prior work also suggested that the selection of specialists might explain, to a significant degree, the better problem-solving performance of lexicase selection relative to other parent selection methods.

More specifically, in this prior work we examined the lineage leading to a solution to the “Replace Space with Newline” software synthesis problem, evolved with a PushGP genetic programming system. In the run that we examined, the generation in which a solution first appeared actually contained 45 distinct solutions. All of these solutions were children of the same parent in the previous generation, and both this parent and and its parent (that is, the grandparent of all of the solutions) had total error values that were in the worst quartile of their respective generations by total error. The grandparent of every solution had nearly the worst total error of its generation. Nonetheless, both the grandparent and the parent produced large numbers of offspring, including large numbers of solutions in the final generation.

A later study using a larger set of benchmark problems observed lexicase selection selecting individuals with high total error significantly more frequently than tournament selection (Pantridge et al., 2018). This study also observed that lexicase selection rarely utilizes a majority of the training cases when selecting parents.

These observations motivated the present study, but anecdotal evidence is not sufficient to ground scientific understanding or to guide engineering practice. Systematic studies are required to determine the extent to which the selection of specialists is truly important, and the contexts in which this is the case. In this paper we document such a study, providing the first clear evidence supporting the hypothesis that the selection of specialists is responsible, in large measure, for the superiority of lexicase selection to tournament selection.

In the following sections we present background on lexicase selection and then the design, results, and analysis of our new experiments.

2. Background on Lexicase Selection

The basic and most commonly used version of the lexicase selection algorithm proceeds as follows each time a parent is required:

(1)

A collection of candidates is set initially to contain the entire population. 2. (2)

A collection of cases is set initially to contain all of the training cases, shuffled in random order. 3. (3)

Until a parent has been designated, loop:

(a)

Discard all individuals in candidates except those with exactly the lowest error for the first case in cases. 2. (b)

If just a single individual remains in candidates, then designate it as the parent. 3. (c)

If only a single item remains in cases, then designate a randomly chosen individual from candidates as the parent. 4. (d)

Otherwise, remove the first item from cases.

Lexicase selection has been studied in several settings, and several variants of the basic algorithm have been proposed (for example, (Spector et al., 2018)). Among the most significant of these variations is epsilon lexicase selection, in which “exactly the lowest error” in the description of the algorithm is replaced with “within epsilon of the lowest error” for a suitably defined epsilon; this has proven to be particularly effective on problems with floating-point errors (La Cava et al., 2016; La Cava et al., 2018). Additionally, lexicase selection has been effectively used to solve problems in areas such as boolean logic and finite algebras (Helmuth et al., 2015b; Helmuth and Spector, 2013; Liskowski et al., 2015), evolutionary robotics (Moore and Stanton, 2017), and boolean constraint satisfaction using genetic algorithms (Metevier et al., 2019).

Lexicase selection often produces and maintains particularly diverse populations, and this has been hypothesized to be responsible, in part, for its problem-solving power (Helmuth et al., 2015a, 2016a). If lexicase selection does in fact select specialists more often than other parent selection techniques, this may contribute to its effects on diversity, regardless of effects on problem-solving performance.

Populations evolving by lexicase selection are also often observed to exhibit hyperselection, in which single individuals in one generation are used as parents for many, sometimes most or nearly all, of the children in the next generation. The causal connections between hyperselection and problem-solving power are complex (Helmuth et al., 2016b), but in any case this may also be relevant to the interpretation of experimental results on specialist selection, since the presence or absence of specialists may influence the frequency and patterns of hyperselection.

An additional aspect of lexicase selection that bears consideration is the fact that selected individuals will always be nondominated in their populations and elite with respect to at least one training case, a property that has been characterized as inhabiting the “corners” of the Pareto front (La Cava et al., 2018). This too should be considered in the interpretation of results on specialist selection.

3. Specialists in Genetic Programming

A specialist is an individual that achieves low errors on a subset of training cases while having high errors on other training cases. The total, or aggregated, error of a specialist individual is often relatively high compared to the rest of the population, since a poor error on a few training cases can dominate the sum of the errors. In contrast to specialists, a generalist is an individual that performs approximately the same on all training cases, achieving neither particularly good nor particularly poor results on any training case, and often achieving relatively good total error. Consider the following training cases for the function $y=(x_{1})^{2}-x_{2}$ .

The following two tables describe the actual output ( $\hat{y}$ ) and expected output ( $y$ ) of a generalist and a specialist on each training case.

The generalist has similar error values across all training cases while the specialist has a near zero error on one training case but high errors on the other training cases.111On an actual problem with many training cases, a specialist will likely perform well on a subset of the training cases, not just one of them. Notice that the specialist has received a penalty error of one million on the second training case because it could not be evaluated on the given set of inputs.

The total error of the specialist is drastically higher than the generalist. However, the generalist was not able to achieve a near zero error on any of the training cases. In an evolutionary population that is ranked by total error, the generalists will tend to have lower rank than the specialists. On the other hand, the specialist may have discovered something truly useful about solving the problem as indicated by its one (or more with real problems) nearly perfect output, and might be worth selecting to pass on its genetics to the next generation.

4. Experimental design

In Section 1 we described a single run that featured an individual in the bottom quartile of the population (when sorted by total error) that was the parent of 45 solution programs. Later, in Section 6 we will show that specialists make up large portions of the individuals selected by lexicase selection compared to tournament selection. Still, this does not answer the question of whether selecting specialists is an important component to lexicase selection’s improved performance compared to tournament selection and other selection methods, or whether it is a side effect that has little bearing on the trajectory of evolution.

Does lexicase selection perform well because it selects specialists, or can it maintain good performance without selecting individuals with poor total error? We hypothesize that lexicase selection’s ability to select specialist individuals with poor total error allows it to more effectively explore the search space than if it were limited to selecting individuals with good performance when measured by total error. We do not expect tournament selection to exhibit similar decreases in performance when limited to selecting individuals with good total error, since it does not often select individuals with poor total error. Additionally, we expect that limiting lexicase selection to individuals with better total error will decrease population diversity.

To test our hypotheses, we propose an experiment where parent selection cannot select individuals with poor total error relative to the population. We devised a new survival selection step to run before parent selection called elitist survival selection. During elitist survival selection, we sort the population by total error and only allow the best $X\%$ of the population to “survive” to be available to make children. We call the percent of the population that survives this step the elitist survival rate. We then conduct parent selection using this reduced population as normal. With 100% elitist survival we would keep the entire population (i.e. no individuals are removed); 30% elitist survival would keep only the best 300 individuals sorted by total error (out of a population of 1000) to be available for parent selection. If our hypothesis holds, we would expect to see decreased performance with lexicase selection but not with tournament selection.

4.1. Benchmark Problems

The problems used in the experiments described here were taken from a benchmark suite of software synthesis problems, which were derived from exercises in introductory computer science textbooks (Helmuth and Spector, 2015). These problems require general-purpose programming to solve, such as multiple data types (strings, integers, floats, Booleans, vectors, etc.) and various control flow techniques. These problems have been addressed in several studies, using multiple genetic programming systems using lexicase selection including PushGP (Helmuth and Spector, 2015; Helmuth, 2015; McPhee et al., 2015; Helmuth et al., 2015a, 2016b, 2016a; Helmuth et al., 2017; McPhee et al., 2017) and grammar guided GP (Forstenlechner et al., 2018c, 2017, a, b), as well as at least one non-evolutionary program synthesis technique (Rosin, 2018).

We selected 8 out of the 29 benchmark problems to use in this study to reflect a wide range of requirements and difficulties. The specific problems addressed in this study are Last Index of Zero, Mirror Image, Negative to Zero, Replace Space with Newline, String Lengths Backwards, Syllables, Vector Average, and X-Word Lines. Some of these problems have been solved with genetic programming using lexicase selection over 75 times out of 100, while others have solution rates around 25%.

In this study, we follow the lead of the benchmark suite in how to determine whether a run is successful or not (Helmuth and Spector, 2015). Each GP run uses a different randomly-generated set of training cases, as well as a larger set of unseen test cases used to assess generalization. Once a program has evolved that passes all of the training cases, we test it on the unseen test set—if it passes those as well, it counts as a solution. In this paper we additionally automatically simplify the programs that pass the training data before testing them for generalization, a process that shrinks program size without changing the behavior of the program on the training set. Previous work has shown that automatic simplification effectively increases generalization on these benchmark problems (Helmuth et al., 2017).

4.2. Push and PushGP

The experiments conducted in this study were run using a PushGP genetic programming system, which evolves stack-based programs expressed in the Push programming language (Spector and Robinson, 2002; Spector et al., 2005; Pantridge and Spector, 2017). The key feature of Push for the experiments presented here is its multi-stack architecture, which includes a stack for each data type and instructions that always take their arguments from the correct stacks and push their results to the correct stacks. This facilitates the evolution of programs that use multiple, nontrivial data and control structures, making it suitable for solving the benchmark problems described above. In addition, a wealth of prior data on the performance of PushGP on these problems can provide context for the results obtained in different experimental conditions (Helmuth, 2015; McPhee et al., 2015; Helmuth et al., 2015a, 2016b, 2016a; Helmuth et al., 2017; McPhee et al., 2017). We use the Clojure implementation of PushGP222https://github.com/lspector/Clojush, which was also used in the aforementioned studies.

The parameters and configurations of the PushGP system that we used for the experiments here are the same as those described in the original benchmark description (Helmuth and Spector, 2015). Table 1 presents the key parameters. The version of the code used in our experiments is made available here: https://github.com/thelmuth/Clojush/releases/tag/GECCO-Lexicase-Selection-Of-Specialists.

5. Specialists Under Tournament Selection

Tournament selection displays an inherent pressure to select generalists due to its utilization of an aggregate error metric, such as RMSE, classification accuracy, or total error. To compute these kinds of error metrics, an individual’s errors on all training cases must be considered. If an individual performs particularly poorly on any subset of training cases, its aggregated error will be raised and probability of getting selected will decrease.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Bäck (1994) Thomas Bäck. 1994. Selective pressure in evolutionary algorithms: a characterization of selection mechanisms. In Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on . 57–62 vol.1. https://doi.org/10.1109/ICEC.1994.350042 · doi ↗
3Blickle and Thiele (1995) Tobias Blickle and Lothar Thiele. 1995. A Mathematical Analysis of Tournament Selection. In Proceedings of the 6th International Conference on Genetic Algorithms . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 9–16. http://dl.acm.org/citation.cfm?id=645514.658088
4Forstenlechner et al . (2017) Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O’Neill. 2017. A Grammar Design Pattern for Arbitrary Program Synthesis Problems in Genetic Programming. In Euro GP 2017: Proceedings of the 20th European Conference on Genetic Programming (LNCS) , Mauro Castelli, James Mc Dermott, and Lukas Sekanina (Eds.), Vol. 10196. Springer Verlag, Amsterdam, 262–277. https://doi.org/10.1007/978-3-319-55696-3_17 · doi ↗
5Forstenlechner et al . (2018 a) Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O’Neill. 2018 a. Extending Program Synthesis Grammars for Grammar-Guided Genetic Programming. In 15th International Conference on Parallel Problem Solving from Nature (LNCS) , Anne Auger, Carlos M. Fonseca, Nuno Lourenco, Penousal Machado, Luis Paquete, and Darrell Whitley (Eds.), Vol. 11101. Springer, Coimbra, Portugal, 197–208. https://doi.org/10.1007/978-3-319-99253-2_16 · doi ↗
6Forstenlechner et al . (2018 b) Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O’Neill. 2018 b. Towards effective semantic operators for program synthesis in genetic programming. In GECCO ’18: Proceedings of the Genetic and Evolutionary Computation Conference . ACM, Kyoto, Japan, 1119–1126. https://doi.org/10.1145/3205455.3205592 · doi ↗
7Forstenlechner et al . (2018 c) Stefan Forstenlechner, David Fagan, Miguel Nicolau, and Michael O’Neill. 2018 c. Towards Understanding and Refining the General Program Synthesis Benchmark Suite with Genetic Programming. In 2018 IEEE Congress on Evolutionary Computation (CEC) , Marley Vellasco (Ed.). IEEE, Rio de Janeiro, Brazil.
8Helmuth et al . (2017) Thomas Helmuth, Nicholas Freitag Mc Phee, Edward Pantridge, and Lee Spector. 2017. Improving Generalization of Evolved Programs Through Automatic Simplification. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ’17) . ACM, Berlin, Germany, 937–944. https://doi.org/10.1145/3071178.3071330 · doi ↗