Exploration and Exploitation in Symbolic Regression using   Quality-Diversity and Evolutionary Strategies Algorithms

J.-P. Bruneton; L. Cazenille; A. Douin; V. Reverdy

arXiv:1906.03959·cs.NE·June 11, 2019

Exploration and Exploitation in Symbolic Regression using Quality-Diversity and Evolutionary Strategies Algorithms

J.-P. Bruneton, L. Cazenille, A. Douin, V. Reverdy

PDF

Open Access

TL;DR

This paper presents a hybrid approach combining Genetic Programming, MAP-Elites, and Covariance Matrix Adaptation Evolution Strategy to improve exploration and success rates in symbolic regression, effectively solving multiple benchmark problems.

Contribution

The paper introduces a novel combination of algorithms that enhances exploration and diversity in symbolic regression, achieving high success rates on standard benchmarks.

Findings

01

High success rates in symbolic regression benchmarks

02

Effective exploration and diversity preservation

03

Successful evaluation of multiple targets from literature

Abstract

By combining Genetic Programming, MAP-Elites and Covariance Matrix Adaptation Evolution Strategy, we demonstrate very high success rates in Symbolic Regression problems. MAP-Elites is used to improve exploration while preserving diversity and avoiding premature convergence and bloat. Then, a Covariance Matrix Adaptation-Evolution Strategy is used to evaluate free scalars through a non-gradient-based black-box optimizer. Although this evaluation approach is not computationally scalable to high dimensional problems, our algorithm is able to find exactly most of the $31$ targets extracted from the literature on which we evaluate it.

Tables5

Table 1. Table 1. Comparison between plain GP and Map-Elites with only free scalars being “ 1 1 1 ” and “ 2 2 2 ”. See Section 3.3 for more details. Based on 100 independent runs. N 𝑁 N -eval is the average number of individuals evaluated when the solution is exactly found. MAP-Elites shows a slight improvement over plain GP, although it requires a bit more evaluations before convergence. The training and validation set intervals are the same as the ones specified in Table 2.

Target name	Target formula	Hit rate - GP	N-eval	Hit rate - Map-Elite	N-eval
Nguyen-2	$x + x^{2} + x^{3} + x^{4}$	67 %	22292	93 %	28969
Koza-3	$x^{6} - 2 x^{4} + x^{2}$	24 %	34454	50 %	42143
Meier-3*	$x^{2} y^{2} / (x + y)$	94 %	29251	100 %	26320
Meier-4*	$x^{5} y^{- 3}$	57 %	41364	52 %	59896
Nguyen-9*	$\sin (x) + \sin (y^{2})$	87 %	20121	85 %	56461
Burks	$4 x^{4} + 3 x^{3} + 2 x^{2} + x$	2 %	46881	34 %	74735

Table 2. Table 2. Results for small targets with L = 25 𝐿 25 L=25 for the first step (no CMA-ES) and L = 15 𝐿 15 L=15 with CMA-ES. Starred targets correspond to targets where the intervals for x 𝑥 x , y 𝑦 y (and z 𝑧 z if any) are the same. Based on 20 independent runs. When the run fails after P = 150 𝑃 150 P=150 iterations for the first step and Q = 150 𝑄 150 Q=150 iterations for second step, computation time is around 20 minutes on a 40-cores computer (for one target).

Target name	Target formula	Training set	Test Set	Hits - no CMA-ES	N-eval	Hits (CMA-ES)	N-eval	Hits (total)
Nguyen-2	$x + x^{2} + x^{3} + x^{4}$	U [0, 1, 20]	U [0, 2, 200]	95 %	24624	5 %	4221	100 %
Koza-3	$x^{6} - 2 x^{4} + x^{2}$	U [0, 1, 20]	U [0, 2, 200]	40 %	48722	45 %	16899	85 %
Meier-3*	$x^{2} y^{2} / (x + y)$	U [0, 1, 20]	U [0, 2, 50]	100 %	27948	-	-	100 %
Meier-4*	$x^{5} y^{- 3}$	U [0, 1, 20]	U [0, 2, 50]	80 %	41217	20 %	3957	100 %
Nguyen-9*	$\sin (x) + \sin (y^{2})$	U [0, 1, 20]	U [0, 2, 100]	90 %	56254	10 %	2184	100 %
Keijzer-1	$0.3 x \sin (2 π x)$	E [0, 1, 0.05]	E [0, 10, 0.05]	0 %	-	95 %	5704	95 %
Keijzer-2	$0.3 x \sin (2 π x)$	E [0, 2, 0.05]	E [0, 4, 0.05]	0 %	-	100 %	5611	100 %
Keijzer-3	$0.3 x \sin (2 π x)$	E [0, 3, 0.05]	E [0, 4, 0.05]	0 %	-	100 %	3717	100 %
Nguyen-5	$\sin (x^{2}) \cos (x) - 1$	U [0, 1, 20]	U [0, 1.2, 200]	20 %	46194	60 %	19551	80 %
Nguyen-6	$\sin (x) + \sin (x + x^{2})$	U [0, 1, 20]	U [0, 1.2, 200]	60 %	48362	35 %	13898	95 %
Sine	$\sin (x) + \sin (x + x^{2})$	E [0, 6.2, 0.1]	U [0, 10, 100]	90 %	34619	10 %	10417	100 %
Koza-2	$x^{5} - 2 x^{3} + x$	U [0, 1, 20]	U [0, 2, 200]	45 %	57392	50 %	17520	95 %

Table 3. Table 3. Results for mid-sized targets. Based on averaging 20 independent runs. Burks, Keijzer-14 and Nguyen-3 are run with maximum L 𝐿 L of 20 for the CMA-ES method, Nguyen-7 with L = 25 𝐿 25 L=25 , and the remaining ones with L = 30 𝐿 30 L=30 . One run per target takes at most one hour for L = 20 𝐿 20 L=20 ( i.e. when it fails), and up to 5 hours for L = 30 𝐿 30 L=30 .

Target name	Target formula	Training set	Test Set	Hits	N-eval	Hits	N-eval	Hits
				(no CMA-ES)		(CMA-ES)		(total)
Burks	$4 x^{4} + 3 x^{3} + 2 x^{2} + x$	U [0, 1, 20]	U [0, 3, 200]	35 %	79033	60 %	16350	95 %
Keijzer-14*	$\frac{8}{2 + x^{2} + y^{2}}$	U [0, 3, 20]	E [0, 4, 0.1]	30 %	139554	65 %	6644	95 %
Nguyen-3	$x + x^{2} + x^{3} + x^{4} + x^{5}$	U [0, 1, 20]	U [0, 2, 200]	60 %	65082	30 %	17523	90 %
Nguyen-7	$\ln (1 + x) + \ln (1 + x^{2})$	U [0, 2, 20]	U [0, 3, 200]	0 %	-	20 %	42459	20 %
R1	${(x + 1)}^{3} / (x^{2} - x + 1)$	E [0, 2, 0.1]	U [0, 3, 100]	5 %	143850	90 %	50741	95 %
R2	$(x^{5} - 3 x^{3} + 1) / (x^{2} + 1)$	E [0, 2, 0.1]	U [0, 4, 400]	0 %	-	85 %	73009	85 %
Keijzer-5	$30 x z / ((x - 10) y^{2})$	$\frac{x, y : U [0, 2, 5]}{z : U [1, 5, 10]}$	$\frac{x, y : U [0, 3, 20]}{z : U [0, 5, 30]}$	5 %	348979	95 %	14983	100 %
Keijzer-12*	$x^{4} - x^{3} + 0.5 y^{2} - y$	U [0, 3, 20]	E [0, 4, 0.1]	30 %	259639	70 %	47086	100 %
Keijzer-15*	$\frac{x^{3}}{5} + \frac{y^{3}}{2} - y - x$	U [0, 3, 20]	E [0, 4, 0.1]	0 %	-	100 %	35894	100 %
Keijzer-11	$x y + \sin ((x - 1) (y - 1))$	U [0, 3, 20]	E [0, 4, 0.1]	0 %	-	15 %	71605	15 %
Nguyen-4	$x + x^{2} + x^{3} + x^{4} + x^{5} + x^{6}$	U [0, 1, 40]	U [0, 1.5, 200]	40 %	181816	55 %	41728	95 %
Pagie-1	$1 / (1 + x^{- 4}) + 1 / (1 + y^{- 4})$	E [0, 5, 0.2]	U [0, 6, 20]	15 %	233542	85 %	48647	100%

Table 4. Table 4. Description of difficult targets.

Target name	Target formula	Training set	Test Set	Maximal Length used (CMA-ES)
R3	$\frac{x^{6} + x^{5}}{x^{4} + x^{3} + x^{2} + x + 1}$	E [0, 1, 0.05]	U [0, 2, 100]	35
Vladislavleva-1	$e^{- {(x - 1)}^{2}} / ((1.2 + {(y - 2.5)}^{2})$	(x,y) : U [0.3, 4, 20]	(x,y) : E [0, 8, 0.1]	35
Keijzer-4	$x^{3} e^{- x} \cos (x) \sin (x) (\cos (x) \sin {(x)}^{2} - 1)$	E [0, 10, 0.1]	U [0, 14, 1000]	40
Nonic	$\sum_{i = 1}^{i = 9} x^{i}$	E [0, 1, 0.05]	U [0, 2, 100]	40
Vladislavleva-3	$x^{3} e^{- x} \cos (x) \sin (x) (\cos (x) \sin {(x)}^{2} - 1) (y - 5)$	x : E [0.05, 10, 0.1]	x : U [0, 10, 50]	45
		y : E [0.05, 10.05, 2]	y : U [0, 10, 10]

Table 5. Table 5. Results for difficult targets, based on 10 independent runs. The first step of the algorithm (without CMA-ES) can see more than 500 000 times 500000 absent 500\,000\text{\,} different equations. This is however not enough to hit the target. Computational time is around two days per target per run.

Target name	Hits - no CMA-ES	N-eval	Hits (CMA-ES)	N-eval	Hits (total)
R3	20 %	325312	70 %	64319	90%
Vladislavleva-1	0 %	-	30 %	101360	30 %
Keijzer-4	0 %	-	40 %	105969	40 %
Nonic	0 %	-	20 %	170928	20 %
Vladislavleva-3	0 %	-	20 %	246756	20 %

Equations10

{\emptyset, x, y, z, sin, cos, exp, ln, \times, +, -, /, \land}

{\emptyset, x, y, z, sin, cos, exp, ln, \times, +, -, /, \land}

C = ∣ f_{target} - \tilde{f} ∣ + i \sum ∣ \partial_{i} (f_{target} - \tilde{f}) ∣

C = ∣ f_{target} - \tilde{f} ∣ + i \sum ∣ \partial_{i} (f_{target} - \tilde{f}) ∣

f = (x - 3) (y - 3) + 2 sin ((x - 4) (y - 4))

f = (x - 3) (y - 3) + 2 sin ((x - 4) (y - 4))

f = \frac{30 ( x - 1 ) ( z - 1 )}{( x - 10 ) y ^{2}},

f = \frac{30 ( x - 1 ) ( z - 1 )}{( x - 10 ) y ^{2}},

\tilde{f}

\tilde{f}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Metaheuristic Optimization Algorithms Research · Advanced Multi-Objective Optimization Algorithms

Full text

Exploration and Exploitation in Symbolic Regression using Quality-Diversity and Evolutionary Strategies Algorithms

J.-P. Bruneton

LIED, Université de Paris

CNRS/UMR 8236ParisFranceF-75013

,

L. Cazenille

Ochanomizu UniversityTokyoJapan

,

A. Douin

LIED, Université de Paris

CNRS/UMR 8236ParisFranceF-75013

and

V. Reverdy

University of Illinois at Urbana-ChampaignAstronomy Department, 1002 West Green StreetIllinoisUSA

Abstract.

By combining Genetic Programming, MAP-Elites and Covariance Matrix Adaptation Evolution Strategy, we demonstrate very high success rates in Symbolic Regression problems. MAP-Elites is used to improve exploration while preserving diversity and avoiding premature convergence and bloat. Then, a Covariance Matrix Adaptation-Evolution Strategy is used to evaluate free scalars through a non-gradient-based black-box optimizer. Although this evaluation approach is not computationally scalable to high dimensional problems, our algorithm is able to find exactly most of the $31$ targets extracted from the literature on which we evaluate it.

1. Introduction

Symbolic Regression (SR) aims at building mathematical models of numerical, and possibly experimental, data. Given data of the form $(y_{i},\vec{x}_{i})$ , the goal is to discover automatically the analytical relationship between $y$ and $x$ , as a function of a mathematical language that usually includes basic operators like $(+,-,\times,/)$ , possibly also non-algebraic functions such as sine, exp, …, some free scalars (pure numbers), and the variables $\vec{x}$ .

If such an analytic relation exists by construction, i.e. when we feed the program with data $y_{i}=f_{\textrm{target}}(x_{i})$ , then the goal is to write candidate equations $\tilde{f}(\textrm{vocabulary})$ until one hits the target $\tilde{f}=f_{\textrm{target}}$ . If not, the goal is to find a good approximation of the target function on the provided data (the training set) with good generalization properties, meaning that we want the discovered functions to behave well on previously unseen data, known as the test set or validation set.

Besides being a difficult machine learning problem that is interesting on its own, SR can also be used to provide accurate models of physical systems that are too complex to be theoretically modelled. Any complex phenomena emerging from the underlying dynamics of a large number of degrees of freedom typically fall in this category. This happens in particular in the fields of meteorology, climatology, material properties, heat transfers, astrophysics, economy, financial data, complex systems, etc. Notice that the vocabulary can include derivatives with respect to the variables so that dynamics can also be discovered provided the data has some temporal component. Even in the case where the outcome of the program is not a perfect fit, finding accurate solutions may guide the researcher towards a better understanding of the system’s underlying physics. In this regard, the interpretability of the fittest candidate equations is important. We shall comment on this later on.

SR has been studied quite extensively along these lines. For instance, references (Bongard9943, ; Schmidt81, ) try to recover physical laws and invariants of some mechanical systems, (quade2016prediction, ) focuses on real-world complex systems data sets, while, related to our concern, (2018arXiv181010525W, ) aims at building an ”automated physicist”.

One of the main approach to SR is Genetic Programming (GP) which is based on a computer simulated Darwinian evolution111Notice that GP can of course be used to solve other problems than SR., see, e.g., textbooks (koza1992genetic, ; poli2008field, ). In this field, the vocabulary is known as the primitive set, while the individuals built from correspond to candidate equations. Some number of individuals are created initially, and then selected according to their fitness (i.e. some metric) with regards to the problem at hand, and then evolved by genetic operations, namely mutations of the equation, or crossovers between two equations. This scheme then iterates until the target is found or some computational budget is reached.

Although GP has proven very successful for finding highly fit individuals in large search spaces, it has some long standing issues regarding in particular the recurrent loss of diversity during evolution (see Section 2.1). Recently however, a new paradigm emerged of evolutionary algorithms that aims at exploring both quality and diversity of individuals. These algorithms are not looking for the fittest individual only, but rather look for a set of high performing ones given their behavior with respect to hand-designed features (see Section 2.2). This so-called MAP-Elites algorithm (DBLP:journals/corr/MouretC15, ) has been used as an improvement of GP in algebraic problems (dolson2019exploring, ), path-finding (pugh2016quality, ; 2018arXiv180702397G, ), design discovery (gaier2017data, ), robotics (duarte2018evolution, ), and is available as a Python library (qdpy, ). As far as we are aware, it has not been applied to SR yet. Although later improvements have been proposed to the algorithm, e.g. (pugh2016searching, ; cully2018quality, ), we shall restrict ourselves here to its original version as published in (DBLP:journals/corr/MouretC15, ).

In this paper, we will apply this enhanced exploration algorithm to the problem of Symbolic Regression. However, we are not only concerned here with maintaining diversity, but also with a better exploitation of the results. One striking issue concern the way free scalars in SR are handled. In most of the published literature, free scalars that appear while building an individual can either be picked up randomly from a given integer set, for instance $\{-2,-1,0,1,2\}$ , or be randomly chosen floating-point numbers in a predefined and fixed interval – the so-called ”ephemeral constants” (Davidson:2003, ).

This, we believe, is not quite satisfactory. If we limit ourselves to integer scalars only, given that the equation has a maximal size, we cannot build all real-valued scalars this way. On the other hand, using ephemeral constants requires by construction many iterations of the evolution scheme before finding a value that is accurate enough.

Instead, we will write candidate equations with not yet assigned free scalars under forms such as $f=A_{1}\times\exp(A_{2}\times x)$ and then find the best scalars $A_{i}$ . However, achieving this cannot rely on common gradient-based techniques since they would often converge to a local optimum and miss the global one222Still we note that (kommenda2013effects, ; quade2016prediction, ) try to fit numerical constants with gradient descent and show that it is already an improvement over the use of ephemeral constants. On the other hand, (cerny2008using, ) uses instead another non-gradient based method, but limits itself to very simple targets only (at most order three polynomials).. Instead, we use another evolution algorithm, namely a Covariance Matrix Adaptation-Evolution Strategy algorithm (hansen2003reducing, ) (CMA-ES) in order to look for the best fit for the free scalars. Details of the method can be found in Section 2. Because CMA-ES is computationally heavy, it could not be reasonably applied to very long equations (say of more than 60 elements) with many scalars. However, it represents a substantial improvement that is worth considering for mid-sized targets.

To summarize, our model relies on an improved exploration of the fitness landscape via the evolutionary Quality-Diversity algorithm, and then fits the best scalars with another evolutionary algorithm (CMA-ES). This last technique is specific to SR and shall not apply to other types of problems that GP usually deals with. These two methods combined result in a very high success rate on many targets found in the literature. Moreover, even when the algorithm fails to find the exact target, it returns very accurate fits thanks to the CMA-ES method (although generalization may be poor in this case). Section 2 provides some background to both plain GP and its limitations and to the MAP-Elites algorithm. It finally gives a quick overview of how CMA-ES method works. Section 3 describes the entire model by putting together all these pieces, and Section 4 shows experimental results.

2. Background

2.1. Tree-based GP for SR

Before going to the GP algorithm and its improvements, let us first quickly outline how SR is usually implemented in so-called tree-based GP. Mathematical expressions are created and modified either as strings of symbols, usually in infix or postfix (reverse Polish) notation, as Abstract Syntax Trees (ASTs), or as a combination of the two depending on which representation fits best each section of the SR algorithm. ASTs are especially convenient to run genetic operators such as crossovers and point-wise mutations. The SR algorithm is given a primitive set of symbols, including $\varnothing$ that serves as a halt symbol to terminate the expression. For example, in the postfix notation that we use, $\left(x+y\right)\times x$ is represented as $x\,y\,+\,x\,\times\,\varnothing$ and corresponds to the tree given in Fig. 1.

Basic mathematical rules can easily be encoded in ASTs. While the literature usually imposes a limit on the tree depth, we will instead impose a hard limit on the length $L$ of the mathematical expressions produced by the algorithm. This corresponds to a parsimony measure (koza1992genetic, ; iba1994genetic, ) of the generated equations. This choice was motivated by the fact that parsimony is taken directly into account into our methodology, see next subsection.

GP algorithms require the following ingredients, whose relationship are summarized in Fig. (2): a population of individuals and its initialization, fitness evaluation, selection, genetic mutations, population update, and meta-parameters. A very crude view of GP is the following. After the creation of an initial population of $N$ random equations (see (koza1992genetic, ) for variations of initialization techniques), individuals are evaluated with respect to the target on the training set; then some are selected either randomly or in relation to their fitness. These individuals go through genetic operations, basically mutations and crossovers; these new individuals are evaluated, and the population is updated to keep only the $N$ best equations. The algorithm then iterates over this scheme. Termination occurs when one individual is accurate enough with respect to the validation set, see Section 3.2 for details, or when the computational budget is exhausted.

Many recent papers focus on improving one or more of these steps. For instance, one may not want to only target the best fit individual, but other features as well. This has led to multi-objective optimization and Pareto-front exploitation where fitness evaluation considers several objectives at the same time, see e.g. (6791888, ; laumanns2002archiving, ; smits2005pareto, ). While the algorithm runs, a single population of individuals is kept in memory. However, it might be useful to decompose hard problems into smaller, easier sub-problems. Therefore this scheme can also be tweaked to incorporate problem decomposition, by keeping small blocks of expressions that have proven useful during the training, see, e.g., (koza1994genetic, ; arnaldo2014multiple, ; astarabadi2018decomposition, ). As we shall see, MAP-Elites includes a sort of problem decomposition when remembering the small but fit individuals.

GP still has long standing issues, however. One of them is the bloat that happens when no hard limits are set on the length of expressions; then, crossovers tend to create longer individuals while their fitness no longer improves. Many techniques have been designed to counter bloat, in particular setting hard limits, or setting a ”soft limit” by disadvantaging long genomes (i.e. individuals), see, e.g., (Bloat, ), or even variations of this (poli2003simple, ). As we shall see, the MAP-Elites algorithm can automatically counter this problem.

Another issue is the premature convergence or diversity loss during evolution. This may be prevented by increasing population size, modifying genetic operations, and/or selection mechanism. As a first guide to the improvement of genetic operations besides the basic point-wise mutation and crossovers, we refer the reader to (poli2008field, ) and references therein.

2.2. MAP-Elites

MAP-Elites algorithm belongs to the class of Quality-Diversity algorithms (pugh2016quality, ; cully2018quality, ) that also includes, for example, Novelty Search with Local Competition (NS-LC) (lehman2011evolving, ; lehman2013effective, ). The algorithm is grid-based and stores the best-fit individuals in a grid of elites; the grid being $N$ -dimensional with $N$ features chosen by the user. As far as equations are concerned, quite natural features one may think of includes the length of the equation (or other metrics of its complexity), the number of free scalars, the number of nested non-algebraic functions such as $\sin(\sin(...))$ , the number of trigonometric functions, and the order of non-linearity of (vladislavleva2009order, ).

The algorithm is then quite simple. After the generation of initial random expressions, individuals are evaluated. The individuals produced are then sorted in terms of grid bins. Inside a given bin, only the best individual is kept. Once the grid has been populated, an iteration consists in producing new equations by applying genetic operators between the elements of the grid – and only them – chosen at random (uniform selection, as in (DBLP:journals/corr/MouretC15, )). These new states are then evaluated, and the grid is updated: some of these new equations may replace some previous best individuals in several bins of the grid, see Fig. 3. Termination is similar to the pure GP case.

The algorithm, and its relation to other standard evolutionary methods such as, e.g. Pareto-Front optimization or NS-LC is discussed at length in (DBLP:journals/corr/MouretC15, ). We want to emphasize that using such a feature map shall address in many ways the aforementioned issues of GP for SR. First of all, the preservation of diversity is kind of built-in in the method, while addressing at the same time the quest for multi-objective regression. Secondly, the bloat can be addressed by choosing as a feature the length of an equation (or its complexity). This will indeed force the algorithm to remember small individuals and automatically counter the bloat. Small individuals may not be excellent ones, but still are the best seen so far of that given size or complexity. As such, it also acts as a kind of problem decomposition, where small individuals can be seen as relevant blocks for later building a larger and better equation. Regarding now the parsimony and Occam’s razor, it is clear that when two individuals have similar fitness, the one using less free scalar should be considered as better than the other one. Therefore, it is natural to use as a feature the free scalars count of the equation. On top of increasing the diversity, it provides a way to rank the best solutions by the number of free parameters used, which is good practice when dealing with analytic models of physical phenomena (compare to approach of (de2018greedy, )).

We believe that using both length (or complexity), and the number of free scalars in an equation, are two inescapable choices in MAP-Elites-based Symbolic Regression. Additional features might be included, either to increase the grid size, or to bring some more relevant diversity. To determine which additional features to use is quite a non-trivial question. We have chosen a 3-D grid (see Section 3) based on the length, the number of scalars, and the number of non-algebraic functions such as $\sin$ or $\exp$ . This last choice is arguable, but does increase the grid size and thus boosts the exploration.

2.3. CMA-ES

CMA-ES method is described at length in a series of papers, see, e.g., (hansen2003reducing, ) and available as a library in many programming languages. Although details are quite complex, the main idea is that the algorithm will browse the landscape in an evolutionary way. Say one wants to minimize a function $f(A_{1},\ldots,A_{n})$ . First, a population of candidates for $\vec{A}$ are sampled along a normalized multivariate Gaussian; then the best individuals (in a mean-squared error sense with respect to the actual target) are sub-sampled, and this in turn defines a new multivariate normal with a shifted mean and covariant matrix, according to which new candidates are drawn, etc. The method requires many such iterations (from $1000\text{\,}$ to $10\,000\text{\,}$ ), and each iteration relies on some number of function evaluation resulting in a quite slow, but powerful method. It can be trapped in local optima, of course, but is also able to capture quite often the global optimum. See also Section 4.3 for an explicit example. Meta-parameters include, amongst other things, the population size, the maximum number of iterations, and a time limit that we have modified with respect to default values – see next section for details.

As an example, consider the target ”Korns-7” (as named in (DBLP:journals/corr/abs-1805-10365, ), i.e. $f_{\textrm{target}}=213.80940889\,(1-e^{-0.54723748542\,x})$ on the range $x\in[-50,50]$ . It is formally quite a trivial target, but is also understandably difficult to find exactly without an appropriate method for finding these two scalars. Here CMA-ES does trivialize finding such an equation. In fact, such a simple equation is likely to be already present in the initial random population. In that case, applying then the CMA-ES method to find the best-fit for the $A$ ’s will terminate the algorithm in only one step. As a matter of fact, the target Korns-7 was always found very easily. Also, because of its triviality for our combined method MAP-Elites + CMA-ES, we decided to remove it from our target list.

3. The model

3.1. General specifications

The model shall run on all the targets of Tables 1–5 with the same following primitive set:

[TABLE]

where $\wedge$ stands for exponentiation. We limited the set to three variables ( $x,y,z$ ) at most for reasons to be discussed later. The dictionary is then completed by pure numbers (also referred to as scalars in the following). Then two options exist. First the model can be run with integer scalars (namely ’1’ and ’2’ only in the following): this model is used to compare plain GP to MAP-Elites SR, see Section 3.3. Second, the full model can be run with free floating-point scalars $A$ that are fitted by the CMA-ES method at the end, as detailed in Section 3.4.

Basic simplifications on the fly are also included, such as $\exp(\ln(x))=x$ , etc. As we do not rely on existing simplification packages, the algorithm is not equipped with a full expand-refactor simplification engine. Instead, it is limited to a set of basic hard-coded rules, which is sufficient for the application described in this paper. Note that when using formal scalars $A$ , an expression like $A\times A\times\exp(x)$ can be simplified to $A\times\exp(x)$ prior to CMA-ES evaluation. Some simplifications require to add some special symbols to our dictionary, namely the neutral element – if not already present–, the zero, and infinity.

We do not use protected divide of any kind. When infinity occurs in simplifications such as $1/(x-x)\to 1/0\to\infty$ , the equation is discarded prior to evaluation. Because our simplification engine is not comprehensive however, the program occasionally creates zero division errors, in which case the maximum penalty is attributed to the equation. The same applies for other kind of exceptions such as overflow errors.

We restrict ourselves to very basic genetic operations, namely point-wise mutations between symbols of equal arities, and basic crossovers. By this we mean nodes for crossovers are chosen randomly amongst internal nodes and leaves. We do not try to improve this by weighting probabilities for choosing internal versus terminal nodes, for instance, or other refinements such as the ones described in (poli2008field, ). Because we set a hard limit on the length of an equation, crossovers are tried but discarded as soon as the resulting equation does not fit inside this limit. This is a bit different from the literature where usually a hard limit on max depth of the tree is set up (using a maximum length $L$ instead is more relevant when using a MAP-Elites grid).

Also, and this can be seen as the only expert knowledge we do implement, we limit ourselves to a maximum number $K$ of nested functions such as $\sin(\sin(\exp(...)))$ for interpretability reasons. In particular, runs were made with $K=1$ ( $K=2$ also works fine). Again, crossovers leading to out-of-bound equations are discarded. Alternatively, the number of nested functions could have been used as yet another feature for the MAP-Elites grid.

Regarding genetic operations, states are chosen randomly among the population (both for plain GP and MAP-Elites), and in $40\%$ of the cases, one symbol only is mutated, in $40\%$ of the cases, two random states go through a crossover and returns two offspring, and in the last $20\%$ , two states are chosen at random and go through both a mutation and a crossover. Also, when a non-algebraic function is chosen for mutation, there is a $30\%$ probability to simply drop the symbol as in $x\times\sin(x)\rightarrow x^{2}$ . This choice was made in order to limit the number of non-algebraic functions and helps fighting the growing number of nested functions that crossover usually produce (unless hard limit on such terms is set).

As said in Section 2, the MAP-Elites grid that we use is three dimensional and uses as features the length of the equation (in post-fix notation), the number of free scalar parameters, and the number of non-algebraic functions such a $\sin,\exp,\ldots$ involved in an expression. We thus use a grid with one bin for each length of the equation from $1$ to $L$ , one bin for each number of any non-algebraic function from [math] to $8$ (if an equation has more than $8$ functions, then it enters the last bin), and one bin for each free scalar from [math] to $L/2$ (which is the maximum possible). In order to give orders of magnitude, the grid is usually small with around $150$ elements for equations with a maximum size of $L=15$ , and may be as large as around a thousand individuals for larger equations with $L\geq 35$ .

Finally, because CMA-ES is not a perfect method, once it has returned a recommendation for the best $A_{i}$ s, we apply thereafter a least square method to descend to the closest minimum, if not exactly found previously. This usually increases the method’s precision. As a result, for a (trivial) target like, say, $f=x^{3}$ , our method might well return the exact result $f=x\times x\times x$ or, also, $f=x^{A}$ . In the latter case, the CMA-ES plus the least-square fit combined will in general return $A=2.99999(\ldots)$ . In this case, we consider that the target has been exactly found, see next section for termination criterion. Note that because non-integral exponents are permitted, we need to restrict the sampled values for the variables to positive ranges, see Section 4.1.

We have implemented the model in Python, using CMA-ES for Python (cmapackage, ), and scipy module for least squares. Everything else has been coded from scratch.

3.2. Cost function and termination criterion

Following common practices of SR literature, the program write equations $\tilde{f}=\tilde{f}(\textrm{primitive set})$ , where the right-hand side cannot333This prevents finding polynomial equations in $f$ that would first require a numerical solver which is a more complex task and left for future work. have terms depending on $\tilde{f}$ . Differential equations of the form $(d/dx)^{n}\tilde{f}=\ldots$ could also be produced using the same approach, but we restrict ourselves to non-differential equations in this paper.

Once the best free scalars $A$ in $\tilde{f}$ are found, we simply compare the training set’s right-hand side with the target. Similarly of what can be found in multi-objective regression papers, we found more effective to also take into account a ”derivative cost”. We thus use the cost $C$ :

[TABLE]

i.e., using the L1 norm on both the distance to the target, and the distance of the first derivatives to the target. A sum is used for the derivative cost when the function has more than one variable. The cost is then properly normalized in order to return a reward (or fitness) scaled between $-1$ and $1$ . By running some preliminary tests, we found that including the derivative cost speeds up the convergence.

Because CMA-ES is so accurate however, termination criterion can be subtle. Indeed, the program quite often produces spectacular fits (accurate to $10^{-5}$ in relative values) to the target on its training set, even if the formal equation is not the expected one. See for instance Fig. 7. Therefore, in our target list in Section 4 taken from the literature, we have been careful to often increase the range of the validation set to prevent the algorithm from stopping early on false positives. (We recall that termination criterion is based on the validation set only). Going back to the trivial example where the target is $f_{\textrm{target}}=x^{3}$ , our method might return $f=x^{A}$ with $A=2.99999(\ldots)$ . We consider that the target is exactly found in this case, in the following sense: one computes the NRMSE444As defined in Eq. (6) in (miranda2017noisy, ). (Normalized Root Mean Square Error) which in that case would be typically close to $10^{-16}$ due to numerical precision. We defined our termination criterion as $NRMSE\leq 10^{-6}$ .

3.3. Comparison of plain GP versus MAP-Elites

As said, we made some preliminary runs to check whether MAP-Elites improves GP approaches. In order to do so, we used a vocabulary with two integer scalars ’1’ and ’2’ (i.e. no CMA-ES), on targets of Table 1. Genetic operations have the same parameters in both runs; GP runs with a population size of 1000 individuals. Selection is made with a tournament555Two individuals are drawn randomly from the population. Only the best one is kept for mutation, otherwise they go through a crossover. of size 2, and genetic operations are done until 2000 new individuals are produced. These ones are then evaluated, and the 1000 bests of these 3000 individuals become the new population. Regarding the MAP-Elites run, 4000 individuals are first randomly created, evaluated, and binned in the grid. Then, when the grid has $N$ elements, $2N$ new individuals are created by genetic operations, evaluated, binned, and the grid is updated.

The program either stops when the target is hit, or after $10^{5}$ evaluations of the fitness function. We made 100 such runs for the five targets of Table 1. It shows that MAP-Elites is a slight improvement over GP although it requires slightly more fitness evaluations.

3.4. Full algorithm MAP-Elites + CMA-ES

We could have simply used a MAP-Elites + CMA-ES algorithm on an initial collection of random equations. Literature often goes for the half/half method for initializing the population. In fact, as we have realized that CMA-ES is a slow method while using integer scalars is very fast, our full algorithm is rather three-steps:

•

First, we create a collection of 4000 random equations. Then we evolve for $P=150$ iterations the grid of equations by using a grid with no free scalar $A$ , but only scalars ”1” and ”2”. The maximum size is set to $L+10$ . In the case where the target includes no floating-point number, it may already be found at this step, and quite often is. This is step 1 of Fig. 4.

•

If not, promising equations from the grid such as $\tilde{f}=\sin(2)x/(1+x)$ are transformed into their free scalar counterpart, namely: $\tilde{f}=\sin(A_{1})x/(A_{2}+x)$ , i.e. $\tilde{f}=A_{1}\ x/(A_{2}+x)$ after simplification. This is step 2 of Fig. 4. They are then evaluated by CMA-ES, and stored inside a new grid.

•

This grid serves as an initialization for $Q=150$ iterations using MAP-Elites with free scalars $A$ and CMA-ES, see step 3 of Fig. 4.

In other words, we use the MAP-Elites method with ephemeral constants (drawn randomly from the integer set $(1,2)$ ) to generate an initial population for CMA-ES which is a priori much more relevant than random equations. This means that we do have two distinct vocabularies and two distinct sets of simplification rules in our code.

The full algorithm can be summarized by the following diagram:

As a remark, even if it is not required in a strict sense, simplification often reduces the number of free parameters per equation. For instance $\tilde{f}=A\times A\times A\times(x+A)$ is equivalent to $\tilde{f}=A_{1}\times x+A_{2}$ . Being able to make this simplification is a huge advantage because the fewer free scalars, the faster CMA-ES method executes. Next subsection gives order of magnitude about execution time.

3.5. Execution time

The first step of the algorithm described above runs way faster than the second one, even in mono-CPU implementation. The execution time is in fact no different than the one of standard SR packages like DEAP (DeRainville:2012:DPF:2330784.2330799, ).

The CMA-ES method is very much time-consuming. On mid-sized equations (say of length $30$ ) with around 5 free scalars, the CMA-ES method takes around 30 seconds on a single CPU. In practice, we ran CMA-ES on a 40 cores machine. But even in this case, for difficult targets with maximum size $L=40$ or $45$ , it took almost two days to produce around $200\,000\text{\,}$ CMA-ES evaluations. The runs we report in Table 5 for difficult targets typically took between one and three days per target per run. Clearly, our method can not be generalized to very long expressions.

The details of the execution time required for each run are reported in the captions of Tables 1 to 5.

4. Experimental Results

4.1. Targets description

As already said, we use the same vocabulary for all our targets. The only change from one to another is the maximum length an individual can have. The table of targets we used can be found on the last page of Ref. (DBLP:journals/corr/abs-1805-10365, ), which is in itself a compilation of many targets from the literature (see also (white2013better, )). From this original list of 63 targets, we discarded the ones with more than three variables as well as the ones that are too simple for our implementation666Namely Keijzer-7, Keijzer-8, Keijzer-13, Korns-4, Korns-5, Korns-6, Korns-7, Nguyen-1, Nguyen-8, Vladisleva-6.

We ended up with the 31 targets that are listed in result Tables 1 to 5. Table 2 lists ”easy targets” for which a sentence length of 15 was enough (although the success rate might improve with slightly longer maximal size), while Table 3 reports ”reasonably difficult ones” with length around $L=30$ , and Table 4 and 5 list difficult targets for which our success rate dropped significantly.

As in Ref. (DBLP:journals/corr/abs-1805-10365, ), the notation $x:E[0,1,0.05]$ means that the variable $x$ is sampled in the interval $[0,1]$ with constant steps of size $0.05$ , while $x:U[0,1,100]$ means for instance sampling 100 points for $x$ in the interval $[0,1]$ according to uniform distribution.

Our success rate is in most cases greater than $80\%$ . One might object that fixing the maximal length $L$ from the start is expert-knowledge. But in practice $L$ is just a meta-parameter that can be adjusted by the user, starting from small values that lead to shorter execution times, and progressively increasing it when the success rate is too small. As a matter of fact, we did proceed in this way for some of the targets of Table 3 that we first thought would be easy in small length, but turned out to be harder than expected. In theory, nothing would prevent a more generic version of the algorithm presented in this paper to auto-adjust this parameter.

Since the CMA-ES method process floating-point numbers, one must avoid expressions that are not defined on $\mathbb{R}$ , such as $x^{2.999}$ for $x\in\mathbb{R}^{-}$ . Therefore, compared to the table (DBLP:journals/corr/abs-1805-10365, ), the training and validation set ranges were systematically cut to subsets of $\mathbb{R}^{+}$ . The same number of points were given for the training sets, but on a smaller range. Yet, it did not prevent us to achieve high success rates.

Also, we give ourselves a much smaller training set for some of the multidimensional targets. Consider for example Keijzer-5 (see Table 2). Ref. (DBLP:journals/corr/abs-1805-10365, ) reports a training set $U[-1,1,1000]$ for $x,y$ and $U[-1,1,10000]$ for $z$ . This would mean providing way too many points to the CMA-ES method, which already requires many iterations. This would translate into a very large execution time. For this reason, we limited the training set for targets of this sort. In particular, for this target, we give ourselves a training set of $5\times 5\times 10$ points only. Again, this does not prevent the method to achieve $100\%$ success rate here on the validation set.

However, this is also a clear drawback of the method for high dimensional targets. Remind that CMA-ES optimizes the $A_{i}$ ’s in $\tilde{f}(\vec{A},\vec{x})$ with respect to the quadratic cost $\sum_{\vec{x}}(f_{\textrm{target}}(\vec{x})-\tilde{f}(\vec{A},\vec{x}))^{2}$ . Therefore it can not really do so without taking too long when $\vec{x}$ is of the order of a thousand points. Thus, in practice, we could only experiment with this method up to 4-dimensional targets with 5 points along each dimensions ( $5^{4}=625$ ). For this reason, we explore at most three-dimensional targets in this paper. This limitation may however be overcome by using another optimizer than CMA-ES.

4.2. Result tables description

Table 1 is self explanatory. Regarding results reported on Tables 2, 3, and 5 of the combined method MAP-Elites + CMA-ES of Section 3.4, the first column corresponds to the target name listed in (DBLP:journals/corr/abs-1805-10365, ). The second column gives the formula, while the third column details the training set. As said, we only reduced the range to positive values with respect to (DBLP:journals/corr/abs-1805-10365, ), and sometimes we reduced the number of points provided to the evolutionary algorithm, but never increased them. The fourth column corresponds to the validation set, usually greater than the one given in this reference, for reasons already explained in Section 3.2.

Then, the fifth column gives the hit rate for the first step of the algorithm with integer scalars $1$ and $2$ . Given that only a few of these targets involve floating-point numbers, this first step is already able to reach the target, especially the easy ones of Table 2. On the contrary, it can never hit Keijzer 1,2,3 which are $0.3x\sin(2\pi x)$ on various training sets. Column 6 gives the number of evaluations that were done before actually hitting the target, and when it was hit, averaged over the number of independent runs.

Column 7 and 8 are similar to 5 and 6, but this time for the second step of the algorithm involving CMA-ES, and after conversion of integer scalars to free scalars $A_{i}$ . The last column is the sum of both hit rates, i.e. our main experimental result.

The following Tables have only 29 targets, and not 31: this is because we shall not report on Vladislavleva-7

[TABLE]

and Vladislavleva-5

[TABLE]

for which we have no success at all (although quite good NRMSE).

4.3. Illustration of MAP-Elites Grid

As an illustration, next Fig. 5 shows the population on the MAP-Elites grid for a successful run for target ’Burks’, as described in Table 1 and 2. The $x$ -axis corresponds to the number of free scalars from 0 to 16 and $y$ -axis is the length (from 0 on the top to 30). This is not the full 3-D grid, but only a slice corresponding to a number of functions less or equal to 1. The grid is not (and actually can’t be) populated on the top right corner. It shows that small individuals perform quite poorly, as well as individuals with too many or too few free scalars.

4.4. Discussion

It appears that target involving $sin$ functions are the most difficult ones. This is can be illustrated by Keijzer 1, 2, 3, $f=0.3x\sin(2\pi x)$ that require many fitness evaluations before convergence although the target looks quite trivial (and see also the low hit rate for target Keijzer-11). As a matter of fact, the CMA-ES method has a very small success rate on fitting the exact equation $\tilde{f}=A_{1}x\sin(A_{2}x)$ . The success rate with default CMA parameters is only around $2\%$ based on 1000 runs for Keijzer-1. This means that CMA-ES will often miss the right target.

This is because the landscape is quite deceptive in this case. Next figure shows projections of that landscape as a function of $A_{2}$ (on the horizontal axis) alone of the mean squared error (used in the CMA-ES method) between the actual target and a test function $f=0.3x\sin(A_{2}x)$ for the test range of Keijzer-3. We see that the global minimum in $A_{2}=2\pi$ is kind of lost in many local optima, see Fig. 6.

In fact, when the target is found, it is usually with a more complex formula, e.g. $\tilde{f}=A_{1}x\sin(A_{2}x+A_{3})$ or even $\tilde{f}=A_{1}x\sin(A_{2}x+A_{3})+A_{4}$ , for which CMA-ES success rate increases significantly up to around $17\%$ . Interestingly enough then, the algorithm tries many variations around the correct formal expression until CMA-ES hits the right scalar parameters. Moreover CMA-ES seems to be able to do better on landscapes that have extra spurious dimensions. Most presumably, the landscape gets easier to browse in this extra dimensional embedding. This also means that simplification may be counter-productive. However, we did runs with and without simplifications and the results are overall similar. (Simplification is not used from Table 1 to 4, and used in Table 5).

In order to increase CMA-ES success rate, each CMA-ES instance is called with an initial Gaussian with a random mean varying between $-1$ and $1$ , and a random initial variance chosen varying between 1 and 5.

The poor success rate on target Nguyen-7 (Table 3) is of different nature. It seems, looking at the detailed result, that the target on that range can be very easily approximated by rational fractions, so that the program does not get incentives to look for $\log$ solutions. Moreover, the provided range is very small; the success rate increases a lot and reaches easily $100\%$ if we give a training set of e.g., $x\in[0,40]$ . Given that we give ourselves only a small sample of the function, the difficult task here is more about finding good generalizations. In this regards, the algorithm actually performs very well. As an example, one of the program’s result is the following (with a NRMSE of $5.6\,10^{-5}$ ):

[TABLE]

which is highly accurate not only on the validation set $x\in[0,3]$ but actually on a much larger range $x\in[0,40]$ as shown in Fig. 7.

5. Conclusion

We described an approach to the Symbolic Regression problem combining a MAP-Elites exploration scheme together with a evolutionary optimizer. The optimizer, namely CMA-ES, allowed us to find ephemeral constants involved in symbolic expression with high accuracy. Starting from the same primitive set, we demonstrated high success rates on a large sample of reference targets taken from the literature.

We use the MAP-Elites (DBLP:journals/corr/MouretC15, ) algorithm to search for equations that are both accurate and diverse, across a range of selected topological features (parsimony, number of free scalars, number of non-algebraic functions). Additionally, other features could be taken into account to increase the diversity of the explored equations, catering either to equation topology (*e.g.*number of trigonometric functions) or to the mathematical properties of the target function (*e.g.*number of modes of the optimized equation compared to the target function, error measures on the first-order derivative, etc). An associated difficulty is the increase of number of bins in the MAP-Elites grid when more features are taken into account. This can make the algorithm focus too much on diversity which would reduce convergence speed. This can be prevented by using the CVT-MAP-Elites algorithm (vassiliades2017using, ).

The use of CMA-ES as ephemeral constants optimizer is a computationally intensive technique. Thus, it may difficult to scale our methodology to higher-dimensional or more complex target functions, as CMA-ES would require larger evaluations budgets. To overcome these limitations, alternative optimization techniques will have a key role to play, such as CMA-ES with several populations (hansen2009benchmarking, ) or variants of CMA-ES capable of handling large number of dimensions (loshchilov2018large, ). Moreover, to put the algorithm in practice on noisy targets (see (miranda2017noisy, )), further work will need to be done on the performance of black-box optimizers facing noise. In particular, because of its accuracy, CMA-ES is likely to return a large set of possible solutions that all seem to perfectly fit to the data. In this context, the handling of error bars throughout the entire SR process will be critical. Alternatively, we could employ optimization techniques that are inherently tolerant to noise, like Bayesian Optimization (frazier2018tutorial, ; pelikan2002scalability, ).

The set of genetic operations was intentionally left in its most basic form. We wanted to explore how the combination of several methods can perform on SR problems. Our algorithms leaves a lot of room for improvement and we hope that state-of-the-art GP techniques together with refined analyses of feature selection will allow to achieve better performance.

With very few adjustments, the algorithm we described here can handle systems of partial derivative equations. We leave this aspect and the automated discovery of coupled differential equations in physical systems for future work.

Acknowledgments

The work of Vincent Reverdy has been supported by the National Science Foundation under the grants SI2-SSE-1642411 and CCF-1647432. The work of Leo Cazenille has been supported by Grant-in-Aid for JSPS Fellows JP19F19722.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Josh Bongard and Hod Lipson. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences , 104(24):9943–9948, 2007.
2[2] Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. Science , 324(5923):81–85, 2009.
3[3] Markus Quade, Markus Abel, Kamran Shafi, Robert K Niven, and Bernd R Noack. Prediction of dynamical systems by symbolic regression. Physical Review E , 94(1):012214, 2016.
4[4] Tailin Wu and Max Tegmark. Toward an AI Physicist for Unsupervised Learning. ar Xiv e-prints , page ar Xiv:1810.10525, Oct 2018.
5[5] John R Koza and John R Koza. Genetic programming: on the programming of computers by means of natural selection , volume 1. MIT press, 1992.
6[6] Riccardo Poli, William B Langdon, Nicholas F Mc Phee, and John R Koza. A field guide to genetic programming . Lulu. com, 2008.
7[7] Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. Co RR , abs/1504.04909, 2015.
8[8] Emily Dolson, Alexander Lalejini, and Charles Ofria. Exploring genetic programming systems with map-elites. In Genetic Programming Theory and Practice XVI , pages 1–16. Springer, 2019.