How to Estimate the Ability of a Metaheuristic Algorithm to Guide Heuristics During Optimization
Milo\v{s} Simi\'c (University of Belgrade, Belgrade, Serbia)

TL;DR
This paper introduces a methodology to evaluate how effectively a metaheuristic guides heuristics during optimization by comparing it to a naive placebo version, providing insights into the metaheuristic's specific contribution.
Contribution
It proposes a simple, effective approach using distribution comparison and introduces BER measures to quantify the metaheuristic's guiding ability.
Findings
The methodology successfully distinguishes guiding effectiveness in simulated experiments.
BER measures provide practical significance thresholds for performance differences.
Application to Simulated Annealing and SAT problems demonstrates the approach's utility.
Abstract
Metaheuristics are general methods that guide application of concrete heuristic(s) to problems that are too hard to solve using exact algorithms. However, even though a growing body of literature has been devoted to their statistical evaluation, the approaches proposed so far are able to assess only coupled effects of metaheuristics and heuristics. They do not reveal us anything about how efficient the examined metaheuristic is at guiding its subordinate heuristic(s), nor do they provide us information about how much the heuristic component of the combined algorithm contributes to the overall performance. In this paper, we propose a simple yet effective methodology of doing so by deriving a naive, placebo metaheuristic from the one being studied and comparing the distributions of chosen performance metrics for the two methods. We propose three measures of difference between the two…
| Parameter | Low | Medium | High |
|---|---|---|---|
| Parameter | ||||
|---|---|---|---|---|
| Half-distance |
| [Flip] | SA[Flip] | |
|---|---|---|
| Algorithm | Average |
|---|---|
| [Flip] | |
| SA[Flip] |
| [Flip] | SA[Flip] | |
|---|---|---|
| Algorithm | Average |
|---|---|
| [Flip] | |
| SA[Flip] |
| overall |
|---|
| overall |
|---|
| overall |
|---|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetaheuristic Optimization Algorithms Research · Advanced Multi-Objective Optimization Algorithms · Advanced Optimization Algorithms Research
How to Estimate the Ability of a Metaheuristic Algorithm to Guide Heuristics During Optimization
Miloš Simić111ORCID: orcid.org/0000-0003-1506-3728
University of Belgrade
Studentski trg 1, 11000 Belgrade
Abstract
Metaheuristics are general methods that guide application of concrete heuristic(s) to problems that are too hard to solve using exact algorithms. However, even though a growing body of literature has been devoted to their statistical evaluation, the approaches proposed so far are able to assess only coupled effects of metaheuristics and heuristics. They do not reveal us anything about how efficient the examined metaheuristic is at guiding its subordinate heuristic(s), nor do they provide us information about how much the heuristic component of the combined algorithm contributes to the overall performance. In this paper, we propose a simple yet effective methodology of doing so by deriving a naive, placebo metaheuristic from the one being studied and comparing the distributions of chosen performance metrics for the two methods. We propose three measures of difference between the two distributions. Those measures, which we call BER values (benefit, equivalence, risk) are based on a preselected threshold of practical significance which represents the minimal difference between two performance scores required for them to be considered practically different. We illustrate usefulness of our methodology on the example of Simulated Annealing, Boolean Satisfiability Problem, and the Flip heuristic.
Keywords: Algorithm Analysis, Metaheuristics, Heuristics, Simulated Annealing, Boolean Satisfiability
1 Introduction
Metaheuristics and heuristics are widely accepted optimization tools within operations research community (Caserta & Voß 2010). They are used to approximately, but efficiently, solve problems that are too hard to be solved using exact algorithms (Nesmachnow 2014).
Heuristics are problem-specific techniques that, in general, quickly find good solutions to given problems, although there are no guarantees that those solutions will always be optimal. Heuristics can be used only to solve problems for which they have been specifically designed. Metaheuristics, on the other hand, have so far been utilized in two ways (Caserta & Voß 2010):
- •
as general-purpose optimization methods ready to apply to any problem without any modification, and
- •
as higher-order methods which guide how problem-specific heuristics are applied to instances belonging to a particular problem class.
Over time, it has been noticed that the latter approach yields better results (Caserta & Voß 2010). However, once a researcher evaluates such a method, they assess the combined performance of the metaheuristic and its subordinate heuristic(s). Although that gives insight into performance of the method as a whole, it does not provide answers to the following questions:
- •
Is it possible that the performance score has been achieved mostly or solely by the heuristic(s)?
- •
How much does the guiding logic of the metaheuristic contribute to total performance?
The answers to these questions are important because if it is the case that performance comes mostly or solely from heuristics, then it would be wrong to attribute the score to the metaheuristic and claim that a new solver for the specific class of problems has been found. The goal of this paper is to present a sound methodological framework to answer said questions. To our best knowledge, this is the first attempt to formulate such a technique.
The rest of the paper is organized as follows. The proposed methodology is described in Section 2. In Section 3, we present an example of its application to Simulated Annealing, Boolean Satisfiability Problem and the Flip heuristic. Finally, we discuss it and draw our conclusions in Section 4.
2 Proposed Methodology
Let be the metaheuristic being examined, and let denote a single heuristic or a group of heuristics intended to be executed one after another, known to work well on the problem class of interest. The performance metric of the combined method in which guides the application of can be modeled as a random variable whose distribution is given by:
[TABLE]
where and are random variables representing the instance of the problem class to which is applied and the seed for the random number generator, whereas and denote the parameters of and , respectively. For now, we assume that is a univariate variable, i.e. that the metric is a single value (the objective function to optimize, execution time, etc.). Its distribution is not known in advance and researchers estimate it by first tuning and and then evaluating the method on a number of problem instances , repeating evaluation several times for different choices of the seeds for the random number generator.
As said in introduction, the metric measured in this manner represents an estimate of the performance of . In order to assess how good is at guiding , we can introduce an additional variable to Equation 1 which now becomes:
[TABLE]
assuming a more general form for the performance of a metaheuristic () guiding for the problem of a given class. The variable will denote the metaheuristic component and will be understood to have two levels: and , where the latter denotes what we will call a naive or placebo metaheuristic henceforth. It is a metaheuristic which is based on no purposeful logic and has no components other than random decisions. It is such that acts as an algorithm where is guided randomly, as if no metaheuristic has been used to guide it. Then, to answer the question:
- •
How good is at guiding ?
we should estimate
[TABLE]
and compare it to:
[TABLE]
The difference reveals the effect of changing the metaheuristic from naive random search, which has no guiding logic, to . If the effect is negligible, then it indicates that using , which may have sophisticated and complicated logic, to guide how is applied, is the same as using a naive metaheuristic with no logic to guide . In fact, that would mean that any score has achieved comes from using and has nothing or little to do with . After all, if the logic of guides similarly or identically to random search, then we cannot justify use of in that particular setting.
This method is similar to the one used in a typical scenario where there are two factors, and , and a researcher wants to estimate the linear effect of on a yield variable when is fixed to a certain value. The way to do so is to define the low and high levels of and then estimate how changes when is increased from its low to high level. The effect that we are estimating is called the simple effect of at the chosen level of . This is precisely what we are trying to do in our case. We want to estimate how the performance metric changes when , the metaheuristic component, is changed from its low level with no logic (), to its high level, the metaheuristic being examined, with the heuristic component fixed to . The method that we propose in this Section achieves just that. Another example analogous to our case is from pharmacological studies. When a new medicine is tested, one group of patients, called the control group, is given placebo, while the other is given the drug. If the effect of the medicine is significantly better then that of placebo, the drug is deemed effective. Otherwise, there is no justification to produce and use the medicine as it is less effective than a simple placebo. If we use terminology from that example in our study, we will say that the acts as placebo and takes the role of medicine.
We have exposed the core of our methodology and the rationale behind it, but there are still several issues that we must address:
Is the same for every and why cannot we simply apply without plugging it in a metaheuristic? 2. 2.
Should the parameters be tuned to yield the maximal performance prior to evaluation of and or drawn randomly from a predefined space of allowed values? 3. 3.
How to compare to ?
We answer all those questions in continuation of this section.
2.1 The Naive Metaheuristic
The main idea of our method is to see if guiding with no logic is the same as guiding it with the logic of , the metaheuristic being examined. The rationale behind this is that each metaheuristic is a specific set of rules, and that if using those rules gives the same results as not using any rules at all, then the observed performance is achieved by heuristics alone and the logic of is not effective nor efficient. We have referred to guiding heuristics with no logic as the naive or placebo metaheuristic, . The reason why, in general, has to be a naive metaheuristic, and not just a mere application of is that has to invest the same computational effort as in order for comparison of the corresponding distributions to be fair. This means that if is a population metaheuristic (such as, e.g., Genetic Algorithm), must be too. Similarly, if is a single solution metaheuristic (such as, e.g., Simulated Annealing), so must be . Moreover, in the former case, if the population in consists of individuals, the same must hold for .
In general, we do not need to state explicitly. We can derive from by removing all ’s unique algorithmic components and leaving only naive, random operations. An example in Section 3 will clarify this step.
2.2 Choice of Parameters
The choice of parameters is crucial to performance of a (meta)heuristic. If tuned appropriately, they can greatly improve performance. If not set to appropriate values, they can deteriorate the algorithm. The question that naturally arises in our case is whether the parameters should be tuned prior to evaluation or treated as random variables and randomly set before each run of and during their evaluation. Both alternatives are viable, but are related to essentially different goals. If we opt for randomly setting the parameters, we would be aiming to assess the intrinsic guiding capability of that does not depend on the choice of parameters and is present in all its applications. However, does such capability exist? Different parameter settings can lead to diametrically opposite results. Besides, before is applied to real problems in practice, it is always tuned. Practitioners and researchers are interested in the best performance can give for a class of problems, not any performance for random parameter settings. Hence, we argue for tuning the parameters of prior to its evaluation. We can use some of them as the parameters of (for example, the number of individuals to ensure the populations are of the same sizes in and ), and then tune the parameters of , if any.
2.3 Comparison of Distributions
In literature, the most common way to compare two metaheuristics is to compare their expected values of , the chosen performance metric, approximated by means of measurements of on the selected problems instances for different, but random choices of the seed for the random number generator. However, we argue against using means to compare distributions of . What must be understood is that mean, even when accompanied by standard deviation, may not be representative of the distribution (Gunawardena 2014). Therefore, difference in means may not be informative and inference based on it may be invalid. Another, unfortunately common practice that we argue against is using values as definite proofs to accept or reject tested hypotheses. One reason is that significance at the desired level can always be achieved by using sufficiently large samples (Demidenko 2016). In our case, by evaluating algorithms on a large number of problem instances and repeating the process for a lot of times, we can make values as small as desired. The other reason is, as Fraser & Reid (2016) explained, that ”[ value] can guide the judgments about scientific conclusions, but cannot replace them.”.
Knowing this, we ask what is the appropriate way to compare the distribution of the performance metric for , , with that for , ? Let us assume, without loss of generality, that lower scores of the metric signify superior performance. If works better than , we should expect the distribution of to be located to the left of . A measure of how far the former is to the left of the latter is
[TABLE]
the probability that a score of a run of is lower, i.e. better than that of a score of a run of for some randomly selected problem instance. However, we should not limit ourselves to testing only if is located to the left of . For example, if the scores of lied in the range and those of in , but score differences lower than are practically negligible, even though would be equal to and indicate complete superiority of over , which would be true from a purely statistical point of view, but false from the standpoint of practical importance. Therefore, we first need to set some threshold to define a minimal difference between two performance scores required for them to be considered practically different. Therefore, instead of estimating , we should focus on:
[TABLE]
The probability quantifies practical benefit, with respect to , of guiding with , hence the name . The converse probability
[TABLE]
represents the risk of using instead of , that is the probability that guides practically worse than . What remains is the probability that and are practically equivalent:
[TABLE]
Those quantities, which we will call BER values (benefit, equivalence, risk) from now onwards, express the size of the effect of using instead of on the probability scale, simultaneously taking into account chosen definition of practical meaningfulness. The BER values are related to ROC curves Gonçalves et al. (2014). More specifically, when , the is the area under the ROC curve (AUROC) associated with and , whereas is equal to the area above the curve (Demidenko 2016). This is not a new idea for testing for difference between two distributions. We refer interested readers to (Wolfe & Hogg 1971, Zhou 2008, Newcombe 2006a, b, Demidenko 2016) for more details about ROC curves, computational techniques for estimating AUROC, and application of the method to discriminate between distributions. What is new in our approach is , the threshold of practical significance, which should be set in advance according to the theory and empirical knowledge of the optimization-problem class for which is being developed, and making distinction between and values - as opposed to Demidenko (2016) who does not distinguish between them.
Finally, we have to address calculation of , , and and their interpretation. Let us assume that we have run and on problem instances , repeating evaluation on each instance times using seeds , , . Let and denote matrices where the results of and are stored. The obvious way to calculate empirical value, denoted as , is to compare the corresponding entries in the result matrices:
[TABLE]
where is the indicator function that takes the value when its underlying condition evaluates to , and [math] otherwise.
The explanation of Equation 9 is as follows. The result of the sums in Equation 9 is equal to the number of times that produced better solutions than for the problems . The denominator is the total number of comparisons. Therefore, their ratio is an estimate of the probability that for a random instance from the problem class to which belong, will produce a better solution than . Better in this context means ”lower for at least ”.
The empirical equivalence and risk are calculated analogously:
[TABLE]
[TABLE]
In general, if is a good choice for guiding , we should expect that and . If , then no meaningful difference between and has been found, suggesting that the score of is achieved by . The greater the value of , the stronger the evidence that guides in a way that it deteriorates the effect of the heuristic component. The closer is to , the stronger the evidence in favor of being able to efficiently guide .
Finally, we must stress out that BER values, just as value, cannot replace a researcher’s own reasoning. Just as we should not base conclusions solely on values being lower or greater than the usual significance thresholds of and , we should not regard empirical benefit, risk, and equivalence as a definite answer to the question concerning the examined metaheuristic’s efficiency in guiding its heuristic(s). After all, the nature of statistical research is such that only through replications of experiments can a certain hypothesis be accepted or rejected. So, researchers should always plot and one against another to visually inspect the empirical distributions. Moreover, similar plots should be made for each problem instance. Only when all that is taken into account, should researchers formulate their conclusions.
2.4 Assumptions
We will conclude this section by briefly stating the assumptions of the methodology which we proposed:
- A1
The heuristic component is known to work well. 2. A2
The performance metric is univariate (its value is a single score, not a tuple). 3. A3
The metric is measured for each run.
We can see that they are fairly general and easy to meet in practice. In Section 4, we discuss the cases of their violation.
3 Experimental Example
In this Section, we will describe how we applied the method presented in Section 2 to a variant of Boolean Satisfiability Problem, 3-SAT, Simulated Annealing (SA), and the SAT heuristic known as Flip. The problem and the algorithms are presented in Sections 3.1-3.6. We describe benchmarks in Section 3.7, tuning in Section 3.8, and the results of comparing SA[Flip] to [Flip] in Section 3.9.
The repository with the code and data can be downloaded from https://osf.io/f2m9w/. .
3.1 Boolean Satisfiability Problem
Boolean Satisfiability Problem, shorthand SAT, is an NP-complete problem (Cook 1971) formulated as follows:
Definition 1
Given a Boolean formula with propositional letters , find their valuation under which evaluates to .
Any Boolean formula can be converted to conjunctive normal form (CNF), i.e. a conjunction of clauses that are themselves disjunctions of literals (propositional letters or their negations):
[TABLE]
where for some . If all , then we say that is in its -CNF and refer to the SAT problem as -SAT. Since -SAT is also NP-complete and all Boolean formulae can be converted to -CNF (Cook 1971), we will focus on the case where .
3.2 Solution Representation and Objective Function
A solution to a (-)SAT instance is a valuation of its propositional letters , i.e. a mapping from to . Encoding as [math] and as , we can represent solutions as integer arrays of zeros and ones. The goal is to find such a solution that all the clauses in the formula are satisfied. We can formulate the objective function as the percentage of satisfied clauses and aim to maximize it, or the percentage of unsatisfied clauses and try to minimize it. The two objective functions are equivalent and both return the values between [math] and . We opt for the minimization alternative in this paper (i.e. we will minimize the ratio of the number of unsatisfied clauses to the total number of clauses) and use it as the performance metric . Obviously, if , the optimal solution has been found and satisfiability of the formula in question has been proven. The reason why we use percentages rather than numbers of unsatisfied clauses is that we want the performance metric to be on the same scale for all formulae, no matter how much clauses they consist of.
3.3 Flip Heuristic
Let be a -CNF formula. The Flip heuristic receives a possible solution ( for ) and iteratively flips one its element at a time if it improves the objective function until no further improvement is possible. The heuristic is presented in Algorithm 1 (Marchiori & Rossi 1999).
3.4 Simulated Annealing
Simulated Annealing (SA) is a well-known and widely used metaheuristic whose history dates back to 1980s when Kirkpatrick et al. (1983) and Černỳ (1985) published first papers on the algorithm. Simulated Annealing is inspired by the process of physical annealing with solids, ”in which a crystalline solid is heated and then allowed to cool very slowly until it achieves its most regular possible crystal lattice configuration (i.e., its minimum lattice energy state), and thus is free of crystal defects.” (Nikolaev & Jacobson 2010). The algorithm starts with an initial solution and processes it iteratively. Each iteration consists of several steps, and at each step, the algorithm compares the current solution to one if its neighbors. The current solution is always replaced with the better neighbor. If the neighbor is worse, replacements occur with a probability which depends on the current temperature. The algorithm receives the initial temperature and the cooling schedule at the beginning and decreases temperature at each iteration according to the schedule. The pseudo-code of the Simulated Annealing is outlined in Algorithm 2 (Nikolaev & Jacobson 2010).
3.5 The SA[Flip] Algorithm for the SAT problem
The combination of Simulated Annealing and the Flip heuristic for SAT, named SA[Flip], is presented in Algorithm 3. In it, the heuristic specific to SAT is applied to the initial solution at the beginning of the algorithm, and once to each neighbor proposed to replace the current solution. Even though there may be other ways to combine the two algorithms, the goal of our study is not to find the best combination of them all, but to show how we can assess if the overall result of the combination being examined is due to the heuristic alone. The same procedure can be carried out for any metaheuristic and the heuristic(s) it guides.
We used geometric cooling schedule, in which for some constant . As the stopping criterion we used the following compound condition:
- •
The objective function (ratio of the number of unsatisfied clauses to the total number of clauses) of the current solution is equal to [math] or
- •
, the number of iterations performed, is equal to , the maximal number of iterations allowed, specified as a SA[Flip]’s parameter.
We checked for the stopping condition at each iteration as well as after each step.
Also, we kept track of the best solution encountered during execution of the algorithm and output it when SA[Flip] stops. We decided to do so because it may happen that the algorithm finds the optimal solution, but replaces it with a neighbor that is worse than it.
Even though definition of neighborhoods can be treated as an additional parameter to calibrate, we chose not do so, but to adopt one neighborhood definition in advance in order to reduce the number of parameters and simplify demonstration of our methodology. Of course, we advise researchers to experimentally determine the best definition of a neighborhood, as in (Simić 2017). The one that we adopted and used throughout the experiment is as follows:
Definition 2
Two solutions to the same instance of -SAT problem are neighbors to each other if and only if their Hamming distance is equal to .
This means that a neighbor of a solution differs from it in valuation of a single propositional letter.
3.6 Derivation of [Flip]
As said in Section 2, we need to compare SA[Flip] to [Flip] in order to estimate how good SA is at guiding Flip. We do not need to explicitly state as an actual algorithm. It is sufficient to remove all SA’s components from SA[Flip] and leave only naive operations at the metaheuristic level: random generation of the initial solution, random generation of neighbors, and their random acceptance. The parameters inherited from SA[Flip] are and and they should be set to the same values as for SA[Flip] in order to ensure that [Flip] can invest the same computational effort as SA[Flip].
We present [Flip] in Algorithm 4.
3.7 Benchmarks
Even though -SAT constitutes a class of problems of its own, we did not aim to cover all the possible subclasses of -SAT problems. Instead, we focused on those -SAT instances which are in the so called phase transition. Those are the formulae with approximately clauses (Gent & Walsh 1994), where is the number of propositional letters that appear in them. Such instances are computationally hardest to solve and the probability of them being satisfiable is approximately equal to the probability that they are not.
We also limited since it is impossible to conduct an experiment involving all possible numbers of propositional letters and our computational resources were limited. We chose the range because the corresponding solution spaces are sufficiently large but not too much for our testing machine. For , we downloaded corresponding instances from SATLIB (http://www.cs.ubc.ca/~hoos/SATLIB/benchm.html) (Hoos & Stützle 2000). They are all satisfiable and in the phase transition. We split the formulae into training and test sets. The former contained formulae, for each , while the latter included the rest.
3.8 Tuning the Parameters
We tuned the parameters following the methodology of Simić (2017) as it rigorously employs statistical techniques from Design of Experiments (Montgomery 2000).
First, we screened the parameters , , , and to identify the influential ones. To do so, we defined their low, medium, and high levels (see Table 1). We used Box-Behnken design for four three-level factors (Oehlert 2000, Box & Behnken 1960) and evaluated SA[Flip] for thirty times on each formula in the training set, blocking the design for seeds. It turned out that and had substantial main effects and that there were second-order interactions between and , on one hand, and and on the other. Hence, we had to calibrate all the four parameters. The found effects are presented in Figure 1.
Then, we calibrated the parameters iteratively, conducting Response Surface Methodology #, evaluating SA[Flip] for thirty times on each benchmark, but without blocking the design for seeds. The design that we used in this phase was fractional factorial. The reason why we used such a simple design is that the response (average performance) can be approximated with a linear model if the portion of the search space is sufficiently small. To ensure that, we used small but effective half-distances for the parameters. We present them in Table 2. The starting configuration was: , , and , because screening indicated that it might give very good results. We stopped the procedure once the values of and were such that the maximal number of applications of Flip exceeded . The found settings are: , , and .
3.9 Results and Their Interpretation
We evaluated SA[Flip] and [Flip] on each testing benchmarks thirty times. We made sure that the algorithms used the same sets of seeds for each formula to allow for fair comparison.
In general, both methods achieved very good results, as can be seen in Tables 3-6. Their performance scores, , deteriorate as increases, which is also the case with their success rates - the percentages of successful runs, i.e. the runs where the output .
We calculated BER values for each as well as for the whole test set. We used three different values for in our analysis: [math], , and . The results are presented in Tables 7-9 and depicted in Figure 2.
Overall, the value turned out to dominate other two by large margins for all choices of and each , , , and . This implies that SA[Flip] is effectively the same as [Flip], i.e. that SA guides Flip as effectively as the corresponding naive metaheuristic. We can also observe that drops whereas and increase with for . It is probable that such a trend continues for and SA[Flip] and [Flip] become effectively distinct at some point. Therefore, future research could focus on investigating this hypothesis.
Plots of distributions of for SA[Flip] and [Flip] are presented in Figure 3. By visual inspection, we conclude that the distributions are almost indistinguishable, which confirmes what has indicated: that SA is not than at guiding Flip (for this set of problems). In turn, that means that the observed performance of SA[Flip] is most probably due to Flip alone. Had we not tested SA[Flip] in this manner, we would not have discovered that efficiency of SA[Flip] came from the heuristic component. We were able to find it out only because we compared SA[Flip] to [Flip], which stresses out the importance of the methodology proposed in this paper and its usefulness in research in this field.
There is one more issue that needs to be discussed. Another explanation for observed equivalence of SA[Flip] and [Flip] could be that the benchmarks we used are too easy so both algorithms performed really well and no difference between them was possible to found in the first place. This is known as ceiling effect (Bartz-Beielstein & Preuß 2014). However, we do not think that the effect occurred in this experiment because we deliberately used the benchmarks that are in phase transition and hence, the most challenging and difficult to solve. In addition to this, the numbers of propositional letters were not low and there are studies which evaluated solvers on the same groups of benchmarks but reported worse results, e.g. (Djenouri et al. 2016). Still, researchers who decide to follow our methodology to estimate their metaheuristics should take caution and make sure that their benchmarks do not cause ceiling (or floor) effects.
Finally, our conclusion is as follows. Success of our SA[Flip] for the class of random -SAT formulae in phase transition with propositional letters at most comes from the Flip heuristic. For this class of Boolean formulae, guiding Flip with Simulated Annealing has the same effect as using a naive random metaheuristic with essentially no logic to guide application of Flip. The more similar SAT problems are to those used in our study, the higher is the probability that the same effects will be detected.
4 Discussion and Conclusions
In this paper, we have proposed a methodology to empirically estimate how efficient a metaheuristic algorithm is at guiding specific heuristic(s) . The proposed technique was applied to Simulated Annealing (SA), Boolean Satisfiability Problem and the Flip heuristic. The experiment revealed that the performance score of the combination of SA and Flip was due to the heuristic, which is a result that we would not be able to obtain without our methodology.
The methodology itself is mathematically well-founded, intuitive, easy to apply and relies on practical significance rather than solely on statistics. It directly compares empirical distributions of the chosen performance metrics, not just sample means, which provides a better insight into the metaheuristic in question. By comparing to , the technique allows us to estimate the effect of using metaheuristic to guide . The introduced BER values quantify that effect on the probability scale and, accompanied by visual comparison, reveal whether it is justified to guide application of with . Without investigating if performance of comes mostly or entirely from , we can easily draw wrong conclusions and claim that we have discovered novel solvers, when, if fact, we have done nothing more than wrapping up an efficient heuristic solver with a metaheuristic whose contribution is negligible. If we propose as a new solver, we must prove that there is something that makes it worth to guide with , and the technique studied and demonstrated in this paper offers a way to do precisely that.
Comparison to is the core of our approach. We defined to be a naive, placebo metaheuristic, which performs only naive operations in guiding . We argued that completely random decisions constitute naive moves and that such represents the ”low” level of in Equation 2, equivalent to no guiding logic. Can there be naive operations other than completely random decisions? Can greedy moves be thought of as naive? Those would be operations that always select the best solution from a group of candidates and discard the rest. Although they may seem naive, they do follow some logic, no matter how simple it is. Therefore, that would include greedy moves would actually be a plain greedy algorithm. If fails to beat it, we can say that there are no reasons to use when a simple solver achieves the same or better results, but we would not be able to test if performance of is achieved by or the logic of contributes to it significantly. Therefore, we argue for to contain only random operations.
One may also ask why we do not compare to , where would denote use of no heuristic at all? The reason is that difference between and can reveal if contributes anything to the performance of the whole method, not if is able to guide it efficiently.
The proposed approach is not without limitations, though. First of all, it requires evaluation of an additional algorithm () which is derived from the metaheuristic being examined. Even though this prolongs research, it also provides information which we would not get otherwise, as demonstrated in the example in Section 3, and without which scientific conclusions could be flawed. Therefore, we find that taking more time to complete this step pays off. Another limitation is that it enables us to reason about our metaheuristic only with respect to a chosen class of optimization problems. However, this limitation is not unique to this methodology and is inherent to all techniques for analyzing numerical experiments involving stochastic optimization algorithms.
Also, we must ask if the hybrid algorithm is the only (sensible) combination of and because if it is the case, then we can be sure that our method sheds light onto general ability of to guide . However, there may be more than one way to guide with . For instance, had we used Genetic Algorithm (GA) (Holland 1992) instead of SA in the example in Section 3, we could have applied Flip after mutation (as Marchiori & Rossi (1999)), but we could have also done it before mutation, immediately after performing crossovers. In such cases, rather than estimating general ability of to guide , we are assessing efficiency of the specific strategy based on for guiding . If its effect is approximately equal to that of , which can be tested with our method, then we can determine if guiding in that particular way is justified. Moreover, even though it is possible to plug a heuristic into a metaheuristic between any two operations, is it really sensible to arbitrary intertwine their logics? In each metaheuristic there is a point where the quality of a solution, i.e. the value of the objective function, is computed. It is the only step in execution of metaheuristic methods where they need to evaluate a problem-specific function. All the other operations that they perform are based on specific optimization ideas or some metaphors and constitute a logical unity. In our opinion, the moment just before evaluating a solution is a good time for applying a heuristic because that keeps problem-specific operations (evaluation of the objective function and application of heuristic) at one place, allowing the logic of the metaheuristic to execute without interruptions and as originally designed. This way, comparing to is as close to revealing the general effect of using to guide as it gets.
We also need to discuss the assumptions of our methodology as well as the effects of violating them. One of the assumptions is that is an efficient heuristic. If is new and its efficiency has not been confirmed, then the heuristic must be tested prior to application of the proposed methodology. Another assumption is that performance metric is univariate, i.e. a single value, not a tuple of values. If several metrics are of interest, we can compare to once for each of them and then analyze results per metric. Then, what if we want to use a metric calculated as an aggregated value of the results of several runs on each problem instance? Let us suppose that we have stored the results in matrices and . The actual metric scores that we are interested in are then calculated as and (), where is the aggregating function. The methodology can still be applied, but the formulae for empirical BER values would need to be modified and the results could not be interpreted in quite the same way. Instead of comparing to , we would essentially be comparing to and the value would answer the following question:
- •
What is the probability that, for a randomly chosen instance from the problem class of interest, ’s score will be practically better than that of when aggregated over several runs?
This is different from the meaning of the value as originally defined in Section 2 for the non-aggregated case. The corresponding formula for would then be:
[TABLE]
with analogous modifications being in place for and . Those differences are simple, but subtle, so we need to point them out.
Finally, the BER values that we define and propose to quantify the degree to which two distributions are not just statistically, but practically different, can be used to compare any two stochastic algorithms, not just and . Moreover, since numerical and practical significance are confounded in BER values through , we find them suitable to detect important effects not just in the field of metaheuristics, but in science in general.
We hope that other researchers will see merit in our idea, adopt it in their own studies and improve it further to the benefit of the whole research community.
Possible directions of future research are:
- •
Developing a methodology that would simultaneously test both the efficiency of and ’s ability to guide it;
- •
Formulating a technique capable of estimating the general ability of to guide any heuristic for the problem at hand, not just the selected .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Bartz-Beielstein & Preuß (2014) Bartz-Beielstein, T. & Preuß, M. (2014), Experimental analysis of optimization algorithms: Tuning and beyond, in ‘Theory and Principled Methods for the Design of Metaheuristics’, Springer, pp. 205–245.
- 3Box & Behnken (1960) Box, G. E. P. & Behnken, D. W. (1960), ‘Some new three level designs for the study of quantitative variables’, Technometrics 2 (4), 455–475. https://www.tandfonline.com/doi/abs/10.1080/00401706.1960.10489912
- 4Caserta & Voß (2010) Caserta, M. & Voß, S. (2010), Matheuristics: Hybridizing Metaheuristics and Mathematical Programming , Springer US, Boston, MA, chapter Metaheuristics: Intelligent Problem Solving, pp. 1–38.
- 5Černỳ (1985) Černỳ, V. (1985), ‘Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm’, Journal of optimization theory and applications 45 (1), 41–51.
- 6Cook (1971) Cook, S. A. (1971), The complexity of theorem-proving procedures, in ‘Proceedings of the Third Annual ACM Symposium on Theory of Computing’, STOC ’71, ACM, New York, NY, USA, pp. 151–158. http://doi.acm.org/10.1145/800157.805047
- 7Demidenko (2016) Demidenko, E. (2016), ‘The p-value you can’t buy’, The American Statistician 70 (1), 33–38.
- 8Djenouri et al. (2016) Djenouri, Y., Habbas, Z. & Aggoune-Mtalaa, W. (2016), Bees swarm optimization metaheuristic guided by decomposition for solving max-sat, in ‘Proceedings of the 8th international conference on agents and artificial intelligence’, SCITEPRESS-Science and Technology Publications, Lda, pp. 472–479.
