This paper develops a comprehensive formal framework for stochastic global optimization algorithms, using advanced probability theory to analyze their structure, convergence, and combination methods.
Contribution
It introduces a systematic formal approach with new concepts like join-kernels and optimization space for analyzing SGoals.
Findings
01
Proves convergence conditions for certain SGoals.
02
Represents algorithmic functions as kernels.
03
Analyzes combination of stochastic methods.
Abstract
As we know, some global optimization problems cannot be solved using analytic methods, so numeric/algorithmic approaches are used to find near to the optimal solutions for them. A stochastic global optimization algorithm (SGoal) is an iterative algorithm that generates a new population (a set of candidate solutions) from a previous population using stochastic operations. Although some research works have formalized SGoals using Markov kernels, such formalization is not general and sometimes is blurred. In this paper, we propose a comprehensive and systematic formal approach for studying SGoals. First, we present the required theory of probability (\sigma-algebras, measurable functions, kernel, markov chain, products, convergence and so on) and prove that some algorithmic functions like swapping and projection can be represented by kernels. Then, we introduce the notion of join-kernel as…
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Full text
Stochastic Global Optimization Algorithms: A Systematic Formal Approach
As we know, some global optimization problems cannot be solved using
analytic methods, so numeric/algorithmic approaches are used to find
near to the optimal solutions for them. A stochastic global optimization
algorithm (SGoal) is an iterative algorithm that generates
a new population (a set of candidate solutions) from a previous population
using stochastic operations. Although some research works have formalized
SGoals using Markov kernels, such formalization is not general
and sometimes is blurred. In this paper, we propose a comprehensive
and systematic formal approach for studying SGoals. First,
we present the required theory of probability (σ-algebras,
measurable functions, kernel, markov chain, products, convergence
and so on) and prove that some algorithmic functions like swapping
and projection can be represented by kernels. Then, we introduce the
notion of join-kernel as a way of characterizing the combination of
stochastic methods. Next, we define the optimization space, a formal
structure (a set with a σ-algebra that contains strict ϵ-optimal
states) for studying SGoals, and we develop kernels, like sort
and permutation, on such structure. Finally, we present some popular
SGoals in terms of the developed theory, we introduce sufficient
conditions for convergence of a SGoal, and we prove convergence
of some popular SGoals.
††footnotetext: Department of Computer Science and Engineering, Universidad Nacional
de Colombia, Bogotá, Colombia
1 Stochastic Global Optimization
A global optimization problem is formulated in terms of finding a
point x in a subset Ω⊆Φ where a certain
function f:Φ→R, attains is **best/optimal
**value (minimum or maximum) [1]. In the
optimization field, Ω, Φ, and f are called
the feasible region, the solution space, and the objective function,
respectively. The optimal value for the objective function (denoted
as f∗∈R) is suppose to exist and it is unique (R
is a total order). In this paper, the global optimization problem
will be considered as the minimization problem described by equation
1.
[TABLE]
Since a global optimization problem cannot be solved, in general,
using analytic methods, numeric methods are largely applied in this
task [2, 3]. Some numeric
methods are deterministic, like Cutting plane techniques [4],
and Branch and Bound [5] approaches
while others are stochastic like Hill Climbing and Simulated Annealing.
Many stochastic methods, like Evolutionary Algorithms and Differential
Evolution, are based on heuristics and metaheuristics [6, 7, 8, 9].
In this paper, we will concentrate on Stochastic Global Optimization
Methods (algorithms). A Stochastic Global Optimization ALgorithm (SGoal)
is an iterative algorithm that generates a new candidate set of solutions
(called population) from a given population using a stochastic operation,
see Algorithm 1.
In Algorithm 1, n is the number of individuals in
the population* (population’s size)111Although, parameters that control the stochastic process, like the
size of the population, can be adapted (adjusted) during the execution
of a SGoal, in this paper we just consider any SGoal
with fixed control parameters (including population’s size)., Pt∈Ωn is the population at iteration
t≥0, \mboxInitPop:N→Ωn
is a function that generates the initial population(according
to some distribution), \mboxNextPop:Ωn→Ωn*
is a stochastic method that generates the next population from the
current one (the stochastic search), \mboxEnd:Ωn×N→Bool
is a predicate that defines when the SGoal(n) process is
stopped and \mboxBest:Ωn→Ω
is a function that obtains the best candidate solution (individual)
in the population according to the optimization problem under consideration,
see equation 2.
[TABLE]
Although there are several different SGoal models, such models
mainly vary on the definition of the NextPop function. Sections
1.1 to 1.3
present three popular SGoals reported in the literature.
1.1 Hill Climbing (HC)
The hill climbing algorithm (HC), see Algorithm 2,
is a SGoal that uses a single individual as population (n=1),
generates a new individual from it (using the stochastic method Variate:Ω→Ω),
and maintains the best individual among them (line 2). Notice that
HC allows to introduce neutral mutations222A neutral mutation is a variation in the individual that does not
change the value of the objective function [10]. if the greather or equal operator (≥) is used in line 2. In
order to maintain more than one individual in the population, the
HC algorithm can be parallelized using Algorithm 3.
1.2 Genetic Algorithms (Ga)
Genetic algorithms (Ga)s are optimization techniques based
on the principles of natural evolution [6]. Although
there are several different versions of Gas, such as Generational
Genetic (GGa) algorithms and Steady State Genetic
(SSGa) algorithms, in general, all Gas have
the same structure. Major differences between them are in the encoding
scheme, in the evolution mechanism, and in the replacement mechanism.
Algorithms 4 and 5 present the GGa
and SSGa, respectively. There, PickParents:Ωn→N2
picks two individuals (indices) as parents, XOver:Ω2→Ω2
combines both of them and produces two new individuals, Mutate:Ω2→Ω2
produces two individuals (offspring) that are mutations of such two
new individuals, Best2:Ω4→Ω2
picks the best two individuals between parents and offspring, and
Bernoulli(r) generates a true value following a Bernoulli
distribution with probability CR.
1.3 Differential Evolution (DE)
Differential Evolution (DE) algorithm is an optimization technique,
for linear spaces, based on the idea of using vector differences for
perturbing a candidate solution, see Algorithm 6. Here,
Ω is a d-dimensional linear search space, PickDifParents:N×N→N3
gets three individuals (indices a,b, and c) that are different
from each other and different from the individual under consideration
(i), 0≤CR≤1 is a crossover rate, and 0≤F≤2
is the difference weight.
2 Measure and Probability Theory
In this section, we introduce the basic measure and probability theory
concepts that are required for a formal treatment of SGoals.
First, we will concentrate on the concept of family of sets, required
for formalizing concepts like event and random observation (subsections
2.1 and 2.2). Next, we
will cover the concepts of measurable, measure, and probability measure
functions (subsection 2.3) that are required for
defining and studying the notion of Kernel (subsection 2.4),
notion that will be used as formal characterization of stochastic
methods used by SGoals. Then, we will present the concept of
Markov chains (subsection 2.5), concept that
is used for a formal treatment of SGoals. Finally, we introduce
two concepts of random sequence convergence (subsection 2.6)
for studying the convergence properties of a SGoal.
2.1 Family of Sets
Probability theory starts by defining the space of elementary events
(a nonempty set Ω) and the system of observable events
(a family of subsets of Ω). In the case of a formal treatment
of a SGoal, the space of elementary events is the set of possible
populations while the system of observable events is defined by any
subset of populations that can be generated, starting from a single
population, through the set of stochastic methods used by the SGoal.
In the rest of this paper, let Ω=Ø be a non-empty
set (if no other assumption is considered).
Definition 1**.**
(Power Set) Let Ω be a set, the power set of Ω,
denoted as 2Ω, is the family of all subsets of Ω,
i.e. 2Ω={A∣A⊆Ω}.
Clearly, the system of observable events is a subset A
of 2Ω that satisfies some properties. Here, we introduce
families of sets with some properties that are required for that purpose.
Then, we stablish a relation between two of them.
Definition 2**.**
Let A⊆2Ω be a
family of subsets of Ω.
df
(disjoint family) A is a disjoint family if
A⋂B=∅ for any pair of A=B∈A.
2. cf
(countable family) A is a countable family
if A={Ai}i∈I for some countable
set I.
3. cdf
(countable disjoint family) A is a countable
disjoint family if A is cf
and .
4. \mboxc
(close under complements) A is close under
complements if A∈A then Ac≡Ω∖A∈A.
5. \mboxpd
(close under proper differences) A is close
under proper differences if A,B∈A and A⊂B
then B∖A∈A.
6. \mboxcdu
(close under countable disjoint unions) A
is close under countable disjoint unions if ⋃i∈IAi∈A
for all {Ai∈A}i∈I cdf.
7. \mboxcu
(close under countable unions) A is close
under countable disjoint unions if ⋃i∈IAi∈A
for all {Ai∈A}i∈I cf.
8. \mboxci
(close under countable intersections) A is
close under countable intersections if ⋂i∈IAi∈A
for all {Ai∈A}i∈I cf.
9. π
(π-system) A is a π-system
if it is close under finite intersections, i.e. if A,B∈A
then A⋂B∈A.
10. λ
(λ-system) A is called** λ**-system
iff (λ.1) Ø∈A, (λ.2) A
is \mboxpd and (λ.3) A
is \mboxcdu.
Lemma 3**.**
Let A⊆2Ω
(λ→\mboxpd) If A
is λ-system then A is \mboxpd.
2. 2.
(\mboxc→(\mboxcu↔\mboxci)*)
If A is \mboxc then A
is *\mboxcuiff A is \mboxci.
Proof.
[1. λ→\mboxpd]
If A,B∈A then Bc∈A (λ.2:A
is \mboxc). Clearly, A⋂Bc=Ø
(A⊂B) and A⋃Bc∈A (λ.3:A
is \mboxcdu). So, (A⋃Bc)c∈A
(λ.2:A is \mboxc),
i.e., Ac⋂B∈A (Morgan’s law). Therefore, B∖A∈A
(def. proper difference). [2. \mboxc→(\mboxcu↔\mboxci)]
If {Ai}i∈I is cf
in A, then {Aic}i∈I
is cf in A (A
is \mboxc). Now, if
A is \mboxcu then ⋂i∈IAi=(⋃i∈IAic)c∈A
(Morgan’s law andA is \mboxc),
so A is \mboxci. Finally,
if A is \mboxci
then ⋃i∈IAi=(⋂i∈IAic)c∈A
(Morgan’s law and A is \mboxc),
so A is \mboxcu.
∎
2.2 σ-algebras
Although each family of sets, in definition 2, is
very interesting on its own, none of them allows by itself to define,
in a consistent manner, a notion of probability. As we will see, σ-algebras
play this role in a natural way.
Definition 4**.**
(σ-algebra) A family of sets Σ⊆2Ω
is called a σ-algebra over Ω, iff (σ.1)
Ω∈Σ, (σ.2) Σ is \mboxc,
and (σ.3) Σ is \mboxcu.
Now, we can stablish some relations between σ-algebras and
some of the previously defined families of sets. These relations are
very useful when dealing with notions like measure, measurable, and
kernel.
Lemma 5**.**
Let Σ be a σ-algebra
over Ω.
Ø∈Σ**
2. 2.
Σ* is \mboxci.*
3. 3.
Σ* is a λ-system.*
4. 4.
Σ* is \mboxpd.*
Proof.
[1]Ω∈Σ (σ.1) then Ø∈Σ
(σ.2:Σ is \mboxc).
[2] Follows from σ.2, σ.3 and lemma 3.
[3]λ.1 follows from (1), λ.2
and λ.3 follow from σ.2 and σ.3, respectively.
[4] Follows from (3) and lemma 3.∎
Proposition 6**.**
Let Ω be a set
2Ω* is a σ-algebra*
2. 2.
If {Σi}i∈I is a family of σ-algebras
over Ω then ⋂i∈IΣi is a σ-algebra
over Ω.
3. 3.
If A⊆2Ω is an arbitrary family of
subsets of Ω then the minimum σ-algebra generated
by A is σ(A)=⋂{Σ∣A⊆Σ\mboxandΣ\mboxisσ\mbox−algebra}.
Proof.
[1] Obvious. [2]Ω∈Σi
for all i∈I (σ.1) then Ω∈⋂i∈IΣi
(def. ⋂). If A∈⋂i∈IΣi then A∈Σi
for all i∈I, then Ac∈Σi for all i∈I (σ.2:Σi
is \mboxc), therefore Ac∈⋂i∈IΣi
(def. ⋂). If {Aj∈⋂i∈IΣi}j∈J
is cf then Aj∈Σi for all j∈J
and i∈I, then {Aj∈Σi}j∈J
is cf in Σi for all i∈I, therefore
⋃j∈JAj∈Σi for all i∈I (σ.3:
Σi is \textsccu). So, ⋃j∈JAj∈⋂i∈IΣi
(def. ⋂).* *[3] Follows from (1) and (2).∎
Theorem 7**.**
(Dynkin π-λ theorem)
Let A be a λ-system and let E⊆A
be a π-system then σ(E)⊆A.
Proof.
A proof of this theorem can be found on page 6 of Kenkle’s book [11]
(Theorem 1.19).
∎
Now, notions of measure and probability measure are defined on the
real numbers (R), usually equipped with the Euclidean
distance, so we need to define an appropiated σ-algebra on
it. Such appropiated σ-algebra can be defined as a special
case of a σ-algebra for topological spaces.
Definition 8**.**
(Borel σ-algebra) Let
(Ω,τ) be a topological space. The σ-algebra
B(Ω)≡B(Ω,τ)≡σ(τ)
is called the Borel σ-algebra on Ω and every A∈B(Ω,τ)
is called Borel (measurable) set.
Proposition 9**.**
If B(R)
is the Borel σ-algebra where R is equipped with
the Euclidean distance, then B(R)=σ(E7)
with E7={(α,β]∣α,β∈Q,α<β}.
Proof.
A proof of this proposition can be found on page 9 of Kenkle’s book
[11] (Theorem 1.23).
∎
Now, we are ready to define the basic mathematical structure used
by probability theory.
Definition 10**.**
(measurable space) If Σ is a σ-algebra over
a set Ω then the pair (Ω,Σ)
is called a measurable space. Sets in Σ are called measurable
sets on Ω.
2.3 Functions
Having the playground defined (space of elementary events and observable
events), probability theory defines operations over them (functions).
Such functions will allow us to characterize stochastic methods used
by a SGoal*. *First, we introduce the concepts of set
function and inverse function, that are used when working on σ-algebras.
Definition 11**.**
(set functions) Let f:Ω1→Ω2
be a function,
The power set function of f is defined as
[TABLE]
2. 2.
The inverse function of f is defined as
[TABLE]
Next, we study the measurable functions, structure-preserving maps
(homomorphisms between measurable spaces). A measurable function guarantees
that observable events are obtained by applying the function to observable
events. For SGoals, a measurable function (stochastic methods)
basically means that any generated subset of populations must be obtained
by applying the stochastic methods to some generated subset of populations.
Definition 12**.**
(measurable function) Let (Ω1,Σ1)
and (Ω2,Σ2) be measurable spaces
and f:Ω1→Ω2 be a function.
Function f is called Σ1−Σ2 measurable if for
every measurable set B∈Σ2, its inverse image is a measurable
set in (Ω1,Σ1), i.e.,
f−1(B)∈Σ1.
Corollary 13**.**
Let (Ω1,Σ1), (Ω2,Σ2)
and (Ω3,Σ3) be measurable spaces
and f:Ω1→Ω2 be Σ1−Σ2
measurable and g:Ω2→Ω3 be
Σ2−Σ3 measurable then g∘f:Ω1→Ω3
is Σ1−Σ3 measurable.
Proof.
If A∈Σ3 then g−1(A)∈Σ2 (g
is Σ2−Σ3 measurable), therefore f−1(g−1(A))∈Σ1
(f−1 is Σ1−Σ2 measurable). Clearly, f−1(g−1(A))=(g∘f)−1(A)∈Σ1
(def inverse). In this way, g∘f is Σ1−Σ3
measurable.∎
Definition 14**.**
(isomorphism of measurable spaces) Let (Ω1,Σ1)
and (Ω2,Σ2) be measurable spaces
and φ:Ω1→Ω2 be a bijective
function. φ is called (Ω1,Σ1)-(Ω2,Σ2)
isomorphism ((Ω1,Σ1) and (Ω2,Σ2)
are called isomorphic) if φ is Σ1−Σ2 measurable
and φ−1 is Σ2−Σ1 measurable.
After that, we consider measure functions, functions that quantify,
in some way, how much observable events are. This concept of measure
fuction is the starting point on defining probability measure functions.
In the following, we just write f is measurable instead of f
is Σ1−Σ2 measurable if the associated σ-algebras
can be inferred from the context.
Definition 15**.**
(measure function) Let (Ω,Σ) be
a measurable space and μ:Σ→R
be a function from Σ to the extended reals (R=R⋃{−∞,∞}).
Function μ is called measure if it satisfies the following three
conditions:
μ.1
(nullity) μ(Ø)=0.
2. μ.2
(non-negativity) μ(A)≥0 for all A∈Σ.
3. μ.3
(σ-additivity) μ(⋃i∈IAi)=∑i∈Iμ(Ai)
for all {Ai∈Σ}i∈I cdf.
Then, we consider probability measure functions, functions that quantify,
how probable observable events are. For SGoals, a probability
function will quantify how probable a set of populations can be generated
using stochastic methods.
Definition 16**.**
Let (Ω,Σ) be a measurable space and μ:Σ→R
be a measure.
(finite measure) μ is a finite measure if μ(A)<∞
for all A∈Σ.
2. 2.
(probability measure) μ is a probability measure if
μ(Ω)=1.
Now, we are ready to define the mathematical structure used by probability
theory.
Definition 17**.**
If (Ω,Σ) is a measurable space and μ:Σ→R
is a measure function
(measure space) (Ω,Σ,μ) is called
measure space.
2. 2.
(probability space) If μ is a probability measure, (Ω,Σ,μ)
is called probability space.
Finally, we can define the concept of random variable, a function
that preserves observable events and quantifies how probable an observable
event is.
Definition 18**.**
(random variable) Let (Ω1,Σ1,Pr)
be a probability space and (Ω2,Σ2)
be a measurable space. If X:Ω1→Ω2
is a measurable function then X is called a random variable with
values in (Ω2,Σ2).
For A∈Σ2 , we denote {X∈A}≡X−1(A)
and Pr[X∈A]≡Pr[X−1(A)].
In particular, if (Ω2,Σ2)=(R,B(R)),
then X is called a real random variable and we let Pr[X≥0]≡X−1([0,∞)).
2.4 Kernel
As pointed by Breiman in Section 4.3 of [12], the kernel
is a regular conditional probability K(x,A)=P(x,A)=Pr[Xt∈A∣Xt−1=x].
For SGoals, a kernel will be used for characterizing the stochastic
process carried on iteration by iteration (generating a population
from a population).
Definition 19**.**
(Markov kernel) Let (Ω1,Σ1)
and (Ω2,Σ2) be measurable spaces.
A function K:Ω1×Σ2→[0,1]
is called a (Markov) kernel if the following two conditions hold:
K.1
Function Kx,∙:A↦K(x,A) is a probability
measure for each fixed x∈Ω1 and
2. K.2
Function K∙,A:x↦K(x,A) is a measurable
function for each fixed A∈Σ2.
Remark 20*.*
As noticed by Kenkle in Remark 8.26,
page 181 of [11], it is sufficient to check K.2 in
definition 19 for sets A from a π-system E
that generates Σ2 and that either contains A or a sequence
An↑A . Indeed, in this case, D={A∈Σ2∣K∙,A−1∈Σ1}
is a λ-system. Since E⊂D, by
the Dynkin π−λ theorem (7),
D=σ(E)=Σ2.
If the transition density K:Ω1×Ω2→[0,1]
exits, then the transition kernel can be defined using equation 3.
In the rest of this paper, we will consider kernels having transition
densities.
[TABLE]
Kernels that will play a main role in a systematic development of
a formal theory for SGoals are those associated to deterministic
methods that are used by a particular Sgoal, like selecting
any/the best individual in a population, or sorting a population.
Theorem 21 provides a sufficient condition for
characterizing deterministic methods as kernels.
Theorem 21**.**
(deterministic kernel) Let (Ω1,Σ1)
and (Ω2,Σ2) be measurable spaces,
and f:Ω1→Ω2 be Σ1−Σ2
measurable. The function 1f:Ω1×Σ2→[0,1]
defined as follow is a kernel.
[TABLE]
Proof.
[well-defined] Obvious, it is defined using the membership
predicate and takes only two values [math] and 1. [K.1]
Let x∈Ω1, clearly 1fx,∙(A)≥0
for all A∈Σ2 so 1fx,∙ is non-negative.
Now, 1f(x,Ø)=0 so it satisfies μ.1.
Let {Ai∈Σ2}i∈I be a cdf.
If x∈⨄i∈IAi then 1f(x,⨄i∈IAi)=1
(def 1f(x,A)), and ∃k∈I such that x∈Ak
(def ⨄). Therefore, x∈/Ai for all i=k∈I
({Ai∈Σ}i∈I is df)
so 1f(x,Ak)=1 and 1f(x,Ai)=0
for all i=k∈I (def 1f(x,A)). Clearly, 1fx,∙(⨄i∈IAi)=1=∑i∈I1fx,∙(Ai).
A similar proof is carried on when x∈/⨄i∈IAi,
in this case 1x,∙(⨄i∈IAi)=0=∑i∈I1fx,∙(Ai).
Then, 1fx,∙ is σ-additive, so it is
a measure. 1f(x,Ω2)=1 (obvious) then 1fx,∙
is a probability measure for a fixed x∈Ω1. [K.2]
Let A∈Σ2 and α∈Q+. If α<1
then 1f∙,A−1((0,α])=Ø
(1f(x,A)={0,1}∈/(0,α]),
so 1f∙,A−1((0,α])∈Σ1
(lemma 5.1). Now, ifα≥1 then
1f∙,A−1((0,α])={x∈Ω1∣1f(x,A)=1∈(0,α]}
(1f(x,A)=1 is the only value in (0,α]),
i.e., 1f∙,A−1((0,α])={x∈Ω1∣f(x)∈A}=f−1(A)
(def f−1), so 1f∙,A−1((0,α])∈Σ1
(f is measurable). Therefore, 1f∙,A is measurable. ∎
Corollary 22**.**
(Indicator kernel) Let (Ω,Σ)
be a measurable space. The indicator function 1:Ω×Σ→[0,1]
defined as 1(x,A)=1id(x)(A),
with id(x)=x is a kernel.
Proof.
According to theorem 21, it is sufficient to
prove that id is a measurable function. It is obvious, we have
that A=id(A)=id−1(A) (def id) then
id−1(A)∈Σ if A∈Σ.
∎
Transition probabilities (kernels) also represent linear operators
over infinite-dimensional vector spaces [13, 14]. Therefore,
operations like kernels multiplication, and kernels convex combinations
can be used in order to preserve the Markovness property of the resulting
transition kernel (sometimes called update mechanism).
2.4.1 Random Scan (Mixing)
The random scan (mixing) update mechanism follows the idea of picking
one update mechanism (among a collection of predefined update mechanisms)
and then applying it. Such update mechanism is picked according to
some weigth associated to each one of the update mechanism. Following
this idea, the mixing update mechanism is built using kernels addition
and kernel multiplication by a scalar.
In order to maintain the Markovness property (both operations, kernels
addition and kernel multiplication by a scalar, in general, do not
preserve such property), a convex combination of them is considered.
Definition 23**.**
(mixing) The mixing update mechanism
of a set of n Markov transition kernels K1, …, Kn,
each of them with a probability of being picked p1,p2,…,pn
(∑pi=1), is defined by equation 4.
[TABLE]
Since the integral in equation 4 is a linear
operator, the mixing operation can be defined by equation 5.
[TABLE]
2.4.2 Composition
The composition update mechanism follows the idea of applying an update
mechanism (kernel) followed by other update mechanism and so on. Following
this idea, the composition update mechanism is built using the kernel
multiplication operator.
Definition 24**.**
(composition) The composition of two kernels K1, K2
is defined by equation 6.
[TABLE]
Since the kernel multiplication is an associative operation (using
the conditional Fubini theorem, see Theorem 2 of Chapter 22, page
431 of the book of Fristedt and Gray [15]), the
composition of update mechanisms that corresponds to a set of n
transition kernels K1, …, Kn is defined as the product
kernel Kn∘Kn−1∘…∘K1.
2.4.3 Transition’s Kernel Iteration
The transition probability t-th iteration (application) of a Markovian
kernel K, given by equation 7, describes
the probability to transit to some set A∈Σ within t steps
when starting at state x∈Ω.
[TABLE]
If p:Σ→[0,1] is the initial distribution
of subsets, then the probability that the Markov process is in set
A∈Σ at step t≥0 is given by equation 8.
[TABLE]
2.5 Markov Chains
Definition 25**.**
(Markov chain) A discrete-time stochastic process X0,
X1, X2, …, taking values in an arbitrary state
space Ω is a Markov chain if it satisfies:
(Markov property) The conditional distribution of Xt
given X0, X1,…, Xt−1 is the same as the
conditional distribution of Xt given only Xt−1,
2. 2.
(stationarity property) The conditional distribution of Xt
given Xt−1 does not depend on t.
Clearly, transition probabilities of the chain are specified by the
conditional distribution of Xt given Xt−1 (kernel), while
the probability law of the chain is completely specified by the initial
distribution X0. Moreover, many SGoals may be characterized
by Markov chains.
2.6 Convergence
Definition 26**.**
Let (Dt) be a random sequence, i.e., a sequence of
random variables defined on a probability space (Ω,Σ,P).
Then (Dt) is said to
Converge completely to zero, denoted as Dt→c0,
if equation 9 holds for every ϵ>0
[TABLE]
2. 2.
Converge in probability to zero, denoted as Dt→p0,
if equation 10 holds for every ϵ>0.
[TABLE]
Notice that convergence in probability to zero (equation 10)
is a necessary condition for convergence completely to zero (equation
9).
3 Probability Theory on Cartesian Products
Since we are working on populations (finite tuples of individuals
in the search space), we need to consider probability theory on generalized
cartesian products of the search space (subsection 3.1).
By considering some mathematical properties of the generalized cartesian
product (we will move from tuples of tuples to just a single tuple),
some mathematical proofs can be simplified and an appropiated σ-algebra
for populations can be defined (subsection 3.2).
These accesory definitions, propositions and theorems will allow us
to define a kernel by joining some simple kernels (section 3.3).
Therefore, we will be able to work with stochastical methods in SGoals
that are defined as joins of stochastic methods that produce subpopulations
(sub-tuples) of the newly generated complete population (single tuple).
3.1 Generalized Cartesian Product
Definition 27**.**
(cartesian product) Let L={Ω1,Ω2,...,Ωn}
be an ordered list of n∈N sets. The Cartesian product
of L is the set of ordered n-tuples: ∏i=1nΩi={(a1,a2,…,an)∣ai∈Ωi\mboxforalli=1,2,…,n}.
If Ω=Ωi for all i=1,2,…,n then ∏i=1nΩi
is noted Ωn and it is called the n-fold
cartesian product of set Ω.
[1] Functions hL:(Ω1×Ω2)×Ω3→Ω1×Ω2×Ω3
and hR:Ω1×(Ω2×Ω3)→Ω1×Ω2×Ω3
such that hL((a,b),c)=(a,b,c)
and hR(a,(b,c))=(a,b,c) are
equivalence functions. **[2] **Function r:A×B→B×A
for any A,B such that r(a,b)=(b,a) is
a bijective function.∎
Corollary 29**.**
Let {ni∈N+}i=1,2,…,m be
an ordered list of m∈N positive natural numbers.
∏i=1m∏j=1niΩi,j≡Ω1,1×…×Ω1,n1×…×Ωm,1×…×Ωm,nm*
with Ωi,j a set for all i=1,2,…m and j=1,2,…,ni.*
2. 2.
∏i=1mΩni≡Ωn* with n=∑i=1mni.*
3.2 Product σ-algebra
Products σ-algebra allow us to define appropiated σ-algebra
for generalized cartesian products. If we are provided with a σ-algebra
associated to the feasible region of a SGoal, then we can define
a σ-algebra for populations of it.
Definition 30**.**
Let L={Σ1,Σ2,…,Σn}
be a n∈N ordered list of Σi⊆2Ωi
family of sets.
**(generalized family product) **The generalized product of L
is ∏i=1nΣi={∏i=1nAi∣∀i=1nAi∈Σi}
2. 2.
(product σ-algebra) If Σi is a σ-algebra
for all i=1,2,…,n, then the product σ-algebra of L
is the σ-algebra ⨂i=1nΣi=σ(∏i=1nΣi)
defined over the set ∏i=1nΩi.
Lemma 31**.**
If L={(Σ1,Ω1),(Σ2,Ω2),…,(Σn,Ωn)}
is a finite (n∈N) ordered list of measurable spaces
then ∏i=1nΣi is a π-system.
Proof.
If U,V∈∏i=1nΣi then for all i=1,2,…,n
exist Ui,Vi∈Σi such that U=∏i=1nUi
and V=∏i=1nVi (def. ∏i=1nΣi).
Clearly, Ui⋂Vi∈Σi (lemma 5).
Therefore, ∏i=1n(Ui⋂Vi)∈∏i=1nΣi
(def. ∏i=1nΣi). Let z=(z1,z2,…,zn)∈∏i=1nΩi.
Clearly, z∈U⋂V iff z∈U and z∈V (def ⋂)
iff zi∈Ui and zi∈Vi for all i=1,2,…,n
(def U and V) iff zi∈Ui⋂Vi for all i=1,2,…,n
(def ⋂) iff z∈∏i=1n(Ui⋂Vi)
(def ∏). Therefore, U⋂V=∏i=1n(Ui⋂Vi)∈∏i=1nΣi.
∎
Proposition *32 *will
allow us to move from the product σ-algebra of products σ-algebras
to a single product σ-algebra (as we move from tuples of tuples
to just a single tuple).
Proposition 32**.**
(associativity
of σ-algebraproduct) Let Σi be a σ-algebra
defined over a set Ωi for all i=1,2,3, then (Σ1⊗Σ2)⊗Σ3≡Σ1⊗(Σ2⊗Σ3)≡Σ1⊗Σ2⊗Σ3.
Proof.
[(Σ1⊗Σ2)⊗Σ3≡Σ1⊗Σ2⊗Σ3]
[⊆] We will use the Dynkin π−λ theorem
7 here. So, we need to find a π-system
that is contained by a λ system. Consider A3∈Σ3
and define AA3={X∈Σ1⊗Σ2∣X×A3∈σ(Σ1×Σ2×Σ3)}.
[π-system] A1×A2×A3∈Σ1×Σ2×Σ3
for all A1∈Σ1 and A2∈Σ2 (A3∈Σ3)
then Σ1×Σ2⊂AA3.(Σ1⊗Σ2)×Σ3.
[λ-system] [λ.1]Ωi∈Σi
for i=1,2,3 (Σiσ-algebra) then Ω1×Ω2×A3∈Σ1×Σ2×Σ3.
Therefore, Ω1×Ω2∈AA3
(def AA3). [λ.2] Let X∈AA3
then Xc×A3=((Ω1×Ω2)∖X)×A3
(def. complement). Clearly, Xc×A3=((Ω1×Ω2)×A3)⋂(X×A3)c
(Distribution and Morgan’s law). Now, (X×A3)∈σ(Σ1×Σ2×Σ3)
(def AA3), then (X×A3)c∈σ(Σ1×Σ2×Σ3)
(σ-algebra). Moreover, (Ω1×Ω2)×A3∈σ(Σ1×Σ2×Σ3)
(part 1 and def AA3) therefore, Xc×A3∈σ(Σ1×Σ2×Σ3)
(lemma 5.2). So, Xc∈AA3.
[λ.3] Let {Xi}i∈I⊆AA3
be cdf of AA3. Clearly,
⋃i∈I(Xi×A3)=(⋃i∈IXi)×A3
(sets algebra), and (⋃i∈IXi)∈Σ1⊗Σ2
(Xi∈Σ1⊗Σ2 for all i∈I). then,
⋃i∈I(Xi×A3)∈AA3.
Therefore, AA3 is a λ-system. In this
way, Σ1⊗Σ2=σ(Σ1×Σ2)⊆AA3
(Dynkin π−λ theorem 7), and
σ(Σ1×Σ2)×A3⊆σ(Σ1×Σ2×Σ3)
(def AA3), i.e., (Σ1⊗Σ2)×A3⊆Σ1⊗Σ2⊗Σ3.
Because, (Σ1⊗Σ2)×A3⊆Σ1⊗Σ2⊗Σ3
for all A3∈Σ3 then (Σ1⊗Σ2)×Σ3⊆Σ1⊗Σ2⊗Σ3
and σ((Σ1⊗Σ2)×Σ3)=(Σ1⊗Σ2)⊗Σ3⊆Σ1⊗Σ2⊗Σ3
(def σ(⋅)). [⊇] It
is clear that Σ1×Σ2⊆Σ1⊗Σ2
(def ⊗), so Σ1×Σ2×Σ3⊆(Σ1⊗Σ2)×Σ3,
therefore, σ(Σ1×Σ2×Σ3)⊆σ((Σ1⊗Σ2)×Σ3)
(def σ(⋅)), i.e., (Σ1⊗Σ2)⊗Σ3⊇Σ1⊗Σ2⊗Σ3.
[Σ1⊗(Σ2⊗Σ3)≡Σ1⊗Σ2⊗Σ3]
A similar proof to the (Σ1⊗Σ2)⊗Σ3≡Σ1⊗Σ2⊗Σ3
is carried on. ∎
Corollary 33**.**
If Σ is a σ-algebra defined over set Ω
and {⨂k=1niΣ}i=1,2,…,m
is an ordered lists of the m∈N given product σ-algebras
(ni∈N+ for all i=1,2,…,m) then ⨂i=1nΣ≡⨂i=1m(⨂k=1niΣ)
with n=∑i=1mni.
In the rest of this paper, we will denote ∏i=1nA≡An
for any A⊆2Ω , and Σ⊗n≡⨂i=1nΣ
for any σ-algebra Σ on Ω.
3.3 Kernels on product σ-algebras
Now, we are in the position of defining a kernel that characterizes
a deterministic method that is commonly used by SGoals (as
part of individual’s selection methods based on fitness): the Swap
method.
Definition 34**.**
(Swap) Let Ω1 and
Ω2 be two sets. The swap function ()
is defined as follows333We will use the notation z≡(z)
for any z∈Ω1×Ω2 and A≡(A)
for any A⊆Ω1×Ω2.:
[TABLE]
Lemma 35**.**
Let Ω1 and Ω2 be two
sets.
Ø=Ø* and Ω1×Ω2=Ω2×Ω1*
2. 2.
z∈A* iff z∈A*
3. 3.
A=A* for all A⊆Ω1×Ω2*
4. 4.
B∖A=B∖A*
for all A,B⊆Ω1×Ω2*
5. 5.
⋃i∈IAi=⋃i∈IAi*
for any family*{Ai⊆Ω1×Ω2}i∈I.
Proof.
[1, 2, 3] Are obvious (just applying def swap). [4]
Let A,B⊆Ω1×Ω2. Now, B∖A={z∣z∈B∖A}
(def swap), so B∖A={z∣z∈B∧z∈/A}
(def proper diff). Clearly, B∖A={z∣z∈B∧z∈/A}
(2), i.e. B∖A=B∖A
(def proper diff). [5] Let {Ai⊆Ω1×Ω2}i∈I
a family of sets, z∈⋃i∈IAi iff
∃i∈I such that z∈Ai iff z∈⋃i∈IAi.∎
Proposition 36**.**
Let (Ω1,Σ1) and
(Ω2,Σ2) be measurable spaces then
A∈Σ1⊗Σ2 iff A∈Σ2⊗Σ1.
Proof.
[→] We will apply the Dynkin π**-λ
**theorem (theorem 7). Let A={A∈Σ1⊗Σ2∣A∈Σ2⊗Σ1}.
[λ.1] Obvious, Ø=Ø∈Σ2⊗Σ1
(lemma 35.1 and lemma 5.1).
[λ.2] Let A,B∈A such that A⊂B.
Since B∖A=B∖A
(lemma 35.4), and A,B∈Σ2⊗Σ1
(def A), then B∖A=B∖A∈Σ2⊗Σ1
(lemma 5.2), i.e., A is
\mboxpd. [λ.3]
Let {Ai⊆Ω}i∈I a cdf,
⋃i∈IAi=⋃i∈IAi
(lemma 35.5) and Ai∈Σ2⊗Σ1
(def A) then ⋃i∈IAi∈Σ2⊗Σ1
(σ.3). Therefore, A is λ-system. Now,
let A∈Σ1×Σ2, then there are A1∈Σ1
andA2∈Σ2 such that A=A1×A2 (def Σ1×Σ2).
Clearly, A=A2×A1∈Σ2×Σ1
(def swap and Σ2×Σ1), i.e., Σ1×Σ2⊂A
(def A). Because Σ1⊗Σ2=σ(Σ1×Σ2)
and Σ1×Σ2⊂A then A∈Σ2⊗Σ1.
[←] If A∈Σ2⊗Σ1,
we have that A∈Σ1⊗Σ2
(→), therefore, A∈Σ1⊗Σ2 (A=A). ∎
(measurability of swap) The swap function
is measurable.
3. 3.
(swap kernel) The function 1 is a
kernel444We will use the ambiguous notation 1≡
in the rest of this paper..
Proof.
[1] The swap function is a bijective function. [2]
Follows from (1) and proposition 36.** [3]**
Follows from (2) and theorem 21.
Moreover, we can define kernels for deterministic methods that select
a group of individuals from the population (projections).∎
Lemma 38**.**
(projection) Let L={(Σ1,Ω1),(Σ2,Ω2),…,(Σn,Ωn)}
be a finite (n∈N) ordered list of measurable spaces
and I={k1,k2,…,km}⊆{1,…,n}
be a set of indices, i.e., ki<ki+1 and m≤n. The function
πI defined as follows is ⨂i=1nΣi−⨂i=1mΣki
measurable.
[TABLE]
Proof.
Because ∏i=1nΩi≡∏i=1mΩki×∏i=1n−mΩli
with Ic={l1,l2,…,ln−m} the complement
set of indices of I (by applying many times lemma 28),
we can “rewrite” (under equivalences) πI as πI(x,y)=x
with y∈∏i=1n−mΩli and x∈∏i=1mΩki.
Now, ⨂i=1nΣi≡(⨂i=1mΣki)⨂(⨂i=1n−mΣli)
(by applying many times proposition 32
and corollary 37). Thus,
for any A∈⨂i=1mΣki we have that πI−1(A)=A×∏i=1n−mΩli.
Clearly, πI−1(A)∈(⨂i=1mΣki)⨂(⨂i=1n−mΣli)
(def. product σ-algebra), therefore, πI is measurable.∎
Corollary 39**.**
The function 1πI as defined in theorem
21 is a kernel555We will use the ambiguous notation 1πI≡πI in
the rest of this paper..
Finally, we are able to define a kernel for a stochastic method that
is the join of several stochastic methods (methods that generate a
subpopulation of the next population).
Theorem 40**.**
(product probability
measure) Let {(Ωi,Σi,μi)∣i=1,…,n}
an ordered list of n∈N probability spaces. There exist
a unique probability measure μ:⨂i=1nΣi→R
such that μ(∏i=1nAi)=∏i=1nμ(Ai)
for all Ai∈Σi, i=1,2,…,n. In this case μ
is called the product probability measure of the μi probability
measures and is denotated ⨂i=1nμi.
Proof.
This theorem is the version of theorem 14.14 in page 277 of the book
of Kenkle [11], when considering Σi not just
a ring but a sigma algebra. In this case, any probability measure
is a finite measure and any finite measure is σ-finite measure
(Ωi∈Σi).∎
Theorem 41**.**
(join-kernel) Let (Ω′,Σ′)
be a measurable space and{(Ωi,Σi)}
and {Ki:Ω′×Σi→[0,1]}
be ordered lists of n∈N measurable spaces and kernels,
respectively. The following function is a kernel.
[TABLE]
Proof.
[well-definedand K.1] Let x∈Ω′,
since Ki(x,∙) is a probability measure for
all i=1,2,…,n (K.1 for Ki kernel) then ⊛Kx,∙=⨂i=1nKi(x,∙)
is a probability measure ⊛Kx,∙:⨂i=1nΣi→[0,1]
(theorem 40), thus its is well
defined for any A∈⨂i=1nΣi. [K.2]
Using remark 20, we just need to prove
that ⊛K∙,A is measurable for any A∈E
with E⊆2∏Ωi a π-system
that generates ⨂i=1nΣi. Because ∏i=1nΣi
is a π-system (lemma 31) and ⨂i=1nΣi
is the σ-algebra generated by ∏i=1nΣi
(def. 30), then we just need to prove that ⊛K∙,A
is measurable for any A=∏i=1nAi with Ai∈Σi
for all i=1,2,…,n. By definition, ⊛K(x,∏i=1nAi)=⨂i=1nKi(x,∙)(∏i=1nAi)
and according to theorem 40,
K(x,∏i=1nAi)=∏i=1nKi(x,∙)(Ai)
(Ai∈Σi). Clearly, ⊛K(∙,∏i=1nAi)(x)=∏i=1nKi(∙,Ai)(x).
Now, ⊛Ki(∙,Ai) is a measurable
function for all i=1,2,…,n (K.2 for Ki kernel) then
their product is a measurable function (see Theorem 1.91, page 37
in Kenkle’s book [11]). Therefore, ⊛K(∙,∏i=1nAi)
is a measurable function.∎
Corollary 42**.**
If (Ω,Σ) is a
measurable space and {ni∈N+}i=1,2,…,m
such that Ωi=Ωni and Σi=Σ\varotimesni
for all i=1,2,…,m in theorem 41, then
⊛K:Ω′×Σ\varotimesn→[0,1]
, with n=∑i=1mni, is a kernel.
Proposition 43**.**
(permutation) Let (Σ,Ω)
be a measurable space and I=[i1,i2,…,in]
be a fixed permutation of the set {1,2,…,n}
then the function KI:Ωn×Σ\varotimesn→[0,1]
defined as KI=⊛k=1nπik is a kernel.
Let (Σ,Ω) be a
measurable space and P be the set of permutations of
set {1,2,…,n}. Function KP:Ωn×Σ\varotimesn→[0,1]
defined as KP=∣P∣1I∈P∑πI
is a kernel.
Proof.
KP is a mixing update mechanisms of ∣P∣
kernels (subsection 2.4.1 and proposition
44).
∎
4 Characterization of a SGoal using Probability Theory
Following the description of a SGoal (see Algorithm 1),
the initial population P0 is chosen according to some initial
distribution p(⋅) and the population Pt at
step t>0 is generated using a stochastic method (NextPop)
on the previous population Pt−1. If such NextPop method
can be characterized by a Markov kernel, the stochastic sequence (Pt:t≥0)
becomes a Markov chain. In order to develop this characterization,
first we define appropiated measurable spaces and Markov kernels of
stochastic methods, and then we define some properties of stochastic
methods that cover many popular SGoals reported in the literature.
Since a SGoal consists of a population of n individuals
on the feasible region Ω, it is clear that the state space
is defined on Ωn. Moreover, the initial population P0∈Ωn
is chosen according to some initial distribution p(⋅).
Now, the σ-algebra must allow us to determine convergence
properties on the kernel. In this paper, we will extend the convergence
approach proposed by Günter Rudolph in [16]
to SGoals. In the following, we call objective
function to a function f:Φ→R
if its has an optimal value (denoted as f∗∈R) in
the feasible region.
4.1 ϵ-optimal states
We define and study the set of strict ϵ-optimal states (the
optimal elements according to Rudolph’s notation), i.e., a set that
includes any candidate population which best individual has a value
of the objective function close (less than ϵ∈R+)
to the optimum objective function value. We also introduce two new
natural definitions that we will use in some proofs (the ϵ-optimal
states and ϵ-states) and study some properties of sets defined
upon these concepts.
Definition 45**.**
Let Ω⊆Φ
be a set, f:Φ→R be an objective
function, ϵ>0 be a real number and x∈Ωm.
(optimality) d(x)=f(\mboxBest(x))−f∗
(here f∗ is the optimal value of f in Ω).
2. 2.
(strict ϵ-optimum state) x is an strict
ϵ-optimum element if d(x)<ϵ,
3. 3.
(ϵ-optimum state) x is an ϵ-optimum
element if d(x)≤ϵ, and
4. 4.
(ϵ-state) x is an ϵ-element if d(x)=ϵ.
Sets Ωϵm={x∈Ωm:d(x)<ϵ},
Ωϵm={x∈Ωm:d(x)≤ϵ},
and Ωϵ˚m={x∈Ωm:d(x)=ϵ}
are called set of strict ϵ-optimal states, ϵ-optimal
states, and ϵ-states, respectively. We will denotate Ωϵ=Ωϵ1,
Ωϵ=Ωϵ1
and Ωϵ˚=Ωϵ˚1
.
Remark 46*.*
Notice that
(Ωϵm)c={x∈Ωm:ϵ≤d(x)}
and
2. 2.
(Ωϵm)c={x∈Ωm:ϵ<d(x)}.
Lemma 47**.**
Ωϵm=n=1⋂∞Ωϵ+n1m*
for all ϵ>0 and m∈N+.*
Proof.
[⊆] If x∈Ωϵm
then d(x)≤ϵ (def Ωϵm).
Now, ϵ<ϵ+n1 for all n>0, clearly d(x)<ϵ+n1
for all n>0. Therefore, x∈Ωϵ+n1m
for all n>0 so x∈n=1⋂∞Ωϵ+n1m.
[⊇] Let x∈n=1⋂∞Ωϵ+n1m,
if x∈/Ωϵm then d(x)>ϵ
(def Ωϵm), therefore, ∃δ>0
such that d(x)=ϵ+δ. We have that ∃n∈N
such that 0<n1<δ (Archims theorem and real numbers
dense theorem). Clearly, ϵ+n1<ϵ+δ=d(x),
so x∈/Ωϵ+n1m, then x∈/n=1⋂∞Ωϵ+n1m
(contradiction). Therefore, x∈Ωϵm.
∎
4.2 Optimization space
We define the optimization σ-algebra property (a σ-algebra
containing the family of sets of strict ϵ-optimal states)
and show that such property is preserved by the product σ-algebra.
Definition 48**.**
(f−optimization σ-algebra) Let f:Φ→R
be an objective function, and Ω⊆Φ. A σ-algebra
Σ on Ω is called f−optimization σ-algebra
iff {Ωϵ}ϵ>0⊆Σ.
Lemma 49**.**
Let Σ be an f-optimization
σ-algebra on Ω then {Ωϵ}ϵ>0⊆Σ
and {Ωϵ˚}ϵ>0⊆Σ
.
Proof.
[{Ωϵ}ϵ>0⊆Σ]
It follows from the facts that {Ωϵ+n1}n∈N+⊆{Ωϵ}⊆Σ
(Σ optimization σ-algebra), Ωϵ=n=1⋂∞Ωϵ+n1
(lemma 47) and Σ close under countable
intersections (part 2, lemma 5). [{Ωϵ˚}ϵ>0⊆Σ]
It follows from the fact that Ωϵ˚=Ωϵ∖Ωϵ
for all ϵ>0 and Σ is close under proper differences
(part 4, lemma 5).∎
Proposition 50**.**
Ωϵm=i=1⋃m[Ωi−1×Ωϵ×Ωm−i]*
for all ϵ>0 and m∈N+.*
Proof.
[⊆] Let x∈Ωϵm, then
d(x)<ϵ (def Ωϵm) and
f(\mboxBest(x))−f∗<ϵ
(def d(x)). It is clear that x=k=1∏i−1xk×xi×k=i+1∏mxkf(xi)−f∗<ϵ for some i=1,2,…,m (def
Best), so d(xi)<ϵ (def d(x)).
Therefore, xi∈Ωϵ (def Ωϵ)
so x∈Ωi−1×Ωϵ×Ωm−i
and x∈i=1⋃m[Ωi−1×Ωϵ×Ωm−i].
[⊇] if x∈i=1⋃m[Ωi−1×Ωϵ×Ωm−i]
then x∈[Ωi−1×Ωϵ×Ωm−i]
for some i=1,2,…,m. Clearly, f(\mboxBest(x))<f(xi)
(def Best) so d(x)≤d(xi). Now,
d(x)≤d(xi)<ϵ (xi∈Ωϵ)
therefore x∈Ωϵm (def Ωϵm
).∎
Corollary 51**.**
Let Σ be an f-optimization σ-algebra on Ω
then Σ\varotimesm is an f−optimization σ-algebra
on Ωm for all m∈N+.
Proof.
Ωϵ∈Σ (optimization σ-algebra)
and Ωi∈Σ⊗i for all i=1,2,…,m
(universality of σ-algebra) then [Ωi−1×Ωϵ×Ωm−i]∈Σ⊗m
(def product σ-algebra), so i=1⋃m[Ωi−1×Ωϵ×Ωm−i]∈Σ⊗m
(Σ⊗m is \mboxcdu).
Therefore, Ωϵm∈Σ⊗m for all
ϵ>0, i.e., Σ⊗m is an optimization σ-algebra.
∎
Now, we are ready to define the mathematical structure that we use
for characterizing a SGoal.
Definition 52**.**
(optimization space) If
f:Φ→R is an objective function,
n∈N and Σ is an f−optimization σ-algebra
over a set Ω then the triple (Ωn,Σ⊗n,f)
is called optimization space.
4.3 Kernels on optimization spaces
If we are provided with an optimization space (Ωn,Σ⊗n,f),
we can represent the population used by the SGoal, as an individual
x∈Ωn, while we can characterize the NextPop
(a \mboxf:Ωn→Ωn
stochastic method), as a Markov kernel K:Ωn×Σ⊗n→[0,1].
Because such NextPop can be defined in terms of more general
\mboxf:Ωη→Ωυ
stochastic methods, we study such general kernels.
4.3.1 Join Stochastic Methods
Definition 53**.**
(Join) A stochastic method \mboxf:Ωη→Ωυ
is called join if it is defined as the join of m∈N stochastic
methods (\mboxfi:Ωn→Ωυi),
each method generating a subpopulation of the population, i.e. \mboxf=i=1∏m\mboxfi,
(see Algorithm 7). Here, υi∈N+
is the size of the i-th sub-population i=1,2,…,m, and sm=υ.
Example 54**.**
The NextPop method of a GGa (see
Algorithm 4), is a joined stochastic method: a stochastic
method \mboxNextSubPop\textscGGa:Ωn→Ω2
that generates two new candidate solutions (by selecting two parents
from the population, recombining them and mutating the offspring),
is applied 2n times in order to generate the next
population, see equation 11.
[TABLE]
Example 55**.**
The NextPop method of a DE (see
Algorithm 6), is a joined stochastic method: n stochastic
methods \mboxNextInd\textscDE,i:Ωn→Ω
each one generating the ith individual of the new population (by
selecting three extra parents from the population and recombining
each dimension using differences if required), are applied, see equation
11.
[TABLE]
Example 56**.**
The NextPop method of a PHC (see
Algorithm 3), is a joined stochastic
method: n stochastic methods \mboxNextInd\textscPHC,i:Ωn→Ω
each one generating the ith individual of the new population (\mboxNextInd\textscPHC,i(P)=\mboxNextPop\textscHC(Pi)),
see equation 13.
[TABLE]
We are now in the position of providing a sufficient condition for
characterizing PHC, GGa, and DE algorithms.
Proposition 57**.**
Let {\mboxfi:Ωn→Ωυi}i=1,2,…,m
be a finite family of stochastic methods, each one characterized by
a kernel Ki:Ωη×Σ⊗υi→[0,1],
then the join stochastic method \mboxf=i=1∏m\mboxfi
is characterized by the kernel ⊛K:Ωη×Σ⊗υ→[0,1]
with υ=∑k=1mυk.
Each of the NextPop\textscPHC,
NextPop\textscGGa and NextPop\textscDE stochastic
methods can be characterized by kernels if each of the**
stochastic methods NextPop\textscHC,
NextSubPop\textscGGa, and NextInd\textscDE
can be characterized by a kernel.
4.3.2 Sorting Methods
Although the result of sorting a population is, in general, a non
stochastic method, we can model it as a kernel. We start by modeling
the sorting of two elements according to their fitness value.
Definition 59**.**
(Sort-Two) Let d:Ω→R,
the sort-two function \mboxs2:Ω2→Ω2
is defined as follows:
[TABLE]
In order to model the S2 method as a kernel, we need to define
sets that capture some notions of sorted couples.
Definition 60**.**
(sorted couples sets) Let x,y∈Ω.
mM
The set \mboxmM={(x,y)∣d(x)<d(y)}
is called min-max sorted couples set.
2. Mm
The set \mboxMm={(x,y)∣d(y)<d(x)}
is called max-min sorted couples set.
3. m
The set \mboxm={(x,y)∣d(y)=d(x)}
is called equivalent couples set.
Lemma 61**.**
The following set of equations holds.
\mboxmM=r∈Q⋃Ωr×Ωrc**
2. 2.
\mboxMm=r∈Q⋃Ωrc×Ωr**
3. 3.
\mboxm=(\mboxmM⨄\mboxMm)c**
Proof.
[1][⊆] Let (x,y)∈\mboxmM
then d(x)<d(y) (def. mM),
therefore ∃r∈Q such that d(x)<r<d(y)
(Archims theorem and real numbers dense theorem). Clearly, x∈Ωr
and y∈Ωrc (def Ωϵ,Ωϵ
and remark 46), then (x,y)∈Ωr×Ωrc,
so (x,y)∈r∈Q⋃Ωr×Ωrc
. [⊇] Let (x,y)∈r∈Q⋃Ωr×Ωrc
then ∃r∈Q such that (x,y)∈Ωr×Ωrc,
therefore x∈Ωr and y∈Ωrc.
Clearly, d(x)<r and r<d(y) (def Ωϵ,Ωϵ
and remark 46), then d(x)<d(y)
so (x,y)∈\mboxmM (def mM).
[2] It is a proof similar to the proof of part [1].
[3] Obvious, for any (x,y)∈Ω×Ω
we have that d(x)<d(y)∨d(y)<d(x)∨d(x)=d(y)
(R is total order). Clearly, mM, Mm, and m
are pairwise disjoint sets. Then \mbox{\left({\displaystyle\varOmega}\times\varOmega\right)={mM}}\biguplus\mbox{{Mm}}\biguplus\mbox{{m}},
so \mboxm=(Ω×Ω)∖(\mboxmM⨄\mboxMm)=(\mboxmM⨄\mboxMm)c.∎
Lemma 62**.**
\mbox{{mM, }}\mbox{{Mm}, }\mbox{{m}, }\mbox{{mM{}^{c}, }}\mbox{{Mm}{}^{c}, }\mbox{{m{}^{c}}}\in\Sigma^{\varotimes 2}*
if Σ is an optimization σ-algebra.*
Proof.
(Ωr×Ωrc),(Ωrc×Ωr)∈Σ\varotimes2
(Ωr,Ωrc∈Σ, def product
σ-algebra), and \mboxmM=r∈Q⋃Ωr×Ωrc
and \mboxMm=r∈Q⋃Ωrc×Ωr
(lemma 61), then \mboxmM,\mboxMm∈Σ\varotimes2
(Σ\varotimes2 is \mboxcdu).
Clearly, (\mboxmM⨄\mboxMm)\mboxm∈Σ\varotimes2
(Σ\varotimes2 is \mboxcdu
), so \mboxm∈Σ\varotimes2 (Σ\varotimes2
is \mboxc). Finally, \mbox{{mM{}^{c}, }}\mbox{{Mm}{}^{c}, }\mbox{{m{}^{c}}}\in\Sigma^{\varotimes 2}
(σ.2).∎
Proposition 63**.**
\mboxs2:Ω2→Ω2*
is measurable.*
Proof.
Let A∈Σ and z=(x,y)∈Ω2. z∈\mboxs2−1(A)
iff [z∈A∧d(x)<d(y)]∨[z∈A∧d(y)≤d(x)]
(def \mboxs2) iff [z∈A∧z∈\mboxmM]∨[z∈A∧z∈\mboxmMc]
(def mM) iff z∈(A⋂\mboxmM)⨄(A⋂\mboxmMc)
(def ⋃ and ⋂). Since A,Ω∈Σ then
A,A,\mboxmM,\mboxmMc∈Σ⊗2
(corollary 36 and lemma 62). Therefore,
\mboxs2−1(A)∈Σ⊗2 (Σ⊗2
is \mboxc, \mboxcu,
and \mboxci). Clearly, \mboxs2
is measurable.∎
Corollary 64**.**
1\mboxs2:Ω2×Σ\varotimes2→[0,1]*
as defined in theorem 21 is a kernel.*
Having defined the kernel for s2, we define a kernel
\mboxsn,n−1:Ωn×Σ\varotimesn→[0,1]
for characterizing a n-tuple sorting method.
Proposition 65**.**
The following functions are kernels
\mboxwn,k:Ωn×Σ\varotimesn→[0,1]*
defined as \mboxwn,k=π{1,…,k−1}⊛[\mboxs2∘π{k,k+1}]⊛π{k+2,…,n}
for k=1,…,n−1.*
2. 2.
\mboxtn,k:Ωn×Σ\varotimesn→[0,1]*
defined as \mboxtn,1=\mboxwn,1, and
\mboxtn,k=\mboxwn,k∘\mboxtn,k−1
for k=2,…,n−1.*
3. 3.
\mboxsn,k:Ωn×Σ\varotimesn→[0,1]*
defined as \mboxsn,1=\mboxtn,1, and \mboxsn,k=\mboxtn,k∘\mboxsn,k−1
for k=2,…,n−1.*
Proof.
Obvious, all functions are defined in terms of composition and/or
join of kernels. ∎
Corollary 66**.**
The Best2* function used by the SSGa (line 5, algorihm
5) can be characterized by the kernel \mboxb2,4:Ω4×Σ⊗2→[0,1]
defined as \mboxb2,4=π{1,2}∘\mboxsn,2.*
Proposition 67**.**
If lines 3-4 in the SSGa (see algorithm
5) can be modeled by a kernel \mboxv:Ω2×Σ⊗2→[0,1],
The stochastic method NextPop\mboxSSGa can
be characterized by the following kernel.
Many SGoals are defined as two-steps stochastic processes:*
*First by applying a stochastic method that generates ϖ∈N
new individuals, in order to “explore” the search space,
and then by applying a stochastic method that selects candidate solutions
among the current individuals and the new individuals, in order to
“improve” the quality of candidate solutions.
Definition 68**.**
(Variation-Replacement)
A stochastic method \mboxf:Ωη→Ωυ
is called Variation-Replacement (VR)
if there are two stochastic methods, \mboxv:Ωη→Ωϖ
and \mboxr:Ωη+ϖ→Ωυ,
(t) such that \mboxf(P)=\mboxr(P,\mboxv(P))
or \mboxf(P)=\mboxr(\mboxv(P),P)
for all P∈Ωη.
Example 69**.**
The NextPop method of HC with neutral
mutations (see Algorithm 2) is a VR stochastic
method, see equations 14 and 15. The HC algorithm
will not consider neutral mutations just by changing the order of
the arguments in the replacement stochastic method R\mboxHC,
i.e., \mboxR\textscHC(\textscVariate(x),x).
[TABLE]
[TABLE]
Proposition 70**.**
If \mboxv:Ωη→Ωϖ
and \mboxr:Ωη+ϖ→Ωυ
are stochastic methods characterized by kernels K\mboxv:Ωη×Σ⊗ϖ→[0,1]
and K\mboxr:Ωη+ϖ×Σ⊗υ→[0,1],
respectively, then K\mboxf=K\mboxr∘[1Ωη\varoastK\mboxv]
and K\mboxf=K\mboxr∘[K\mboxv\varoast1Ωη]
are kernels that characterize the VR stochastic method \mboxf(P)=\mboxr(P,\mboxv(P))
and \mboxf(P)=\mboxr(\mboxv(P),P)
with P∈Ωη, respectively.
Proof.
Clearly, [1Ωη\varoastK\mboxv]
and [K\mboxv\varoast1Ωη]
are kernels (theorem 41 and lemma 21).
Therefore, K\mboxf=K\mboxr∘[1Ωη\varoastK\mboxv]
and K\mboxf=K\mboxr∘[K\mboxv\varoast1Ωη]
are kernels by composition of kernels, see Section 2.4.2.
We are now in the position of defining a kernel that characterizes
the replacement method of a HC algorithm. Before doing that, notice
that \mboxr\textscHC(x,y)=π1(\mboxs2(x,y)).∎
Lemma 71**.**
The function \mboxr\textscHC=π1∘\mboxs2
is measurable and K\mboxr\textscHC≡1\mboxr\textscHC
as defined in theorem 21 is a kernel.
Proof.
Follows from the fact \mboxr\textscHC=π1∘\mboxs2
is measurable (composition of measurable functions is measurable)
and theorem 21.∎
Corollary 72**.**
The Hill Climbing algorithm shown
in Algorithm 2 can be characterized by a kernel
if its Variate\mboxhc stochastic method can
be characterized by a kernel.
Proof.
Follows from example 69, proposition 70
and lemma 71.∎
Corollary 73**.**
The Parallel Hill Climbing algorithm
shown in Algorithm 3 can be characterized
by a kernel if the Variate\mboxhc stochastic
method of the parallelized HC can be characterized by a kernel.
Proof.
Follows from example 56, corollary 72
and proposition 57.∎
Lemma 74**.**
The NextPop stochastic method of the SSGa shown in Algorithm
5 can be characterized by the composition of two kernels
\mboxv\mboxSSGa=[[\mboxv∘π{1,2}]\varoast1]∘KP
and \mboxr\mboxSSGa=[\mboxb2,4∘π{1,…,4}]\varoastπ{5,…,n+2}
if lines 3-4 can be characterized by a kernel \mboxv:Ω2×Σ⊗2→[0,1].
Proof.
Follows from composition of kernels and proposition 67.
∎
4.3.4 Elitist Stochastic Methods
Some SGoals use elitist stochastic methods, i.e., if the best
candidate solution obtained after applying the method is at least
as good as the best candidate solution before applying it, in order
to capture the notion of “improving” the solution.
Definition 75**.**
(elitist method) A stochastic method \mboxf:Ωη→Ωυ
is called elitist if f(\mboxBest(\mboxf(P)))≤f(\mboxBest(P)).
Example 76**.**
The NextPop methods of the following
algorithms666Here we just present the examples when such algorithms consider neutral
mutations, but it is also valid when those do not consider neutral
mutations (we just need to reverse the product order)., are elitist stochastic methods. Here, we will denotate \mboxQ\textscA≡\mboxNextPop\textscA(P).
SSGa: \mboxBest(\mboxQ\textscSSGa(P))=\mboxBest(c1×c2×P),
(see Algorithm 5). Then, f(\mboxBest(c1×c2×P))≤f(\mboxBest(P)).
2. 2.
PHC: Let k∈[1,n] the index of the best individual
in population P, then f(\mboxBest(P))=f(Pk).
Since \mboxQ\textscPHC(P)i=\textscQ\textscHC(Pi)
for all i=1,2,…,n (see Algorithm 3),
it is clear that f(\mboxQ\textscPHC(P)k)≤f(\mboxBest(P))
(\mboxQ\textscHC is elitist). Then, f(\mboxBest(\mboxQ\textscPHC(P)))≤\mboxQ\textscPHC(P)i=f(\mboxBest(P)).
Definition 77**.**
(elitist kernel) A kernel K:Ωη×Σ⊗υ→[0,1]
is called elitist if K(x,A)=0 for each A∈Σ⊗υ
such that d(x)<d(y) for all y∈A.
Proposition 78**.**
Kernels \mboxr\mboxhc
and \mboxr\mboxSSGa are elitist kernels.
Proof.
Let (x,y)∈Σ⊗2 and A∈Σ such
that d(z)<d(x,y) for all z∈A. Now,
\mboxr\textscHC(x,y)=π1∘\mboxs2(x,y)
(def \mboxr\textscHC), clearly, d(\mboxr\textscHC(x,y))≤d(x,y)
(def d()), therefore d(\mboxr\textscHC(x,y))∈/A
(def A). In this way, \mboxr\textscHC(x,A)=0
(def kernel \mboxr\textscHC and theorem 21).
Therefore, \mboxr\mboxhcis**
**elitist (def elitist kernel). A similar proof is carried on for \mboxr\mboxSSGa.∎
Lemma 79**.**
If K:Ωη×Σ⊗υ→[0,1]
is elitist then
K(x,(Ωd(x)v)c)=0*
and K(x,Ωd(x)v)=1.*
2. 2.
Let x∈Ωη, if d(x)<α∈R
then K(x,(Ωαv)c)=0
and K(x,Ωαv)=1
Proof.
[1] Let y∈(Ωd(x)v)c
then ¬(d(y)≤d(x)) (def
complement,Ωd(x)), i.e., d(x)<d(y).
Therefore, K(x,(Ωd(x)v)c)=0
(K elitist) and K(x,Ωd(x)v)=1
(Kx,∙ probability measure). [2] if d(x)<α
then Ωd(x)⊆Ωα
(def Ωϵ) and (Ωα)c⊆(Ωd(x))c
(def c). Clearly, K(x,(Ωαv)c)≤K(x,(Ωd(x)v)c)=0
and K(x,Ωαv)=1 (Kx,∙measure).∎
Definition 80**.**
(optimal strictly bounded from zero) A kernel K:Ωη×Σ⊗υ→[0,1]
is called optimal strictly bounded from zero iff K(x,Ωϵ)≥δ(ϵ)>0
for all ϵ>0.
5 Convergence of a SGoal
We will follow the approach proposed by Günter Rudolph in [16],
to determine the convergence properties of a SGoal. In the
rest of this paper, Σ is an optimization σ-algebra.
First, Rudolph defines a convergence property for a SGoal in
terms of the objective function.
Definition 81**.**
(SGoal convergence). Let Pt∈Ωn
be the population maintained by a SGoal A at iteration
t. Then A converges to the global optimum if the random
sequence (Dt=d(Pt):t≥0)
converges completely to zero.
Then, Rudolph proposes a sufficient condition on the kernel when applied
to the set of strict ϵ-optimal states in order to attain
such convergence.
Lemma 82**.**
(Lemma 1 in [16])
If K(x,Ωϵ)≥δ>0 for all x∈Ωϵc
and K(x,Ωϵ)=1 for all x∈Ωϵ
then, equation 16 holds for t≥1.
[TABLE]
Proof.
In [16], Rudolph uses induction on t
in order to demostrate lemma 82. For t=1 we
have that K(t)(x,Ωϵ)=K(x,Ωϵ)
(equation 7), so K(t)(x,Ωϵ)≥δ
(condition lemma), therefore K(t)(x,Ωϵ)≥1−(1−δ)t
(t=1 and numeric operations). Here, we will use the notations K(t)(y,Ωϵ)=Ky(t)(Ωϵ)
to reduce the visual length of the equations.
[TABLE]
∎
Using lemma 82, Rudolph is able to stay a theorem
for convergence of evolutionary algorithms (we rewrite it in terms
of SGoals). However, Rudolph’s proof is not wright, since Pr{d(Pt)<ϵ}=Pr{Pt∈Ωϵ}
for t≥0 by definition of Ωϵ and Rudolph
wrongly assumed that Pr{d(Pt)≤ϵ}=Pr{Pt∈Ωϵ}.
Here, we correct the proof proposed by Rudolph (see step 7 in our
demostration).
Theorem 83**.**
(Theorem 1 in Rudolph [16])
A SGoal, whose stochastic kernel satisfies the precondition
of lemma 82, will converge to the global optimum
(f∗) of a real valued function f:Φ→R
with f>−∞, defined in an arbitrary space Ω⊆Φ,
regardless of the initial distribution p(⋅).
Proof.
The idea is to show that the random sequence (d(Pt):t≥0)
converges completely to zero under the pre-condition of lemma 82
[16].
[TABLE]
Since (1−δ)t→0 as t→∞
then Pr{d(Pt)>ϵ}→0
as t→∞, so Dt→p0. Now,
[TABLE]
Therefore, (d(Pt):t≥0) converges completely
to zero.
∎
5.1 Convergence of a VR-SGoal
We follow the approach proposed by Günter Rudolph in [16],
to determine the convergence properties of a VR-SGoals but
we formalize it in terms of kernels (both variation and replacement).
Theorem 84**.**
A VR-SGoal with K\mboxv
an optimal strictly bounded from zero variation kernel and K\mboxr
an elitist replacement kernel, will converge to the global optimum
of the objective function.
Proof.
If we prove that K=K\mboxr∘[K\mboxv\varoast1Ωη]
satisfies the precondition of lemma 82 then the
VR-SGoal will converge to the global optimum of the objective
function (theorem 83). we use the notation ω=η+υ
in this proof.
Notice, if y∈Ωυ×{x} then
d(y)≤d(x) (def d()) and
if y∈Ωϵω then d(y)<ϵ
(def Ωϵω) therefore K\mboxr(y,Ωϵη)=1
(lemma 79.2).
[2. K(x,Ωϵη)≥δ(ϵ)>0
for all x∈Ωη]
[TABLE]
Clearly, K(x,Ωϵη)≥δ(ϵ)>0
for all x∈/Ωϵη (K\mboxv
optimal strictly bounded from zero).
[3. K(x,Ωϵη)=1 if
x∈Ωϵη] If x∈Ωϵη
then d(x)<ϵ (def Ωϵη).
Clearly, d(y)<ϵ (transitivity), therefore K(x,(Ωϵη)c)=0
(lemma 79.2) and K(x,Ωϵη)=1
(Kx,∙ probability measure).
∎
Corollary 85**.**
Algorithms HC, and SSGa will converge to the global optimum
of the objective function if kernels\mboxv\mboxhc,
and \mboxv\mboxSSGa are optimal strictly
bounded from zero kernels.
Developing a comprehensive and formal approach to stochastic global
optimization algoritms (SGoals) is not an easy task due to
the large number of different SGoals reported in the literature
(we just formalize and characterize three classic SGoals in
this paper!). However, such SGoals are defined as joins, compositions
and/or random scans of some common deterministic and stochastic methods
that can be represented as kernels on an appropiated structure (measurable
spaces with some special property and provided with additional structure).
Such special structure is the optimization space (defined in this
paper). On this structure, we are able to characterize several SGoals
as special cases of variation/replacement strategies, join strategies,
elitist strategies and we are able to inherit some properties of their
associated kernels. Moreover, we are able to prove convergence properties
(following Rudolph approach [16]) of SGoals.
Since the optimization σ-algebra property of the structure
is preserved by product σ-algebras, our formal approach can
be applicable to both single point SGoals and population based
SGoals.
Although the theory developed in this paper is comprehensive for just
studying SGoals with fixed parameters (like population size
and variation rates), it is a good starting point for studying adapting
SGoals (SGoals that adapt/vary some search parameters
as they are iterating). The central concept for doing that will be
the join of kernels (if we consider the space of the parameter values
as part of the σ-algebra). However, such study is far from
the scope of this paper.
Our future work will concentrate on including in this formalization,
as many as possible, selection mechanisms that are used in SGoals,
and extending and developing the theory required for characterizing
both adaptable and Mixing SGoals.
Bibliography16
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] L. Liberti, “Introduction to global optimization,” 2008.
2[2] G. P. Rangaiah and G. P. Rangaiah, Stochastic Global Optimization Techniques and Applications in Chemical Engineering: Techniques and Applications in Chemical Engineering . River Edge, NJ, USA: World Scientific Publishing Co., Inc., 2010.
3[3] A. Zhigljavsky and A. Zilinskas, Stochastic global optimization . Springer Optimization and Its Applications, Springer, 2010.
4[4] J. J. E. Kelley, “The cutting-plane method for solving convex programs,” Journal of the Society for Industrial and Applied Mathematics , vol. 8, no. 4, pp. 703–712, 1960.
5[5] D. R. Morrison, S. H. Jacobson, J. J. Sauppe, and E. C. Sewell, “Branch-and-bound algorithms,” Discret. Optim. , vol. 19, pp. 79–102, Feb. 2016.
6[6] J. H. Holland, Adaptation in Natural and Artificial Systems . The University of Michigan Press, 1975.
7[7] K. De Jong, An analysis of the Behavior of a class of genetic adaptive systems . Ph D thesis, University of Michigan, 1975.
8[8] A. E. Eiben, R. Hinterding, and Z. Michalewicz, “Parameter control in evolutionary algorithms,” IEEE Transactions in Evolutionary Computation , vol. 3(2), pp. 124–141, 1999.