Double Descent and Intermittent Color Diffusion for Global Optimization   and landscape exploration

Luca Dieci; Manuela Manetta; Haomin Zhou

arXiv:1706.05285·math.NA·June 19, 2017

Double Descent and Intermittent Color Diffusion for Global Optimization and landscape exploration

Luca Dieci, Manuela Manetta, Haomin Zhou

PDF

Open Access

TL;DR

This paper introduces a novel optimization method combining double descent and intermittent colored diffusion to effectively explore smooth landscapes and locate global minimizers, demonstrated through numerical experiments.

Contribution

The paper presents a new approach that integrates double descent and basin-escaping via intermittent colored diffusion for improved global landscape exploration.

Findings

01

Effective in locating global minima in smooth potentials

02

Numerical results demonstrate the method's performance

03

Combines double descent with basin-escaping techniques

Abstract

In this work, we present a method to explore the landscape of a smooth potential in the search of global minimizers,combining a double-descent technique and a basin-escaping technique based on intermittent colored diffusion. Numerical results illustrate the performance of the method.

Tables5

Table 1. Table 1 . Camel critical points

$x$	$y$	$f (x, y)$	spectrum of H
0.0898	-0.7127	-1.0316	7.6822 16.4932
-0.0898	0.7127	-1.0316	7.6823 16.4932
1.6071	0.5687	2.1043	7.1215 10.0216
-1.6071	-0.5687	2.1043	7.1215 10.0216
1.7036	-0.7961	-0.2155	18.8171 22.6975
-1.7036	0.7961	-0.2155	18.8171 22.6975
1.2302	0.1623	2.4963	-8.0149 -5.9537
-1.2302	-0.1623	2.4963	-8.0149 -5.9537
0	0	0	-8.0623 8.0623
1.1092	-0.7683	0.5437	-7.9026 20.3667
-1.1092	0.7683	0.5437	-7.9026 20.3667
1.2961	0.6051	2.2295	-6.1772 9.6376
-1.2961	-0.6051	2.2295	-6.1772 9.6376
1.6381	0.2287	2.2294	-5.5458 12.4367

Table 2. Table 2 . Lennard Jones global minima

$d$	N	global minima
2	1	-1
3	3	-3
4	6	-6
5	9	-9.104
6	12	-12.712
7	15	-16.505
8	18	-19.821
9	21	-24.113
10	24	-28.422
11	27	-32.766
12	30	-37.968
13	33	-44.327
14	36	-47.845

Table 3. Table 3 . Morse global minima ( d = 11 𝑑 11 d=11 , N = 27 𝑁 27 N=27 )

$ρ$	global minima
3	-37.930817
6	-31.521880
10	-30.265230
14	-29.596054

Table 4. Table 4 . Results obtained using Matlab global optimization toolbox ( GlobalSearch and simulannealbnd ) on Morse Potential ( d = 11 𝑑 11 d=11 , N = 27 𝑁 27 N=27 ).

$ρ$	GlobalSearch	simulannealbnd
3	-22.0429	-11.1367
6	-16.2076	-3.1483
10	-9	-1
14	-9	-1

Table 5. Table 5 . Critical points - Boggs system

$x$	$y$	$g (x, y)$	id
0	1	0
-1	2	0	global minima
-0.7071	1.5	0
-2.1530	5.9055	0.7139
0.1301	-0.3768	1.2161	local minima
0.1890	-0.3663	1.1941
-0.8898	1.7671	0.0013
-0.3319	1.1830	0.0038	saddle points
0.4555	2.4926	1.5111
-0.3277	4.3927	6.0502

Equations60

P(x,x_{new},T)=\left\{\begin{array}[]{ll}1,&\text{ if }g(x_{new})<g(x),\\ exp\left(\frac{-[g(x_{new})-g(x)]}{T}\right),&\text{ if }g(x_{new})>g(x),\end{array}\right.

P(x,x_{new},T)=\left\{\begin{array}[]{ll}1,&\text{ if }g(x_{new})<g(x),\\ exp\left(\frac{-[g(x_{new})-g(x)]}{T}\right),&\text{ if }g(x_{new})>g(x),\end{array}\right.

d x (t, w) = - \nabla g (x (t, w)) d t + σ (t) d W (t), t \in [0, + \infty]

d x (t, w) = - \nabla g (x (t, w)) d t + σ (t) d W (t), t \in [0, + \infty]

σ (t) = i = 1 \sum N σ_{i} I_{[S_{j}, T_{j}]} (t)

σ (t) = i = 1 \sum N σ_{i} I_{[S_{j}, T_{j}]} (t)

x_{k + 1} = x_{k} - h \nabla g (x_{k}) + + h σ (t_{k}) W,

x_{k + 1} = x_{k} - h \nabla g (x_{k}) + + h σ (t_{k}) W,

ν (H) = {n_{+} (H), n_{0} (H), n_{-} (H)},

ν (H) = {n_{+} (H), n_{0} (H), n_{-} (H)},

P := {x \in R^{n} : n_{0} (H (x)) = 0, n_{+} (H (x)) and n_{-} (H (x)) constant \neq = 0},

P := {x \in R^{n} : n_{0} (H (x)) = 0, n_{+} (H (x)) and n_{-} (H (x)) constant \neq = 0},

g (x_{3}) = Σ \in Γ in f x \in Σ max g (x),

g (x_{3}) = Σ \in Γ in f x \in Σ max g (x),

v^{T} \nabla G < 0 and v^{T} \nabla g < 0 .

v^{T} \nabla G < 0 and v^{T} \nabla g < 0 .

v^{T} H \nabla g < 0 and v^{T} \nabla g < 0 .

v^{T} H \nabla g < 0 and v^{T} \nabla g < 0 .

v^{T} H \nabla g < 0 and v^{T} \nabla g < 0 .

v^{T} H \nabla g < 0 and v^{T} \nabla g < 0 .

v^{T} \nabla g = - \nabla g (x_{0})^{T} (H_{+} (x_{0}))^{†} \nabla g (x_{0}) < 0

v^{T} \nabla g = - \nabla g (x_{0})^{T} (H_{+} (x_{0}))^{†} \nabla g (x_{0}) < 0

v^{T} H \nabla g

v^{T} H \nabla g

x_{1} = x_{0} + α v_{1} (x_{0}) v_{1}^{T} (x_{0}) W, W \in N (0, 1)^{n} .

x_{1} = x_{0} + α v_{1} (x_{0}) v_{1}^{T} (x_{0}) W, W \in N (0, 1)^{n} .

\begin{split}x_{k+1}\ &=\ x_{k}-h\bigl{(}H^{\dagger}(x_{k})\nabla g(x_{k})\bigr{)}+\alpha\sqrt{h}\sigma(x_{k})W\ ,\text{where}\\ \sigma(x_{k})\ &=\ -v_{1}(x_{k})v_{1}(x_{k})^{T}\ ,\quad\text{and}\quad W\in N(0,1)^{n}\ .\end{split}

\begin{split}x_{k+1}\ &=\ x_{k}-h\bigl{(}H^{\dagger}(x_{k})\nabla g(x_{k})\bigr{)}+\alpha\sqrt{h}\sigma(x_{k})W\ ,\text{where}\\ \sigma(x_{k})\ &=\ -v_{1}(x_{k})v_{1}(x_{k})^{T}\ ,\quad\text{and}\quad W\in N(0,1)^{n}\ .\end{split}

g (x_{0} + ϵy) = g (x_{0}) + ϵ \nabla g (x_{0})^{T} y + \frac{1}{2} ϵ^{2} y^{T} H (x_{0}) y + \dots

g (x_{0} + ϵy) = g (x_{0}) + ϵ \nabla g (x_{0})^{T} y + \frac{1}{2} ϵ^{2} y^{T} H (x_{0}) y + \dots

x_{1} = x_{0} + α v_{n} (x_{0}) v_{n}^{T} (x_{0}) W, W \in N (0, 1)^{n} .

x_{1} = x_{0} + α v_{n} (x_{0}) v_{n}^{T} (x_{0}) W, W \in N (0, 1)^{n} .

g (x_{0} + ϵy) = g (x_{0}) + ϵ \nabla g (x_{0})^{T} y + \frac{1}{2} ϵ^{2} y^{T} H (x_{0}) y + \dots

g (x_{0} + ϵy) = g (x_{0}) + ϵ \nabla g (x_{0})^{T} y + \frac{1}{2} ϵ^{2} y^{T} H (x_{0}) y + \dots

\begin{split}x_{k+1}\ &=\ x_{k}-h\bigl{(}H_{+}^{\dagger}(x_{k})\nabla g(x_{k})\bigr{)}+\alpha\sqrt{h}\sigma(x_{k})W\ \ ,\text{where}\\ \sigma(x_{k})\ &=\ -v_{n}(x_{k})v_{n}(x_{k})^{T}\ ,\quad\text{and}\quad W\in N(0,1)^{n}\ .\end{split}

\begin{split}x_{k+1}\ &=\ x_{k}-h\bigl{(}H_{+}^{\dagger}(x_{k})\nabla g(x_{k})\bigr{)}+\alpha\sqrt{h}\sigma(x_{k})W\ \ ,\text{where}\\ \sigma(x_{k})\ &=\ -v_{n}(x_{k})v_{n}(x_{k})^{T}\ ,\quad\text{and}\quad W\in N(0,1)^{n}\ .\end{split}

x_{k + 1} σ (x_{k}) = x_{k} - h \nabla g (x_{k}) + α h σ (x_{k}) W, where = - v_{n} (x_{k}) v_{n} (x_{k})^{T}, and W \in N (0, 1)^{n} .

x_{k + 1} σ (x_{k}) = x_{k} - h \nabla g (x_{k}) + α h σ (x_{k}) W, where = - v_{n} (x_{k}) v_{n} (x_{k})^{T}, and W \in N (0, 1)^{n} .

g (x, y) = (x^{2} - 1)^{2} + (x^{2} + y - 1)^{2} .

g (x, y) = (x^{2} - 1)^{2} + (x^{2} + y - 1)^{2} .

g(x,y)=\left(\sum_{i=1}^{5}i\,\cos\big{[}(i+1)\,x+i\big{]}\right)\left(\sum_{i=1}^{5}i\,\cos\big{[}(i+1)\,y+i\big{]}\right),

g(x,y)=\left(\sum_{i=1}^{5}i\,\cos\big{[}(i+1)\,x+i\big{]}\right)\left(\sum_{i=1}^{5}i\,\cos\big{[}(i+1)\,y+i\big{]}\right),

g (X) = i = 1 \sum 10 (e^{- t_{i} x_{1}} - 5 e^{- t_{i} x_{2}} - y_{i})^{2}

g (X) = i = 1 \sum 10 (e^{- t_{i} x_{1}} - 5 e^{- t_{i} x_{2}} - y_{i})^{2}

g (x, y) = (4 - 2.1 x^{2} + \frac{x ^{4}}{3}) x^{2} + x y + 4 (y^{2} - 1) y^{2}

g (x, y) = (4 - 2.1 x^{2} + \frac{x ^{4}}{3}) x^{2} + x y + 4 (y^{2} - 1) y^{2}

g (x) = i = 1 \sum N - 1 [100 (x_{i + 1} - x_{i}^{2})^{2} + (x_{i} - 1)^{2}],

g (x) = i = 1 \sum N - 1 [100 (x_{i + 1} - x_{i}^{2})^{2} + (x_{i} - 1)^{2}],

X = (x_{2}, x_{3}, y_{3}, x_{4}, y_{4}, z_{4}, \dots, x_{d}, y_{d}, z_{d}),

X = (x_{2}, x_{3}, y_{3}, x_{4}, y_{4}, z_{4}, \dots, x_{d}, y_{d}, z_{d}),

V (r) = 4 ε i < j \sum d [(\frac{σ}{r _{ij}})^{12} - (\frac{σ}{r _{ij}})^{6}],

V (r) = 4 ε i < j \sum d [(\frac{σ}{r _{ij}})^{12} - (\frac{σ}{r _{ij}})^{6}],

V_{ρ} (r) = i < j \sum d [e^{ρ (1 - r_{ij})} (e^{ρ (1 - r_{ij})} - 2)]

V_{ρ} (r) = i < j \sum d [e^{ρ (1 - r_{ij})} (e^{ρ (1 - r_{ij})} - 2)]

S (x) = 0, \mbox w i t h x \in R^{n},

S (x) = 0, \mbox w i t h x \in R^{n},

g (x) = \frac{1}{2} S (x)^{T} S (x)

g (x) = \frac{1}{2} S (x)^{T} S (x)

[x^{2} - y + 1 x - cos (\frac{π}{2} y)] = 0,

[x^{2} - y + 1 x - cos (\frac{π}{2} y)] = 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Optimization Algorithms Research · Iterative Methods for Nonlinear Equations · Fractional Differential Equations Solutions

Full text

Double Descent and Intermittent Color Diffusion

for Global Optimization and landscape exploration

Luca Dieci, Manuela Manetta, Haomin Zhou

Abstract.

In this work, we present a method to explore the landscape of a smooth potential in the search of global minimizers, combining a double-descent technique and a basin-escaping technique based on intermittent colored diffusion. Numerical results illustrate the performance of the method.

Key words and phrases:

Double descent, color noise, intermittent diffusion, optimization

1991 Mathematics Subject Classification:

65H99, 65K99

The second and third authors work was supported under NSF Awards DMS-1419027, DMS-1620345 and ONR Award N000141310408.

1. Introduction

Let $g:\mathbb{R}^{n}\to\mathbb{R}$ , $n\geq 1$ , be a sufficiently smooth function (say, ${\mathcal{C}}^{\infty}$ ); we call $g$ the “potential,” or “objective function.” Let $\nabla g$ be the gradient of $g$ , and $H$ be the Hessian. Finally, we also let $G:\mathbb{R}^{n}\to\mathbb{R}^{+}$ be defined as $G=\frac{1}{2}(\nabla g)^{T}(\nabla g)$ ; we call $G$ the “auxiliary potential.” Our goal is to minimize $g$ .

Finding global minimizers for a general objective function $g$ is one of the oldest and most challenging problems in applied mathematics. Whereas it is at times possible to exploit a-priori knowledge for specific potentials, it remains an outstanding task to devise effective general optimization strategies which can be applied to a general problem. In the literature, one finds extensive collections of real-world potentials, arising from chemistry, physics, mathematics, etc., as well as artificial problems. Many challenging problems in the first group are obtained as models of interatomic forces, and are distinguished by having a reduced region of interest, expensive computation of the potential and its gradient, and full (not sparse) Hessians. Problems in the second group are useful to validate an optimization technique, to illustrate it, and to have objective functions with selectively distinguished features: a single global minimum, multiple global minima in the presence of many local minima (in which case any deterministic technique will be trapped in the basin of attraction of a minimizer without being able to escape it), long narrow valleys (which will slow down the search process), and flat surfaces. Unsurprisingly, some methods perform well on some problems, and poorly on others, and -aside from knowing ahead of time what method to use on a specific potential- one is left wandering on what to use for a given problem.

We are often confronted with this frustrating state of affairs when teaching a course on numerical methods for optimization. Even absolutely marvelous textbooks (e.g., [5]) are ultimately having to accept some uncertainties, and to deal with fine-tuning of parameters, and empirical choices. To be fair, these difficulties are intrinsic to the task at hand, and surely not the result of negligence. So, when we teach such a course, we end up teaching local techniques, maybe continuation and embedding techniques, emphasize gradient descent and Newton’s method and their variants, stress convex or maybe polynomial functions, but in the end we fail at providing rigorous answers to the recurring questions of alert students relative to a general smooth function $g$ : “how do we know that we found the global minimum? how do we know that we have visited the interesting regions of configuration space?” We don’t know, and most likely we will not know for the foreseeable future. Indeed, borrowing a painstaking and extensive search of the configuration space, we have few hints to offer to our students for answering their questions above. Motivated by our classroom experience, one of our purposes in this work is to present methods and ideas that can be taught in a numerical optimization course. That said, quite honestly, we have no pretense that our work is an answer to the above questions, but we are hopeful to be taking a (small) step in the right direction.

Let us immediately stress that we are putting forth some ideas for a general purpose method, one which does not rely on the specific properties of the potential. With this purpose in mind, we may recall that the main components of a global minimization algorithm are: to explore the landscape, and to locate minima. The shape of the level sets of $g$ is of course dictating the nature of the landscape: flat, rough, predictable, crowded with minima. At the same time, the shape of the level sets of $G$ is more directly responsible for properly identifying the basins of attraction of the minima of $g$ for important techniques, such as Newton’s method. Naturally, the level sets of $g$ and of $G$ are often of a very different nature; for example, in 1 we show them relative to the function of Example 6.1.1.

Regardless, unfortunately, graphical insight provided by the level sets is all but unavailable for problems in several variables, and exploration of the landscape remains a mix of randomization and subdivision ideas. Indeed, the most widely adopted exploratory technique appears to be Monte-Carlo, which is then combined with localization methods such as gradient descent, Newton’s method, and their modifications. For example, the popular “basin-hopping” methods randomly perturb a minimum and search for other minima by gradient descent (e.g., see [11]). And, of course, the best known instance of a method exploiting stochastic effects is simulated annealing, which makes use of a probability function to decide how to move in the search space; briefly, the method can be described as follows: call $x$ the current state, $x_{new}$ a randomly selected neighbor of $x$ , and let the probability function be given by

[TABLE]

where $T$ is called the temperature and is a function of the ratio between the current iteration number and the total number of iterations allowed. Given this setting, a random number $r$ is generated: if the probability to move from $x$ to $x_{new}$ is greater than $r$ , then the new state is accepted, otherwise it is rejected; note that a downhill direction will always be accepted, though one may also take uphill steps. This method is inherently better suited for discrete problems, and it is sequential in nature (see [2] for more details on simulated annealing).

To a large extent, everyone needs to deal with the key two aspects above: how to explore the landscape, and how to locate minima once the decision is made that there is one in the vicinity of the iterates. Our goal is to do so while avoiding a purely Monte Carlo technique, but rather using the information gathered at the critical points to move in an educated way in the landscape exploration.

In this work we introduce a method which uses a combination of new techniques, namely a double-descent method to search for minima, and an intermittent colored diffusion technique to escape the basins of attraction of critical points.

All along, we will tacitly assume that critical points (i.e., values $x$ where $G(x)=0$ ) are simple, in the sense that the Hessian is invertible there. In particular, at minima, the associated Hessian will be positive definite. Nevertheless, the proposed method will also be able to solve problems where at the critical point the Hessian has eigenvalues equal to [math].

We conclude this introduction with some practical considerations.

(i)

Although we are considering an unconstrained minimization problem, the case of constrained optimization is of course also important, and we expect to consider it in the future.

(ii)

Of the many minimization methods proposed during the years, some use only gradient information, some also Hessian, and some use only functional evaluations (so-called direct search algorithms). Whereas, of course, the specific problem at hand may inhibit using the gradient and/or the Hessian, we will assume that these are available to us. In fact, in our technique, we make use of repeated eigen-decomposition of the Hessian. Of course, this is an expense which may be prohibitive for truly large problems, though by today standards it is easily doable for dimensions of up to a few hundreds. It is not by coincindence that a lot of people have been concerned with efficient updating of Hessian factorizations (e.g., the BFGS (Broyden-Fletcher-Goldfarb-Shanno) or the DFP (Davidon-Fletcher-Powell) updates); see [7].

(iii)

The prevailing wisdom (e.g., see the well known Levenberg-Marquardt algorithm, trust-region methods, and the discussion in [5] and [10]) is to do Newton’s method near a minimizer. Our technique is designed to automatically do Newton’s method as well, as we reach the neighborhood of a minimize, or of another critical point.

(iv)

Many recent advances in global optimization (e.g., genetic algorithms, direct search techniques, multiple random initializations) have found their place in public domain software; e.g., see [8] and the Matlab Global Optimization Toolbox. In particular, the latter contains two routines which we have used for cross-comparison of our results: GlobalSearch and simulannealbnd. The function simulannealbnd is the Matlab implementation of simulated annealing. Instead, GlobalSearch finds minimizers at different stages: first a local search (carried out by the function fmincon) starts from an initial point $x_{0}$ provided by the user, and then a list of trial point is generated as potential starting points, taking into account penalty functions, spherical basins of attractions, and run-time, to eventually perform the local search from a large number of initial points. (For fair comparison with the results of our method, we used this function by providing gradient and Hessian information).

(v)

Finally, we must stress that it is very delicate to implement any method, and that methods that look good on paper will not deliver according to expectations, if not properly implemented. For this reason, we will detail our implementation choices so that our results may be replicated.

A plan of this paper is as follows. In section 2 we give some background material. In section 3 we introduce the double-descent technique, and in section 4 we give the combined “double-descent-color-intermittent-diffusion” method (DD-CID, for short). An overview of our numerical method is in section 5, and several numerical experiments are reported in section 6.

2. Preliminaries

2.1. Intermittent Diffusion

In the recent work [3], the authors proposed a general methodology, called intermittent diffusion (ID, for short), motivated by the fact that the most widely used stochastic technique available for global optimization, the simulated annealing mentioned before, needs a deterministic part to speed up the convergence towards the minimizers. In order to do so, ID alternates between gradient descent and diffusion processes, by turning on and off a white noise term. In mathematical terms, the ID methodology can be summarized by the following stochastic differential equation:

[TABLE]

where $W(t)$ is Brownian motion in $\mathbb{R}^{n}$ , $w$ is a random path in the Wiener space and $\sigma(t)$ is a piecewise constant function of time alternating between positive and zero values. In particular, when the noise is off ( $\sigma(t)=0$ ), the method reduces to the gradient descent technique, leading the trajectory towards a minimizer of the potential, when the noise is on ( $\sigma(t)>0$ ), the trajectory should leave a neighborhood of the minimizer and, eventually, reach the basins of attraction of different minima.

The discontinous diffusion term is given in [3] as

[TABLE]

where $0=S_{1}<T_{1}<S_{2}<T_{2}\dots<S_{N}<T_{N}<S_{N+1}=T$ and $I_{[S_{j},T_{j}]}$ is the characteristic function of the interval $[S_{j},T_{j}]$ . In [3], the length of the intervals $T_{j}-S_{j}$ , and the constants $\sigma_{j}$ are supposedly chosen randomly for all $j=1,\dots,N$ ; therefore, when the characteristic function is $1$ , the minimizer is perturbed by a positive random number for a certain amount of time, and when the noise is off, namely $I_{[S_{j},T_{j}]}=0$ , the method reduces to gradient descent and slowly converges to a local minimum.

As originally proposed, the ID approach is a general methodology, but to make it become a practical method requires making a lot of choices; for example, to decide how long the diffusion process needs to be carried out. In our experimentation of this technique, we used the discrete analog of (1):

[TABLE]

where $W\in N(0,1)^{n}$ . However, when using this technique, we faced the need to adjust too many parameters based on the potential we were trying to minimize, and realized that there were some key aspects to be addressed:

•

the local convergence towards minima, using the gradient descent, was too slow, and a faster method (eventually, Newton-like) was desirable;

•

white noise based diffusion did not account for the local landscape of the potential, and we eventually wanted to modify this with color noise diffusion;

•

criteria were needed to replace choosing the interval length randomly, finding instead a deterministic criterion to switch the noise on and off.

We addressed all of the above concerns in the present paper.

In order to achieve our goal to build up a method which automatically adjusts to the optimization problem, we resorted to exploiting the Hessian’s spectral information.

2.2. $\mathbb{R}^{n}$ via the Hessian

Below, we clarify the structure of regions of $\mathbb{R}^{n}$ where the eigenvalues of $H$ have a specified signature (inertia).

Definition 1.

Given a symmetric matrix $H$ , the inertia of $H$ is the triplet

[TABLE]

where $n_{+},n_{0},n_{-}$ , are the number of positive, zero, or negative, eigenvalues of $H$ , counted with their multiplicities. $H$ is called hyperbolic if $n_{0}(H)=0$ .

Observe that $H:\mathbb{R}^{n}\to\mathbb{R}^{n\times n}$ is a smooth symmetric function of $n$ parameters, hence the reasonings below are valid.

We will always order the eigenvalues of $H$ as $\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{n}$ , and $v_{1},\dots,v_{n}$ , will be the corresponding orthogonal eigenvectors. According to $\nu(H)$ , we will also use the notation $V=[v_{1},\ \dots\ ,\ v_{n}]=[V_{+},\ V_{0},\ V_{-}]$ , and will call $V_{+}$ the basis for the positively dominant subspace, or simply (with improper language) the dominant subspace, etc..

Lemma 2.

Consider the set ${\mathcal{P}}:=\{x\in\mathbb{R}^{n}\ :\ y^{T}H(x)y>0\ ,\,\forall y\in\mathbb{R}^{n}\}$ . We have the following properties:

(1)

${\mathcal{P}}$ * is open.* 2. (2)

${\mathcal{P}}=\cup_{i}{\mathcal{P}}_{i}$ , where each ${\mathcal{P}}_{i}$ is open and connected and ${\mathcal{P}}_{i}\cap{\mathcal{P}}_{j}=\emptyset$ for $i\neq j$ . 3. (3)

Each ${\mathcal{P}}_{i}$ is path connected.

Proof.

(1) follows from these considerations.

(i)

The eigenvalues of the function $H$ are continuous functions of $x$ . (This is a standard result).

(ii)

If $A$ is a symmetric positive definite function, and $B=B^{T}$ is such that $\|A-B\|_{2}<\lambda_{n}(A)$ , then $B$ is positive definite. (This is also well known).

Thus, if we take a value $x_{0}$ where $H(x_{0})$ is pos-def, consider the smallest eigenvalue of $H$ as a function of $x$ , and use the continuity of $\lambda_{n}(\cdot)$ , then $H(\cdot)$ must remain positive definite for $x\in B_{r}(x_{0})$ (an open ball centered at $x_{0}$ , of radius $r$ ): $B_{r}(x_{0})=\{y\in\mathbb{R}^{n}\ :\,\|x-y\|<r\}$ , or also $B_{r}(x_{0})=\{x\in\mathbb{R}^{n}\ :\,x=x_{0}+\rho w,\ \|w\|=1,\ \rho<r\}$ .

As far as (2), the observation is that the function $H$ ceases to be positive definite when $\lambda_{n}(x)=0$ . So, we define a set ${\mathcal{P}}_{i}$ as that component of ${\mathcal{P}}$ such that for any two $x,y\in{\mathcal{P}}_{i}$ there is a curve joining $x$ and $y$ such that along this curve $\lambda_{n}>0$ . As above, ${\mathcal{P}}_{i}$ is open, and thus the set ${\mathcal{P}}$ is separated into open connected components ${\mathcal{P}}_{i}$ ’s, and ${\mathcal{P}}_{i}\cap{\mathcal{P}}_{j}=\emptyset$ , for $i\neq j$ .

(3) follows from a classical result in topology, telling that “open connected sets in $\mathbb{R}^{n}$ are path connected”. [It is also possible to argue directly, since, given that the ${\mathcal{P}}_{i}$ ’s are open, an open ball centered at any point $x\in{\mathcal{P}}$ must intersect all coordinate directions]. ∎

Remark 3.

Properties similar to (1)-(2)-(3) above are still true in case the Hessian is hyperbolic. Indeed, considering the set

[TABLE]

*this set is open and the union of (path) connected components. The reason is that perturbation of a hyperbolic Hessian renders a hyperbolic one, with same inertia, as a consequence of the fact that invertible matrices form an open set. *

2.3. Courant’s theorem

As we will see in the following, a main idea of our method is to escape the basin of attraction of a minimizer by searching for a saddle point. Now, it is well-known that if the potential is a function of one real variable, $g:\mathbb{R}\to\mathbb{R}$ , and $x_{1}$ and $x_{2}$ are two strict minima, that is $g^{\prime\prime}(x_{1,2})>0$ , then $g$ must have another critical point $x_{3}$ between $x_{1}$ and $x_{2}$ . However, as soon as we consider a real-valued function of two variables, a similar result does not hold, in general111As an example, consider the function $g_{1}(x,y)=(x^{2}y-x-1)^{2}+(x^{2}-1)^{2}$ : it has two local minima at $(1,2)$ and $(-1,0)$ , and no other critical point.. Nevertheless, under certain conditions the existence of other non-extremal critical points has been proved, and this result, due to Courant, dates back to 1950 (see [9, p.49], where $g$ is only assumed to be $\mathcal{C}^{1}$ ).

Theorem 4.

Suppose that $g$ is coercive222Recall $g:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is coercive if $\lim_{\|x\|\rightarrow\infty}g(x)=+\infty$ , that is for any constant $M$ there is a constant $R_{M}$ such that $\|g(x)\|>M$ whenever $\|x\|>R_{M}$ . and possesses two distinct strict relative minima $x_{1}$ and $x_{2}$ . Then $g$ possesses a third critical point distinct from $x_{1}$ and $x_{2}$ , characterized by

[TABLE]

*where $\Gamma=\{\Sigma\subset\mathbb{R}^{N};\Sigma\mbox{ is compact and connected and }x_{1},x_{2}\in\Sigma\}$ .

Moreover, $x_{3}$ is not a relative minimizer; that is in every neighborhood of $x_{3}$ , there exists a point $x$ such that $g(x)<g(x_{3})$ .*

Thus, $x_{3}$ in Theorem 4 can be viewed as the set which topologically separates $x_{1}$ and $x_{2}$ .

Theorem 4 is part of “mountain pass theory.” An accessible introduction to this theory and its applications is in [1], a comprehensive treatment is [9], and the report [12] and the work [13] propose numerical methods to approximate mountain pass points (here, the authors use the characterization of mountain pass points as critical points where the (nonsingular) Hessian has exactly one negative eigenvalue).

3. Descent directions and the double-descent

Here we introduce the double descent direction. First, we recall the definition of (gradient) descent and Newton’s directions.

(a)

(Descent direction). Assuming that $\nabla g(x_{0})\neq 0$ , any direction $v$ such that $g(x_{0}+\alpha v)<g(x_{0})$ , for some $\alpha>0$ , is called a direction of descent for the potential $g$ . A trivial computation shows that a direction of descent $v$ must satisfy $(\nabla g(x_{0}))^{T}v<0$ . The classic choice is $v=-\nabla g(x_{0})$ (the so-called gradient descent choice).

(b)

(Newton’s direction). This is the direction resulting from using Newton’s method to solve the problem $\nabla g(x)=0$ . In other words, it is the direction (assuming that $\nabla g(x_{0})\neq 0$ and that the Hessian is invertible) given by $v(x_{0})=-H(x_{0})^{-1}\nabla g(x_{0})$ . We note that this is a descent direction for the functional $G$ at $x_{0}$ , since $\nabla G=H\nabla g$ .

Remark 5.

Of course, we can always normalize a descent (and/or Newton’s) direction to be a vector of norm $1$ .

3.1. Descent direction within a positive definite region

The following result, which is both fundamental and well known (see [5, p.114]), serves as motivation for some of our later algorithmic choices.

Lemma 6.

Let $x_{0}$ be a point where $\nabla g\neq 0$ and let $H(x_{0})$ be positive definite. Then, the Newton’s direction is a direction of descent for $g$ .

Proof.

We need to show that –at $x_{0}$ – we have $(\nabla g)^{T}v<0$ when $v=-H^{-1}\nabla g$ . But this is obvious since $H$ is positive definite. ∎

With the help of Lemma 6, the following is immediate.

Proposition 7.

Suppose that $x_{0}\in\Omega_{0}$ , where $\Omega_{0}$ is a path-connected component where $H$ is positive definite. Then, either $\nabla g(x_{0})=0$ , or there exists a direction $v\in\mathbb{R}^{n}$ , and a scalar $\alpha>0$ , such that both $G(x_{0}+\alpha v)<G(x_{0})$ and $g(x_{0}+\alpha v)<g(x_{0})$ . Further, one can also choose $\tau$ so that both potentials decrease and $H(x_{0}+\tau v)$ is positive definite.

Proof.

We want $v$ such that both of these relations hold at $x_{0}$ :

[TABLE]

Since $\nabla G=H\nabla g$ , the sought relations rewrite as

[TABLE]

Each of the above inequalities defines an open half space, and the Newton’s direction is in both of these. Therefore, the existence of a unit vector $v$ giving us the sought decrease is established, and there exists a scalar $\alpha$ , positive, such that both $G(x_{0}+\alpha v)<G(x_{0})$ and $g(x_{0}+\alpha v)<g(x_{0})$ .

Further, since $H(x_{0})$ is positive definite, then there is an open ball centered at $x_{0}$ and of radius $r>0$ , $B_{r}(x_{0})$ , such that $H(x)$ remains positive definite for any $x\in B_{r}(x_{0})$ . Therefore, there exist $\tau$ so that $H(x)$ is positive definite for $x=x_{0}+\tau v$ . ∎

Remark 8.

Because of Lemma 6, in Proposition 7 we can choose $v$ to be the Newton’s direction.

We note right away that it is often not advisable to select the step-length $\tau$ so that the Hessian remains positive definite. Indeed, in our numerical experiments, doing so often resulted in a severe restriction of the step-length and inefficient computations, and it was quite preferable to allow a decrease in the potentials without forcing a fixed inertia for the Hessian. For this reason, we now define the double-descent direction allowing for the Hessian to be indefinite.

3.2. Descent direction within an indefinite region

Here, we generalize the above result to the case of regions with different Hessian’s inertia.

Proposition 9.

*Let $\Omega$ be a path-connected region where $\nu(H)=\{n_{+},n_{0},n_{-}\}$ for all $x\in\Omega$ , with $n_{+}\geq 1$ . Let $x_{0}\in\Omega$ , and let $V_{+}=\text{span}\{v_{1},\dots,v_{n_{+}}\}$ be the subspace spanned by the first $n_{+}$ eigenvectors of $H(x_{0})$ .

Then, if $V_{+}^{T}\nabla g(x_{0})\neq 0$ , there exists a direction $v\in\mathbb{R}^{n}$ , and a scalar $\alpha>0$ , such that both $G(x_{0}+\alpha v)<G(x_{0})$ and $g(x_{0}+\alpha v)<g(x_{0})$ . Further, if $n_{0}=0$ , i.e. $H(x_{0})$ is hyperbolic, then there exists $\tau>0$ so that $\nu(H(x_{0}+\tau v))=\nu(H(x_{0}))$ .

Finally, in all cases above, the direction $v$ can be taken as $v=-(H_{+}(x_{0}))^{\dagger}\nabla g(x_{0})$ , where $H_{+}$ is the closest positive semidefinite matrix to $H$ ; that is, if $H=V\Lambda V^{T}$ , and $\Lambda=\text{diag}(\Lambda_{+},\Lambda_{0},\Lambda_{-})$ , then $H_{+}=V\Lambda^{+}V^{T}$ , with $\Lambda^{+}=\text{diag}(\Lambda_{+},0_{n_{0}},0_{n_{-}})$ , and thus $(H_{+})^{\dagger}=V_{+}(\Lambda_{+})^{-1}V_{+}^{T}$ .*

Proof.

We want $v$ such that both of these relations hold at $x_{0}$ :

[TABLE]

Considering the direction $v$ given in the statement, we have

[TABLE]

since $V_{+}^{T}\nabla g(x_{1})\neq 0$ . For the same reason, we also have

[TABLE]

Therefore, the existence of a unit vector $v$ giving us the sought decrease is established, and there exists a scalar $\alpha$ , positive, such that both $G(x_{0}+\alpha v)<G(x_{0})$ and $g(x_{0}+\alpha v)<g(x_{0})$ .

Further, if $H(x_{0})$ is hyperbolic, then $\nu(H(x_{0}))=\{n_{+},n_{-}\}$ . Therefore, there is an open ball centered at $x_{0}$ and of radius $r>0$ , $B_{r}(x_{0})$ , such that $\nu(H(x))=\nu(H(x_{0}))$ for any $x\in B_{r}(x_{0})$ . Thus, we can choose $\tau>0$ such that $\nu(H(x_{0}+\tau v))=\nu(H(x_{0}))$ . ∎

Remark 10.

The direction $v=-(H_{+}(x_{0}))^{\dagger}\nabla g(x_{0})$ of Theorem 9 is effectively the Newton’s direction restricted to the subspace associated to the positive eigenvalues.

Summary 11.

To sum up, as long as the point $x_{0}$ is in a region where the Hessian has at least one positive eigenvalue, and $\nabla g$ has a non-trivial component in the subspace spanned by the eigenvectors corresponding to the positive eigenvalues, we can always find a direction $v$ which is of descent for both $G$ and $g$ . If $H(x_{0})$ has no [math] eigenvalue, we can also maintain the inertia of $H$ by taking a step in the direction $v$ ; however, this may be counterproductive (as our computations showed), since it may unduly restrict the step $\tau$ , and it is much more desirable to let the iterate enter and exit regions of different Hessian’s inertia while decreasing the potentials $g$ and $G$ .

Remark 12.

*One more comment is needed about the condition $V_{+}^{T}(x)\nabla g(x)\neq 0$ .

The dimension of the subspace $\text{span}(V_{+})$ is $n_{+}$ , while $\nabla g$ is a vector in $\mathbb{R}^{n}$ . Therefore, in general, the requirement $V_{+}^{T}\nabla g=0$ would define a set of $n_{+}$ equations in the $n$ variables $x\in\mathbb{R}^{n}$ . Generally, these define an $n-n_{+}$ dimensional manifold immersed in $\mathbb{R}^{n}$ . Therefore, we should expect that, at any $x$ , the vector $\nabla g$ will have a nontrivial component in $V_{+}$ . This is the truer the larger is $n_{+}$ . In the limiting case of $n_{+}=n$ , Lemma 6 already told us that. At the same time, if $n_{+}=0$ , then obviously there is no direction $v_{+}$ to begin with. In this case, there is no double descent direction to begin with, and our method (see below) will revert to using the gradient direction. *

4. Double-descent colored-diffusion method

The main idea of our technique is to take advantage of the knowledge of the Hessian inertia, in order to explore the landscape going from a saddle point to a minimum and vice versa, avoiding being trapped in the basins of attraction of the critical points while following an educated path.

There are two types of processes we use: “local zoom-in,” and “basin escaping” methods.

4.1. Reaching a critical point: local search

This part is based on the developments of Section 3. We distinguish between the cases of searching for a minimum or a saddle. In the former case, we have a host of possibilities: the double descent algorithm, gradient-descent, Newton’s method (damped); in the latter, we use a (damped) Newton’s method. Still, we must make some careful implementation choices.

For example, when searching for a minimum, beside the usual concerns on how to choose the step-length (see [5]), we also accounted for these aspects when implementing the double descent method.

(i)

When using the double-descent direction, we demand that both $g$ and $G$ have appreciably decreased.

(ii)

To declare that $\nabla g$ has no meaningful component in the subspace spanned by $V_{+}$ , we used the criterion $\|V_{+}^{T}(x_{k})\nabla g(x_{k})/\|\nabla g(x_{k})\|\|\leq\sqrt{n}/10$ . When this happens, we reverted to the direction $v$ given by the simpler gradient-descent direction, and kept this descent strategy for 5 steps before retrying the double descent direction. Likewise, we reverted to the gradient descent direction when too many damping steps are taken with the double descent direction.

(iii)

Another important consideration is about the stopping criterion (both when searching for a minimum or a saddle). In our implementation we chose the following stopping criterion (always within the maximum allowed number of iterations). We iterate as long as:

$\|\nabla g(x_{k})\|\geq{\mathtt{atol}}+\|\nabla g(x_{0})\|{\mathtt{rtol}}\,\ \text{and}\,\ \|x_{k}-x_{k-1}\|\geq{\mathtt{atol}}+\|x_{0}\|{\mathtt{rtol}}\ .$

4.2. Basin escaping by Color Diffusion

Here we adopt (some of) the ideas in Section 2.1 in order to leave the basin of attraction (for Newton’s method) of a critical point. In particular, compare (4), (5), and (6), with (3); naturally, just as (3) can be viewed as a discretization of the SDE (1), also our equations (4), (5), and (6), can be thought as discretizations of an underlying SDE, in regions where the Hessian inertia and the dominant (respectively, smallest) eigenvalue are not changing.

As seen in Section 3, the double-descent method is designed to lead to a minimum when the Hessian at the starting point is either positive definite or indefinite, and -when properly implemented- it will really be Newton’s method close to a minimum. Instead, when the initial value lies in a region in which both $n_{+}(H)$ and $n_{-}(H)$ are nonzero, we will presume that (damped) Newton method will converge to a saddle point (or perhaps to a maximum); this expectation has effectively been borne out in practice for the vast majority of our experiments.

Regardless, if we are at either a minimum or at a saddle, we need to leave the respective basins of attraction for (damped) Newton method. To do this, we implemented a colored intermittent diffusion method as follows, by reflecting the choices made above to look for either a saddle or a minimum, and the discussion in Section 2.1.

(a)

From a min $x_{0}$ , trying to go to a saddle. Let us first assume that, $\lambda_{1}>\lambda_{2}\geq\cdots\geq\lambda_{n}$ along our iterates. There are three basic steps.

(i)

Select $\alpha$ (usually, $\alpha=1$ ) and generate

[TABLE]

(ii)

Find $h$ such that $|G(\hat{x}_{k+1})|<|G(x_{k})|$ , with $\hat{x}_{k+1}\ =\ x_{k}-h\bigl{(}H^{\dagger}(x_{k})\nabla g(x_{k})\bigr{)}$ , and then

[TABLE]

(iii)

Continue with the diffused damped Newton’s above until the Hessian has some negative eigenvalues or the maximum number of diffusive iteration has been exceeded. At that point, use (damped) Newton’s method. Hence, select the Newton direction $v$ and the step length $h$ to decrease the auxiliary potential $G$ ; say, $G(x_{k}+hv)<G(x_{k})$ . If $g_{m}$ denotes the value of the potential $g$ at the minimum from which we started, we observed that consistently $g(x_{k}+hv)>g_{m}$ which betrays that we are not going back to the starting minimum.

The rationale for the colored noise diffusive step is to move away as quickly as possible from the basin of attraction of the minimum. If $x_{0}$ is a minimum, the standard quadratic approximation in a $\epsilon$ -ball around $x_{0}$ will give:

[TABLE]

and therefore, with $\|y\|=1$ , the fastest increase is for $y=v_{1}$ . In the (very unlikely) case that the dominant eigenvalue has multiplicity greater than $1$ , we select a random vector in the span of the dominant eigenvectors.

(b)

From a saddle $x_{0}$ , trying to go to a min. Let us first assume that $\lambda_{1}\geq\cdots\geq\lambda_{n-1}>\lambda_{n}$ . Even here there are two basic steps, getting out of the saddle and going to a minimum. The second step, see below, can be carried out with the double descent method, or with gradient descent, or possibly with a (damped) Newton approach. In all cases, first we use colored diffusion steps to move out of the saddle.

(i)

Select $\alpha$ (usually, $\alpha=1$ ) and generate

[TABLE]

As before, the choice of the colored noise diffusive step is to move away as quickly as possible from the basin of attraction of the saddle, while decreasing the potential $g$ . If $x_{0}$ is a saddle, in a $\epsilon$ -ball around $x_{0}$ we have:

[TABLE]

and therefore, with $\|y\|=1$ , the fastest decrease is for $y=v_{n}$ .

(ii)

Double descent. If $\nabla g(x_{k})$ has a meaningful component in the direction of $V_{+}(x_{k})$ , do (i-a), otherwise do (i-b).

(ii-a)

Find $h$ such that both $|G(\hat{x}_{k+1})|<|G(x_{k})|$ , and $|g(\hat{x}_{k+1})|<|g(x_{k})|$ , with $\hat{x}_{k+1}\ =\ x_{k}-h\bigl{(}H_{+}^{\dagger}(x_{k})\nabla g(x_{k})\bigr{)}$ , and then

[TABLE]

(ii-b)

Find $h$ such that $|g(\hat{x}_{k+1})|<|g(x_{k})|$ , with $\hat{x}_{k+1}\ =\ x_{k}-h\ \nabla g(x_{k})$ , and then

[TABLE]

(iii)

Continue with the diffused Double Descent above until the Hessian has all positive eigenvalues or the maximum number of diffusive iteration has been exceeded. At that point, use Double Descent method. Hence, select the direction $v$ and the step length $h$ to decrease the potential $g$ auxiliary potential $G$ ; say, $g(x_{k}+hv)<g(x_{k})$ and $G(x_{k}+hv)<G(x_{k})$ .

Again, in the (very unlikely) event that the smallest eigenvalue has multiplicity greater than $1$ , we select a random unit vector in the corresponding subspace.

5. The method at a glance

In the previous sections, we presented only the two key components of the method, namely, the local search and the basin escaping. Here, we give a broader idea of the method.

(0)

The very first initial datum $x_{0}$ is randomly chosen (within a region of interest). A local search for a minimum starts with the double descent method, and the point found is stored in a Table of critical points.

(1)

A point from the Table is randomly selected. Colored diffusion to escape the basin of this critical point is performed (see Figure 2), followed by a local search for the next critical point. The new point is stored in the Table 333. The same point can thus appear multiple times in the Table., and step (1) is thus repeated until a predefined number of critical points is found.

5.1. Sketch of the algorithm

(1)

Choose a random point $x_{0}$ in the search region. 2. (2)

Look for a minimum $x^{min}$ by using the double-descent method. 3. (3)

Store $x^{min}$ in the list of critical points. 4. (4)

LOOP BEGINS - to be repeated for a preassigned number of iterations.

(a)

Randomly choose a point $x$ from the list. 2. (b)

If $x$ is a minimum, start diffusion according to (4) until $n_{-}(H)\neq 0$ (see Figure 3(a)) or maximim number of diffusive steps is exceeded.

Apply (damped) Newton method to find a saddle point.

Store the saddle in the list of critical points. 3. (c)

If $x$ is a saddle point, take diffusive steps according to (5) (or (6)), until $n_{-}(H)=0$ (see Figure 3(b)) or the maximum number of diffusive steps is exceeded. Apply double descent to find a minimizer.

Store the minimum in the list of critical points.

LOOP ENDS

5.2. Computational considerations

A main drawback of Newton’s type technique, hence also of the double descent method, is the need to form, evaluate, and decompose, the Hessian. Except for problems where the Hessian is simple to evaluate, and very structured (e.g., tridiagonal), this can be very expensive and it restricts applicability of Hessian based techniques to small dimension (say, up to a few hundred variables on a typical laptop). We also note that for some problems evaluating the Hessian is itself an expensive task; e.g., this is the case for interatomic potentials, such as the Morse and Lennard-Jones potentials (see Section 6). Although our purpose in this work has not been to deal specifically with efficient implementations, but rather to give ways to explore the phase space (the landscape), for large (possibly sparse) problems we have experimented with Lanczos techniques, and a subspace version of Newton’s method, whereby we project the Hessian in the direction of the most dominant eigenvalues (positive and negative). We will report on these apects in other works.

An important consideration pertains to the colored noise diffusion. To perform this diffusion, and to monitor when to stop it, it is straightforward to bypass the Hessian factorization. In fact, to form the color noise, and to decide when to stop diffusion, we only need the two eigen-directions $v_{1}$ and $v_{n}$ . These are inexpensive to obtain with a well designed Lanczos technique (e.g., eigs in Matlab), by asking for (respectively) the largest and smallest eigenvalues. This feature is particularly useful when using just a damped Newton’s method with color noise (as in our basic intermittent diffusion method from a min to a saddle), since the linear systems arising during the iteration are then solved without resorting to a full eigen-decomposition. To elucidate, and to account for the possibility of singular Hessian, we first form the QR factorization with column pivoting of the Hessian: $HP=QR$ with diagonal of $R$ ordered in decreasing magnitude. Then solve the resulting triangular system, possibly for the minimum norm solution (if the Hessian was singular, which is betrayed by $R_{nn}=0$ ).

Finally, as seen in Section 4.2, we use a variable stepsize controlled through the requirement of moving in the descent direction(s). The initial stepsize is set to $1$ , and the stepsize is always required to remain in $[2^{-26},2^{5}]$ , where $2^{-26}=\sqrt{\tt{eps}}$ , the square root of the machine precision. When one step is taken in the desired direction, and the computation is immediately accepted, then the stepsize is doubled; if the computation is rejected, the stesize is halved. If we reach the minimum allowed stepsize, the algorithm halts and restarts from a different critical point in the Table (or a different random point, the very first time).

6. Applications and examples

In this section, we show performance of our method on several problems, both standard model problems, with an illustrative purpose and to validate the method on different landscape features, and those arising from chemical potentials.

6.1. Test Problems

6.1.1. An illustrative example

Consider the following elementary potential:

[TABLE]

It has 2 minimizers, located at $(-1,0)$ and $(1,0)$ , and a saddle point at $(0,1)$ . Starting from a random point $x_{0}$ , the Double Descent technique quickly leads to a minimizer, and the diffusion combined with Newton method allows to find the saddle point, from which the algorithm looks again for a minimizer. By using our technique, and asking the algorithm for at most 4 critical points, we were able to find, in a single run, the two minimizers and the saddle point. Indeed, the method is behaving exactly like we were hoping: first, it converges to $(-1,0)$ , then it goes through the saddle point and from there localizes the other minimum at $(1,0)$ , and then it goes back to the saddle point. On average we counted 3 diffusive steps and $8.25$ local search iterations.

6.1.2. Shubert Function

The Shubert function is a highly multimodal potential: it has several local minima and many global ones. Naturally, the function presents many saddle points and maxima. Moreover, the global minima and the global maxima are extremely close, and this is one of the reason why it may be difficult to find the global minimizers.

Although our algorithm is designed to find minima and saddles, it could end up finding maxima as well, due to the Newton’s basins of attractions, which are nontrivial. Schubert’s potential,

[TABLE]

is represented in Figure 4(a). Figure 4(b) is a zoom of the contour plot around the global minimizer, and the points are the minimizers found by just applying the double descent technique, starting from random initial values. While finding the global minimizer by applying a deterministic technique requires a starting point in its neighborhood, the ability to explore the landscape eliminates this necessity. One single run of our technique, asking for 100 possible critical points, gave us 45 minima, 3 of which were global, at different locations. On average, for attempt, we counted one diffusive step and 5.9 local search iterations.

6.1.3. Biggs Function

Let us consider the following function:

[TABLE]

where $t_{i}=0.1\,i$ , $y_{i}=e^{-t_{i}}-5e^{10\,t_{i}}$ .

There are two critical points: a minimum at $(1,10)$ and a saddle at $(16.7047,16.7047)$ , as shown in figure 5.

The challenges in this problem are the flat landscape of the potential and the presence of narrow regions in which the Hessian is positive (negative) definite, but that do not contain a minimum (maximum), as shown in Figure 6.

Asking for 2 critical points, we were able to find, in a single run, both the minimimum and the saddle point. On average, we performed 2.5 diffusive steps and $18$ local search iterations.

6.1.4. Camel Function

Let us consider the following function:

[TABLE]

This function is a standard test function for global optimization, but it is also useful as a test for the mountain passes’ search (see [12]). In fact, our algorithm can be also used to compute mountain passes. As in [12], these are characterized as critical points that satisfy the Palais-Smale condition and whose Hessian has exactly one negative eigenvalue, that is $n_{-}(H)=1$ .

Consider the region $[-2,2]\times[-1,1]$ . Here, there are $14$ critical points: $6$ mountain passes, $6$ minima and $2$ maxima. Our method has no difficulty in computing all of these points in one single execution. Results are summarized in Figure 7 and in Table 1.

6.1.5. Rosenbrock Function

This function is given by

[TABLE]

and it has a global minimum value of [math] at at $(1,1,...,1)$ , for any value of $N$ . The Hessian is very inexpensive to compute and factor, being tridiagonal. The global minimum lies inside a long, narrow, parabolic shaped flat valley; while finding the valley is trivial, detecting the global minimizer is not.

We take $N=50$ . A Monte-Carlo gradient-descent technique using $20$ random initial guesses did not find the global minimizer. A single implementation of our technique with the possibility to find at most $20$ critical points, found $16$ minimizers, $9$ of which were global. On average, per attempt to find a critical point, we counted $1.8$ diffusive steps and $136$ local search steps. For comparison sake, the Matlab routine GlobalSearch found the global minimizer, whereas the Matlab simulated annealing routine simulannealbnd gave a best value for the minimum of $40.7188$ .

6.2. Chemical Potentials

An interesting application of global optimization is protein folding. Mathematically, this consists in finding the equilibrium configuration of $d$ atoms, assuming that the forces between the atoms are known. In the end, one has to find the minimizer of a potential energy function depending on $3d$ variables.

The Lennard-Jones and Morse clusters are two well-known systems of this kind and they have been extensively studied, and the minima tabulated. For example, the (currently best) global minima for Lennard-Jones and Morse potential can be found at the database [4]. These results were obtained with the methods presented in [6] and [11]. Both are “basin-hopping” techniques; they exploit the funneling structure of the potentials (that is, the global minimizer lies at the bottom of a monotonically descending sequence of minimizers), and make a number of choices explicitly based on the specific problem at hand. For example, the authors perform a continuation based upon an optimal configuration reached with $d$ atoms to initiate the search for $(d+1)$ atoms.

With no pretense of comparing to these other results, below we present some of the results we obtained applying our general technique on both Morse and Lennard-Jones potentials. These potentials depend on the mutual distances (in $\mathbb{R}^{3}$ ) between the atoms, namely $r_{ij}=\|P_{j}-P_{i}\|$ , with $1\leq i<j\leq d$ and $P_{k}=(x_{k},y_{k},z_{k})$ , for all $k=1,...,d$ .

Given obvious symmetries in the problem, we imposed the following location constraints: we fix one atom at the origin ( $P_{1}=(0,0,0)$ ), another one on the $x$ axis ( $P_{2}=(x_{2},0,0)$ ), and a third one in the $xy$ -plane ( $P_{3}=(x_{3},y_{3},0)$ ). All other atoms are unconstrained. With this setup, each configuration will be identified by the vector of coordinates

[TABLE]

and the dimension of the problem becomes $N=3d-6$ .

In our experiments we compute the gradient analytically, and the Hessian numerically, by forward finite differences.

6.2.1. Lennard-Jones Potential

This is defined as follows:

[TABLE]

where $\varepsilon$ and $2^{1/6}\sigma$ are the pair equilibrium well depth and separation respectively. We take $\varepsilon=\sigma=1$ .

This is a problem where a simple gradient descent technique, coupled with a Monte-Carlo randomization, performs reasonably well; as a matter of fact, our own double-descent method quite often automatically reverts to gradient descent. That said, there are specific configurations where the problem is much more difficult with our method than with a simple gradient descent (for example, $d=38$ , a value where the optimal configuration does not have a regular (Mackay) hycosahedral structure). On this problem, the basin-hopping techniques of [11] is a more effective way to find the global minima, since the knowledge of the potential landscape is exploited in the algorithm itself; our method is really a landscape exploration approach, and our method failed, for example, for $d=38$ . Nevertheless, the method worked well for smaller values of $d$ , as reported in Table 2.

6.2.2. Morse Potential

The Morse potential is defined as follows:

[TABLE]

where $\rho$ is a parameter which determines the width of the well. We treat this problem as truly unconstrained, and this may create difficulties to descent techniques, since descent directions may well identify points “at infinity” (i.e., some coordinates grow unbounded); e.g., this happens to the Matlab code GlobalSearch. A further difficulty is that global minima become harder to locate when $\rho$ increases. In Table 3, we report the results of our method for $11$ atoms and $\rho=3,6,10,14$ ; our minima match those of [4].

For the sake of comparison, we remark that neither Matlab function GlobalSearch nor simulannealbnd gave acceptable results. Namely, they produced the results in Table 4.

6.3. Nonlinear Systems

Our technique can also be used to solve nonlinear systems. Indeed, a nonlinear system

[TABLE]

can be reformulated as an optimization problem, simply by considering the objective function given by

[TABLE]

In this case, $\nabla g(x)=J_{S}(x)^{T}S(x)$ , where $J_{S}$ indicates the Jacobian of $S$ . Therefore, the critical points of the objective function $g$ , correspond to both the zeros of $S(x)$ , and the points for which $S(x)$ is in the left null space of the Jacobian.

6.3.1. Boggs system

Given the nonlinear system

[TABLE]

we construct the objective function according to (15).

The solutions of the problem are $(-1,2)$ , $(0,1)$ and $(-\sqrt{2}/2,3/2)$ , illustrated in Figure 8(a).

Numerical results from a single run of the algorithm, asking for at most 20 critical points, are shown in Table 5 and illustrated in Figure 8.

7. Conclusions

In this work, we presented a method apt at exploring the landscape of a smooth (at least $\mathcal{C}^{2}$ ) potential, in order to locate global minima. The new components of our method are a double-descent technique (to locate minima) and a colored intermittent diffusion (to escape basin of attraction of minima and other critical points). The idea of the technique is to use Hessian information in order to bias the exploration of the landscape. We illustrated performance of our technique on several problems from the literature, observing that our method is able, in most cases, to adapt to different features of the potential. We believe that our method can be easily taught in an optimization course, along with other well established techniques.

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Bisgard, Mountain Passes and Saddle Points , Siam Review, 57(2), pp.275-292, 2015
2[2] D. Bertsimas and J. Tsitsiklis, Simulated Annealing , Statistical Science, 8(1), pp. 10-15, 1993.
3[3] S.N.Chow and T.S.Yang and H.M.Zhou, Global Optimizations by Intermittent Diffusion , Chaos, CNN, Memristors and Beyond: A Festschrift for Leon Chua, Edited by Adamatzky et al. Published by World Scientific Publishing Co. Ltd., 2013.
4[4] D.J. Wales, J.P.K. Doye, A. Dullweber, M.P. Hodges, F.Y. Naumkin, F. Calvo, J. Hernandez-Rojas and T.F. Middleton, The Cambridge Cluster Database , http://www-wales.ch.cam.ac.uk/CCD.html
5[5] J.E. Dennis, R.B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations . SIAM, Classics in Applied Mathematics 16. 1996.
6[6] Doye, J.P.K., Leary, R.H., Locatelli, M., Schoen, F.: Global optimization of Morse clusters by potential energy transformations. INFORMS J. Comput. 16(4), 371-379, 2004.
7[7] P. E. Gill, W. Murray, M. H. Wright. Practical Optimization . Academic Press, London, 1982.
8[8] K. Holmström. New Optimization Algorithms and Software. Theory of Stochastic Processes, 5(21), 55-63, 1999.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Double Descent and Intermittent Color Diffusion

Abstract.

Key words and phrases:

1991 Mathematics Subject Classification:

1. Introduction

2. Preliminaries

2.1. Intermittent Diffusion

2.2. Rn\mathbb{R}^{n}Rn via the Hessian

Definition 1**.**

Lemma 2**.**

Proof.

Remark 3**.**

2.3. Courant’s theorem

Theorem 4**.**

3. Descent directions and the double-descent

Remark 5**.**

3.1. Descent direction within a positive definite region

Lemma 6**.**

Proof.

Proposition 7**.**

Proof.

Remark 8**.**

3.2. Descent direction within an indefinite region

Proposition 9**.**

Proof.

Remark 10**.**

Summary 11**.**

Remark 12**.**

4. Double-descent colored-diffusion method

4.1. Reaching a critical point: local search

4.2. Basin escaping by Color Diffusion

5. The method at a glance

5.1. Sketch of the algorithm

5.2. Computational considerations

6. Applications and examples

6.1. Test Problems

6.1.1. An illustrative example

6.1.2. Shubert Function

6.1.3. Biggs Function

6.1.4. Camel Function

6.1.5. Rosenbrock Function

6.2. Chemical Potentials

6.2.1. Lennard-Jones Potential

6.2.2. Morse Potential

6.3. Nonlinear Systems

6.3.1. Boggs system

7. Conclusions

2.2. $\mathbb{R}^{n}$ via the Hessian

Definition 1.

Lemma 2.

Remark 3.

Theorem 4.

Remark 5.

Lemma 6.

Proposition 7.

Remark 8.

Proposition 9.

Remark 10.

Summary 11.

Remark 12.