Optimization of the Sherrington-Kirkpatrick Hamiltonian
Andrea Montanari

TL;DR
This paper introduces a new message-passing algorithm that efficiently approximates the maximum of the Sherrington-Kirkpatrick Hamiltonian, closely matching the optimal value with high probability for large systems.
Contribution
It presents a novel message-passing algorithm with quadratic time complexity that achieves near-optimal solutions for the SK model, extending its applicability to low-temperature regimes.
Findings
Algorithm achieves (1-ε) approximation of the optimum
Time complexity is quadratic in system size
Constructs approximate solutions to TAP equations at low temperature
Abstract
Let be a symmetric random matrix with independent and identically distributed Gaussian entries above the diagonal. We consider the problem of maximizing over binary vectors . In the language of statistical physics, this amounts to finding the ground state of the Sherrington-Kirkpatrick model of spin glasses. The asymptotic value of this optimization problem was characterized by Parisi via a celebrated variational principle, subsequently proved by Talagrand. We give an algorithm that, for any , outputs such that is at least of the optimum value, with probability converging to one as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\stackMath
Optimization of the Sherrington-Kirkpatrick Hamiltonian
Andrea Montanari Department of Electrical Engineering and Department of Statistics, Stanford University
Abstract
Let be a symmetric random matrix with independent and identically distributed Gaussian entries above the diagonal. We consider the problem of maximizing over binary vectors . In the language of statistical physics, this amounts to finding the ground state of the Sherrington-Kirkpatrick model of spin glasses. The asymptotic value of this optimization problem was characterized by Parisi via a celebrated variational principle, subsequently proved by Talagrand. We give an algorithm that, for any , outputs such that is at least of the optimum value, with probability converging to one as . The algorithm’s time complexity is . It is a message-passing algorithm, but the specific structure of its update rules is new.
As a side result, we prove that, at (low) non-zero temperature, the algorithm constructs approximate solutions of the Thouless-Anderson-Palmer equations.
1 Introduction and main result
Let be a random matrix from the ensemble. Namely, and is a collection of independent random variables with and for . We are concerned with the following optimization problem (here is the standard scalar product)
[TABLE]
From a worst-case perspective, this problem is NP-hard and indeed hard to approximate within a sublogarithmic factor [ABE*+*05]. For random data , the energy function is also known as the Sherrington-Kirkpatrick model [SK75]. Its properties have been intensely studied in statistical physics and probability theory for over 40 years as a prototypical example of complex energy landscape and a mean field model for spin glasses [MPV87, Tal10, Pan13b]. Generalizations of this model have been used to understand structural glasses, random combinatorial problems, neural networks, and a number of other systems [EVdB01, MPZ02, WL12, Nis01, MM09].
In this paper we consider the computational problem of finding a vector that is a near optimum, namely such that . Under a widely believed assumption about the structure of the associated Gibbs measure (more precisely, on the support of the asymptotic overlap distribution) we prove that, for any there exists an algorithm with complexity that –with high probability– outputs such a vector.
In order to state our assumption, we need to take a detour and introduce Parisi’s variational formula for the value of the optimization problem (1.1). Let be the space of probability measures on the interval endowed with the topology of weak convergence. For , we will write (with a slight abuse of notation) for its distribution function. For , consider the following parabolic partial differential equation (PDE) on
[TABLE]
It is understood that this is to be solved backward in time with the given final condition at . Existence and uniqueness where proved in [JT16]. We will also write to emphasize the dependence of the solution on the measure . The Parisi functional is then defined as
[TABLE]
The relation between this functional and the original optimization problem is given by a remarkable variational principle, first proposed by Parisi [Par79] and established rigorously, more than twenty-five years later, by Talagrand [Tal06b], and Panchenko [Pan13a].
Theorem 1** (Talagrand [Tal06b]).**
Consider the partition function . Then we have, almost surely (and in )
[TABLE]
The following consequence for the optimization problem (1.1) is elementary, see e.g. [DMS17].
Corollary 1.1**.**
We have, almost surely
[TABLE]
Remark 1.1**.**
The limit on the right-hand side of Eq. (1.5) can be removed by defining a new variational principle directly ‘at ’. Namely, the right-hand side of Eq. (1.5) can be replaced by where is a modification of and the minimum is taked over a suitable functional space [AC17]. In this paper we use the formulation, but it should be possible to work directly at .
Existence and uniqueness of the minimizer of were proved in [AC15] and [JT16], which also proved that is strongly convex. We will denote by the unique minimizer, and refer to it as the ‘Parisi measure’ or ‘overlap distribution’ at inverse temperature . Our key assumption will be that –at large enough – the support of is an interval .
Assumption 1**.**
There exist such that, for any , the function is strictly increasing on , where and .
This assumption (sometimes referred to as ‘continuous replica symmetry breaking’ or ‘full replica symmetry breaking’) is widely believed to be true (with ) within statistical physics [MPV87]. In particular, this conjecture is supported by high precision numerical solutions of the variational problem for [CR02, OSS07, SO08]. Rigorous evidence was recently obtained in [ACZ17]. Addressing this conjecture goes beyond the scope of the present paper.
We are now in position to state our main result.
Theorem 2**.**
Under Assumption 1, for any there exists an algorithm that takes as input the matrix , and outputs , such that the following hold:
The complexity (floating point operations) of the algorithm is at most .
We have , with high probability (with respect to ).
In other words, on average, the optimization problem (1.1) is much easier than in worst case. Of course, this is far from being the only example of this phenomenon (a gap between worst case and average case complexity). However, it is a rather surprising example given the complexity of the energy landscape . Its proof uses in a crucial way a fine property of the associated Gibbs measure, namely the support overlap distribution.
Remark 1.2** (Computation model).**
For the sake of simplicity, we measure complexity in floating point operations. However, all operations in our algorithm appear to be stable and it should be possible to translate this result to weaker computation models.
We also assume that we can choose one value of the inverse temperature , and query the distribution and the PDE solution as well as its derivatives , at specified points , with each query costing operations.
This is a reasonable model for two reasons: The PDE (1.2) is independent of the instance, and can be solved to a desired degree of accuracy only once. This solution can be used every time a new instance of the problem is presented. The function is uniformly continuous [Gue03] and strongly convex [AC15, JT16]. Further the PDE solution is continuous in and can be characterized as fixed point of a certain contraction [JT16]. Because of these reasons we expect that an oracle to compute , , to accuracy can be implemented in operations, for a constant.
Beyond Theorem 2, our general analysis allows to prove an additional fact that is of independent interest. Namely, for any , our message passing iteration constructs an approximate solution of the celebrated Thouless, Anderson, Palmer (TAP) equations [MPV87, Tal10].
The bulk of this paper is devoted to the case of Gaussian matrices . However, the class of algorithms we use enjoys certain universality properties, first established in [BLM15]. These properties can be used to generalize Theorem 2 to the case of symmetric matrices with independent subgaussian entries. We refer to Section 5 for a statement of of this universality result, and limit ourselves to state a consequence of Theorem 2 for the MAXCUT problem.
Let be an Erdös-Renyi random graph with edge probability {\mathbb{P}}\big{\{}(i,j)\in E_{n}\big{\}}=p. A random balanced partition of the vertices (which we encode as a vector ) achieves a cut , and simple concentration argument implies that the MAXCUT has size . In fact, it follows from [DMS17] that111In [DMS17], the same result is shown to hold for sparser graphs, as long as the average degree diverges: . , where is the prediction of Parisi’s formula (i.e. the right-hand side of ((1.4))). In other words, MAXCUT on dense Erdös-Renyi random graphs is non-trivial only once we subtract the baseline value . As a corollary of Theorem 2 we can approximate this subtracted value arbitrarily well.
Corollary 1.2**.**
Under Assumption 1, for any there exists an algorithm (with complexity at most ), that takes as input an Erdös-Renyi random graph , and outputs , such that
[TABLE]
The rest of this section provides further background. In Section 2 we describe and analyze a general message passing algorithm, which we call incremental approximate message passing (IAMP). We believe this algorithm is of independent interest and can be applied beyond the Sherrington-Kirkpatrick model. In Section 3 we use this approach to prove Theorem 2. In Section 4 we show that the same message passing algorithm of Section 2 produces approximate solutions of the TAP equations. Finally, Section 5 discusses a generalization of Theorem 2 using universality. The impatient reader, who is interested in a succinct description of the algorithm (with some technical bells and whistles removed), is urged to read Appendix B.
1.1 Further background
As mentioned above –under suitable complexity theory assumptions– there is mo polynomial-time algorithm that approximates the quadratic program (1.1) better than within a factor , for some [ABE*+*05]. Little is known on average-case hardness, when is drawn from one of the random matrix distributions considered here. As an exception, Gamarnik [Gam18] proved that exact computation of the partition function is hard on average.
A natural approach to the quadratic program (1.1) would be to use a convex relaxation. A spectral relaxation yields , and hence is not tight for large . This can be compared to a numerical evaluation of Parisi’s formula which yields [CR02, Sch08]. Rounding the spectral solution yields a . Somewhat surprisingly, the simplest semidefinite programming relaxation (degree of the sum-of-squares hyerarchy), does not yield any improvement (for large ) over the spectral one [MS16]. After an initial version of this paper was posted, [BKW19] obtained rigorous evidence that higher order relaxations fail as well.
Theorem 2 was conjectured by the author in 2016 [Mon16], based on insights from statistical physics [CK94, BCKM98]. The same presentation also outlined the basic strategy followed in the present paper, which uses an iterative ‘approximate message passing’ (AMP) algorithm. This type of algorithms were first proposed in the context of signal processing and compressed sensing [Kab03, DMM09]. Their rigorous analysis was developed by Bolthausen [Bol14] and subsequently generalized in several papers [BM11, JM13, BLM15, BMN17]. In this paper we introduce a specific class of AMP algorithms (‘incremental AMP’) whose specific properties allow us to match the result predicted by Parisi’s formula.
The fundamental phenomenon studied here is expected to be quite general. Namely objective functions with overlap distribution having support of the form are expected to be easy to optimize. In contrast, if the support has a gap (for instance, has the form for some ), this is considered as an indication of average case hardness. This intuition originates within spin glass theory [MPV87]. Roughly speaking, the structure of the overlap distribution should reflect the connectivity properties of the level sets . This intuition was exploited in some cases to prove the failure of certain classes of algorithms in problems with a gap in the overlap distribution, see e.g. [GS14].
Important progress towards clarifying this connection was achieved recently in two remarkable papers [ABM18, Sub18].
Addario-Berry and Maillard [ABM18] study an abstract optimization problem that is thought to capture some key features of the the energy landscape of the Sherrington-Kirkpatrick model, the so-called ‘continuous random energy model.’ They prove that an approximate optimum can be found in time polynomial in the problem dimensions. From an optimization perspective, the random energy model is somewhat un-natural, in that specifying an instance requires memory that is exponential in the problem dimensions.
Subag [Sub18] considers the -spin spherical spin glass. Roughly speaking, this can be described as the problem of optimizing a random smooth function (which can be taken to be a low-degree polynomial) over the unit sphere. Subag relaxes this problem by extending the optimization over the unit ball, and proves that this objective function can be optimized efficiently by following the positive directions of the Hessian. The solution thus constructed lies on the unit sphere and thus solves the un-relaxed problem. The mathematical insight of [Sub18] is beautifully simple, but uses in a crucial way the spherical geometry. While it might be possible to generalize the same argument to the hypercube case (e.g., using the generalized TAP free energy of [MV85, CPS18]) this extension is far from obvious. In particular, uniform control of the Hessian is not as straightforward as in [Sub18].
The algorithm presented here is partially inspired by [Sub18] (in particular, a key role is played by approximate orthogonality of the updates), but its specific structure is dictated by the message passing viewpoint. Thanks to the technique of [Bol14, BM11, JM13, BMN17], its analysis does not require uniform control and is relatively simple.
1.2 Notations
Given vectors , we denote by their scalar product and by the associated norm. Given a function , and vectors we write for the vector in with components . The empirical distribution of the coordinates of a vector of vectors is the probability measure on defined by
[TABLE]
In other words, if we arrange the vectors in a matrix in , denotes the probability distribution of a uniformly random row of . In the case of a single vector (i.e. for ), this reduces to the standard empirical distribution of the entries of . We say that a function is pseudo-Lipschitz if .
Given two probability measures , on , we recall that their Wasserstein distance is defined as
[TABLE]
where the infimum is taken over all the couplings of and (i.e. joint distributions on whose first marginal coincides with , and second with . For a sequence of probability measures , and on , we say that converges in Wasserstein distance to (and write ) if . It is well known that if and only if for all bounded Lipschitz functions , and for [Vil08, Theorem 6.9]. Given a sequence of random variables , we write or to state that converge in probability to .
We will sometimes be interested in double limits of sequences of random variables. If is a sequence indexed by and is a constant,
[TABLE]
whenever converges in probability to a non-random quantity as , and .
2 A general message passing algorithm
Our algorithm is based on the following approximate message passing (AMP) iteration.
AMP iteration
Consider a sequence of (weakly differentiable) functions , and a non-random initialization and additional vector with (where is any probability distribution on with finite second moment ). The AMP iteration is defined by letting, for ,
[TABLE]
It will be understood throughout that for .
Proposition 2.1**.**
Consider the AMP iteration (2.1), and assume to be Lipschitz continuous. Then for any , and any pseudo-Lipschitz function , we have
[TABLE]
Here is a centered Gaussian process independent of with covariance determined recursively via
[TABLE]
This proposition follows immediately from the general analysis of AMP algorithms developed in [JM13, BMN17], cf. Appendix A.
We next consider a special case of the general AMP setting.
Incremental AMP (IAMP)
Fix , and functions , , . We consider the general iteration (2.1), with the following choice of functions (independent of ):
[TABLE]
where, for , . Following our convention for , we set for .
We note that, by Eq. (2.4), is indeed a function of , and therefore is a function of as stated.
Lemma 2.2** (State evolution for Incremental AMP).**
Consider the incremental AMP iteration, and assume to be Lipschitz continuous and bounded. Then for any , and any pseudo-Lipschitz function , we have
[TABLE]
(The double limit is to be interpreted as defined in the Notations section.) Here is a centered Gaussian process independent of , with independent entries, with variance given recursively by
[TABLE]
Proof.
Consider Eqs. (2.4), (2.5), and note that, for any , is a bounded Lipschitz function of (because bounded Lipschitz functions are closed under sum, product, and composition). Hence defined in (2.4) is Lipschitz continuous and we can therefore apply Proposition 2.1 to get
[TABLE]
Here is a Gaussian process with covariance determined by Eq. (2.3). We next claim the following:
for (and we set ). 2. 2.
for each as .
With these two claims, the statement of the lemma follows by dominated convergence.
To prove claim 1 note that, by symmetry we only have to consider the case . The proof is by induction over . For there is nothing to prove. Assume next that the claim holds up to a certain , and consider for . By Eq. (2.3) we have (dropping for simplicity the superscripts from the random variables)
[TABLE]
Here the second equality follows from the induction hypothesis.
To prove claim 2, note that satisfies the recursion that follows from Eq. (2.3), namely
[TABLE]
Also note that for some polynomial independent of . Hence the claim follows by applying recursively dominated convergence. ∎
Remark 2.1**.**
The use of truncation in the definition (2.4) is dictated by the need to ensure that is Lipschitz, and to be able to apply Proposition 2.1. We believe that the conclusion of Proposition 2.1 holds under weaker assumptions (e.g. locally Lipschitz with polynomial growth). Such a generalization would allow to replace by in Eq. (2.4), and hence get rid of the parameter in our algorithm.
We are now in position of defining our candidate for a near optimum of the problem (1.1). We fix and define (recalling the definition of in Eqs. (2.4), (2.5))
[TABLE]
Note that this vector depends on parameters , and on the functions . Parameters and must be taken (respectively) small enough and large enough (but independent of ). The next section will be devoted to choosing and the functions . In this section we will establish some general properties of (for small and large ).
Lemma 2.3**.**
Consider the incremental AMP iteration, and assume to be Lipschitz continuous and bounded. Further assume , , to exist and be Lipschitz continuous. Define the random variable
[TABLE]
Then we have, for any pseudo-Lipschitz function ,
[TABLE]
Proof.
Equation (2.15) follows immediately from Lemma 2.2 upon noticing that is a pseudo-Lipschitz function of , …, .
In order to prove Eq. (2.16), we will write , and . We further notice that, for ,
[TABLE]
Here and below the random variables are defined as in the proof of Lemma 2.2. On the other hand
[TABLE]
Note that we applied Lemma 2.2 to a non-Lipschitz function. The limit holds nevertheless by a standard weak convergence argument (namely, using upper and lower Lipschitz approximations of the indicator function). We therefore conclude that (using , as ):
[TABLE]
Next notice that, for ,
[TABLE]
By a similar argument, always for ,
[TABLE]
On the other hand
[TABLE]
By the AMP iteration, we know that . Hence, using the above limits, for ,
[TABLE]
We finally can compute
[TABLE]
∎
In the case of models with full replica symmetry breaking, it is natural to consider the limit of small step size . This limit is described by a stochastic differential equation (SDE) described below.
SDE description.
Consider Lipschitz functions , with . Let be a standard Brownian motion. We define the process via
[TABLE]
with initial condition . Equivalently
[TABLE]
where the integral is understood in Ito’s sense. Existence and uniqueness of strong solutions of this SDE is given –for instance– in [Øks03, Theorem 5.2.1].
Lemma 2.4**.**
Given Lipschitz functions , with and bounded, let be the process defined above. Assume for all . Further consider the state evolution iteration of Eq. (2.7), whereby is defined recursively via
[TABLE]
Then, there exists a coupling of and such that
[TABLE]
(Here is a constant depending only on the bounds on , and on . Further the error is bounded as for the same constant.)
Proof.
Throughout this proof, we will write and denote by a generic constant that depends on the bounds on , and can change from line to line. Note that, by construction, for all , and therefore . Hence we can construct the discrete and continuous processes on the same space by letting .
We then decompose the difference between the two processes as
[TABLE]
By taking the second moment, and using the fact that is measurable on and is measurable on , we get
[TABLE]
Next notice that by the boundedness of , we have . Let . Assuming without loss of generality ,
[TABLE]
The same bound holds for . Substituting above, we get
[TABLE]
This implies bound as stated in (2.38).
In order to prove Eq. (2.39), note that
[TABLE]
Hence
[TABLE]
Let , and write
[TABLE]
Therefore
[TABLE]
The bound of Eq. (2.39) follows since
[TABLE]
Finally, Eq. (2.40) follows by the same estimates. ∎
We now collect the main findings of this section in a theorem. This characterizes the values of the objective function achievable by the above algorithm.
Theorem 3**.**
Let be Lipschitz continuous, with and bounded, and define the process using the SDE (2.35) with initial condition . Assume for all . Further assume to exist and be Lipschitz continuous.
Define the incremental AMP iteration , and let be given by Eq. (2.13). Finally, let be a pseudo-Lipschitz function. Then, for any there exist , and for any there exist such that, if and , we have
[TABLE]
(Further the above limits in probability are non-random quantities.)
Proof.
This follows immediately from Lemma 2.3 and Lemma 2.4. ∎
3 Proof of the main theorem
3.1 Choosing the nonlinearities
In view of Theorem 3, we need to choose the coefficients in the SDE (2.35) as to satisfy two conflicting requirements: maximize (the energy value achieved by our algorithm); keep (we want a solution in the hypercube).
Throughout this section we set as per Assumption 1. We also set and the unique minimizer of the Parisi functional. We also fix to be the solution of the PDE (1.2) with .
There is a natural SDE associated with the Parisi’s variational principle, that was first introduced in physics [MPV87], and recently studied in the probability theory literature [AC15, JT16]:
[TABLE]
Unless otherwise stated, it is understood that we set the initial condition to . Motivated by this, we set the coefficients as follows
[TABLE]
We collect below a few useful regularity properties of , which have been proved in the literature.
Lemma 3.1**.**
* exists and is continuous for all .*
For all ,
[TABLE]
* for all .*
, are Lipschitz continuous on .
Proof.
Points and are Theorem 4 in [JT16]. Point is Proposition 2. in [AC15]. Finally, point follows immediately from points , . ∎
This Lemma implies that the choice (3.2) satisfies the regularity assumptions in Theorem 3. We next have to check the normalization condition, and compute the resulting distribution.
Lemma 3.2**.**
We have
[TABLE]
In particular for all .
Proof.
By Lemma 2 in [AC15], we have, for any
[TABLE]
which is exactly Eq. (3.4). Lemma 3.1. implies almost surely. ∎
Lemma 3.3**.**
For all , we have
[TABLE]
Proof.
Equation (3.6) is Proposition 1 in [Che17]. For Eq. (3.7) note that by Eq. (39) in the same paper, we have, for any
[TABLE]
and therefore the claim follows from Eq. 3.6. ∎
Lemma 3.4**.**
For any , we have
[TABLE]
Proof.
Consider a continuity point of . Then the proof of Lemma 16 in [JT16] yields
[TABLE]
Taking expectation and using Fubini’s alongside Eq. (3.6), we get
[TABLE]
The claim follows also for not a continuity point because the right hand side is obviously continuous in . The left hand side is continuous because is Lipschitz (cf. Lemma 3.1) and because the coefficients of the SDE are bounded Lipschitz. ∎
We summarize the results of this section in the following theorem. Here and below, for , , we let .
Theorem 4**.**
Under Assumption 1 let be defined as per Eq. (3.2), and set for . Further let
[TABLE]
Define the incremental AMP iteration via Eqs. (2.1), (2.4), (2.5), with given by Eq. (2.37), and let be given by Eq. (2.13). Then, for any there exist , and for any there exist such that, if and , we have
[TABLE]
(Further the above limits in probability are non-random quantities.)
Proof.
First notice that with a pseudo-Lipschitz function. Further, integration by parts yields
[TABLE]
Hence the claims of this theorem follow immediately from Theorem 3 upon checking those assumptions using the lemmas given in this section. ∎
3.2 Sequential rounding and putting everything together
Theorem 4 constructs a vector . It is not difficult to round this to a vector with entries in , as detailed in the next lemma.
Lemma 3.5**.**
There exist an algorithm with complexity , and an absolute constant such that the following happens with probability at least . Given and a vector such that . Then there algorithm returns a vector such that
[TABLE]
Proof.
Recall the definition of Hamiltonian (which we view as a function on ). We also define .
We construct in two steps. First we let to be the projection of onto the hypercube (i.e. is such that ). Note that this can be constructed in time (simply by projecting each coordinate onto ).
Second, note that the function is linear in each coordinate of . Namely, for each , where and . We then construct a sequence as follows. Set and, for each :
[TABLE]
Finally we set . This procedure takes operations.
The lemma then follows straightforwardly from the following three claims:
.
, with probability at least .
with probability at least .
Claim is immediate since for each .
Claim holds since, for any ,
[TABLE]
Now we have , and is a Lipschitz function of the Gaussian vector . hence the desired bounds follow by Gaussian concentration.
For claim , let and note that (denoting by the maximum eigenvalue of )
[TABLE]
The desired probability bound follows by concentration of the largest eigenvalue of matrices [AGZ09]. ∎
We finally need to show that the quantity of Theorem 4 converges to the asymptotic optimum value, for large . This is achieved in the two lemmas below.
Lemma 3.6**.**
Let . Then, almost surely,
[TABLE]
Proof.
By Gaussian concentration, it is sufficient to consider the expectation (recall that . Recall the definition of partition function , and define the associated Gibbs measure and free energy density . A standard thermodynamic identity [MM09] yields , where is the Shannon entropy of the probability distribution . Further and as . Hence
[TABLE]
On the other hand, . Since by Theorem 1, are convex with differentiable [Tal06a], it follows that
[TABLE]
(The last equality is proved in [Tal06a], with a difference in normalization of .) ∎
Lemma 3.7**.**
For any ,
[TABLE]
Proof.
The PDE (1.2) can be solved for using the Cole-Hopf transformation . This yields , whence and . Substituting in Eqs. (3.6), (3.6), we get
[TABLE]
Hence
[TABLE]
∎
The proof our main result, Theorem 2, follows quite easily from the findings of this section.
Proof of Theorem 2.
Let . This limit exists by Corollary 1.1, and we further have (this can be proved by the same thermodynamic argument as in the proof of Lemma 3.6, noting that for [Pan13b]). It is therefore sufficient to output such that, with high probability, .
Let . By Lemma 3.6 and Lemma 3.7, we have . Applying the algorithm of Theorem 4 thus we obtain, with high probability, a vector such that and . The proof is completed by using the rounding procedure of Lemma 3.5. ∎
4 Relation with the TAP equations
In this section we prove that the algorithm described in Section 2, when used in conjunction with the specific choice of functions , , in Section 3 actually constructs an approximate solution of the TAP equations (under Assumption 1). As in the previous section, we set , , , , and
[TABLE]
Using these settings, we recall that and are given by
[TABLE]
Finally, we will repeatedly use the fact that the PDE (1.2) can be solved on using the Cole-Hopf transformation, which yields .
Lemma 4.1**.**
Setting , we have
[TABLE]
Proof.
By Lemma 2.2, we have
[TABLE]
On the other hand, using Lemma 2.4, we obtain
[TABLE]
where the last identity follows from Eq. (3.5). ∎
Lemma 4.2**.**
Setting , we have
[TABLE]
Proof.
Throughout the proof, we will write . By the basic iteration (2.1), we have
[TABLE]
Using Eqs. (2.19) and (2.22), together with the fact that , are bounded by Lemma 2.2, we get
[TABLE]
Next, using again Lemma 2.4, we have , and
[TABLE]
where in the last step we used Lemma 3.4. By Fubini’s theorem
[TABLE]
where in the last step we used once more Eq. (3.5). Substituting these limits in Eq. (4.11), we get
[TABLE]
Where we used the fact that solves te SDE (3.1), and . ∎
We can therefore state our result about constructing solutions to the TAP equations.
Theorem 5** (Constructing solutions to the TAP equations).**
Under Assumption 1 let be defined as per Eq. (3.2), and set for . Define the incremental AMP iteration via Eqs. (2.1), (2.4), (2.5), with given by Eq. (2.37), and let be given by Eq. (2.13). (The same iteration is given explicitly in Eqs. (4.2), (4.3).)
Set . Then, for any there exist , and for any there exist such that, if and , we have, with high probability
[TABLE]
Proof.
The theorem follows immediately from Lemma 4.1 and Lemma 4.2, using the fact that, with high probability, has operator norm bounded by [AGZ09]. ∎
5 Universality
In this section we use the universality results of [BLM15] to generalize Theorem 2 to other random matrix distributions. Namely, we will work under the following assumption:
Assumption 2**.**
The matrix is symmetric with and a collection of independent random variables, satisfying , . Further, the entries are subgaussian, with common subgaussian parameter . (Namely, for all .)
Using [BLM15, Theorem 4], and proceeding exactly as for Proposition 2.1, we obtain the following.
Proposition 5.1**.**
Consider the AMP iteration (2.1), with satisfying Assumption 2. Further, assume to be a fixed polynomial (independent of ). Then for any , and any pseudo-Lipschitz function , we have
[TABLE]
Here is a centered Gaussian process independent of with covariance determined recursively via
[TABLE]
Notice an important difference with respect to Proposition (2.1): instead of Lipschitz functions, we require the functions to be polynomials. However, this result is strong enough to allow us prove the following generalization of Theorem 2.
Theorem 6**.**
Let , be random matrices satisfying Assumption 2. Under Assumption 1, for any there exists an algorithm that takes as input the matrix , and outputs , such that the following hold: The complexity (floating point operations) of the algorithm is at most . We have .
Proof.
Let , , be defined as in the proof of Theorem 2 for . For each , and each , we construct a polynomial which approximates the dynamics defined by , , , in a sense that we will make precise below.
We define the IAMP iteration, analogously to (2.4), (2.5)
[TABLE]
We then claim that we can construct these polynomial approximations so that, for any , and any pseudo-Lipschitz function , we have
[TABLE]
where the independent random variables are defined as in Lemma 2.2. Given this claim, the rest of the proof of Theorem 2 can be applied verbatimly to this –slightly different– algorithm.
In order to prove the claim (5.4), we proceed as in the proof of Lemma 2.2. Namely, by applying Proposition 5.1, we get
[TABLE]
where is a centered Gaussian process. Using the same argument as in Lemma 2.2, we obtain that the Gaussian random variables are independent. Further, letting , Proposition 5.1 yields the following recursion
[TABLE]
The claim (5.4) follows by showing the we can choose polynomials so that for each . This can be done by induction over . As a preliminary, notice that there is sufficiently small so that, for the sequence of random variables defined recursively via Eq. (2.7), we have for all (the existence of such can also be shown by induction over using the fact that are bounded Lipschitz).
The basis of the induction is trivial. Then assume that the induction claim is true for all . Without loss of generality we can consider that, for any we have . Indeed by the induction hypothesis this holds for all large enough, and we can always renumber the polynomials so that it holds for all . Then notice that the random variable of Eq. (2.7) can be written as for a certain function that is bounded by a polynomial. We then choose the polynomials so that
[TABLE]
Such polynomials can be constructed, for instance, by considering the expansion of in the basis of multivariate Hermite polynomials (suitably rescaled as to form an orthonormal basis with in , where is the joint distribution of .) The variance bound is used in controlling the error term.
The induction claim then follows by
[TABLE]
where the last equality holds by dominated convergence. ∎
Corollary 1.2 follows by applying Theorem 6 with a suitably centered and normalized adjacency matrix.
Proof of Corollary 1.2.
Given a graph , construct the matrix , by setting and, for :
[TABLE]
It is easy to verify that this matrix satisfies Assumption 2. Further, we have
[TABLE]
Recall that we know from [DMS17] . Let denote the output of the algorithm of Theorem 6, on input . Applying this theorem and Lemma 5.1, we get
[TABLE]
We construct by balancing . Namely, if , we obtain by flipping entries of so that . We then have, with high probability
[TABLE]
(Here denotes the operator norm of matrix .) Therefore, since , and with high probability [AGZ09], we get
[TABLE]
which completes the proof. ∎
Acknowledgements
I am grateful to Eliran Subag for an inspiring presentation of his work [Sub18] delivered at the workshop ‘Advances in Asymptotic Probability’ in Stanford, and for a stimulating conversation. This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162 and ONR N00014-18-1-2729.
Appendix A Proof of Proposition 2.1
As mentioned in the main text, Proposition 2.1 is a consequence of the general analysis of AMP algorithms available in the literature. In particular it can be obtained from a reduction to the setting of [JM13, Theorem 1]. Let us briefly recall the class of algorithms considered in [JM13], adapting the notations to the present ones. (we limit ourselves to consider the ‘one-block’ case in the language of [JM13]).
Fixing consider a sequence of Lipschitz functions
[TABLE]
Given two matrices , , we let be the matrix whose -th row is given by (where is the -th row of and is the -th row of ).
Then [JM13] analyzes the following AMP iteration, which produces a sequence of iterates
[TABLE]
Here is a matrix with entries defined by
[TABLE]
Under the assumption that are independent of , and converges in , [JM13, Theorem 1] determines the asymptotic empirical distribution of .
Proposition 2.1 can be recast as a special case of this setting. First notice that we can always choose an -independent such that the time horizon in Eq. (2.2) satisfies . We then consider the iteration (A.1) with initialization , data vectors , and update functions given by
[TABLE]
With this setting, the vector coincides with as given in Eq. (2.1), for all . The recursion of Eq. (2.3) follows from the analogous recursion in [JM13, Theorem 1].
Appendix B A simplified version of the algorithm
In this appendix we provide a simplified version of the algorithm of Theorem 2, for the reader’s convenience. In this presentation we simplify certain technical details that have been introduced in the main text to simplify the proof. In the pseudo-code below denotes entrywise multiplication between vectors. Further, when a scalar function is applied to a vector, it is understood to be applied componentwise. In particular, note that is the norm of the vector whose -th component is .
Notice that this pseudo-code does not describe how to minimize the Parisi functional and to solve the PDE (1.2). As discussed in the introduction, we believe this can be done efficiently because of the strong convexity and continuity of . Indeed highly accurate numerical solutions (albeit with no rigorous analysis) were developed already in [CR02, OSS07, SO08].
Further, the pseudo-code does not specify the rounding procedure, which is given below.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ABE + 05] Sanjeev Arora, Eli Berger, Hazan Elad, Guy Kindler, and Muli Safra, On non-approximability for quadratic programs , Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, IEEE, 2005, pp. 206–215.
- 2[ABM 18] Louigi Addario-Berry and Pascal Maillard, The algorithmic hardness threshold for continuous random energy models , ar Xiv:1810.05129 (2018).
- 3[AC 15] Antonio Auffinger and Wei-Kuo Chen, The Parisi formula has a unique minimizer , Communications in Mathematical Physics 335 (2015), no. 3, 1429–1444.
- 4[AC 17] , Parisi formula for the ground state energy in the mixed p 𝑝 p -spin model , The Annals of Probability 45 (2017), no. 6b, 4617–4631.
- 5[ACZ 17] Antonio Auffinger, Wei-Kuo Chen, and Qiang Zeng, The SK model is Full-step Replica Symmetry Breaking at zero temperature , ar Xiv:1703.06872 (2017).
- 6[AGZ 09] Greg W. Anderson, Alice Guionnet, and Ofer Zeitouni, An introduction to random matrices , Cambridge University Press, 2009.
- 7[BCKM 98] Jean-Philippe Bouchaud, Leticia F Cugliandolo, Jorge Kurchan, and Marc Mézard, Out of equilibrium dynamics in spin-glasses and other glassy systems , Spin glasses and random fields (1998), 161–223.
- 8[BKW 19] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein, Computational hardness of certifying bounds on constrained pca problems , ar Xiv:1902.07324 (2019).
