Optimization of the Sherrington-Kirkpatrick Hamiltonian

Andrea Montanari

arXiv:1812.10897·math.PR·April 8, 2019·SIAM J. Comput.

Optimization of the Sherrington-Kirkpatrick Hamiltonian

Andrea Montanari

PDF

TL;DR

This paper introduces a new message-passing algorithm that efficiently approximates the maximum of the Sherrington-Kirkpatrick Hamiltonian, closely matching the optimal value with high probability for large systems.

Contribution

It presents a novel message-passing algorithm with quadratic time complexity that achieves near-optimal solutions for the SK model, extending its applicability to low-temperature regimes.

Findings

01

Algorithm achieves (1-ε) approximation of the optimum

02

Time complexity is quadratic in system size

03

Constructs approximate solutions to TAP equations at low temperature

Abstract

Let $A \in R^{n \times n}$ be a symmetric random matrix with independent and identically distributed Gaussian entries above the diagonal. We consider the problem of maximizing $⟨ σ, A σ ⟩$ over binary vectors $σ \in {+ 1, - 1}^{n}$ . In the language of statistical physics, this amounts to finding the ground state of the Sherrington-Kirkpatrick model of spin glasses. The asymptotic value of this optimization problem was characterized by Parisi via a celebrated variational principle, subsequently proved by Talagrand. We give an algorithm that, for any $ε > 0$ , outputs $σ_{*} \in {- 1, + 1}^{n}$ such that $⟨ σ_{*}, A σ_{*} ⟩$ is at least $(1 - ε)$ of the optimum value, with probability converging to one as…

Equations231

\mbox ma x imi z e \mbox s u bj ec tt o ⟨ σ, A σ ⟩, σ \in {+ 1, - 1}^{n} .

\mbox ma x imi z e \mbox s u bj ec tt o ⟨ σ, A σ ⟩, σ \in {+ 1, - 1}^{n} .

\begin{split}&\partial_{t}\Phi(t,x)+\frac{1}{2}\beta^{2}\partial_{xx}\Phi(t,x)+\frac{1}{2}\beta^{2}\mu(t)\big{(}\partial_{x}\Phi(t,x)\big{)}^{2}=0\,,\\ &\Phi(1,x)=\log 2\cosh x\,.\end{split}

\begin{split}&\partial_{t}\Phi(t,x)+\frac{1}{2}\beta^{2}\partial_{xx}\Phi(t,x)+\frac{1}{2}\beta^{2}\mu(t)\big{(}\partial_{x}\Phi(t,x)\big{)}^{2}=0\,,\\ &\Phi(1,x)=\log 2\cosh x\,.\end{split}

P_{β} (μ) \equiv Φ_{μ} (0, 0) - \frac{1}{2} β^{2} \int_{0}^{1} t μ (t) d t .

P_{β} (μ) \equiv Φ_{μ} (0, 0) - \frac{1}{2} β^{2} \int_{0}^{1} t μ (t) d t .

n \to \infty lim \frac{1}{n} lo g Z_{n} (β) = μ \in \mathscrsfs P ([0, 1]) min P_{β} (μ) .

n \to \infty lim \frac{1}{n} lo g Z_{n} (β) = μ \in \mathscrsfs P ([0, 1]) min P_{β} (μ) .

n \to \infty lim \frac{1}{2 n} σ \in {+ 1, - 1}^{n} max ⟨ σ, A σ ⟩ = β \to \infty lim \frac{1}{β} μ \in \mathscrsfs P ([0, 1]) min P_{β} (μ) .

n \to \infty lim \frac{1}{2 n} σ \in {+ 1, - 1}^{n} max ⟨ σ, A σ ⟩ = β \to \infty lim \frac{1}{β} μ \in \mathscrsfs P ([0, 1]) min P_{β} (μ) .

(CUT_{G} (σ_{*}) - \frac{∣ E _{n} ∣}{2}) \geq (1 - ε) σ \in {+ 1, - 1}^{n} max (CUT_{G} (σ_{*}) - \frac{∣ E _{n} ∣}{2}) .

(CUT_{G} (σ_{*}) - \frac{∣ E _{n} ∣}{2}) \geq (1 - ε) σ \in {+ 1, - 1}^{n} max (CUT_{G} (σ_{*}) - \frac{∣ E _{n} ∣}{2}) .

\overset{p}{^}_{x_{1}, \dots, x_{k}} \equiv \frac{1}{n} i = 1 \sum n δ_{(x_{1, i}, \dots, x_{k, i})} .

\overset{p}{^}_{x_{1}, \dots, x_{k}} \equiv \frac{1}{n} i = 1 \sum n δ_{(x_{1, i}, \dots, x_{k, i})} .

W_{2} (μ, ν) \equiv {γ \in C (μ, ν) in f \int ∣ x - y ∣^{2} γ (d x, d y)}^{1/2},

W_{2} (μ, ν) \equiv {γ \in C (μ, ν) in f \int ∣ x - y ∣^{2} γ (d x, d y)}^{1/2},

M \to \infty lim n \to \infty p-lim X_{n, M} = x_{*},

M \to \infty lim n \to \infty p-lim X_{n, M} = x_{*},

u^{k + 1} b_{k, j} = A f_{k} (u^{0}, \dots, u^{k}; y) - j = 1 \sum k b_{k, j} f_{j - 1} (u^{0}, \dots, u^{j - 1}; y), = \frac{1}{n} i = 1 \sum n \frac{\partial f _{k}}{\partial u _{i}^{j}} (u_{i}^{0}, \dots, u_{i}^{k}; y_{i}) .

u^{k + 1} b_{k, j} = A f_{k} (u^{0}, \dots, u^{k}; y) - j = 1 \sum k b_{k, j} f_{j - 1} (u^{0}, \dots, u^{j - 1}; y), = \frac{1}{n} i = 1 \sum n \frac{\partial f _{k}}{\partial u _{i}^{j}} (u_{i}^{0}, \dots, u_{i}^{k}; y_{i}) .

\frac{1}{n} i = 1 \sum n ψ (u_{i}^{0}, \dots, u_{i}^{k}; y_{i}) ⟶ p E ψ (U_{0}, \dots, U_{k}; Y) .

\frac{1}{n} i = 1 \sum n ψ (u_{i}^{0}, \dots, u_{i}^{k}; y_{i}) ⟶ p E ψ (U_{0}, \dots, U_{k}; Y) .

\displaystyle{\widehat{Q}}_{k+1,j+1}={\mathbb{E}}\big{\{}f_{k}(U_{0},\dots,U_{k};Y)f_{j}(U_{0},\dots,U_{j};Y)\big{\}}\,.

\displaystyle{\widehat{Q}}_{k+1,j+1}={\mathbb{E}}\big{\{}f_{k}(U_{0},\dots,U_{k};Y)f_{j}(U_{0},\dots,U_{j};Y)\big{\}}\,.

f_{k} (u_{0}, \dots, u_{k})

f_{k} (u_{0}, \dots, u_{k})

x_{k}

M \to \infty lim n \to \infty p-lim \frac{1}{n} i = 1 \sum n ψ (u_{i}^{0}, \dots, u_{i}^{k}) = E ψ (U_{0}^{δ}, \dots, U_{k}^{δ}) .

M \to \infty lim n \to \infty p-lim \frac{1}{n} i = 1 \sum n ψ (u_{i}^{0}, \dots, u_{i}^{k}) = E ψ (U_{0}^{δ}, \dots, U_{k}^{δ}) .

q_{k + 1} X_{k}^{δ} = E {g_{k} (X_{k - 1}^{δ})^{2}} \cdot q_{k}, = X_{k - 1}^{δ} + v (X_{k - 1}^{δ}; k δ) δ + s (X_{k - 1}; k δ) U_{k}^{δ} δ .

q_{k + 1} X_{k}^{δ} = E {g_{k} (X_{k - 1}^{δ})^{2}} \cdot q_{k}, = X_{k - 1}^{δ} + v (X_{k - 1}^{δ}; k δ) δ + s (X_{k - 1}; k δ) U_{k}^{δ} δ .

\frac{1}{n} i = 1 \sum n ψ (u_{i}^{0}, \dots, u_{i}^{k}) ⟶ p E ψ (U_{0}^{δ, M}, \dots, U_{k}^{δ, M}) .

\frac{1}{n} i = 1 \sum n ψ (u_{i}^{0}, \dots, u_{i}^{k}) ⟶ p E ψ (U_{0}^{δ, M}, \dots, U_{k}^{δ, M}) .

Q_{j, k + 1}^{M}

Q_{j, k + 1}^{M}

\displaystyle={\mathbb{E}}\big{\{}\widehat{g}_{j-1}(X_{j-2})[U_{j-1}]_{M}\,\widehat{g}_{k}(X_{k-1})\big{\}}{\mathbb{E}}\big{\{}[U_{k}]_{M}\big{\}}=0\,.

q_{k + 1}^{M}

q_{k + 1}^{M}

X_{k}^{δ, M}

z = δ k = 1 \sum ⌊ \overline{q} / δ ⌋ f_{k} (u_{0}, \dots, u_{k}) .

z = δ k = 1 \sum ⌊ \overline{q} / δ ⌋ f_{k} (u_{0}, \dots, u_{k}) .

Z^{δ} \equiv δ k = 1 \sum ⌊ \overline{q} / δ ⌋ g_{k} (X_{k - 1}) U_{k}^{δ} .

Z^{δ} \equiv δ k = 1 \sum ⌊ \overline{q} / δ ⌋ g_{k} (X_{k - 1}) U_{k}^{δ} .

M \to \infty lim n \to \infty p-lim \frac{1}{n} i = 1 \sum n ψ (z_{i}) = E {ψ (Z^{δ})},

M \to \infty lim n \to \infty p-lim \frac{1}{n} i = 1 \sum n ψ (z_{i}) = E {ψ (Z^{δ})},

M \to \infty lim n \to \infty p-lim \frac{1}{2 n} ⟨ z, A z ⟩ = δ k = 1 \sum ⌊ \overline{q} / δ ⌋ - 1 E {(U_{k}^{δ})^{2}} E {g_{k} (X_{k - 1}^{δ})} E {g_{k} (X_{k - 1}^{δ})^{2}} .

n \to \infty p-lim b_{k, j}

n \to \infty p-lim b_{k, j}

= E {\frac{\partial g}{\partial u _{j}} (X_{k - 1}^{δ, M}; k δ) [U_{k}^{δ, M}]_{M}}

= E {\frac{\partial g}{\partial u _{j}} (X_{k - 1}^{δ, M}; k δ)} E {[U_{k}^{δ, M}]_{M}} = 0 .

n \to \infty p-lim b_{k, k}

n \to \infty p-lim b_{k, k}

= E {g_{k} (X_{k - 1}^{δ, M})} P (U_{k}^{δ, M} \in [- M, M]) .

M \to \infty lim n \to \infty p-lim b_{k, k}

M \to \infty lim n \to \infty p-lim b_{k, k}

n \to \infty p-lim \frac{1}{n} ⟨ f_{j}, f_{k} ⟩

n \to \infty p-lim \frac{1}{n} ⟨ f_{j}, f_{k} ⟩

\displaystyle={\mathbb{E}}\big{\{}\widehat{g}_{j}(X^{M,\delta}_{j-1})\,[U_{j}^{M,\delta}]_{M}\widehat{g}_{k}(X^{M,\delta}_{k-1})\big{\}}{\mathbb{E}}\big{\{}[U_{k}^{M,\delta}]_{M}\big{\}}=0\,.

n \to \infty p-lim \frac{1}{n} ⟨ f_{j}, u^{k} ⟩

n \to \infty p-lim \frac{1}{n} ⟨ f_{j}, u^{k} ⟩

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\stackMath

Optimization of the Sherrington-Kirkpatrick Hamiltonian

Andrea Montanari Department of Electrical Engineering and Department of Statistics, Stanford University

Abstract

Let $\boldsymbol{A}\in{\mathbb{R}}^{n\times n}$ be a symmetric random matrix with independent and identically distributed Gaussian entries above the diagonal. We consider the problem of maximizing $\langle{\boldsymbol{\sigma}},\boldsymbol{A}{\boldsymbol{\sigma}}\rangle$ over binary vectors ${\boldsymbol{\sigma}}\in\{+1,-1\}^{n}$ . In the language of statistical physics, this amounts to finding the ground state of the Sherrington-Kirkpatrick model of spin glasses. The asymptotic value of this optimization problem was characterized by Parisi via a celebrated variational principle, subsequently proved by Talagrand. We give an algorithm that, for any ${\varepsilon}>0$ , outputs ${\boldsymbol{\sigma}}_{*}\in\{-1,+1\}^{n}$ such that $\langle{\boldsymbol{\sigma}}_{*},\boldsymbol{A}{\boldsymbol{\sigma}}_{*}\rangle$ is at least $(1-{\varepsilon})$ of the optimum value, with probability converging to one as $n\to\infty$ . The algorithm’s time complexity is $C({\varepsilon})\,n^{2}$ . It is a message-passing algorithm, but the specific structure of its update rules is new.

As a side result, we prove that, at (low) non-zero temperature, the algorithm constructs approximate solutions of the Thouless-Anderson-Palmer equations.

1 Introduction and main result

Let $\boldsymbol{A}\in{\mathbb{R}}^{n\times n}$ be a random matrix from the ${\sf GOE}(n)$ ensemble. Namely, $\boldsymbol{A}=\boldsymbol{A}^{{\sf T}}$ and $(A_{ij})_{i\leq j\leq n}$ is a collection of independent random variables with $A_{ii}\sim{\sf N}(0,2/n)$ and $A_{ij}={\sf N}(0,1/n)$ for $i<j$ . We are concerned with the following optimization problem (here $\langle{\boldsymbol{u}},{\boldsymbol{v}}\rangle=\sum_{i\leq n}u_{i}v_{i}$ is the standard scalar product)

[TABLE]

From a worst-case perspective, this problem is NP-hard and indeed hard to approximate within a sublogarithmic factor [ABE*+*05]. For random data $\boldsymbol{A}$ , the energy function $H_{n}({\boldsymbol{\sigma}})=\langle{\boldsymbol{\sigma}},\boldsymbol{A}{\boldsymbol{\sigma}}\rangle/2$ is also known as the Sherrington-Kirkpatrick model [SK75]. Its properties have been intensely studied in statistical physics and probability theory for over 40 years as a prototypical example of complex energy landscape and a mean field model for spin glasses [MPV87, Tal10, Pan13b]. Generalizations of this model have been used to understand structural glasses, random combinatorial problems, neural networks, and a number of other systems [EVdB01, MPZ02, WL12, Nis01, MM09].

In this paper we consider the computational problem of finding a vector ${\boldsymbol{\sigma}}_{*}\in\{+1,-1\}^{n}$ that is a near optimum, namely such that $H_{n}({\boldsymbol{\sigma}}_{*})\geq(1-{\varepsilon})\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}H_{n}({\boldsymbol{\sigma}})$ . Under a widely believed assumption about the structure of the associated Gibbs measure (more precisely, on the support of the asymptotic overlap distribution) we prove that, for any ${\varepsilon}>0$ there exists an algorithm with complexity $O(n^{2})$ that –with high probability– outputs such a vector.

In order to state our assumption, we need to take a detour and introduce Parisi’s variational formula for the value of the optimization problem (1.1). Let $\mathscrsfs{P}([0,1])$ be the space of probability measures on the interval $[0,1]$ endowed with the topology of weak convergence. For $\mu\in\mathscrsfs{P}([0,1])$ , we will write (with a slight abuse of notation) $\mu(t)=\mu([0,t])$ for its distribution function. For $\beta\in{\mathbb{R}}_{\geq 0}$ , consider the following parabolic partial differential equation (PDE) on $(t,x)\in[0,1]\times{\mathbb{R}}$

[TABLE]

It is understood that this is to be solved backward in time with the given final condition at $t=1$ . Existence and uniqueness where proved in [JT16]. We will also write $\Phi_{\mu}$ to emphasize the dependence of the solution on the measure $\mu$ . The Parisi functional is then defined as

[TABLE]

The relation between this functional and the original optimization problem is given by a remarkable variational principle, first proposed by Parisi [Par79] and established rigorously, more than twenty-five years later, by Talagrand [Tal06b], and Panchenko [Pan13a].

Theorem 1 (Talagrand [Tal06b]).

Consider the partition function $Z_{n}(\beta)=\sum_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}\exp\{\beta H_{n}({\boldsymbol{\sigma}})\}$ . Then we have, almost surely (and in $L^{1}$ )

[TABLE]

The following consequence for the optimization problem (1.1) is elementary, see e.g. [DMS17].

Corollary 1.1.

We have, almost surely

[TABLE]

Remark 1.1.

The limit $\beta\to\infty$ on the right-hand side of Eq. (1.5) can be removed by defining a new variational principle directly ‘at $\beta=\infty$ ’. Namely, the right-hand side of Eq. (1.5) can be replaced by $\min_{\gamma}\hat{{\sf P}}(\gamma)$ where $\hat{{\sf P}}$ is a modification of ${\sf P}$ and the minimum is taked over a suitable functional space [AC17]. In this paper we use the $\beta<\infty$ formulation, but it should be possible to work directly at $\beta=\infty$ .

Existence and uniqueness of the minimizer of ${\sf P}_{\beta}(\,\cdot\,)$ were proved in [AC15] and [JT16], which also proved that $\mu\mapsto{\sf P}_{\beta}(\mu)$ is strongly convex. We will denote by $\mu_{\beta}$ the unique minimizer, and refer to it as the ‘Parisi measure’ or ‘overlap distribution’ at inverse temperature $\beta$ . Our key assumption will be that –at large enough $\beta$ – the support of $\mu_{\beta}$ is an interval $[0,q_{*}(\beta)]$ .

Assumption 1.

There exist $\beta_{0}<\infty$ such that, for any $\beta>\beta_{0}$ , the function $t\mapsto\mu_{\beta}([0,t])$ is strictly increasing on $[0,q_{*}]$ , where $q_{*}=q_{*}(\beta)$ and $\mu_{\beta}([0,q_{*}])=1$ .

This assumption (sometimes referred to as ‘continuous replica symmetry breaking’ or ‘full replica symmetry breaking’) is widely believed to be true (with $\beta_{0}=1$ ) within statistical physics [MPV87]. In particular, this conjecture is supported by high precision numerical solutions of the variational problem for ${\sf P}_{\beta}$ [CR02, OSS07, SO08]. Rigorous evidence was recently obtained in [ACZ17]. Addressing this conjecture goes beyond the scope of the present paper.

We are now in position to state our main result.

Theorem 2.

Under Assumption 1, for any ${\varepsilon}>0$ there exists an algorithm that takes as input the matrix $\boldsymbol{A}\in{\mathbb{R}}^{n\times n}$ , and outputs ${\boldsymbol{\sigma}}_{*}={\boldsymbol{\sigma}}_{*}(\boldsymbol{A})\in\{+1,-1\}^{n}$ , such that the following hold:

$(i)$

The complexity (floating point operations) of the algorithm is at most $C({\varepsilon})n^{2}$ .

$(ii)$

We have $\langle{\boldsymbol{\sigma}}_{*},\boldsymbol{A}{\boldsymbol{\sigma}}_{*}\rangle\geq(1-{\varepsilon})\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}\langle{\boldsymbol{\sigma}},\boldsymbol{A}{\boldsymbol{\sigma}}\rangle$ , with high probability (with respect to $\boldsymbol{A}\sim{\sf GOE}(n)$ ).

In other words, on average, the optimization problem (1.1) is much easier than in worst case. Of course, this is far from being the only example of this phenomenon (a gap between worst case and average case complexity). However, it is a rather surprising example given the complexity of the energy landscape $H_{n}({\boldsymbol{\sigma}})$ . Its proof uses in a crucial way a fine property of the associated Gibbs measure, namely the support overlap distribution.

Remark 1.2 (Computation model).

For the sake of simplicity, we measure complexity in floating point operations. However, all operations in our algorithm appear to be stable and it should be possible to translate this result to weaker computation models.

We also assume that we can choose one value of the inverse temperature $\beta$ , and query the distribution $\mu_{\beta}(t)$ and the PDE solution $\Phi(t,x)$ as well as its derivatives $\partial_{x}\Phi(t,x)$ , $\partial_{xx}\Phi(t,x)$ at specified points $(t,x)$ , with each query costing $O(1)$ operations.

This is a reasonable model for two reasons: $(i)$ The PDE (1.2) is independent of the instance, and can be solved to a desired degree of accuracy only once. This solution can be used every time a new instance of the problem is presented. $(ii)$ The function $\mu\mapsto{\sf P}_{\beta}(\mu)$ is uniformly continuous [Gue03] and strongly convex [AC15, JT16]. Further the PDE solution $\Phi$ is continuous in $\mu$ and can be characterized as fixed point of a certain contraction [JT16]. Because of these reasons we expect that an oracle to compute $\Phi(t,x)$ , $\partial_{x}\Phi(t,x)$ , $\partial_{xx}\Phi(t,x)$ to accuracy $\eta$ can be implemented in $O(\eta^{-C})$ operations, for $C$ a constant.

Beyond Theorem 2, our general analysis allows to prove an additional fact that is of independent interest. Namely, for any $\beta>\beta_{0}$ , our message passing iteration constructs an approximate solution of the celebrated Thouless, Anderson, Palmer (TAP) equations [MPV87, Tal10].

The bulk of this paper is devoted to the case of Gaussian matrices $\boldsymbol{A}$ . However, the class of algorithms we use enjoys certain universality properties, first established in [BLM15]. These properties can be used to generalize Theorem 2 to the case of symmetric matrices with independent subgaussian entries. We refer to Section 5 for a statement of of this universality result, and limit ourselves to state a consequence of Theorem 2 for the MAXCUT problem.

Let $G_{n}=([n],E_{n})\sim{\mathcal{G}}(n,p)$ be an Erdös-Renyi random graph with edge probability ${\mathbb{P}}\big{\{}(i,j)\in E_{n}\big{\}}=p$ . A random balanced partition of the vertices (which we encode as a vector ${\boldsymbol{\sigma}}\in\{+1,-1\}^{n}$ ) achieves a cut ${\sf CUT}_{G}({\boldsymbol{\sigma}})=|E_{n}|/2+O(n)=n^{2}p/4+O(n)$ , and simple concentration argument implies that the MAXCUT has size $\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}}{\sf CUT}_{G}({\boldsymbol{\sigma}})=|E_{n}|/2+O(n^{3/2}p^{1/2})$ . In fact, it follows from [DMS17] that111In [DMS17], the same result is shown to hold for sparser graphs, as long as the average degree diverges: $np_{n}\to\infty$ . $\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}{\sf CUT}_{G}({\boldsymbol{\sigma}})=|E_{n}|/2+(n^{3}p(1-p)/2)^{1/2}{\sf P}_{*}+o(n^{3/2})$ , where ${\sf P}_{*}$ is the prediction of Parisi’s formula (i.e. the right-hand side of ((1.4))). In other words, MAXCUT on dense Erdös-Renyi random graphs is non-trivial only once we subtract the baseline value $|E_{n}|/2$ . As a corollary of Theorem 2 we can approximate this subtracted value arbitrarily well.

Corollary 1.2.

Under Assumption 1, for any ${\varepsilon}>0$ there exists an algorithm (with complexity at most $C({\varepsilon})\,n^{2}$ ), that takes as input an Erdös-Renyi random graph $G_{n}=([n],E_{n})\sim{\mathcal{G}}(n,p)$ , and outputs ${\boldsymbol{\sigma}}_{*}={\boldsymbol{\sigma}}_{*}(G)\in\{+1,-1\}^{n}$ , such that

[TABLE]

The rest of this section provides further background. In Section 2 we describe and analyze a general message passing algorithm, which we call incremental approximate message passing (IAMP). We believe this algorithm is of independent interest and can be applied beyond the Sherrington-Kirkpatrick model. In Section 3 we use this approach to prove Theorem 2. In Section 4 we show that the same message passing algorithm of Section 2 produces approximate solutions of the TAP equations. Finally, Section 5 discusses a generalization of Theorem 2 using universality. The impatient reader, who is interested in a succinct description of the algorithm (with some technical bells and whistles removed), is urged to read Appendix B.

1.1 Further background

As mentioned above –under suitable complexity theory assumptions– there is mo polynomial-time algorithm that approximates the quadratic program (1.1) better than within a factor $O((\log n)^{c})$ , for some $c>0$ [ABE*+*05]. Little is known on average-case hardness, when $\boldsymbol{A}$ is drawn from one of the random matrix distributions considered here. As an exception, Gamarnik [Gam18] proved that exact computation of the partition function $Z_{n}(\beta)$ is hard on average.

A natural approach to the quadratic program (1.1) would be to use a convex relaxation. A spectral relaxation yields $\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}}H_{n}({\boldsymbol{\sigma}})/n\leq\lambda_{1}(\boldsymbol{A})/2=1+o_{n}(1)$ , and hence is not tight for large $n$ . This can be compared to a numerical evaluation of Parisi’s formula which yields ${\sf P}_{*}\approx 0.763166$ [CR02, Sch08]. Rounding the spectral solution yields a $H_{n}({\boldsymbol{\sigma}}_{\mbox{\tiny sp}})=2/\pi+o_{n}(1)\approx 0.636619$ . Somewhat surprisingly, the simplest semidefinite programming relaxation (degree $2$ of the sum-of-squares hyerarchy), does not yield any improvement (for large $n$ ) over the spectral one [MS16]. After an initial version of this paper was posted, [BKW19] obtained rigorous evidence that higher order relaxations fail as well.

Theorem 2 was conjectured by the author in 2016 [Mon16], based on insights from statistical physics [CK94, BCKM98]. The same presentation also outlined the basic strategy followed in the present paper, which uses an iterative ‘approximate message passing’ (AMP) algorithm. This type of algorithms were first proposed in the context of signal processing and compressed sensing [Kab03, DMM09]. Their rigorous analysis was developed by Bolthausen [Bol14] and subsequently generalized in several papers [BM11, JM13, BLM15, BMN17]. In this paper we introduce a specific class of AMP algorithms (‘incremental AMP’) whose specific properties allow us to match the result predicted by Parisi’s formula.

The fundamental phenomenon studied here is expected to be quite general. Namely objective functions with overlap distribution having support of the form $[0,q_{*}]$ are expected to be easy to optimize. In contrast, if the support has a gap (for instance, has the form $[0,q_{1}]\cup[q_{2},q_{*}]$ for some $q_{1}<q_{2}$ ), this is considered as an indication of average case hardness. This intuition originates within spin glass theory [MPV87]. Roughly speaking, the structure of the overlap distribution should reflect the connectivity properties of the level sets ${\cal L}_{n}({\varepsilon})\equiv\{{\boldsymbol{\sigma}}:\;H_{n}({\boldsymbol{\sigma}})\geq(1-{\varepsilon})\max_{{\boldsymbol{\sigma}}^{\prime}}H_{n}({\boldsymbol{\sigma}}^{\prime})\}$ . This intuition was exploited in some cases to prove the failure of certain classes of algorithms in problems with a gap in the overlap distribution, see e.g. [GS14].

Important progress towards clarifying this connection was achieved recently in two remarkable papers [ABM18, Sub18].

Addario-Berry and Maillard [ABM18] study an abstract optimization problem that is thought to capture some key features of the the energy landscape of the Sherrington-Kirkpatrick model, the so-called ‘continuous random energy model.’ They prove that an approximate optimum can be found in time polynomial in the problem dimensions. From an optimization perspective, the random energy model is somewhat un-natural, in that specifying an instance requires memory that is exponential in the problem dimensions.

Subag [Sub18] considers the $p$ -spin spherical spin glass. Roughly speaking, this can be described as the problem of optimizing a random smooth function (which can be taken to be a low-degree polynomial) over the unit sphere. Subag relaxes this problem by extending the optimization over the unit ball, and proves that this objective function can be optimized efficiently by following the positive directions of the Hessian. The solution thus constructed lies on the unit sphere and thus solves the un-relaxed problem. The mathematical insight of [Sub18] is beautifully simple, but uses in a crucial way the spherical geometry. While it might be possible to generalize the same argument to the hypercube case (e.g., using the generalized TAP free energy of [MV85, CPS18]) this extension is far from obvious. In particular, uniform control of the Hessian is not as straightforward as in [Sub18].

The algorithm presented here is partially inspired by [Sub18] (in particular, a key role is played by approximate orthogonality of the updates), but its specific structure is dictated by the message passing viewpoint. Thanks to the technique of [Bol14, BM11, JM13, BMN17], its analysis does not require uniform control and is relatively simple.

1.2 Notations

Given vectors ${\boldsymbol{x}},{\boldsymbol{y}}\in{\mathbb{R}}^{n}$ , we denote by $\langle{\boldsymbol{x}},{\boldsymbol{y}}\rangle$ their scalar product and by $|{\boldsymbol{x}}|\equiv\langle{\boldsymbol{x}},{\boldsymbol{x}}\rangle^{1/2}$ the associated $\ell_{2}$ norm. Given a function $f:{\mathbb{R}}^{k}\to{\mathbb{R}}$ , and $k$ vectors ${\boldsymbol{x}}_{1},\dots,{\boldsymbol{x}}_{k}\in{\mathbb{R}}^{n}$ we write $f({\boldsymbol{x}}_{1},\dots,{\boldsymbol{x}}_{k})$ for the vector in ${\mathbb{R}}^{n}$ with components $f({\boldsymbol{x}}_{1},\dots,{\boldsymbol{x}}_{k})_{i}=f(x_{1,i},\dots,x_{k,i})$ . The empirical distribution of the coordinates of a vector of vectors $({\boldsymbol{x}}_{1},\dots,{\boldsymbol{x}}_{k})\in({\mathbb{R}}^{n})^{k}$ is the probability measure on ${\mathbb{R}}^{k}$ defined by

[TABLE]

In other words, if we arrange the vectors ${\boldsymbol{x}}_{1},\dots,{\boldsymbol{x}}_{k}$ in a matrix in $\boldsymbol{X}=[{\boldsymbol{x}}_{1},\dots,{\boldsymbol{x}}_{k}]\in{\mathbb{R}}^{n\times k}$ , $\hat{p}_{{\boldsymbol{x}}_{1},\dots,{\boldsymbol{x}}_{k}}$ denotes the probability distribution of a uniformly random row of $\boldsymbol{X}$ . In the case of a single vector ${\boldsymbol{x}}\in{\mathbb{R}}^{n}$ (i.e. for $k=1$ ), this reduces to the standard empirical distribution of the entries of ${\boldsymbol{x}}$ . We say that a function $f:{\mathbb{R}}^{d}\to{\mathbb{R}}$ is pseudo-Lipschitz if $|f({\boldsymbol{x}})-f({\boldsymbol{y}})|\leq C(1+|{\boldsymbol{x}}|+|{\boldsymbol{y}}|)|{\boldsymbol{x}}-{\boldsymbol{y}}|$ .

Given two probability measures $\mu$ , $\nu$ on ${\mathbb{R}}^{d}$ , we recall that their Wasserstein $W_{2}$ distance is defined as

[TABLE]

where the infimum is taken over all the couplings of $\mu$ and $\nu$ (i.e. joint distributions on ${\mathbb{R}}^{d}\times{\mathbb{R}}^{d}$ whose first marginal coincides with $\mu$ , and second with $\nu$ . For a sequence of probability measures $(\mu_{n})_{n\geq 1}$ , and $\mu$ on ${\mathbb{R}}^{d}$ , we say that $\mu_{n}$ converges in Wasserstein distance to $\mu$ (and write $\mu_{n}\stackrel{{\scriptstyle W_{2}}}{{\longrightarrow}}\mu$ ) if $\lim_{n\to\infty}W_{2}(\mu_{n},\mu)=0$ . It is well known that $\mu_{n}\stackrel{{\scriptstyle W_{2}}}{{\longrightarrow}}\mu$ if and only if $\lim_{n\to\infty}\int\psi({\boldsymbol{x}})\mu_{n}({\rm d}{\boldsymbol{x}})=\int\psi({\boldsymbol{x}})\mu({\rm d}{\boldsymbol{x}})$ for all bounded Lipschitz functions $\psi$ , and for $\psi({\boldsymbol{x}})=|{\boldsymbol{x}}|^{2}$ [Vil08, Theorem 6.9]. Given a sequence of random variables $X_{n}$ , we write $X_{n}\stackrel{{\scriptstyle p}}{{\longrightarrow}}X_{\infty}$ or $\operatorname*{p-lim}_{n\to\infty}X_{n}=X_{\infty}$ to state that $X_{n}$ converge in probability to $X_{\infty}$ .

We will sometimes be interested in double limits of sequences of random variables. If $X_{n,M}$ is a sequence indexed by $n,M$ and $x_{*}$ is a constant,

[TABLE]

whenever $X_{n,M}$ converges in probability to a non-random quantity $x_{M}$ as $n\to\infty$ , and $\lim_{M\to\infty}x_{M}=x_{*}$ .

2 A general message passing algorithm

Our algorithm is based on the following approximate message passing (AMP) iteration.

AMP iteration

Consider a sequence of (weakly differentiable) functions $f_{k}:{\mathbb{R}}^{k+2}\to{\mathbb{R}}$ , and a non-random initialization ${\boldsymbol{u}}^{0}\in{\mathbb{R}}^{n}$ and additional vector ${\boldsymbol{y}}\in{\mathbb{R}}^{d}$ with $\hat{p}_{{\boldsymbol{u}}^{0},{\boldsymbol{y}}}\stackrel{{\scriptstyle W_{2}}}{{\longrightarrow}}p_{U_{0},Y}$ (where $p_{U_{0},Y}$ is any probability distribution on ${\mathbb{R}}^{2}$ with finite second moment $\int(u_{0}^{2}+y^{2})p_{U_{0},Y}({\rm d}u_{0},{\rm d}y)<\infty$ ). The AMP iteration is defined by letting, for $k\geq 0$ ,

[TABLE]

It will be understood throughout that $f_{j}=0$ for $j<0$ .

Proposition 2.1.

Consider the AMP iteration (2.1), and assume $f_{k}:{\mathbb{R}}^{k+2}\to{\mathbb{R}}$ to be Lipschitz continuous. Then for any $k\in{\mathbb{N}}$ , and any pseudo-Lipschitz function $\psi:{\mathbb{R}}^{k+2}\to{\mathbb{R}}$ , we have

[TABLE]

Here $(U_{j})_{j\geq 1}$ is a centered Gaussian process independent of $(U_{0},Y)$ with covariance ${\boldsymbol{\widehat{Q}}}=({\widehat{Q}}_{kj})_{k,j\geq 1}$ determined recursively via

[TABLE]

This proposition follows immediately from the general analysis of AMP algorithms developed in [JM13, BMN17], cf. Appendix A.

We next consider a special case of the general AMP setting.

Incremental AMP (IAMP)

Fix $\delta,M>0$ , and functions $\widehat{g}_{k}:{\mathbb{R}}\to{\mathbb{R}}$ , $k\in{\mathbb{N}}$ , $s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ . We consider the general iteration (2.1), with the following choice of functions $f_{k}$ (independent of $y$ ):

[TABLE]

where, for $u\in{\mathbb{R}}$ , $[u]_{M}=\max(-M,\min(u,M))$ . Following our convention for $f_{j}$ , we set $\widehat{g}_{j}=0$ for $j<0$ .

We note that, by Eq. (2.4), $x_{k}$ is indeed a function of $u_{0},\dots,u_{k}$ , and therefore $f_{k}$ is a function of $u_{0},\dots,u_{k}$ as stated.

Lemma 2.2 (State evolution for Incremental AMP).

Consider the incremental AMP iteration, and assume $g,s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ to be Lipschitz continuous and bounded. Then for any $k\in{\mathbb{N}}$ , and any pseudo-Lipschitz function $\psi:{\mathbb{R}}^{k+2}\to{\mathbb{R}}$ , we have

[TABLE]

(The double limit is to be interpreted as defined in the Notations section.) Here $(U^{\delta}_{j})_{j}\geq 1$ is a centered Gaussian process independent of $U^{\delta}_{0}=U_{0}$ , with independent entries, with variance ${\rm Var}(U^{\delta}_{k})={\widehat{q}}_{k}$ given recursively by

[TABLE]

Proof.

Consider Eqs. (2.4), (2.5), and note that, for any $k$ , $x_{k-1}$ is a bounded Lipschitz function of $u_{0},\dots,u_{k-1}$ (because bounded Lipschitz functions are closed under sum, product, and composition). Hence $f_{k}$ defined in (2.4) is Lipschitz continuous and we can therefore apply Proposition 2.1 to get

[TABLE]

Here $(U^{\delta,M}_{j})_{j\geq 1}$ is a Gaussian process with covariance ${\boldsymbol{\widehat{Q}}}^{M}$ determined by Eq. (2.3). We next claim the following:

${\widehat{Q}}^{M}_{j,k}=0$ for $k\neq j$ (and we set ${\widehat{q}}^{M}_{k}\equiv{\widehat{Q}}^{M}_{k,k}$ ). 2. 2.

${\widehat{q}}^{M}_{k}\to{\widehat{q}}_{k}$ for each $k$ as $M\to\infty$ .

With these two claims, the statement of the lemma follows by dominated convergence.

To prove claim 1 note that, by symmetry we only have to consider the case $j<k$ . The proof is by induction over $k$ . For $k=1$ there is nothing to prove. Assume next that the claim holds up to a certain $k$ , and consider ${\widehat{Q}}^{M}_{j,k+1}$ for $1\leq j\leq k$ . By Eq. (2.3) we have (dropping for simplicity the superscripts $\delta,M$ from the random variables)

[TABLE]

Here the second equality follows from the induction hypothesis.

To prove claim 2, note that ${\widehat{q}}_{k}^{M}$ satisfies the recursion that follows from Eq. (2.3), namely

[TABLE]

Also note that $|\widehat{g}_{k}(X^{\delta,M}_{k-1})|\leq F_{k}(U_{0},U_{1}^{\delta,M},\dots,U_{k-1}^{\delta,M})$ for some polynomial $F_{k}$ independent of $M$ . Hence the claim follows by applying recursively dominated convergence. ∎

Remark 2.1.

The use of truncation $[u_{k}]_{M}$ in the definition (2.4) is dictated by the need to ensure that $f_{k}$ is Lipschitz, and to be able to apply Proposition 2.1. We believe that the conclusion of Proposition 2.1 holds under weaker assumptions (e.g. $f_{k}$ locally Lipschitz with polynomial growth). Such a generalization would allow to replace $[u_{k}]_{M}$ by $u_{k}$ in Eq. (2.4), and hence get rid of the parameter $M$ in our algorithm.

We are now in position of defining our candidate for a near optimum of the problem (1.1). We fix $\overline{q}>0$ and define (recalling the definition of $f_{k}$ in Eqs. (2.4), (2.5))

[TABLE]

Note that this vector depends on parameters $\delta,M,\overline{q}$ , and on the functions $g,s,v$ . Parameters $\delta$ and $M$ must be taken (respectively) small enough and large enough (but independent of $n$ ). The next section will be devoted to choosing $\overline{q}$ and the functions $g,s,v$ . In this section we will establish some general properties of ${\boldsymbol{z}}$ (for small $\delta$ and large $M$ ).

Lemma 2.3.

Consider the incremental AMP iteration, and assume $g,s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ to be Lipschitz continuous and bounded. Further assume $\partial_{x}\widehat{g}_{k}(x)$ , $\partial_{x}s(x,t)$ , $\partial_{x}v(x,t)$ to exist and be Lipschitz continuous. Define the random variable

[TABLE]

Then we have, for any pseudo-Lipschitz function $\psi:{\mathbb{R}}\to{\mathbb{R}}$ ,

[TABLE]

Proof.

Equation (2.15) follows immediately from Lemma 2.2 upon noticing that $\psi(z_{i})$ is a pseudo-Lipschitz function of $u_{0,i}$ , …, $u_{k,i}$ .

In order to prove Eq. (2.16), we will write ${\boldsymbol{f}}_{k}=f_{k}({\boldsymbol{u}}_{0},\dots,{\boldsymbol{u}}_{k})$ , and $K=\lfloor\overline{q}/\delta\rfloor$ . We further notice that, for $j<k$ ,

[TABLE]

Here and below the random variables $U_{k}^{\delta,M},X_{k}^{\delta,M}$ are defined as in the proof of Lemma 2.2. On the other hand

[TABLE]

Note that we applied Lemma 2.2 to a non-Lipschitz function. The limit holds nevertheless by a standard weak convergence argument (namely, using upper and lower Lipschitz approximations of the indicator function). We therefore conclude that (using $U_{k}^{\delta,M}\sim{\widehat{q}}_{k}^{M}$ , ${\widehat{q}}_{k}^{M}\to{\widehat{q}}_{k}$ as $M\to\infty$ ):

[TABLE]

Next notice that, for $j<k$ ,

[TABLE]

By a similar argument, always for $j<k$ ,

[TABLE]

On the other hand

[TABLE]

By the AMP iteration, we know that $\boldsymbol{A}{\boldsymbol{f}}_{k}={\boldsymbol{u}}^{k+1}+\sum_{\ell=1}^{k}{\sf b}_{k,\ell}{\boldsymbol{f}}_{\ell-1}$ . Hence, using the above limits, for $j\leq k$ ,

[TABLE]

We finally can compute

[TABLE]

∎

In the case of models with full replica symmetry breaking, it is natural to consider the limit of small step size $\delta\to 0$ . This limit is described by a stochastic differential equation (SDE) described below.

SDE description.

Consider Lipschitz functions $g,s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ , with $|s(x,t)|+|v(x,t)|\leq C(1+|x|)$ . Let $(B_{t})_{t\geq 0}$ be a standard Brownian motion. We define the process $(X_{t},Z_{t})_{t\geq 0}$ via

[TABLE]

with initial condition $X_{0}=Z_{0}=0$ . Equivalently

[TABLE]

where the integral is understood in Ito’s sense. Existence and uniqueness of strong solutions of this SDE is given –for instance– in [Øks03, Theorem 5.2.1].

Lemma 2.4.

Given Lipschitz functions $g,s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ , with $v$ and $s$ bounded, let $(X_{t},Z_{t})$ be the process defined above. Assume ${\mathbb{E}}\{g(X_{t},t)^{2}\}=1$ for all $t\geq 0$ . Further consider the state evolution iteration of Eq. (2.7), whereby $\widehat{g}_{k}$ is defined recursively via

[TABLE]

Then, there exists a coupling of $(X^{\delta}_{k})_{k\geq 0}$ and $(X_{t})_{t\geq 0}$ such that

[TABLE]

(Here $C$ is a constant depending only on the bounds on $g,v,s$ , and on $\overline{q}$ . Further the $O(\delta^{1/4})$ error is bounded as $|O(\delta^{1/4})|\leq C\delta^{1/4}$ for the same constant.)

Proof.

Throughout this proof, we will write $t_{k}=k\delta$ and denote by $C$ a generic constant that depends on the bounds on $g,s,v$ , and can change from line to line. Note that, by construction, ${\widehat{q}}_{j}=1$ for all $j$ , and therefore $(U^{\delta}_{j})_{j\geq 0}\sim_{iid}{\sf N}(0,1)$ . Hence we can construct the discrete and continuous processes on the same space by letting $\sqrt{\delta}U^{\delta}_{j}=B_{t_{j+1}}-B_{t_{j}}$ .

We then decompose the difference between the two processes as

[TABLE]

By taking the second moment, and using the fact that $X_{t}$ is measurable on $(B_{s})_{s\leq t}$ and $X^{\delta}_{j}$ is measurable on $(B_{s})_{s\leq t_{j}}$ , we get

[TABLE]

Next notice that by the boundedness of $s,v$ , we have ${\mathbb{E}}\{|X_{t}-X_{s}|^{2}\}\leq C|t-s|$ . Let $\Delta_{k}\equiv{\mathbb{E}}(|X_{t_{k}}-X^{\delta}_{k}|^{2})$ . Assuming without loss of generality $\delta<1$ ,

[TABLE]

The same bound holds for ${\mathbb{E}}\{[s(X_{t},t)-s(X_{j}^{\delta},t_{j+1})]^{2}\}$ . Substituting above, we get

[TABLE]

This implies bound ${\mathbb{E}}(|X_{t_{k}}-X^{\delta}_{k}|^{2})\leq C\delta$ as stated in (2.38).

In order to prove Eq. (2.39), note that

[TABLE]

Hence

[TABLE]

Let $K=\lfloor\overline{q}/\delta\rfloor$ , and write

[TABLE]

Therefore

[TABLE]

The bound of Eq. (2.39) follows since

[TABLE]

Finally, Eq. (2.40) follows by the same estimates. ∎

We now collect the main findings of this section in a theorem. This characterizes the values of the objective function achievable by the above algorithm.

Theorem 3.

Let $g,s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ be Lipschitz continuous, with $v$ and $s$ bounded, and define the process $(X_{t},Z_{t})$ using the SDE (2.35) with initial condition $X_{0}=Z_{0}=0$ . Assume ${\mathbb{E}}\{g(X_{t},t)^{2}\}=1$ for all $t\geq 0$ . Further assume $\partial_{x}g(x,t)\partial_{x}s(x,t)\partial_{x}v(x,t)$ to exist and be Lipschitz continuous.

Define the incremental AMP iteration $({\boldsymbol{u}}^{k})_{k\geq 0}$ , and let ${\boldsymbol{z}}$ be given by Eq. (2.13). Finally, let $\psi:{\mathbb{R}}\to{\mathbb{R}}$ be a pseudo-Lipschitz function. Then, for any ${\varepsilon}>0$ there exist $\delta_{*}({\varepsilon})>0$ , and for any $\delta\geq\delta_{*}({\varepsilon})$ there exist $M_{*}({\varepsilon},\delta)<\infty$ such that, if $\delta\leq\delta_{*}({\varepsilon})$ and $M\geq M_{*}({\varepsilon},\delta)$ , we have

[TABLE]

(Further the above limits in probability are non-random quantities.)

Proof.

This follows immediately from Lemma 2.3 and Lemma 2.4. ∎

3 Proof of the main theorem

3.1 Choosing the nonlinearities

In view of Theorem 3, we need to choose the coefficients $g,s,v$ in the SDE (2.35) as to satisfy two conflicting requirements: $(i)$ maximize $\int_{0}^{\overline{q}}{\mathbb{E}}\{g(X_{t},t)\}\,{\rm d}t$ (the energy value achieved by our algorithm); $(ii)$ keep ${\mathbb{P}}(Z_{q_{*}}\in[-1,1])=1$ (we want a solution in the hypercube).

Throughout this section we set $\beta>\beta_{0}$ as per Assumption 1. We also set $q_{*}=q_{*}(\beta)$ and $\mu=\mu_{\beta}$ the unique minimizer of the Parisi functional. We also fix $\Phi$ to be the solution of the PDE (1.2) with $\mu=\mu_{*}$ .

There is a natural SDE associated with the Parisi’s variational principle, that was first introduced in physics [MPV87], and recently studied in the probability theory literature [AC15, JT16]:

[TABLE]

Unless otherwise stated, it is understood that we set the initial condition to $X_{0}=0$ . Motivated by this, we set the coefficients $g,s,v$ as follows

[TABLE]

We collect below a few useful regularity properties of $\Phi$ , which have been proved in the literature.

Lemma 3.1.

$(i)$

$\partial_{x}^{j}\Phi(t,x)$ * exists and is continuous for all $j\geq 1$ .*

$(ii)$

For all $(t,x)\in[0,1]\times{\mathbb{R}}$ ,

[TABLE]

$(iii)$

$\partial_{t}\partial_{x}^{j}\Phi(t,x)\in L^{\infty}([0,1]\times{\mathbb{R}})$ * for all $j\leq 0$ .*

$(iv)$

$\partial_{x}\Phi(t,x)$ , $\partial_{x}^{2}\Phi(t,x)$ are Lipschitz continuous on $[0,1]\times{\mathbb{R}}$ .

Proof.

Points $(i)$ and $(iii)$ are Theorem 4 in [JT16]. Point $(ii)$ is Proposition 2. $(ii)$ in [AC15]. Finally, point $(iv)$ follows immediately from points $(iii)$ , $(iv)$ . ∎

This Lemma implies that the choice (3.2) satisfies the regularity assumptions in Theorem 3. We next have to check the normalization condition, and compute the resulting distribution.

Lemma 3.2.

We have

[TABLE]

In particular ${\mathbb{P}}(Z_{t}\in[-1,1])=1$ for all $t$ .

Proof.

By Lemma 2 in [AC15], we have, for any $t_{1}<t_{2}$

[TABLE]

which is exactly Eq. (3.4). Lemma 3.1. $(ii)$ implies $|Z_{t}|\leq 1$ almost surely. ∎

Lemma 3.3.

For all $0\leq t\leq q_{*}$ , we have

[TABLE]

Proof.

Equation (3.6) is Proposition 1 in [Che17]. For Eq. (3.7) note that by Eq. (39) in the same paper, we have, for any $t_{1}<t_{2}\leq q_{*}$

[TABLE]

and therefore the claim follows from Eq. 3.6. ∎

Lemma 3.4.

For any $0\leq t\leq q_{*}$ , we have

[TABLE]

Proof.

Consider $t\in[0,q_{*}]$ a continuity point of $\mu$ . Then the proof of Lemma 16 in [JT16] yields

[TABLE]

Taking expectation and using Fubini’s alongside Eq. (3.6), we get

[TABLE]

The claim follows also for $t$ not a continuity point because the right hand side is obviously continuous in $t$ . The left hand side is continuous because $\partial_{xx}\Phi$ is Lipschitz (cf. Lemma 3.1) and ${\mathbb{E}}\{|X_{t}-X_{s}|^{2}\}\leq C|t-s|$ because the coefficients of the SDE are bounded Lipschitz. ∎

We summarize the results of this section in the following theorem. Here and below, for ${\boldsymbol{x}}\in{\mathbb{R}}^{n}$ , $S\subseteq{\mathbb{R}}^{n}$ , we let $d({\boldsymbol{x}},S)\equiv\inf\{|{\boldsymbol{x}}-{\boldsymbol{y}}|\,:\;{\boldsymbol{y}}\in S\}$ .

Theorem 4.

Under Assumption 1 let $g,s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ be defined as per Eq. (3.2), and set $\overline{q}=q_{*}(\beta)$ for $\beta>\beta_{0}$ . Further let

[TABLE]

Define the incremental AMP iteration $({\boldsymbol{u}}^{k})_{k\geq 0}$ via Eqs. (2.1), (2.4), (2.5), with $\widehat{g}_{k}$ given by Eq. (2.37), and let ${\boldsymbol{z}}$ be given by Eq. (2.13). Then, for any ${\varepsilon}>0$ there exist $\delta_{*}({\varepsilon})>0$ , and for any $\delta\geq\delta_{*}({\varepsilon})$ there exist $M_{*}({\varepsilon},\delta)<\infty$ such that, if $\delta\leq\delta_{*}({\varepsilon})$ and $M\geq M_{*}({\varepsilon},\delta)$ , we have

[TABLE]

(Further the above limits in probability are non-random quantities.)

Proof.

First notice that $d({\boldsymbol{z}},[-1,1]^{n})^{2}=\sum_{i=1}^{n}\psi(z_{i})$ with $\psi(z_{i})=d(z_{i},[-1,1])^{2}$ a pseudo-Lipschitz function. Further, integration by parts yields

[TABLE]

Hence the claims of this theorem follow immediately from Theorem 3 upon checking those assumptions using the lemmas given in this section. ∎

3.2 Sequential rounding and putting everything together

Theorem 4 constructs a vector ${\boldsymbol{z}}\in{\mathbb{R}}^{n}$ . It is not difficult to round this to a vector with entries in $\{+1,-1\}$ , as detailed in the next lemma.

Lemma 3.5.

There exist an algorithm with complexity $O(n^{2})$ , and an absolute constant $C>0$ such that the following happens with probability at least $1-e^{-n}$ . Given $\boldsymbol{A}\sim{\sf GOE}(n)$ and a vector ${\boldsymbol{x}}\in{\mathbb{R}}^{n}$ such that $d({\boldsymbol{x}},[-1,1]^{n})^{2}\leq n\,{\varepsilon}_{0}$ . Then there algorithm returns a vector ${\boldsymbol{\sigma}}_{*}\in\{+1,-1\}^{n}$ such that

[TABLE]

Proof.

Recall the definition of Hamiltonian $H_{n}({\boldsymbol{x}})\equiv\langle{\boldsymbol{x}},\boldsymbol{A}{\boldsymbol{x}}\rangle/2$ (which we view as a function on ${\mathbb{R}}^{n}$ ). We also define $\tilde{H}_{n}({\boldsymbol{x}})=H_{n}({\boldsymbol{x}})-\sum_{i=1}^{n}A_{ii}x_{i}^{2}/2=\sum_{i<j\leq n}A_{ij}x_{i}x_{j}$ .

We construct ${\boldsymbol{\sigma}}_{*}$ in two steps. First we let $\tilde{\boldsymbol{z}}$ to be the projection of ${\boldsymbol{z}}$ onto the hypercube $[-1,+1]^{n}$ (i.e. $\tilde{\boldsymbol{z}}\in[-1,+1]^{n}$ is such that $|\tilde{\boldsymbol{z}}-{\boldsymbol{z}}|^{2}=d(\tilde{\boldsymbol{z}},[-1,+1]^{n})^{2}\leq n\,{\varepsilon}_{0}$ ). Note that this can be constructed in $O(n)$ time (simply by projecting each coordinate $\tilde{z}_{i}$ onto $[-1,+1]$ ).

Second, note that the function $\tilde{H}_{n}({\boldsymbol{x}})$ is linear in each coordinate of ${\boldsymbol{x}}$ . Namely, for each $\ell$ $\tilde{H}_{n}({\boldsymbol{x}})=x_{\ell}h_{1,\ell}({\boldsymbol{x}}_{\sim\ell};\boldsymbol{A})+h_{0,\ell}({\boldsymbol{x}}_{\sim\ell};\boldsymbol{A})$ , where ${\boldsymbol{x}}_{\sim\ell}=(x_{i})_{i\in[n]\setminus\ell}$ and $h_{1,\ell}({\boldsymbol{x}}_{\sim\ell};\boldsymbol{A})=\sum_{j\neq\ell}A_{\ell j}x_{j}$ . We then construct a sequence $\tilde{\boldsymbol{z}}(0),\dots\tilde{\boldsymbol{z}}(n)$ as follows. Set $\tilde{\boldsymbol{z}}(0)=\tilde{\boldsymbol{z}}$ and, for each $1\leq\ell\leq n$ :

[TABLE]

Finally we set ${\boldsymbol{\sigma}}_{*}=\tilde{\boldsymbol{z}}(n)$ . This procedure takes $O(n^{2})$ operations.

The lemma then follows straightforwardly from the following three claims:

$(i)$

$\tilde{H}_{n}({\boldsymbol{\sigma}}_{*})\geq\tilde{H}_{n}(\tilde{\boldsymbol{z}})$ .

$(ii)$

$|\tilde{H}_{n}({\boldsymbol{\sigma}}_{*})-H_{n}({\boldsymbol{\sigma}}_{*})|\leq 20\sqrt{n}$ , $|\tilde{H}_{n}({\boldsymbol{\sigma}}_{*})-H_{n}({\boldsymbol{\sigma}}_{*})|\leq 20\sqrt{n}$ with probability at least $1-e^{-2n}$ .

$(iii)$

$|H_{n}({\boldsymbol{z}})-H_{n}(\tilde{\boldsymbol{z}})|\leq 20n\sqrt{{\varepsilon}_{0}}$ with probability at least $1-e^{-2n}$ .

Claim $(i)$ is immediate since $\tilde{H}_{n}(\tilde{\boldsymbol{z}}(\ell+1))\geq\tilde{H}_{n}(\tilde{\boldsymbol{z}}(\ell+1))$ for each $\ell$ .

Claim $(ii)$ holds since, for any ${\boldsymbol{x}}\in[-1,+1]^{n}$ ,

[TABLE]

Now we have ${\mathbb{E}}\tau(\boldsymbol{A})=\sqrt{n/\pi}$ , and $\tau$ is a Lipschitz function of the Gaussian vector $(A_{ii})_{i\leq n}$ . hence the desired bounds follow by Gaussian concentration.

For claim $(iii)$ , let ${\boldsymbol{v}}={\boldsymbol{z}}-\tilde{\boldsymbol{z}}$ and note that (denoting by $\lambda_{\max}(\boldsymbol{A})$ the maximum eigenvalue of $\boldsymbol{A}$ )

[TABLE]

The desired probability bound follows by concentration of the largest eigenvalue of ${\sf GOE}$ matrices [AGZ09]. ∎

We finally need to show that the quantity ${\mathcal{E}}(\beta)$ of Theorem 4 converges to the asymptotic optimum value, for large $\beta$ . This is achieved in the two lemmas below.

Lemma 3.6.

Let ${\mathcal{E}}_{0}(\beta)\equiv(\beta/2)(1-\int_{0}^{1}t^{2}\,\mu_{\beta}({\rm d}t))$ . Then, almost surely,

[TABLE]

Proof.

By Gaussian concentration, it is sufficient to consider the expectation $E_{n}={\mathbb{E}}\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}h_{n}({\boldsymbol{\sigma}})/n$ (recall that $H_{n}({\boldsymbol{\sigma}})=\langle{\boldsymbol{\sigma}},\boldsymbol{A}{\boldsymbol{\sigma}}\rangle/2$ . Recall the definition of partition function $Z_{n}(\beta)=\sum_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}\exp(\beta H_{n}({\boldsymbol{\sigma}}))$ , and define the associated Gibbs measure $\nu_{\beta}({\boldsymbol{\sigma}})=\exp(\beta H_{n}({\boldsymbol{\sigma}}))/Z_{n}(\beta)$ and free energy density $F_{n}(T)\equiv(T/n){\mathbb{E}}\log Z_{n}(\beta=1/T)$ . A standard thermodynamic identity [MM09] yields $F_{n}(T)={\mathbb{E}}\nu_{1/T}(H_{n}({\boldsymbol{\sigma}}))+TS(\nu_{1/T})$ , where $S(q)$ is the Shannon entropy of the probability distribution $q$ . Further $F^{\prime}_{n}(T)=S(\nu_{1/T})\geq$ and $F_{n}(T)\to E_{n}$ as $T\to 0$ . Hence

[TABLE]

On the other hand, $\partial_{\beta}(\beta F_{n}(\beta))={\mathbb{E}}\nu_{\beta}(H_{n}({\boldsymbol{\sigma}}))$ . Since $\beta F_{n}(\beta)\to{\sf P}_{\beta}(\mu_{\beta})$ by Theorem 1, $F_{n}(\beta),{\sf P}_{\beta}(\mu_{\beta})$ are convex with ${\sf P}_{\beta}(\mu_{\beta})$ differentiable [Tal06a], it follows that

[TABLE]

(The last equality is proved in [Tal06a], with a difference in normalization of $\beta$ .) ∎

Lemma 3.7.

For any $\beta>\beta_{0}$ ,

[TABLE]

Proof.

The PDE (1.2) can be solved for $t\in(q_{*},1]$ using the Cole-Hopf transformation $\Phi=\log u$ . This yields $\Phi(q_{*},x)=((1-q_{*})/2)+\log 2\cosh x$ , whence $\partial_{x}\Phi(q_{*},x)=\tanh(x)$ and $\partial_{xx}\Phi(q_{*},x)=1-\tanh(x)^{2}$ . Substituting in Eqs. (3.6), (3.6), we get

[TABLE]

Hence

[TABLE]

∎

The proof our main result, Theorem 2, follows quite easily from the findings of this section.

Proof of Theorem 2.

Let $E_{*}\equiv\lim_{n\to\infty}\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}H_{n}({\boldsymbol{\sigma}})/n$ . This limit exists by Corollary 1.1, and we further have $E_{*}\geq 1/2$ (this can be proved by the same thermodynamic argument as in the proof of Lemma 3.6, noting that $(1/n)\log_{n}Z_{n}(\beta)\to\log 2+(\beta^{2}/4)$ for $\beta\leq 1$ [Pan13b]). It is therefore sufficient to output ${\boldsymbol{\sigma}}_{*}$ such that, with high probability, $H_{n}({\boldsymbol{\sigma}}_{*})/n\geq E_{*}-({\varepsilon}/3)$ .

Let $\beta=10/{\varepsilon}$ . By Lemma 3.6 and Lemma 3.7, we have ${\mathcal{E}}(\beta)\geq E_{*}-({\varepsilon}/5)$ . Applying the algorithm of Theorem 4 thus we obtain, with high probability, a vector ${\boldsymbol{x}}\in{\mathbb{R}}^{n}$ such that $H_{n}({\boldsymbol{z}})\geq E_{*}-{\varepsilon}/4$ and $d({\boldsymbol{x}},[-1,1]^{n})^{2}\leq{\varepsilon}^{2}/10^{6}$ . The proof is completed by using the rounding procedure of Lemma 3.5. ∎

4 Relation with the TAP equations

In this section we prove that the algorithm described in Section 2, when used in conjunction with the specific choice of functions $g_{k}$ , $s$ , $v$ in Section 3 actually constructs an approximate solution of the TAP equations (under Assumption 1). As in the previous section, we set $\overline{q}=q_{*}$ , $v(x,t)=\beta^{2}\mu(t)\partial_{x}\Phi(t,x)$ , $s(x,t)=\beta$ , $g(x,t)=\beta\partial_{xx}\Phi(t,x)$ , and

[TABLE]

Using these settings, we recall that ${\boldsymbol{x}}^{k}$ and ${\boldsymbol{z}}$ are given by

[TABLE]

Finally, we will repeatedly use the fact that the PDE (1.2) can be solved on $(q_{*},1]$ using the Cole-Hopf transformation, which yields $\Phi(q_{*},x)=\log 2\cosh(x)+(1-q_{*})/2$ .

Lemma 4.1.

Setting $k_{*}=\lfloor q_{*}/\delta\rfloor$ , we have

[TABLE]

Proof.

By Lemma 2.2, we have

[TABLE]

On the other hand, using Lemma 2.4, we obtain

[TABLE]

where the last identity follows from Eq. (3.5). ∎

Lemma 4.2.

Setting $k_{*}=\lfloor q_{*}/\delta\rfloor$ , we have

[TABLE]

Proof.

Throughout the proof, we will write ${\boldsymbol{f}}_{k}\equiv f_{k}({\boldsymbol{u}}_{0},\dots,{\boldsymbol{u}}_{k})$ . By the basic iteration (2.1), we have

[TABLE]

Using Eqs. (2.19) and (2.22), together with the fact that $|{\boldsymbol{f}}_{k}|^{2}/n$ , $|{\boldsymbol{u}}^{k}|^{2}/n$ are bounded by Lemma 2.2, we get

[TABLE]

Next, using again Lemma 2.4, we have $\sqrt{\delta}\sum_{k=1}^{k_{*}}U^{\delta}_{k+1}\stackrel{{\scriptstyle L_{2}}}{{\longrightarrow}}B_{q_{*}}$ , $X^{\delta}_{k_{*}}\stackrel{{\scriptstyle L_{2}}}{{\longrightarrow}}X_{q_{*}}$ and

[TABLE]

where in the last step we used Lemma 3.4. By Fubini’s theorem

[TABLE]

where in the last step we used once more Eq. (3.5). Substituting these limits in Eq. (4.11), we get

[TABLE]

Where we used the fact that $X_{t}$ solves te SDE (3.1), and $\Phi(q_{*},x)=\log 2\cosh(x)+(1-q_{*})/2$ . ∎

We can therefore state our result about constructing solutions to the TAP equations.

Theorem 5 (Constructing solutions to the TAP equations).

Under Assumption 1 let $g,s,v:{\mathbb{R}}\times{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}$ be defined as per Eq. (3.2), and set $\overline{q}=q_{*}(\beta)$ for $\beta>\beta_{0}$ . Define the incremental AMP iteration $({\boldsymbol{u}}^{k})_{k\geq 0}$ via Eqs. (2.1), (2.4), (2.5), with $\widehat{g}_{k}$ given by Eq. (2.37), and let ${\boldsymbol{z}}$ be given by Eq. (2.13). (The same iteration is given explicitly in Eqs. (4.2), (4.3).)

Set $k_{*}=\lfloor q_{*}/\delta\rfloor$ . Then, for any ${\varepsilon}>0$ there exist $\delta_{*}({\varepsilon})>0$ , and for any $\delta\geq\delta_{*}({\varepsilon})$ there exist $M_{*}({\varepsilon},\delta)<\infty$ such that, if $\delta\leq\delta_{*}({\varepsilon})$ and $M\geq M_{*}({\varepsilon},\delta)$ , we have, with high probability

[TABLE]

Proof.

The theorem follows immediately from Lemma 4.1 and Lemma 4.2, using the fact that, with high probability, $\boldsymbol{A}$ has operator norm bounded by $2+{\varepsilon}$ [AGZ09]. ∎

5 Universality

In this section we use the universality results of [BLM15] to generalize Theorem 2 to other random matrix distributions. Namely, we will work under the following assumption:

Assumption 2.

The matrix $\boldsymbol{A}=\boldsymbol{A}(n)$ is symmetric with $A_{ii}=0$ and $(A_{ij})_{1\leq i<j\leq n}$ a collection of independent random variables, satisfying ${\mathbb{E}}\{A_{ij}\}=0$ , ${\mathbb{E}}\{A_{ij}^{2}\}=1/n$ . Further, the entries are subgaussian, with common subgaussian parameter $C_{*}/n$ . (Namely, ${\mathbb{E}}\{\exp(\lambda A_{ij})\}\leq\exp(C_{*}\lambda^{2}/2n)$ for all $i<j\leq n$ .)

Using [BLM15, Theorem 4], and proceeding exactly as for Proposition 2.1, we obtain the following.

Proposition 5.1.

Consider the AMP iteration (2.1), with $\boldsymbol{A}=\boldsymbol{A}(n)$ satisfying Assumption 2. Further, assume $f_{k}:{\mathbb{R}}^{k+2}\to{\mathbb{R}}$ to be a fixed polynomial (independent of $n$ ). Then for any $k\in{\mathbb{N}}$ , and any pseudo-Lipschitz function $\psi:{\mathbb{R}}^{k+2}\to{\mathbb{R}}$ , we have

[TABLE]

Here $(U_{j})_{j\geq 1}$ is a centered Gaussian process independent of $(U_{0},Y)$ with covariance ${\boldsymbol{\widehat{Q}}}=({\widehat{Q}}_{kj})_{k,j\geq 1}$ determined recursively via

[TABLE]

Notice an important difference with respect to Proposition (2.1): instead of Lipschitz functions, we require the functions $f_{k}$ to be polynomials. However, this result is strong enough to allow us prove the following generalization of Theorem 2.

Theorem 6.

Let $\boldsymbol{A}=\boldsymbol{A}(n)$ , $n\geq 1$ be random matrices satisfying Assumption 2. Under Assumption 1, for any ${\varepsilon}>0$ there exists an algorithm that takes as input the matrix $\boldsymbol{A}\in{\mathbb{R}}^{n\times n}$ , and outputs ${\boldsymbol{\sigma}}_{*}={\boldsymbol{\sigma}}_{*}(\boldsymbol{A})\in\{+1,-1\}^{n}$ , such that the following hold: $(i)$ The complexity (floating point operations) of the algorithm is at most $C({\varepsilon})n^{2}$ . $(ii)$ We have $\langle{\boldsymbol{\sigma}}_{*},\boldsymbol{A}{\boldsymbol{\sigma}}_{*}\rangle\geq(1-{\varepsilon})\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}\langle{\boldsymbol{\sigma}},\boldsymbol{A}{\boldsymbol{\sigma}}\rangle$ .

Proof.

Let $\widehat{g}_{k}(x)$ , $v(x,t)$ , $s(x,t)$ be defined as in the proof of Theorem 2 for $k\leq 1/\delta$ . For each $M\in{\mathbb{Z}}$ , and each $k\leq 1/\delta$ , we construct a polynomial $\hat{p}_{k,M}:{\mathbb{R}}^{k-1}\to{\mathbb{R}}$ which approximates the dynamics defined by $\widehat{g}_{k}(\,\cdot\,)$ , $v(\,\cdot\,,k\delta)$ , $s(\,\cdot\,,k\delta)$ , in a sense that we will make precise below.

We define the IAMP iteration, analogously to (2.4), (2.5)

[TABLE]

We then claim that we can construct these polynomial approximations $\hat{p}_{k,M}$ so that, for any $k\leq 1/\delta$ , and any pseudo-Lipschitz function $\psi:{\mathbb{R}}^{k+2}\to{\mathbb{R}}$ , we have

[TABLE]

where the independent random variables $(U^{\delta}_{\ell})_{\ell\geq 0}$ are defined as in Lemma 2.2. Given this claim, the rest of the proof of Theorem 2 can be applied verbatimly to this –slightly different– algorithm.

In order to prove the claim (5.4), we proceed as in the proof of Lemma 2.2. Namely, by applying Proposition 5.1, we get

[TABLE]

where $(U^{\delta,M}_{\ell})_{\ell\geq 0}$ is a centered Gaussian process. Using the same argument as in Lemma 2.2, we obtain that the Gaussian random variables $(U^{\delta,M}_{\ell})_{\ell\geq 0}$ are independent. Further, letting ${\widehat{q}}^{M}_{\ell}\equiv{\mathbb{E}}\{(U^{\delta,M}_{\ell})^{2}\}$ , Proposition 5.1 yields the following recursion

[TABLE]

The claim (5.4) follows by showing the we can choose polynomials $(\hat{p}_{\ell,M})_{\ell\geq 0}$ so that $\lim_{M\to\infty}{\widehat{q}}^{M}_{\ell}={\widehat{q}}_{\ell}$ for each $\ell\leq 1/\delta$ . This can be done by induction over $k$ . As a preliminary, notice that there is $c_{0}=c_{0}(\delta)>0$ sufficiently small so that, for the sequence of random variables defined recursively via Eq. (2.7), we have $2c_{0}\leq{\widehat{q}}_{k}\leq 1/(2c_{0})$ for all $k\leq 1/\delta$ (the existence of such $c_{0}>0$ can also be shown by induction over $k$ using the fact that $\widehat{g}_{k},v,s,$ are bounded Lipschitz).

The basis of the induction $\lim_{M\to\infty}{\widehat{q}}^{M}_{0}={\widehat{q}}_{0}$ is trivial. Then assume that the induction claim is true for all $\ell\leq k$ . Without loss of generality we can consider that, for any $M\geq 1$ we have $c_{0}\leq{\widehat{q}}_{1}^{M},\dots,{\widehat{q}}_{k}^{M}\leq 1/c_{0}$ . Indeed by the induction hypothesis this holds for all $M$ large enough, and we can always renumber the polynomials $\hat{p}_{\ell,M}(\,\cdots\,)$ so that it holds for all $M\geq 1$ . Then notice that the random variable $X^{\delta}_{k}$ of Eq. (2.7) can be written as $X^{\delta}_{k}=h_{k}(U_{0},U_{1}^{\delta},\dots,U_{k-1}^{\delta})$ for a certain function $h_{k}$ that is bounded by a polynomial. We then choose the polynomials $\hat{p}_{k,M}(\,\cdot\,)$ so that

[TABLE]

Such polynomials can be constructed, for instance, by considering the expansion of $h_{k}$ in the basis of multivariate Hermite polynomials (suitably rescaled as to form an orthonormal basis with in $L^{2}({\mathbb{R}}^{k-1},\mu_{k})$ , where $\mu_{k}$ is the joint distribution of $U_{0},U^{\delta,M}_{1},\dots,U^{\delta,M}_{k-1}$ .) The variance bound $c_{0}\leq{\widehat{q}}_{1}^{M},\dots,{\widehat{q}}_{k}^{M}\leq 1/c_{0}$ is used in controlling the error term.

The induction claim then follows by

[TABLE]

where the last equality holds by dominated convergence. ∎

Corollary 1.2 follows by applying Theorem 6 with $\boldsymbol{A}$ a suitably centered and normalized adjacency matrix.

Proof of Corollary 1.2.

Given a graph $G\sim{\mathcal{G}}(n,p)$ , construct the matrix $\boldsymbol{A}=\boldsymbol{A}^{{\sf T}}\in{\mathbb{R}}^{n\times n}$ , by setting $A_{ii}=0$ and, for $i\neq j$ :

[TABLE]

It is easy to verify that this matrix satisfies Assumption 2. Further, we have

[TABLE]

Recall that we know from [DMS17] $\max_{{\boldsymbol{\sigma}}\in\{+1,-1\}^{n}}{\sf CUT}_{G}({\boldsymbol{\sigma}})=|E_{n}|/2+(n^{3}p(1-p)/2)^{1/2}{\sf P}_{*}+o(n^{3/2})$ . Let ${\boldsymbol{\sigma}}_{1}$ denote the output of the algorithm of Theorem 6, on input $\boldsymbol{A}$ . Applying this theorem and Lemma 5.1, we get

[TABLE]

We construct ${\boldsymbol{\sigma}}_{*}$ by balancing ${\boldsymbol{\sigma}}_{1}$ . Namely, if $|\langle{\boldsymbol{\sigma}}_{1},{\boldsymbol{1}}\rangle|=\ell$ , we obtain ${\boldsymbol{\sigma}}_{*}$ by flipping $\lfloor\ell/2\rfloor$ entries of ${\boldsymbol{\sigma}}_{1}$ so that $|\langle{\boldsymbol{\sigma}}_{*},{\boldsymbol{1}}\rangle|\leq 1$ . We then have, with high probability

[TABLE]

(Here $\|\boldsymbol{A}\|_{\mbox{\tiny\rm op}}$ denotes the operator norm of matrix $\boldsymbol{A}$ .) Therefore, since $|\langle{\boldsymbol{\sigma}}_{1},{\boldsymbol{1}}\rangle|/n=\ell/n\stackrel{{\scriptstyle p}}{{\longrightarrow}}1$ , and $\|\boldsymbol{A}\|_{\mbox{\tiny\rm op}}\leq 2.01$ with high probability [AGZ09], we get

[TABLE]

which completes the proof. ∎

Acknowledgements

I am grateful to Eliran Subag for an inspiring presentation of his work [Sub18] delivered at the workshop ‘Advances in Asymptotic Probability’ in Stanford, and for a stimulating conversation. This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162 and ONR N00014-18-1-2729.

Appendix A Proof of Proposition 2.1

As mentioned in the main text, Proposition 2.1 is a consequence of the general analysis of AMP algorithms available in the literature. In particular it can be obtained from a reduction to the setting of [JM13, Theorem 1]. Let us briefly recall the class of algorithms considered in [JM13], adapting the notations to the present ones. (we limit ourselves to consider the ‘one-block’ case in the language of [JM13]).

Fixing $T\geq 1$ consider a sequence of Lipschitz functions

[TABLE]

Given two matrices ${\boldsymbol{x}}\in{\mathbb{R}}^{n\times(T+1)}$ , ${\boldsymbol{z}}\in{\mathbb{R}}^{n\times 2}$ , we let $F_{t}({\boldsymbol{x}};{\boldsymbol{z}})\in{\mathbb{R}}^{n\times(T+1)}$ be the matrix whose $i$ -th row is given by $F_{t}({\boldsymbol{x}}_{i},{\boldsymbol{z}}_{i})$ (where ${\boldsymbol{x}}_{i}$ is the $i$ -th row of ${\boldsymbol{x}}$ and ${\boldsymbol{z}}_{i}$ is the $i$ -th row of ${\boldsymbol{z}}$ ).

Then [JM13] analyzes the following AMP iteration, which produces a sequence of iterates ${\boldsymbol{x}}^{t}\in{\mathbb{R}}^{n\times(T+1)}$

[TABLE]

Here ${\sf B}_{t}\in{\mathbb{R}}^{T\times T}$ is a matrix with entries defined by

[TABLE]

Under the assumption that ${\boldsymbol{x}}^{0},{\boldsymbol{z}}$ are independent of $\boldsymbol{A}$ , and $\hat{p}_{{\boldsymbol{x}}^{0},{\boldsymbol{z}}}\equiv n^{-1}\sum_{i=1}^{n}\delta_{{\boldsymbol{x}}^{0}_{i},{\boldsymbol{z}}_{i}}$ converges in $W_{2}$ , [JM13, Theorem 1] determines the asymptotic empirical distribution of ${\boldsymbol{x}}^{t},{\boldsymbol{z}}$ .

Proposition 2.1 can be recast as a special case of this setting. First notice that we can always choose an $n$ -independent $T$ such that the time horizon $k$ in Eq. (2.2) satisfies $k\leq T$ . We then consider the iteration (A.1) with initialization ${\boldsymbol{x}}^{0}=\boldsymbol{0}$ , data vectors ${\boldsymbol{z}}=({\boldsymbol{u}}^{0},{\boldsymbol{y}})$ , and update functions given by

[TABLE]

With this setting, the vector $(x^{t}_{i,\ell})_{i\leq n}\in{\mathbb{R}}^{n}$ coincides with ${\boldsymbol{u}}^{\ell}$ as given in Eq. (2.1), for all $t\geq\ell$ . The recursion of Eq. (2.3) follows from the analogous recursion in [JM13, Theorem 1].

Appendix B A simplified version of the algorithm

In this appendix we provide a simplified version of the algorithm of Theorem 2, for the reader’s convenience. In this presentation we simplify certain technical details that have been introduced in the main text to simplify the proof. In the pseudo-code below $\odot$ denotes entrywise multiplication between vectors. Further, when a scalar function is applied to a vector, it is understood to be applied componentwise. In particular, note that $|\partial_{xx}\Phi(k\delta,{\boldsymbol{x}}^{k})|$ is the $\ell_{2}$ norm of the vector whose $i$ -th component is $\partial_{xx}\Phi(k\delta,x_{i}^{k})$ .

Notice that this pseudo-code does not describe how to minimize the Parisi functional and to solve the PDE (1.2). As discussed in the introduction, we believe this can be done efficiently because of the strong convexity and continuity of $\mu\mapsto{\sf P}_{\beta}(\mu)$ . Indeed highly accurate numerical solutions (albeit with no rigorous analysis) were developed already in [CR02, OSS07, SO08].

Further, the pseudo-code does not specify the rounding procedure, which is given below.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABE + 05] Sanjeev Arora, Eli Berger, Hazan Elad, Guy Kindler, and Muli Safra, On non-approximability for quadratic programs , Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, IEEE, 2005, pp. 206–215.
2[ABM 18] Louigi Addario-Berry and Pascal Maillard, The algorithmic hardness threshold for continuous random energy models , ar Xiv:1810.05129 (2018).
3[AC 15] Antonio Auffinger and Wei-Kuo Chen, The Parisi formula has a unique minimizer , Communications in Mathematical Physics 335 (2015), no. 3, 1429–1444.
4[AC 17] , Parisi formula for the ground state energy in the mixed p 𝑝 p -spin model , The Annals of Probability 45 (2017), no. 6b, 4617–4631.
5[ACZ 17] Antonio Auffinger, Wei-Kuo Chen, and Qiang Zeng, The SK model is Full-step Replica Symmetry Breaking at zero temperature , ar Xiv:1703.06872 (2017).
6[AGZ 09] Greg W. Anderson, Alice Guionnet, and Ofer Zeitouni, An introduction to random matrices , Cambridge University Press, 2009.
7[BCKM 98] Jean-Philippe Bouchaud, Leticia F Cugliandolo, Jorge Kurchan, and Marc Mézard, Out of equilibrium dynamics in spin-glasses and other glassy systems , Spin glasses and random fields (1998), 161–223.
8[BKW 19] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein, Computational hardness of certifying bounds on constrained pca problems , ar Xiv:1902.07324 (2019).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Optimization of the Sherrington-Kirkpatrick Hamiltonian

Abstract

1 Introduction and main result

Theorem 1** (Talagrand [Tal06b]).**

Corollary 1.1**.**

Remark 1.1**.**

Assumption 1**.**

Theorem 2**.**

Remark 1.2** (Computation model).**

Corollary 1.2**.**

1.1 Further background

1.2 Notations

2 A general message passing algorithm

Proposition 2.1**.**

Lemma 2.2** (State evolution for Incremental AMP).**

Proof.

Remark 2.1**.**

Lemma 2.3**.**

Proof.

Lemma 2.4**.**

Proof.

Theorem 3**.**

Proof.

3 Proof of the main theorem

3.1 Choosing the nonlinearities

Lemma 3.1**.**

Proof.

Lemma 3.2**.**

Proof.

Lemma 3.3**.**

Proof.

Lemma 3.4**.**

Proof.

Theorem 4**.**

Proof.

3.2 Sequential rounding and putting everything together

Lemma 3.5**.**

Proof.

Lemma 3.6**.**

Proof.

Lemma 3.7**.**

Proof.

Proof of Theorem 2.

4 Relation with the TAP equations

Lemma 4.1**.**

Proof.

Lemma 4.2**.**

Proof.

Theorem 5** (Constructing solutions to the TAP equations).**

Proof.

5 Universality

Assumption 2**.**

Proposition 5.1**.**

Theorem 6**.**

Proof.

Proof of Corollary 1.2.

Acknowledgements

Appendix A Proof of Proposition 2.1

Appendix B A simplified version of the algorithm

Theorem 1 (Talagrand [Tal06b]).

Corollary 1.1.

Remark 1.1.

Assumption 1.

Theorem 2.

Remark 1.2 (Computation model).

Corollary 1.2.

Proposition 2.1.

Lemma 2.2 (State evolution for Incremental AMP).

Remark 2.1.

Lemma 2.3.

Lemma 2.4.

Theorem 3.

Lemma 3.1.

Lemma 3.2.

Lemma 3.3.

Lemma 3.4.

Theorem 4.

Lemma 3.5.

Lemma 3.6.

Lemma 3.7.

Lemma 4.1.

Lemma 4.2.

Theorem 5 (Constructing solutions to the TAP equations).

Assumption 2.

Proposition 5.1.

Theorem 6.