Injectivity of ReLU networks: perspectives from statistical physics

Antoine Maillard; Afonso S. Bandeira; David Belius; Ivan Dokmani\'c,; Shuta Nakajima

arXiv:2302.14112·cond-mat.dis-nn·December 13, 2024

Injectivity of ReLU networks: perspectives from statistical physics

Antoine Maillard, Afonso S. Bandeira, David Belius, Ivan Dokmani\'c,, Shuta Nakajima

PDF

Open Access 1 Repo

TL;DR

This paper investigates the injectivity of single-layer ReLU neural networks in high dimensions, connecting it to statistical physics models and challenging existing geometric conjectures with analytical and rigorous methods.

Contribution

It introduces a novel perspective linking neural network injectivity to spin glass models, deriving analytical thresholds and refuting previous geometric predictions.

Findings

01

Derived analytical equations for injectivity thresholds using replica symmetry-breaking theory.

02

Proved that a replica-symmetric upper bound contradicts the Euler characteristic conjecture.

03

Established a connection between spin glasses and integral geometry in the context of neural networks.

Abstract

When can the input of a ReLU neural network be inferred from its output? In other words, when is the network injective? We consider a single layer, $x \mapsto ReLU (W x)$ , with a random Gaussian $m \times n$ matrix $W$ , in a high-dimensional setting where $n, m \to \infty$ . Recent work connects this problem to spherical integral geometry giving rise to a conjectured sharp injectivity threshold for $α = \frac{m}{n}$ by studying the expected Euler characteristic of a certain random set. We adopt a different perspective and show that injectivity is equivalent to a property of the ground state of the spherical perceptron, an important spin glass model in statistical physics. By leveraging the (non-rigorous) replica symmetry-breaking theory, we derive analytical equations for the threshold whose solution is at odds with that from the Euler characteristic. Furthermore, we use…

Equations580

φ_{W} (x)_{μ}

φ_{W} (x)_{μ}

p_{m, n}

p_{m, n}

n \to \infty, m / n \to α .

n \to \infty, m / n \to α .

E_{W} (x)

E_{W} (x)

\displaystyle p_{m,n}=\mathbb{P}_{{\textbf{W}}}\Big{[}\min_{{\textbf{x}}\in\mathcal{S}^{n-1}}E_{{\textbf{W}}}({\textbf{x}})\geq n\Big{]}.

\displaystyle p_{m,n}=\mathbb{P}_{{\textbf{W}}}\Big{[}\min_{{\textbf{x}}\in\mathcal{S}^{n-1}}E_{{\textbf{W}}}({\textbf{x}})\geq n\Big{]}.

d P_{β, W} (x)

d P_{β, W} (x)

Φ_{n} (W, β)

Φ_{n} (W, β)

q_{m, n} : = E [χ_{S} (V \cap C_{m, n})],

q_{m, n} : = E [χ_{S} (V \cap C_{m, n})],

⎩ ⎨ ⎧ n \to \infty lim sup \frac{1}{n} lo g q_{m, n} < 0 n \to \infty lim inf \frac{1}{n} lo g q_{m, n} > 0 for α < α_{inj}^{Euler}, for α > α_{inj}^{Euler},

⎩ ⎨ ⎧ n \to \infty lim sup \frac{1}{n} lo g q_{m, n} < 0 n \to \infty lim inf \frac{1}{n} lo g q_{m, n} > 0 for α < α_{inj}^{Euler}, for α > α_{inj}^{Euler},

\Big{(}\alpha\leq 3.3\Rightarrow\lim_{n\to\infty}p_{m,n}=0\Big{)}\quad\textrm{and}\quad\Big{(}\alpha\geq 9.091\Rightarrow\lim_{n\to\infty}p_{m,n}=1\Big{)}.

\Big{(}\alpha\leq 3.3\Rightarrow\lim_{n\to\infty}p_{m,n}=0\Big{)}\quad\textrm{and}\quad\Big{(}\alpha\geq 9.091\Rightarrow\lim_{n\to\infty}p_{m,n}=1\Big{)}.

\displaystyle p_{m,n}=\mathbb{P}_{{\textbf{W}}}\Big{[}\min_{{\textbf{x}}\in\mathcal{S}^{n-1}}e_{{\textbf{W}}}({\textbf{x}})\geq 1\Big{]}.

\displaystyle p_{m,n}=\mathbb{P}_{{\textbf{W}}}\Big{[}\min_{{\textbf{x}}\in\mathcal{S}^{n-1}}e_{{\textbf{W}}}({\textbf{x}})\geq 1\Big{]}.

- \frac{Φ _{n} ( W , β )}{β} \geq x \in S^{n - 1} min e_{W} (x),

- \frac{Φ _{n} ( W , β )}{β} \geq x \in S^{n - 1} min e_{W} (x),

P_{W} [∣ Φ_{n} (W, β) - E_{W} Φ_{n} (W, β) ∣ \geq t]

P_{W} [∣ Φ_{n} (W, β) - E_{W} Φ_{n} (W, β) ∣ \geq t]

\displaystyle\lim_{\beta\to\infty}\Big{[}-\frac{\Phi(\alpha,\beta)}{\beta}\Big{]}

\displaystyle\lim_{\beta\to\infty}\Big{[}-\frac{\Phi(\alpha,\beta)}{\beta}\Big{]}

\displaystyle\begin{dcases}\lim_{\beta\to\infty}\Big{[}-\frac{\Phi(\alpha,\beta)}{\beta}\Big{]}&<1\Rightarrow\lim_{n\to\infty}p_{m,n}=0,\\ \lim_{\beta\to\infty}\Big{[}-\frac{\Phi(\alpha,\beta)}{\beta}\Big{]}&>1\Rightarrow\lim_{n\to\infty}p_{m,n}=1.\end{dcases}

\displaystyle\begin{dcases}\lim_{\beta\to\infty}\Big{[}-\frac{\Phi(\alpha,\beta)}{\beta}\Big{]}&<1\Rightarrow\lim_{n\to\infty}p_{m,n}=0,\\ \lim_{\beta\to\infty}\Big{[}-\frac{\Phi(\alpha,\beta)}{\beta}\Big{]}&>1\Rightarrow\lim_{n\to\infty}p_{m,n}=1.\end{dcases}

β \to \infty lim - \frac{Φ ( α , β )}{β}

β \to \infty lim - \frac{Φ ( α , β )}{β}

x \in {\pm 1}^{n} min e_{W} (x) \leq - \frac{Φ _{n} ( W , β )}{β}

x \in {\pm 1}^{n} min e_{W} (x) \leq - \frac{Φ _{n} ( W , β )}{β}

Φ_{FRSB} (α, β) = q \in F in f P [q; α, β],

Φ_{FRSB} (α, β) = q \in F in f P [q; α, β],

\displaystyle\lim_{n\to\infty}\mathbb{E}_{{\textbf{W}}}\Big{[}\mathbb{E}_{({\textbf{x}},{\textbf{x}}^{\prime})\sim\mathbb{P}_{\beta,{\textbf{W}}}^{\otimes 2}}f({\textbf{x}}\cdot{\textbf{x}}^{\prime})\Big{]}\,

\displaystyle\lim_{n\to\infty}\mathbb{E}_{{\textbf{W}}}\Big{[}\mathbb{E}_{({\textbf{x}},{\textbf{x}}^{\prime})\sim\mathbb{P}_{\beta,{\textbf{W}}}^{\otimes 2}}f({\textbf{x}}\cdot{\textbf{x}}^{\prime})\Big{]}\,

α_{inj}

α_{inj}

Φ_{FRSB} (α, β) = q \in F in f P [q; α, β],

Φ_{FRSB} (α, β) = q \in F in f P [q; α, β],

Φ = conj. Φ_{FRSB} = k \to \infty lim Φ_{k - RSB} \leq \dots \leq Φ_{k - RSB} \leq \dots \leq Φ_{1RSB} \leq Φ_{RS},

Φ = conj. Φ_{FRSB} = k \to \infty lim Φ_{k - RSB} \leq \dots \leq Φ_{k - RSB} \leq \dots \leq Φ_{1RSB} \leq Φ_{RS},

α_{inj} = conj. α_{inj}^{FRSB} \leq \dots \leq α_{inj}^{(k + 1) - RSB} \leq α_{inj}^{k - RSB} \leq \dots \leq α_{inj}^{1RSB} \leq α_{inj}^{RS},

α_{inj} = conj. α_{inj}^{FRSB} \leq \dots \leq α_{inj}^{(k + 1) - RSB} \leq α_{inj}^{k - RSB} \leq \dots \leq α_{inj}^{1RSB} \leq α_{inj}^{RS},

E_{W} Φ_{n} (W, β)

E_{W} Φ_{n} (W, β)

E lo g X

E lo g X

Φ (α, β)

Φ (α, β)

Φ (α, β; r)

Φ (α, β; r)

Φ (α, β; r)

Φ (α, β; r)

\displaystyle=\lim_{n\to\infty}\frac{1}{n}\log\int\prod_{a=1}^{r}\mu_{n}(\mathrm{d}{\textbf{x}}^{a})\,\mathbb{E}_{\textbf{W}}\Bigg{\{}\prod_{a=1}^{r}\exp\Big{\{}-\beta\sum_{\mu=1}^{m}\theta\Big{[}({\textbf{W}}{\textbf{x}}^{a})_{\mu}\Big{]}\Big{\}}\Bigg{\}}.

E_{W} a = 1 \prod r e^{- β \sum_{μ = 1}^{m} θ [(W x^{a})_{μ}]}

E_{W} a = 1 \prod r e^{- β \sum_{μ = 1}^{m} θ [(W x^{a})_{μ}]}

I_{β} (Q) : = \int_{R^{r}} \frac{d z}{( 2 π ) ^{r /2} det Q} e^{- \frac{1}{2} z^{⊺} Q^{- 1} z} e^{- β \sum_{a = 1}^{r} θ (z^{a})} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anmaillard/injectivity_relu_layer
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Face and Expression Recognition · Topological and Geometric Data Analysis

Full text

Injectivity of ReLU networks: perspectives from statistical physics

Antoine Maillard*⋆,⋄, Afonso S. Bandeira⋆, David Belius♯, Ivan Dokmanić♯,♭*, Shuta Nakajima▷

Abstract

When can the input of a ReLU neural network be inferred from its output? In other words, when is the network injective? We consider a single layer, $x\mapsto\mathrm{ReLU}(Wx)$ , with a random Gaussian $m\times n$ matrix $W$ , in a high-dimensional setting where $n,m\to\infty$ . Recent work connects this problem to spherical integral geometry giving rise to a conjectured sharp injectivity threshold for $\alpha=\frac{m}{n}$ by studying the expected Euler characteristic of a certain random set. We adopt a different perspective and show that injectivity is equivalent to a property of the ground state of the spherical perceptron, an important spin glass model in statistical physics. By leveraging the (non-rigorous) replica symmetry-breaking theory, we derive analytical equations for the threshold whose solution is at odds with that from the Euler characteristic. Furthermore, we use Gordon’s min–max theorem to prove that a replica-symmetric upper bound refutes the Euler characteristic prediction. Along the way we aim to give a tutorial-style introduction to key ideas from statistical physics in an effort to make the exposition accessible to a broad audience. Our analysis establishes a connection between spin glasses and integral geometry but leaves open the problem of explaining the discrepancies.

†† $\star$ Department of Mathematics, ETH Zürich, Switzerland.

$\sharp$ Department of Mathematics and Computer Science, University of Basel, Switzerland.

$\flat$ Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, USA.

$\triangleright$ Graduate School of Science and Technology, Meiji University, Kanagawa, Japan.

$\diamond$ To whom correspondence shall be sent: [email protected].

1 Introduction
1.1 Injectivity and (random) neural networks
1.2 Injectivity and random geometry
1.3 Statistical physics and the spherical perceptron
1.4 Related work
1.5 Main results
1.5.1 Relating the free entropy to injectivity
1.5.2 Predictions of full replica symmetry breaking theory
1.5.3 Additional bounds
1.6 Structure of the paper and open problems
2 The replica hierarchy of upper bounds
2.1 General principles of the replica method
2.2 First steps of the replica method
2.3 The replica-symmetric solution
2.4 The overlap distribution and replica symmetry breaking
2.5 One-step replica symmetry breaking
3 The full-RSB solution: exact injectivity threshold
3.1 The full-RSB prediction for the free entropy
3.2 Zero-temperature limit and algorithm for the injectivity threshold
A Proofs
A.1 Proof of Proposition 1.1
A.2 Proof of Lemma 1.2
A.3 Proof of Theorem 1.4
A.4 Proof of Corollary 1.5
A.5 Proof of Theorem 1.8
A.6 Proof of Lemmas A.5, A.6 and A.7
B Replica-symmetric supplementary calculations
B.1 Zero-temperature limit of the replica-symmetric solution
B.2 Stability of the replica-symmetric solution
B.2.1 The derivatives of $G_{1,r}$
B.2.2 The derivatives of $G_{2,r}$
B.2.3 de Almeida-Thouless condition for replica-symmetric stability
B.2.4 The $\beta\to\infty$ limit
C A replica-symmetric lower bound
D One-step replica symmetry breaking
D.1 Derivation of the 1-RSB free entropy
D.2 Zero-temperature limit
D.3 Numerical procedure
E Details of the FRSB computation
E.1 Entropic contribution
E.2 Energetic contribution
E.3 Recovering the RS result from the full RSB equations
F Technicalities of the algorithmic FRSB procedure
F.1 Technicalities of the derivation of the algorithmic procedure
F.2 Numerical results of the procedure
F.3 Some details on the implementation of convolutions
F.3.1 Convolutions via DFTs
F.3.2 Taking a large enough value of $N$
F.4 Bounds on the injectivity threshold
G An upper bound from Gordon’s “escape through a mesh” theorem

1 Introduction

We ask the following question: when is a randomly-initialized ReLU neural network injective? For $n,m\geq 1$ we consider a single layer at initialization, that is the map $\varphi_{\textbf{W}}$ defined as

[TABLE]

with ${\textbf{x}}\in\mathbb{R}^{n}$ and $\sigma(x)\coloneqq\max(0,x)$ , the ReLU activation. The weights at initialization are random; concretely, we let $W_{\mu i}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,1)$ in what follows, although we expect some of our results to generalize to W with independent entries with zero mean, unit variance, and uniformly bounded third moment; see the discussion on universality in Section 1.3.

Earlier work studied this question in the proportional growth asymptotics, where $n\to\infty$ and the aspect ratio $\frac{m}{n}\to\alpha>0$ . Puthawala et al. proved that there exist values $\alpha_{l}$ and $\alpha_{h}$ , with $\alpha_{l}<\alpha_{h}$ , such that the probability $p_{m,n}$ that the map $\varphi_{\textbf{W}}$ is injective converges to $1$ for $\alpha>\alpha_{h}$ and to [math] for $\alpha<\alpha_{l}$ , suggesting that interesting transitions may appear precisely in this proportional scaling [PKL*+*22] . Indeed, by studying the expected Euler characteristic of the intersection of a random subspace with a union of orthants with sufficiently many negative coordinates, Clum, Paleka, Bandeira, and Mixon conjectured a sharp injectivity threshold at the value $\alpha_{\mathrm{inj}}^{\rm Euler}\simeq 8.34$ [Pal21, Clu22, CPBM22]. We adopt this setting and propose an alternative derivation of the injectivity threshold, by making a connection with a spin glass model known as the spherical perceptron.

1.1 Injectivity and (random) neural networks

Our focus is on framing injectivity as a statistical physics problem and exploring parallels and discrepancies with the mentioned conjecture based on integral geometry111If formal injectivity was itself the goal, we could simply replace ReLU by Leaky ReLU and reduce the problem to injectivity of matrices. In that case the interesting quantity to study may be the inverse Lipschitz constant.. But a study of injectivity has a variety of motivations in contemporary machine learning. Inferring x from $\varphi_{{\textbf{W}}}({\textbf{x}})$ is an ill-posed problem unless $\varphi_{{\textbf{W}}}$ is injective. The question thus arises naturally when applying neural networks to model forward and inverse maps in inverse problems [PKL*+*22, AMÖS19]. There has been considerable interest in inverting generative models on their range to regularize ill-posed inverse problems [BJPD17] and in building injective generative models [BC20, KKdHD21, RC21]. Normalizing flows are designed to be invertible with efficiently computable inverses; similar feats can be achieved with injective maps, even with ReLU activations, while retaining favorable approximation-theoretic properties [PKL*+*22, PLDDH22, KKdHD21]. In finite dimension injective maps are (locally) Lipschitz [SU09]. There is significant work on estimating and controlling the Lipschitz constants of deep neural networks; see for example [FRH*+*19, JD20, GFPC21] and references therein.

Applications abound beyond inverse problems: certain injective generative models can provably be trained with sample complexity which is polynomial in image dimension [BMR18]; a message-passing graph neural networks is as powerful as the Weisfeiler–Lehman test, but only if the aggregation function is injective [XHLJ18]; injective ReLU networks are universal approximators of any map with a sufficiently high-dimensional output space [PKL*+*22] as well as of densities on manifolds [PLDDH22].

There is an analogy between random neural networks and random matrices. Just as results for random matrices help us understand general matrices and have implications throughout mathematics, physics, engineering, and computer science, random neural networks yield insight into general neural networks. This is the perspective of recent work on “nonlinear random matrix theory” for machine learning [PW17, LLC18]. We mention two other examples from this emerging line of research: neural networks at initialization have been used to theoretically study batch normalization [DKB*+*20] and properties of gradients in deep networks [HN20].

1.2 Injectivity and random geometry

Notation –

$\mathbb{N}^{\star}=\mathbb{Z}_{>0}$ denotes the positive integers. We say that an event occurs with high probability (w.h.p.) when its probability is $1-o(1)$ as the dimension $n\to\infty$ . We denote $\mu_{n}$ the uniform probability measure on the Euclidean unit sphere $\mathcal{S}^{n-1}$ in $\mathbb{R}^{n}$ . The symbol $\overset{\mathbb{P}}{\to}$ refers to convergence in probability, and $\mathcal{D}\xi$ is the standard Gaussian measure on $\mathbb{R}$ , as usual in physics.

Our first tool is a proposition proved in Appendix A.1, stated as Proposition 4.10 in [Pal21] and Proposition 37 in [Clu22], which is a simple consequence of Theorem 1 of [PKL*+*22]. It connects injectivity to random geometry:

Proposition 1.1 (Injectivity and random geometry).

The probability $p_{m,n}$ that $\varphi_{\textbf{W}}$ is injective is

[TABLE]

where $V$ is a uniformly random $n$ -dimensional subspace of $\mathbb{R}^{m}$ , and $C_{m,n}$ is the set of vectors in $\mathbb{R}^{m}$ with strictly less than $n$ strictly positive coordinates.

Remark – Since $V\cap C_{m,n}$ is a cone, we can equivalently ask in eq. (2) that $V\cap C_{m,n}\cap\mathcal{S}^{m-1}$ be an empty set.

Recall that we will study injectivity for large matrices W in the proportional growth asymptotics,

[TABLE]

In what follows we will only consider the case $m\geq n$ (and therefore $\alpha\geq 1$ ). For $m<n$ even ${\textbf{x}}\mapsto{\textbf{W}}{\textbf{x}}$ is not injective, implying that $p_{m,n}=0$ .

The expression on the right-hand side of (2) immediately evokes Gordon’s “escape through a mesh” theorem which bounds the probability that a uniformly-sampled random subspace intersects a closed subset of the sphere $A\subseteq\mathcal{S}^{m-1}$ in terms of the Gaussian width of $A$ [Gor88]. Applying Gordon’s theorem is natural but we could only use it to show that $p_{m,n}\to 1$ when $\alpha\geq\alpha_{\mathrm{inj}}^{\mathrm{mesh}}\simeq 23.54$ , which is suboptimal, given that previous work [PKL*+*22] proves injectivity w.h.p. when $\alpha\geq 9.091$ ; see Appendix G for details. A more refined analysis of the random subspace–set intersection based on the phase transition in the expected Euler characteristic yields a sharp injectivity threshold prediction of $\alpha_{\mathrm{inj}}^{\mathrm{Euler}}\simeq 8.34$ [Pal21], see Section 1.4. Here we refute this prediction and conjecture a new threshold based on a different geometric intuition.

1.3 Statistical physics and the spherical perceptron

Injectivity as energy minimization –

The random subspace $V$ of Proposition 1.1 is constructed as the column space of the random matrix W, which has dimension $n$ with probability $1$ when $m\geq n$ . If $V^{\prime}\coloneqq{\textbf{W}}(\mathcal{S}^{n-1})$ is the image of the $n$ -dimensional unit sphere, we have that $\mathbb{P}[V\cap C_{m,n}=\{0\}]=\mathbb{P}[V^{\prime}\cap C_{m,n}=\emptyset]$ . Moreover, for any ${\textbf{x}}\in\mathcal{S}^{n-1}$ we can define $E_{\textbf{W}}({\textbf{x}})$ as the total number of positive coordinates of W****x, and $e_{\textbf{W}}({\textbf{x}})$ as a normalization of this quantity:

[TABLE]

where $\theta(x)=\mathds{1}(x>0)$ is the Heaviside step function, with the convention $\theta(0)=0$ . Since $C_{m,n}$ is the set of all vectors in $\mathbb{R}^{m}$ with strictly less than $n$ (strictly) positive coordinates, one has immediately that ${\textbf{W}}{\textbf{x}}\in C_{m,n}\Leftrightarrow E_{\textbf{W}}({\textbf{x}})<n$ . Therefore, by Proposition 1.1, $p_{m,n}$ can be rewritten as222 The minimum is always reached since $E_{\textbf{W}}(\mathcal{S}^{n-1})$ is a finite set.

[TABLE]

Eqs. (2) and (4) express two different geometric intuitions. The former one lives in $\mathbb{R}^{m}$ (recall that $m\geq n$ ) and it is about an intersection of a random $n$ -dimensional subspace and a certain nonconvex union of orthants. The latter one lives in $\mathbb{R}^{n}$ and it is about the existence of a halfspace which contains less than $n$ (out of $m$ ) random vectors. The two intuitions naturally encourage different analytic tools.

Statistical physics of disordered systems –

The right-hand side of eq. (4) is reminiscent of quantities that theoretical physicists have been tackling since the 1970s, in the field of physics of disordered systems. In these disordered models (also known as spin glasses), one wishes to minimize an energy function like $E_{\textbf{W}}$ , which is itself a function of random interactions (also called quenched disorder), represented in our case by W. We recommend the famous book by Mézard, Parisi, and Virasoro for a beautiful review of the early breakthroughs of the physics of spin glasses [MPV87].

Given this short description, we can see that eq. (4) fits the framework of these studies: the energy function given in eq. (3) defines a model known in the statistical physics literature as the spherical perceptron (sometimes referred to as the Gardner–Derrida perceptron [GD88] when $E_{\textbf{W}}({\textbf{x}})$ is given by eq. (3)). We consider this perhaps unexpected point of the view on injectivity of random layers in neural networks.

Cover’s theorem and the bound $\alpha_{\mathrm{inj}}\geq 3$ –

Cover’s theorem [Cov65] leads to a first natural bound for $\alpha_{\mathrm{inj}}$ . It proves that for $\alpha<2$ , there exists with high probability (as $n\to\infty$ ) ${\textbf{x}}\in\mathcal{S}^{n-1}$ s.t. $E_{\textbf{W}}({\textbf{x}})=0$ (that is, the constraint satisfaction problem $E_{\textbf{W}}({\textbf{x}})=0$ is satisfiable w.h.p.). This can be shown to imply that $\alpha_{\mathrm{inj}}\geq 3$ :

Lemma 1.2 (Cover’s lower bound for injectivity).

Assume $\alpha<3$ . Then as $n,m\to\infty$ the ReLU layer is non injective with high probability, i.e., $\lim_{n\to\infty}p_{m,n}=0$ .

Such arguments are classical, and we detail the proof of Lemma 1.2 for completeness in Appendix A.2333In a nutshell, by Lemma 1.2 there is always an x at obtuse angle with the top $2n$ rows of W. Even if all the remaining $m-2n$ rows form acute angles with x, we need at least $n$ such rows for injectivity.. Results about the perceptron based on Cover’s theorem were greatly extended by Gardner and Derrida [Gar88, GD88] using non-rigorous tools, and then later rigorously justified by Scherbina and Tirozzi [ST02, ST03] and Stojnic [Sto13a]. In the constraint satisfaction problem (CSP) view on the perceptron, $\alpha=2$ is sometimes referred to as the Gardner capacity, which marks the limit between the satisfiable (SAT) and unsatisfiable (UNSAT) phases.

Thermal relaxation: the Gibbs–Boltzmann distribution –

Statistical physicists characterize the landscape of the (random) energy function $E_{\textbf{W}}({\textbf{x}})$ by considering the Gibbs–Boltzmann distribution $\mathbb{P}_{\beta,{\textbf{W}}}$ , defined for any inverse temperature $\beta\geq 0$ as

[TABLE]

Informally, the parameter $\beta\geq 0$ interpolates between two extremes: the infinite-temperature ( $\beta=0$ ) regime, in which the Gibbs measure is uniform on the sphere, and the zero-temperature ( $\beta\to\infty$ ) limit, in which the Gibbs measure is concentrated on the global minima of the energy function $E_{\textbf{W}}({\textbf{x}})$ . Studying the properties of the Gibbs measure for $n\to\infty$ at various $\beta$ (remaining finite when $n\to\infty$ ) yields deep insight about the landscape of the corresponding energy function [Ell06]444The Gibbs distribution is also the invariant measure of stochastic optimization procedures such as Langevin dynamics.. In particular, many of our results will be based on an analysis of the large $n$ limit of the free entropy, which is defined as the the logarithm of the normalization in eq. (5):

[TABLE]

Universality of the free entropy –

Following classical arguments based on the Lindeberg exchange method [Cha06], one can show that the free entropy $\Phi(\alpha,\beta)$ is universal for all matrices W with independent zero-mean entries with unit variance and uniformly bounded third moment. In particular, all our conjectures and theorems on the free entropy can be stated in this more general case. We note that in a recent line of work, similar universality properties have been generalized to matrices with independent rows (see, e.g., [MS22a, GKL*+*22] and references therein) under a “one-dimensional CLT” condition. In particular, [GKL*+*22] conjectures that the ground state energy $f^{\star}(\alpha)=\lim_{n\to\infty}\{\min_{\textbf{x}}e_{\textbf{W}}({\textbf{x}})\}$ (shown in Fig. 1) is universal with respect to the distribution of W in a much wider class than matrices with independent elements: we leave the investigation of this conjecture and its implications on injectivity for future work.

1.4 Related work

Average Euler characteristic prediction –

We follow here closely the presentation of [Pal21] (see also [Clu22]). Proposition 1.1 is reminiscent of the kinematic formulas in integral geometry [SW08], that allow to compute expressions of the type $\mathbb{E}[F(V\cap C)]$ , when $V$ is a uniformly-sampled random $n$ -dimensional subspace, and

$(i)$

$C$ is a finite union of convex cones.

$(ii)$

$F$ is an additive function, i.e., it satisfies for any $A,B\subseteq\mathbb{R}^{m}$ that $F(A\cup B)+F(A\cap B)=F(A)+F(B)$ .

Recall that we can write eq. (2) as $p_{m,n}=\mathbb{E}[\mathds{1}_{\mathcal{S}}(V\cap C_{m,n})]$ , with $\mathds{1}_{\mathcal{S}}(A)\coloneqq\mathds{1}\{A\cap\mathcal{S}^{m-1}\neq\emptyset\}$ the indicator function of the sphere. While $C_{m,n}$ is indeed a finite union of orthants (and thus of convex cones), $\mathds{1}_{\mathcal{S}}$ is not additive. However, it follows from Groemer’s extension theorem [SW08] that the unique additive function defined on finite unions of convex cones to agree with $\mathds{1}_{\mathcal{S}}$ on convex cones is the (spherical) Euler characteristic $\chi_{\mathcal{S}}(A)\coloneqq\chi(A\cap\mathcal{S}^{m-1})$ . A possible heuristic is thus to approximate $p_{m,n}=\mathbb{E}[\mathds{1}_{\mathcal{S}}(V\cap C_{m,n})]$ by

[TABLE]

in order to apply the kinematic formulas. We refer to [Pal21] for more discussion on the validity of this heuristic. In particular, let us note that this strategy has also been used to estimate the probability of excursions of random fields, see [AT07]. Using the kinematic formulas, one can obtain an explicit formula for $q_{m,n}$ . Estimating its limit as $n,m\to\infty$ is involved, and a non-rigorous calculation performed in [Pal21] leads to the conjecture:

[TABLE]

for a sharp threshold $\alpha_{\mathrm{inj}}^{\rm Euler}\simeq 8.34$ , which we will call the average Euler characteristic prediction for injectivity. Checking the validity of this heuristic approach as a prediction for the behavior of $p_{m,n}$ was one of the motivations of our work.

Physics and mathematics of the perceptron –

Motivated in particular by the relation of the perceptron to continuous constraint satisfaction problems (e.g. to soft sphere packing), studies of the spherical perceptron in physics and mathematics are numerous. Without aiming at being exhaustive, and rather primarily referring to works relevant for our presentation, these studies include [GD88, Gar88, FPS*+*17] in the physics literature, while the spherical perceptron has also been studied with mathematically rigorous techniques [ST02], [Tal10, Chapter 3], [Tal11, Chapter 8], [Sto13a, Sto13b, MZZ21]. In particular, the satisfiability threshold $\alpha=2$ has been rigorously determined. The techniques however do not apply to the unsatisfiable (UNSAT) regime, which is the one that is relevant in this paper. One reason for this is that the satisfiability question can be formulated in terms of a convex Hamiltonian, while in the unsatisfiable regime one is interested in a Hamiltonian given by the number of half-spaces a point is contained in, which is not convex. This precludes the straightforward use of these rigorous techniques to study the injectivity question. A rigorous sharp characterization of the unsatisfiable phase remains an important open problem. We refer to [BNSX22] for a summary of current advances on the spherical perceptron, from both the physics and the mathematics points of view.

Other related work –

Puthawala et al. derived a suite of results on injectivity of neural networks, including a simple analysis of random ReLU layers [PKL*+*22]. By combining ideas related to Cover’s theorem with union bounds over row selections from W and concentration of measure, they proved upper and lower bounds on the injectivity threshold, the upper bound being later improved by Paleka [Pal21] and Clum [Clu22]. We summarize them in the following theorem:

Theorem 1.3 (Known bounds for injectivity [PKL+22, Pal21, Clu22]).

[TABLE]

By Proposition 1.1, the injectivity threshold can be characterized as a phase transition in the probability that a random subspace intersects a certain union of orthants. Similar characterizations arise in the study of convex relaxations of sparse linear regression and other high-dimensional convex optimization problems with random data. Amelunxen et al. connect the probability of success of these optimization problems to random convex constraint satisfaction problems, namely the probability that two random convex cones have a common ray [ALMT14]. They prove that this probability exhibits a sharp phase transition in terms of scalar values known as the statistical dimension of the cones. Unfortunately, these results are limited to convex cones, whereas the union of orthants from Proposition 1.1 is non-convex.

1.5 Main results

Recall that we study a high-dimensional regime in which $n\to\infty$ and $m=m(n)$ satisfies $m(n)/n\to\alpha>0$ . We will sometimes use the notation $\alpha_{n}\coloneqq m(n)/n$ . The proofs of the rigorous statements in this section are given in Appendix A.

1.5.1 Relating the free entropy to injectivity

Our starting point is eq. (4) in Proposition 1.1 (recall that $e_{\textbf{W}}({\textbf{x}})=E_{\textbf{W}}({\textbf{x}})/n$ ):

[TABLE]

Recall the definition of the free entropy in eq. (6). We immediately have

[TABLE]

which formalizes the fact that the Gibbs distribution is a relaxation of the uniform distribution on the global minima of $E_{\textbf{W}}$ . Our strategy is to use eq. (9) to characterize injectivity. This involves two challenging steps:

$(i)$

Make the inequality of eq. (9) as tight as possible: as we explain below, conjecturally, when taking $n\to\infty$ and then $\beta\to\infty$ , eq. (9) becomes an equality. While we are not able to prove this statement, we will use it to conjecture a sharp transition for injectivity in terms of the aspect ratio $\alpha$ . Moreover, without assuming that this conjecture holds, we will also use eq. (9) to prove upper bounds on the injectivity threshold.

$(ii)$

Second, computing the large system limit $n\to\infty$ of $\Phi_{n}({\textbf{W}},\beta)$ on the left-hand side of eq. (9). This is a central object in the physics of disordered systems, and we will provide a conjecture for its limiting value, as well as rigorous upper bounds. Our results leverage a long line of work combining probability theory with heuristic predictions of statistical physics.

The following statement is classical in the theory of disordered systems and a direct consequence of celebrated concentration inequalities [BLM13]. It bounds the probability that the free entropy deviates from its mean (with respect to the disorder W):

Theorem 1.4 (Free entropy concentration).

For any $\beta\geq 0$ and $n\geq 1$ , we have,

[TABLE]

Combined with the bound of eq. (9), this already allows us to state a sufficient condition for non-injectivity with high probability. We summarize this in the following corollary, proved in Appendix A.4.

Corollary 1.5 (Sufficient condition for non-injectivity).

We denote $\Phi(\alpha,\beta)=\liminf_{n\to\infty}\mathbb{E}_{\textbf{W}}\Phi_{n}({\textbf{W}},\beta)$ . It has the following properties:

$(i)$

$\beta\mapsto-\Phi(\alpha,\beta)/\beta$ * is a positive non-increasing function of $\beta>0$ .*

$(ii)$

Its limit as $\beta\to\infty$ satisfies

[TABLE]

that is, the limit being smaller than 1 implies non-injectivity w.h.p. as $n,m\to\infty$ 555The proof actually shows that $p_{m,n}$ goes to zero exponentially fast in $n$ , see eq. (65).**.

Existence of the limit –

While we expect the limit of $\mathbb{E}_{\textbf{W}}\Phi_{n}({\textbf{W}},\beta)$ as $n\to\infty$ to exist, or, in other words, $\Phi(\alpha,\beta)$ to be defined not only as a $\liminf$ , this fact is far from trivial. In the spin glass literature, this has historically been shown using interpolation methods due to Guerra, by showing sub-additivity of the free entropy in the system size [GT02, Tal10] for mean-field spin glass models possessing certain convexity properties. Guerra’s technique, however, fails beyond this setting, e.g. in bipartite (or other multi-species) spin glass models [Pan15]. On the other hand, even in some mean-field spin glasses, including spherical $p$ -spins, the existence of the limit was only shown as a corollary of the much stronger asymptotically tight two-sided bound allowing to precisely relate the value of the limit to the Parisi formula, i.e. the prediction of statistical physics [Tal06a, Che13]666However an approximate sub-additivity property has recently been shown to be enough to deduce the convergence of the free entropy in this case [Sub22].. In the spherical perceptron we consider here, the existence of this limit is, to the best of our knowledge, still a conjecture.

Following the statistical physics intuition about the asymptotic tightness of eq. (9), we conjecture the following.

Conjecture 1.6 (Tightness of the free entropy bound).

The bound of Corollary 1.5 is tight, i.e.,

[TABLE]

A generalized conjecture –

Conjecture 1.6 is a weakened version of a more general conjecture one can make from the definition of $\Phi(\alpha,\beta)$ , which largely motivates the study of free entropies in statistical physics. First, assume that the limit defining $\Phi(\alpha,\beta)$ is well defined, so that $\Phi(\alpha,\beta)=\lim_{n\to\infty}\mathbb{E}_{\textbf{W}}\Phi_{n}(\beta,{\textbf{W}})$ . As $\beta\to\infty$ , we expect the configurations that have dominating mass under the Gibbs measure of eq. (5) to have the smallest energy, i.e., to be the ground state configurations. Therefore, the stronger conjecture that motivates our use of $\Phi(\alpha,\beta)$ to characterize injectivity is that as $\beta\to\infty$ , the bound of eq. (9) is actually an equality. In a nutshell, this conjecture can be stated as ( $\operatorname*{p-lim}$ denotes limit in probability):

[TABLE]

Note that such a statement also assumes the concentration of the ground state energy on a value independent of W as $n\to\infty$ . Generally, the concentration of the intensive energy $e_{\textbf{W}}({\textbf{x}})$ under the Gibbs measure at any given $\beta\geq 0$ can be deduced from the existence of the limiting free entropy and its differentiability in $\beta$ [AC18]777Unfortunately, proving these properties often requires the full power of the so-called Parisi formula for the limit of the free entropy, which must first be proven as we discuss later..

A remark on discretization –

A subtlety in establishing eq. (10) arises from the continuous nature of the variable x: one needs to discard the existence of sets with “super-exponentially” small volume that might contain the global minima of $e_{\textbf{W}}$ . In discrete models this issue is often not present. For example, replacing $\int_{\mathcal{S}^{n-1}}\mu_{n}(\mathrm{d}{\textbf{x}})$ by $2^{-n}\sum_{{\textbf{x}}\in\{\pm 1\}^{n}}$ in eq. (6) yields a model called the binary (or Ising) perceptron, for which it is easy to see that

[TABLE]

so that the generalized conjecture of eq. (10) follows from the concentration and existence of the limit of the free entropy. In our spherical model one could hope to approximate $\mathcal{S}^{n-1}$ by a sufficiently fine $\varepsilon$ -net, so that the value of $\Phi_{n}({\textbf{W}},\beta)$ is well approximated by averaging over the points of this net, and such that a two-sided bound similar to eq. (11) holds. Let us briefly describe such an approach. Considering an arbitrary fixed vector ${\textbf{x}}\in\mathcal{S}^{n-1}$ , it is clear that with high probability there exists $\mu\in[n]$ s.t. $|{\textbf{W}}_{\mu}\cdot{\textbf{x}}|\leq 1$ 888Since $\{{\textbf{W}}_{\mu}\cdot{\textbf{x}}\}_{\mu=1}^{n}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,1)$ .. From this, one easily deduces that there exists a small rotation ${\textbf{y}}={\textbf{R}}{\textbf{x}}$ of x (in the direction of $\pm{\textbf{W}}_{\mu}/\|{\textbf{W}}_{\mu}\|$ ), with angle $\mathcal{O}(1/\sqrt{n})$ , such that $({\textbf{y}}\cdot{\textbf{W}}_{\mu})({\textbf{x}}\cdot{\textbf{W}}_{\mu})<0$ , while $\|{\textbf{y}}-{\textbf{x}}\|_{2}\lesssim 1/\sqrt{n}.$ This (very) rough estimation shows that $\varepsilon\lesssim 1/\sqrt{n}$ is necessary to approximate the minimum of $e_{\textbf{W}}$ over $\mathcal{S}^{n-1}$ by the minimum on an Euclidean-distance net. However it is well known that such a net needs to have cardinality at least $(1/\varepsilon)^{n}$ [vH14]. Thus under this discretization the term $\log 2/\beta$ in the upper bound of eq. (11) becomes $\Omega(\log\varepsilon^{-1}/\beta)=\Omega(\log n/\beta)$ . Therefore we would need to consider diverging inverse temperatures $\beta=\beta(n)\gtrsim\log n$ in the discretized system for its free entropy to provably approximate the ground state energy. A rigorous computation of the free entropy on this net with diverging $\beta$ would be challenging: since our results are based on heuristic methods of statistical physics assuming Conjecture 1.6 (with the exception of a rigorous upper bound), we leave the analysis of a possible discretization for future work.

1.5.2 Predictions of full replica symmetry breaking theory

Computing $\Phi(\alpha,\beta)$ is in general intractable rigorously. We will show in Theorem 1.8 that we can still derive meaningful rigorous bounds, but before describing that result we first introduce another conjecture, stemming from non-rigorous methods of statistical physics. This conjecture, which we call a Parisi formula as usual in spin glass models, and that we derive in Section 3 using the non-rigorous replica method of statistical physics, gives us a (heuristic) means to exactly compute $\Phi(\alpha,\beta)$ .

Conjecture 1.7 (Parisi formula).

$\Phi(\alpha,\beta)$ * is given by the full replica symmetry breaking (FRSB) prediction of statistical physics, discussed in Section 3. More precisely, we have $\Phi(\alpha,\beta)=\Phi_{\rm FRSB}(\alpha,\beta)$ , cf. eq. (47), with the following interpretation:*

$(i)$

We have the “Parisi formula”:

[TABLE]

with $\mathcal{F}$ the set of non-decreasing functions from $[0,1]$ to $[0,1]$ , and $\mathcal{P}[q;\alpha,\beta]$ a functional of $q$ , whose expression is given in eq. (47).

$(ii)$

The infimum in eq. (12) is attained at a $q^{\star}\in\mathcal{F}$ that is the functional inverse of the CDF of a probability distribution $\rho^{\star}$ on $[0,1]$ , such that for any continuous bounded function $f$ we have

[TABLE]

where $\mathbb{P}_{\beta,{\textbf{W}}}$ is the Gibbs measure defined in eq. (5).

Eq. (13) shows that in the Parisi formula of eq. (12), the functional parameter $q\in\mathcal{F}$ can be interpreted as the average overlap distribution of the system. Intuitively speaking, the “alignment” ${\textbf{x}}\cdot{\textbf{x}}^{\prime}$ of two independent draws ${\textbf{x}},{\textbf{x}}^{\prime}$ of the Gibbs measure $\mathbb{P}_{\beta,{\textbf{W}}}$ (sharing the same matrix W) will, on average, be distributed according to $\rho^{\star}$ as $n\to\infty$ . The fact that the large-size limit of the system is characterized by this overlap distribution (called therefore an “order parameter” in statistical physics) is one of the most important predictions of the replica symmetry breaking theory of Parisi, and we will further discuss this theory in the following.

Rigorous approaches –

While the most general full replica symmetry breaking framework is widely believed to yield exact predictions in the asymptotic limit, proving these predictions is a field of probability theory in itself. Indeed, significant progress has been made in some mean-field spin glass models, see e.g. [Tal10, Pan14], or in the context of inference problems and the study of computational-to-statistical gaps [BPW18, BKM*+*19], but proving the validity of the replica symmetry breaking procedure in more generality remains one of the important open problems in a rigorous description of the physics of disordered systems. In particular, in the spherical perceptron considered here, the general full-RSB prediction is still a conjecture beyond the satisfiable phase.

Based on Conjectures 1.6 and 1.7, we can design a statistical physics program to characterize the injectivity of the ReLU layer:

$(i)$

For any $\beta\geq 0$ , compute $\Phi(\alpha,\beta)=\lim_{n\to\infty}\mathbb{E}_{\textbf{W}}\log\mathcal{Z}_{n}({\textbf{W}},\beta)/n$ , as given by the Parisi formula of Conjecture 1.7.

$(ii)$

Compute analytically the limit $f^{\star}(\alpha)\coloneqq-\lim_{\beta\to\infty}\Phi(\alpha,\beta)/\beta$ . As we discussed above, this is a non-decreasing function of $\alpha$ , and moreover $f^{\star}(2)=0$ .

$(iii)$

$\varphi_{\textbf{W}}$ is (typically) injective if $f^{\star}(\alpha)>1$ , and non-injective if $f^{\star}(\alpha)<1$ . In particular, if $f^{\star}$ is continuous and strictly increasing (which we numerically observe), the injectivity threshold $\alpha_{\mathrm{inj}}$ is characterized by

[TABLE]

We perform this procedure in detail in Section 3, and it yields the main result of this section.

Result 1.1 (“Full-RSB” conjecture).

Assume Conjectures 1.6 and 1.7 hold. Denote by $A$ the event “ $\varphi_{\textbf{W}}$ * (cf. eq. (1)) is injective”, and let $p_{m,n}=\mathbb{P}_{\textbf{W}}[A]$ . There exists a constant $\alpha_{\mathrm{inj}}^{{\rm FRSB}}\in(6.6979,6.6982)$ , obtained via the Full-RSB prediction of Conjecture 1.7, such that:*

$(i)$

If $\limsup_{n\to\infty}(m/n)<\alpha_{\mathrm{inj}}^{\rm FRSB}$ , then $\lim_{n\to\infty}p_{m,n}=0$ .

$(ii)$

If $\liminf_{n\to\infty}(m/n)>\alpha_{\mathrm{inj}}^{\rm FRSB}$ , then $\lim_{n\to\infty}p_{m,n}=1$ .

1.5.3 Additional bounds

The replica hierarchy of upper bounds –

In Conjecture 1.7, the FRSB prediction is given as

[TABLE]

and we saw that the function $q:[0,1]\to[0,1]$ could be interpreted in terms of an overlap distribution. Restricting the infimum to atomic overlap distributions with $k+1$ atoms (or, equivalently, letting $x\mapsto q(x)$ be a step function with $k+1$ steps) yields a sequence of upper bounds indexed by $k\geq 0$ :

[TABLE]

in which the “ $k-{\rm RSB}$ ” functional is given by eq. (15), with the infimum restricted to step functions with $k+1$ steps, and we suppressed the dependence of all quantities on $\alpha$ and $\beta$ to lighten notation. Let $k^{\star}$ be the smallest $k$ such that $\Phi_{\rm FRSB}(\alpha,\beta)=\Phi_{k^{\star}-{\rm RSB}}(\alpha,\beta)$ . If $1\leq k^{\star}<\infty$ , we say that the system is $k^{\star}$ -th step replica symmetry breaking; if $k^{\star}=0$ the system is called replica-symmetric (RS); if $k^{\star}$ does not exist the system is said to exhibit full replica symmetry breaking. We will clarify the meaning of “replica symmetry breaking” in Section 2. Finally, note that by using Corollary 1.5 and Conjecture 1.6, eq. (16) transfers into a hierarchy of upper bounds for the injectivity threshold:

[TABLE]

in which $\alpha_{\mathrm{inj}}^{k-{\rm RSB}}$ is the value of $\alpha$ at which $f^{\star}_{k-{\rm RSB}}(\alpha)\coloneqq\lim_{\beta\to\infty}[-\Phi_{k-{\rm RSB}}(\alpha,\beta)/\beta]$ crosses $1$ . In particular, as we will see in Section 2, one can compute $\alpha_{\mathrm{inj}}^{\rm RS}\simeq 7.65$ . Note that increasing $k$ and in particular going to the full-RSB solution, which conjecturally solves the problem, only takes us further from the Euler characteristic prediction $\alpha_{\mathrm{inj}}^{\mathrm{Euler}}\simeq 8.34$ . In Fig. 1 we illustrate the predictions at the RS, 1-RSB, and Full-RSB levels.

Proving the replica-symmetric bound –

We can state another rigorous characterization. Using Gordon’s min-max theorem [Gor85, TAH18], we can prove that the replica-symmetric prediction is an upper bound on the injectivity threshold:

Theorem 1.8 (Replica-symmetric upper bound for the injectivity threshold).

Assume that $\alpha>\alpha_{\mathrm{inj}}^{\rm RS}\simeq 7.65$ . Then $p_{m,n}\to 1$ as $n,m\to\infty$ , that is, $\varphi_{\textbf{W}}$ is injective w.h.p.

Note that for the computation of the Gardner capacity of the so-called “positive” perceptron, the replica-symmetric prediction has been shown to be tight also using Gordon’s inequality [Sto13a], because one can rewrite the associated min-max problem using a convex function on convex sets. In the unsatisfiable phase we consider here the solution is conjecturally full-RSB, and the replica-symmetric bound of Theorem 1.8 is not expected to be tight.

Theorem 1.8 is proved in Appendix A.5. It improves upon the earlier upper bounds of Theorem 1.3, and it disproves the Euler characteristic based threshold prediction $\alpha_{\mathrm{inj}}^{\rm Euler}\simeq 8.34$ [Pal21, Clu22, CPBM22]. Finally, let us mention two other bounds one can obtain on the injectivity threshold.

The annealed bound –

A classical approach in statistical physics to upper-bound the free entropy $\Phi(\alpha,\beta)$ is an annealed calculation. Namely, one uses Jensen’s inequality to write

[TABLE]

This gives us an additional upper bound $\Phi(\alpha,\beta)\leq\Phi_{\rm ann.}(\alpha,\beta)$ 999However, it never improves over the replica-symmetric one, since it is a general fact that $\Phi_{\rm RS}(\alpha,\beta)\leq\Phi_{\rm ann.}(\alpha,\beta)$ . and a corresponding upper bound for the injectivity threshold $\alpha_{\mathrm{inj}}\leq\alpha_{\mathrm{inj}}^{\rm ann.}$ . We leave to the reader the exercise to show $\Phi_{\rm ann.}(\alpha,\beta)=\alpha\log[(1+e^{-\beta})/2]$ . In particular, we have $-\Phi_{{\rm ann.}}(\alpha,\beta)/\beta\to 0$ as $\beta\to\infty$ for any $\alpha>0$ , and therefore $\alpha_{\mathrm{inj}}^{\rm ann.}=+\infty$ : here, the result of the annealed calculation is completely uninformative.

An additional lower bound –

In Fig. 1 we also show in green a region that is discarded for the injectivity threshold by a non-rigorous lower bound $\alpha_{\mathrm{inj}}\geq\alpha_{{\rm dAT}}\simeq 5.32$ , based on the de Almeida-Thouless criterion [dAT78] of statistical physics. We detail its origin in Section 2.3, and its calculation in Appendix C. While it is not mathematically rigorous, proving it would follow from a rigorous computation of the free entropy in the “high-temperature” (or small $\beta$ ) phase in which replica symmetry is conjectured to hold. In many models, this turned out to be possible to handle more easily than the complete full-RSB conjecture, so we mention it for completeness. We also note that it improves over the lower bound $\alpha_{\mathrm{inj}}\geq 3.4$ of Theorem 1.3.

1.6 Structure of the paper and open problems

Section 2 has in great part a pedagogical purpose, to introduce the unacquainted reader to the (mostly non-rigorous) results of statistical physics known in the spin glass literature under the umbrella of replica method and replica symmetry breaking. We detail there the replica computation in the spherical perceptron and the arising of replica symmetry breaking, and derive the replica symmetric and one-step replica symmetry breaking predictions for the injectivity threshold. In Section 3 we discuss the full-replica symmetry breaking prediction for the free entropy, and we derive an efficient algorithmic procedure to solve the zero-temperature full-RSB equations. We discuss the numerical behavior of this algorithm, and use it to derive the numerical estimate of the injectivity threshold in Result 1.1. As mentioned, the proofs of our rigorous results (in particular Theorem 1.8) are given in Appendix A, and other analytical or numerical details and technical arguments will be deferred to the other appendices.

Let us finally mention a few open directions that stem from our analysis.

Deep networks –

A natural extension of our results would be to analyze a composition of multiple ReLU layers. Denoting still by $n$ the input dimension, Theorem 1.8 guarantees injectivity w.h.p. if the size $k_{L}$ of the $L$ -th layer satisfies $k_{L}>n(\alpha_{\mathrm{inj}}^{\rm RS})^{L}$ . However, this is far from optimal: leveraging the structure of the image space of a ReLU layer, [Pal21, Clu22] have shown that $k_{L}\geq n(C_{1}+C_{2}L\log L)$ (for some constants $C_{1},C_{2}>0$ ) is enough to guarantee injectivity; this may be further improved using arguments based on random projections [PKL*+*22]. An interesting open question is whether the techniques we develop here (and in particular the replica symmetry breaking framework) can be extended to predict exact injectivity transitions in the multi-layer case.

Stability of the inverse –

While the injectivity question is limited to non-injective $\sigma$ – such as ReLU – in eq. (1), a natural extension would be to estimate the Lipschitz constant of the inverse of $\varphi_{\textbf{W}}$ on its range, either in the injective phase we described for $\sigma=\mathrm{ReLU}$ , or for any $\alpha>0$ when $\sigma$ is injective. Whether this question can be tackled using statistical physics tools similar to the ones we used here is an interesting open direction.

Improvement over Theorem 1.8 –

One can consider a closely-related model called the negative perceptron by replacing $\theta(x)=\mathds{1}\{x>0\}$ by $\mathds{1}\{x\geq\kappa\}$ with $\kappa<0$ in the energy of eq. (3). In this model, even computing the Gardner capacity conjecturally requires the full-RSB prediction. However, [Sto13b, MZZ21] have made a refined use of Gordon’s inequality to improve over the replica-symmetric upper bound for the capacity. While similar ideas might be able to improve the upper bound of Theorem 1.8, it is not immediate to implement them, since the method used in [MZZ21] relies on the min-max problem being formulated over unit-norm vectors, which is not the case here. Since Theorem 1.8 already allows to disprove the Euler characteristic prediction, we leave such an improvement for later work.

Large deviations of sublevel sets –

The non-validity of the average Euler characteristic prediction also leads to interesting predictions on the energy landscape of the perceptron. Indeed, the quantity $q_{m,n}$ of eq. (7) is the mean Euler characteristic of a sublevel set $U$ of the perceptron, more precisely $q_{m,n}=\mathbb{E}_{\textbf{W}}[\chi(U)]$ , with $U\coloneqq\{{\textbf{x}}\in\mathcal{S}^{n-1}\,:\,e_{\textbf{W}}({\textbf{x}})\leq 1\}$ . Recall that $\alpha_{\mathrm{inj}}^{\rm Euler}\simeq 8.34$ while $\alpha_{\mathrm{inj}}^{\rm RS}\simeq 7.65$ . According to Theorem 1.8, for all $\alpha\in(\alpha_{\mathrm{inj}}^{\rm RS},\alpha_{\mathrm{inj}}^{\rm Euler})$ (and conjecturally in $(\alpha_{\mathrm{inj}}^{\rm FRSB},\alpha_{\mathrm{inj}}^{\rm Euler})$ ) the set $U$ is typically empty: however its average Euler characteristic is exponentially large! A possible explanation for this discrepancy is that there exist large deviations events with probability $\exp(-nI_{1})$ in which the set $U$ is not only non-empty, but has Euler characteristic $\exp(nI_{2})$ . A natural conjecture is that $I_{2}>I_{1}$ for $\alpha<\alpha_{\mathrm{inj}}^{\rm Euler}$ and $I_{2}<I_{1}$ for $\alpha>\alpha_{\mathrm{inj}}^{\rm Euler}$ . Exploring further these large deviations could thus explain the error made in the Euler characteristic approach.

Numerical code and reproducibility –

All figures and numerical results in this paper are fully reproducible. The JAX [BFH*+*18] code is available in a GitHub repository [Mai23].

Acknowledgments –

A.M. and A.B. thank D. Paleka, C. Clum, and D. Mixon for several discussions related to this paper. A.M. is grateful to F. Krzakala, L. Zdeborová, B. Loureiro and P. Urbani for insightful discussions.

2 The replica hierarchy of upper bounds

2.1 General principles of the replica method

The replica method is based on the replica trick, a heuristic use of the following formula, for any random variable $X>0$ (assuming that all the moments written hereafter are well-defined):

[TABLE]

While the replica trick is most often described as the first equality in eq. (18), we will here use the second (and equivalent) equality. Assuming that $\Phi(\alpha,\beta)\coloneqq\lim_{n\to\infty}\mathbb{E}\,\Phi_{n}({\textbf{W}},\beta)$ is well defined, we reach:

[TABLE]

So far, eq. (19) is not really surprising. The replica method is based on several heuristics, and leverages the fact that it is often possible to compute the RHS of eq. (19) for integer $r$ . More precisely, the replica method proceeds as follows:

Replica method

$(i)$

Assume that the limits $n\to\infty$ and $r\to 0$ can be inverted in eq. (19), i.e. that we have $\Phi(\alpha,\beta)=\partial_{r}[\Phi(\alpha,\beta;r)]_{r=0}$ , with

$\displaystyle\Phi(\alpha,\beta;r)$ $\displaystyle\coloneqq\lim_{n\to\infty}\frac{1}{n}\log\mathbb{E}_{\textbf{W}}\big{\{}\mathcal{Z}_{n}({\textbf{W}},\beta)^{r}\big{\}}.$

(20)

$(ii)$

Compute $\Phi(\alpha,\beta;r)$ for integer $r$ , i.e. the asymptotics of the moments of $\mathcal{Z}_{n}({\textbf{W}},\beta)$ .

$(iii)$

Use these values to analytically expand $\{\Phi(\alpha,\beta;r)\}_{r\in\mathbb{N}}$ to all $r\geq 0$ .

$(iv)$

Compute $\Phi(\alpha,\beta)=\partial_{r}[\Phi(\alpha,\beta;r)]_{r=0}$ from the analytic continuation above.

Note that step $(i)$ , although a priori non-rigorous, can sometimes be put on rigorous ground using convexity arguments, cf. e.g. page 146 of [Tal10] in the context of the Sherrington-Kirkpatrick (or SK) model. The arguably “most heuristic” step is $(iii)$ , as there in general no guarantee for the uniqueness of the analytic continuation (and it is often not unique!). The choice of the conjecturally correct continuation was proposed by Parisi in a remarkable series of papers [Par79, Par80a, Par80b], one of the most important contributions for which he earned a Nobel prize in Physics in 2021, and we will describe this choice in the following sections. In the SK model originally studied by Parisi his prediction was ultimately proven to be correct by Talagrand [Tal06b] and generalized by Panchenko [Pan14], leveraging notably interpolation techniques that originated with Guerra [Gue03]. An actual rigorous treatment of the replica method itself remains out of reach, and in the spherical perceptron considered here replica predictions have not been proven, with the exception of the satisfiable phase [ST03, Sto13a].

2.2 First steps of the replica method

Let us now perform step $(ii)$ of the replica method. From now on, we relax the level of rigor and sometimes adopt notations closer to the theoretical physics literature, since the core of the method is heuristic. Fixing $r\in\mathbb{N}^{\star}$ we have (recall eq. (6)):

[TABLE]

We have used Fubini’s theorem in eq. (2.2). We see appearing a set $\{{\textbf{x}}^{a}\}_{a=1}^{r}$ of independent samples from the Gibbs measure $\mathbb{P}_{\beta,{\textbf{W}}}$ , with the same realization of the matrix W: we call such independent samples replicas, following the statistical physics nomenclature. The expectation with respect to W in eq. (2.2) can be performed, since at fixed $\{{\textbf{x}}^{a}\}$ , ${\textbf{z}}^{a}\coloneqq{\textbf{W}}{\textbf{x}}^{a}$ are jointly Gaussian vectors with covariance $\mathbb{E}[z^{a}_{\mu}z^{b}_{\nu}]=\delta_{\mu\nu}Q^{ab}$ , where we introduced the overlap matrix $Q^{ab}\coloneqq{\textbf{x}}^{a}\cdot{\textbf{x}}^{b}$ (note that $Q^{aa}=1$ , $Q^{ab}=Q^{ba}$ , and that the matrix $\{{\textbf{x}}^{a}\cdot{\textbf{x}}^{b}\}$ is almost surely invertible under $\mu_{n}^{\otimes r}$ ). Therefore we have:

[TABLE]

in which we defined

[TABLE]

One can thus write eq. (2.2) as:

[TABLE]

with $J({\textbf{Q}})$ defined as the PDF of the overlap matrix ${\textbf{Q}}(\{{\textbf{x}}^{a}\})$ (for $\{{\textbf{x}}^{a}\}\sim\mu_{n}^{\otimes r}$ ) evaluated in Q:

[TABLE]

in which we used that $Q^{aa}=1$ and we re-normalized ${\textbf{x}}^{a}$ by $\sqrt{n}$ . One way to compute the numerator in eq. (23) is to use an exponential tilting method, by the following argument: for any symmetric ${\boldsymbol{\Lambda}}\in\mathbb{R}^{r\times r}$ positive-definite, we have

[TABLE]

The idea is to pick ${\boldsymbol{\Lambda}}$ so that under the probability distribution $P_{\boldsymbol{\Lambda}}(\{{\textbf{x}}^{a}\})\propto\exp\{-\frac{1}{2}\sum_{a,b}\Lambda^{ab}{\textbf{x}}^{a}\cdot{\textbf{x}}^{b}\}$ , we have with high probability ${\textbf{x}}^{a}\cdot{\textbf{x}}^{b}/n\to Q^{ab}$ as $n\to\infty$ . Since $P_{\boldsymbol{\Lambda}}$ is Gaussian, one finds ${\boldsymbol{\Lambda}}={\textbf{Q}}^{-1}$ as the correct choice. Heuristically, the argument then goes as follows: for ${\boldsymbol{\Lambda}}={\textbf{Q}}^{-1}$ , the constraint terms in eq. (2.2) are satisfied as $n\to\infty$ , so that we can remove the Dirac deltas without affecting the asymptotic value of the integral. Performing the same calculation in the denominator (for which ${\boldsymbol{\Lambda}}=\mathrm{I}_{r}$ is now the correct choice), one reaches:

[TABLE]

Such “exponential tilting” arguments can be made rigorous, and are classical e.g. in the theory of large deviations [DZ98]. Another (equivalent) way to obtain eq. (25) is to introduce the Fourier transform of the Dirac delta in eq. (23), and perform a saddle-point method over the parameters of the Fourier integral, see e.g. [CC05] or [Urb18]. The Gaussian integral in eq. (25) can be computed:

[TABLE]

This yields:

[TABLE]

with

[TABLE]

It is crucial that in many physical models, the average of the replicated partition function can be written as in eq. (26), as a function of a low-dimensional parameter (recall that Q is a $r\times r$ matrix, and that $r$ is a fixed positive integer). In physics, one refers to the overlap matrix Q as the order parameter of the problem: a low-dimensional quantity that allows to characterize the macroscopic behavior of our high-dimensional system (similarly to the average magnetization in a ferromagnet for instance).

Applying Laplace’s method to the integral in eq. (26), we finally reach:

[TABLE]

where the supremum is over $r\times r$ symmetric positive-definite matrices such that $Q^{aa}=1$ , and recall that $I_{\beta}({\textbf{Q}})$ is defined in eq. (22). Note that we completely removed the high dimensionality of the problem! The remaining task is to perform step $(iii)$ of the replica method, i.e. to analytically continue $\Phi(\alpha,\beta,r)$ to any $r>0$ . This is the crucial difficulty of the replica method (and the main reason why it is ill-posed mathematically in general), which was solved by Parisi [Par79, Par80a, Par80b].

2.3 The replica-symmetric solution

The functional in eq. (27) is symmetric: one can permute the different replicas of the systems (and correspondingly swap the rows and columns of Q) without changing the value of the functional. This has led physicists to first assume that the supremum in eq. (27) is attained by a matrix Q that is also invariant under permutations, i.e. that satisfies $Q^{ab}=q$ for all $a\neq b$ . This replica-symmetric assumption was historically the first one considered to find a solution to the SK model [SK75]. We will see how it allows to complete the final steps of the replica method.

Note that replica symmetry can be put on a firmer mathematical ground, using the following characterization.

Replica symmetry (RS) – Let $Q({\textbf{x}},{\textbf{x}}^{\prime})\coloneqq{\textbf{x}}\cdot{\textbf{x}}^{\prime}$ , and recall the Gibbs distribution $\mathbb{P}_{\beta,{\textbf{W}}}$ of eq. (5). Replica symmetry amounts to assuming that the random variable $Q({\textbf{x}},{\textbf{x}}^{\prime})$ concentrates when ${\textbf{x}},{\textbf{x}}^{\prime}$ are sampled independently from $\mathbb{P}_{\beta,{\textbf{W}}}$ (with the same W), in the following sense:

$\lim_{n\to\infty}\mathbb{E}_{\textbf{W}}\Big{[}\mathbb{E}_{({\textbf{x}},{\textbf{x}}^{\prime})\sim\mathbb{P}_{\beta,{\textbf{W}}}^{\otimes 2}}\Big{\{}(Q({\textbf{x}},{\textbf{x}}^{\prime})-\mathbb{E}Q)^{2}\Big{\}}\Big{]}=0,$

(28)

with the shorthand $\mathbb{E}Q\coloneqq\mathbb{E}_{\textbf{W}}[\mathbb{E}_{({\textbf{x}},{\textbf{x}}^{\prime})\sim\mathbb{P}_{\beta,{\textbf{W}}}^{\otimes 2}}(Q({\textbf{x}},{\textbf{x}}^{\prime}))]$ .

In particular, under the RS ansatz, we can write the off-diagonal elements of the overlap matrix appearing in eq. (27) as $Q^{ab}=q=\mathbb{E}_{\textbf{W}}[\mathbb{E}_{({\textbf{x}},{\textbf{x}}^{\prime})\sim\mathbb{P}_{\beta,{\textbf{W}}}^{\otimes 2}}({\textbf{x}}\cdot{\textbf{x}}^{\prime})]$ , in which ${\textbf{x}},{\textbf{x}}^{\prime}$ are two independent samples under the Gibbs measure with quenched noise W (two replicas of the system), and $a\neq b$ . . Therefore, we also have $q=\mathbb{E}_{\textbf{W}}[\|\mathbb{E}_{{\textbf{x}}\sim\mathbb{P}_{\beta,{\textbf{W}}}}({\textbf{x}})\|^{2}]$ , which implies in particular that $q\in[0,1]$ .

Let us now finish the replica calculation under a replica symmetric assumption, going back to eq. (27). By simple linear algebra calculations, the RS ansatz implies, for all $a\neq b$ :

[TABLE]

and moreover

[TABLE]

Plugging the form of ${\textbf{Q}}^{-1}$ we have:

[TABLE]

Recall that $\mathcal{D}\xi$ is the standard Gaussian measure on $\mathbb{R}$ , and we have used the identity $\exp(x^{2}/2)=\int\mathcal{D}\xi\exp(x\xi)$ . Plugging eqs. (29) and (2.3) in eq. (27) we reach, for $r\in\mathbb{N}^{\star}$ :

[TABLE]

One can now begin to see how the replica-symmetric ansatz yields an analytical continuation of $\Phi_{\mathrm{RS}}(\alpha,\beta;r)$ for all $r>0$ . A final non-trivial (and very non-rigorous) technicality of the replica method, which we do not detail here, is that when we analytically expand the function above to $r<1$ , theoretical physicists argue that maximizers of eq. (31) are continued into minima of $\Phi_{\rm RS}(r,q)$ . We refer to [MPV87] for a detailed discussion: a rough intuition is that one performs a Laplace method over the $r(r-1)/2$ variables $\{Q^{ab}\}_{a<b}$ , which becomes a negative number of variables (!) for $r<1$ , turning the supremum into an infimum. In the present case this phenomenon can be easily observed, see Fig. 2.

Under the replica-symmetric ansatz, we therefore obtain:

[TABLE]

The inner integral is easy to work out:

[TABLE]

where $H(x)\coloneqq\int_{x}^{\infty}\mathcal{D}u=[1-\mathrm{erf}(x/\sqrt{2})]/2$ . In particular, $H^{\prime}(x)=-e^{-x^{2}/2}/\sqrt{2\pi}$ . Then:

[TABLE]

For any $\beta\geq 0$ , the minimizing $q$ is thus given by the solution to

[TABLE]

that minimizes the functional of eq. (32). The quantity $e^{\star}(\alpha,\beta)\coloneqq-\partial_{\beta}\Phi_{\rm RS}(\alpha,\beta)$ is called the average intensive energy: as can be seen from eq. (6), $ne^{\star}(\alpha,\beta)$ is the average number of negative components of x when sampled from the Gibbs measure of eq. (5). At the replica-symmetric level it is given by:

[TABLE]

In particular, one sees that for $\beta=0$ we have $q=0$ and $e^{\star}(\alpha,\beta=0)=\alpha/2$ , which is the typical number of negative components of a random $m$ -dimensional vector (divided by $n$ ).

The zero-temperature limit – For any $\alpha>2$ (i.e. in the UNSAT phase), one can check from eq. (33) that $q\to 1$ as $\beta\to\infty$ . This means that the replica-symmetric ansatz predicts that, as $\beta\to\infty$ , the Gibbs measure concentrates on the global minima of $E_{\textbf{W}}({\textbf{x}})$ , and that (at fixed W) the distance between any two such minima goes to [math] as $n\to\infty$ 101010As we will see in Sec 3, while the replica-symmetry assumption turns out to be wrong, this prediction remains correct!. One can see also from this equation (cf. [GD88, FPS*+*17]) that the expansion of the solution $q$ is of the type:

[TABLE]

where $\chi_{\rm RS}$ is the so-called zero-temperature susceptibility. Plugging this expansion in the equations above, we recover the result of [GD88] (we detail the computations in Appendix B.1). We find that $\chi_{\rm RS}$ is the unique solution to:

[TABLE]

and $f^{\star}_{\rm RS}(\alpha)=\lim_{\beta\to\infty}[-\Phi_{\rm RS}(\alpha,\beta)/\beta]$ is given as (recall $H(x)=\int_{x}^{\infty}\mathcal{D}u$ ):

[TABLE]

Replica-symmetric prediction for $\alpha_{\mathrm{inj}}$ – Recall the criterion of eq. (14) for the injectivity threshold. Eqs. (36) and (37) are easy to analyze numerically, and they yield that $f^{\star}_{\rm RS}(\alpha)=1$ for:

[TABLE]

in which $\mathrm{RS}$ stands for the replica-symmetric assumption.

Instability of the replica-symmetric solution and the need for a different ansatz – An important check of the validity of the replica-symmetric ansatz is that it indeed is a maximum of the functional given in eq. (27) (or a minimum when $r<1$ as we discussed). This can be verified locally, by considering the Hessian of this function, and looking at the sign of its eigenvalues when $r\to 0$ . The stability criterion is called the de Almeida-Thouless (dAT) condition [dAT78], and we derive it in Appendix B.2 for any inverse temperature $\beta\geq 0$ , cf. eq. (98). However, we also show that this condition is never satisfied in the limit $\beta\to\infty$ , for any $\alpha>2$ . This suggests that the correct solution actually breaks the replica symmetry! Formally, the functional of eq. (27) exhibits a well-known physical phenomenon known as spontaneous symmetry breaking: while the function to maximize is invariant under the group of permutations of the $r$ replicas, any particular maximum is not invariant under this symmetry.

A replica-symmetric lower bound – In Appendix C, we detail a way to use the replica-symmetric prediction at finite $\beta\geq 0$ , combined with the stability analysis of Appendix B.2, to obtain a lower bound on $\alpha_{\mathrm{inj}}$ :

[TABLE]

in which the definition and calculation of $\alpha_{\rm dAT}$ can be deduced solely from the replica-symmetric calculation: we refer to Appendix C for more details on this bound, which is shown as a light green area in Fig. 1.

2.4 The overlap distribution and replica symmetry breaking

Since we must go beyond replica symmetry, one has to understand what could happen if the overlap concentration of eq. (28) is not satisfied. We define $q\equiv{\textbf{x}}\cdot{\textbf{x}}^{\prime}$ , in which ${\textbf{x}},{\textbf{x}}^{\prime}$ are independent samples under the Gibbs measure of eq. (5), with the same quenched noise W, and we will study the law of $q$ averaged over W, which we will denote $\rho_{n}(q)$ .

A natural possibility is that, while the random variable $q$ no longer concentrates, its average distribution $\rho_{n}(q)$ still converges (weakly) to an asymptotic law $\rho(q)$ (for $q\in[0,1]$ ) as $n\to\infty$ . Replica-symmetry then corresponds to the case $\rho(q)=\delta(q-q_{0})$ . But how does an arbitrary $\rho(q)$ transfers to a $r\times r$ overlap matrix Q maximizing eq. (27)? Actually, the other way (going from Q to $\rho(q)$ ) is easier to formalize. Indeed, for the same W, let us draw two independent samples ${\textbf{x}},{\textbf{x}}^{\prime}$ under the Gibbs measure (two “replicas”). On average, their overlap is distributed as the off-diagonal elements of the overlap matrix, i.e. we have (one can formalize this argument, see e.g. [MS22b])

[TABLE]

However, recall that our physical system is not represented by the overlap matrix Q at finite $r$ , but rather by its $r\to 0$ limit, so we should take this limit as well to get the $\rho(q)$ that describes our original physical system (even though taking the $r\to 0$ limit of a $r\times r$ matrix shatters much of our intuition!). More concretely, the overlap distribution $\rho(q)$ is related to the overlap matrix Q by:

[TABLE]

One-step replica symmetry breaking – To build back our intuition a bit, let us look at the simplest possible $\rho(q)$ beyond the RS ansatz, that is, let us assume that $\rho(q)=m\delta(q-q_{0})+(1-m)\delta(q-q_{1})$ , with $m\in[0,1]$ , and $q_{0}\leq q_{1}$ . One brilliant realization of Parisi [Par79, Par80a, Par80b] was that this distribution arises from an ultrametric overlap matrix Q, i.e. that has the following form:

[TABLE]

Let us detail how to go from the Q shown in eq. (41) to the $\rho(q)$ that we want. We denote $x\in\{1,\cdots,r\}$ the size of the diagonal blocks in this matrix Q. Then:

[TABLE]

Now arises an issue: since we take the $r\downarrow 0$ limit, and $x\in\{1,\cdots,r\}$ is an integer, how should we proceed? Comparing eq. (42) with our target $\rho(q)$ gives us a possible answer (which turns out to be the correct one [MPV87]): relaxing the constraint that $x\in\{1,\cdots,r\}$ , and taking the limit $r\downarrow 0$ independently of $x$ , we reach:

[TABLE]

i.e. exactly the $\rho(q)$ we wanted to build, with $x=m\in[0,1]$ which now became a real parameter in $[0,1]$ . This type of distribution $\rho(q)$ (and by extension the corresponding Q in eq. (41)) is called One-Step Replica Symmetry Breaking (1RSB).

General replica symmetry breaking – More generally, one can represent a distribution with a finite support of $(k+1)$ elements as $\rho(q)=\sum_{i=0}^{k}(m_{i}-m_{i-1})\delta(q-q_{i})$ , with weights $m_{0}\leq m_{1}\leq\cdots\leq m_{k-1}\leq m_{k}$ , using the conventions $m_{-1}=0,m_{k}=1$ . This distribution is called “ $k$ -step replica symmetry breaking” ( $k$ -RSB), and in this ansatz, the overlap matrix $\{Q_{ab}\}$ can be written as a hierarchical generalization of eq. (41) (with the convention $q_{-1}=0$ and $q_{k+1}=1$ ):

[TABLE]

with ${\textbf{J}}_{m}^{(r)}$ the block-diagonal matrix with $r/m$ blocks of size $m$ , each diagonal block being the all-ones matrix. Once again, the integers $\{m_{i}\}_{i=0}^{k}$ become elements of $[0,1]$ in the $r\downarrow 0$ limit. As in the replica-symmetric case discussed above, the limit $r\downarrow 0$ also turns the maximum over $\{m_{i},q_{i}\}$ into an infimum [MPV87]. In the end, the $k$ -RSB prediction for the free entropy is of the form:

[TABLE]

It is common to represent the right hand side as a function of a step function $q(x)$ for $x\in[0,1]$ , uniquely defined by $\{m_{i}\}$ and $\{q_{i}\}$ , see Fig. 3 (left, blue curve). We write then the argument of the RHS of eq. (43) as $\mathcal{P}[\{q(x)\};\alpha,\beta]$ .

This allows to consider completely generic distributions $\rho(q)$ (or equivalently functions $q(x)$ ), by taking the $k\to\infty$ limit of eq. (43). This generic procedure is called “Full Replica Symmetry Breaking” (Full RSB), and was introduced by Parisi in [Par79]. It yields for the free entropy a formula of the type:

[TABLE]

Such formulas are usually called Parisi formulas in the spin glass literature. Note that in many disordered models, the overlap distribution $\rho(q)$ has been observed to have two points with positive mass, at the edges of its bulk (see Fig. 3, right). This leads to generically characterize the function $q(x)$ as (see Fig. 3 right, red curve):

[TABLE]

This is purely a convention that often turns out to be convenient and does not remove any generality as one can always set $x_{m}=0$ and $x_{M}=1$ .

Relation between $\rho(q)$ and $q(x)$ – For an overlap distribution with a well-defined density $\rho(q)$ , one has the relation $\rho(q)=x^{\prime}(q)$ , with $x(q)\in[0,1]$ the CDF of the overlap, and $x\mapsto q(x)$ is then the functional inverse of $q\mapsto x(q)$ .

RSB and the form of the Gibbs measure – Interestingly, one can interpret the level of RSB as an assumption on the structure of the level sets of the Gibbs measure (or the global minima of the energy, when $\beta=\infty$ ). Roughly speaking, 1-RSB corresponds to an organization of the mass of the Gibbs measure into clusters. Inside each cluster two solutions typically have overlap $q_{1}$ , while solutions belonging to two different clusters have a typical overlap $q_{0}$ . This hierarchy can be iterated inside each cluster, which gives rise to the 2-RSB structure. Iterating even further, the level of RSB corresponds to the depth of this hierarchical structure of clusters, which is known as ultrametric [MPS*+*84, Pan13]. Ultrametricity and RSB is a beautiful mathematical representation of the free energy landscape of spin glass models, which also allows to create efficient algorithms [EAM20, EAMS21, Sub21, Mon21, AMS22].

A thorough description of all the consequences of replica symmetry breaking would be beyond our scope: the major reference on this topic is [MPV87], and we invite the reader to read as well [Tal10], and the very recent lecture notes [MS22b], for discussions in a more mathematically-friendly language.

2.5 One-step replica symmetry breaking

We start by generalizing the calculation we made in Section 2.3 to the more general one-RSB ansatz we described above. We give the results here, while the calculation is detailed in Appendix D. The final result is given as an infimum over three parameters $\{m,q_{0},q_{1}\}$ (see Fig. 3 for their interpretation):

[TABLE]

Note that when $q_{1}=q_{0}$ or when $m=1$ , the overlap distribution $\rho(q)$ reduces to a single delta peak, and we consistently retrieve the replica-symmetric solution of eq. (32).

The zero-temperature limit and the injectivity threshold – In Appendix D.2, we detail how to take the $\beta\to\infty$ limit in $\Phi_{\rm 1RSB}(\alpha,\beta)$ , and to obtain the function $f^{\star}_{\rm 1RSB}(\alpha)\coloneqq\lim_{\beta\to\infty}[-\Phi_{\rm 1RSB}(\alpha,\beta)/\beta]$ . In Appendix D.3 we present the numerical procedure we used to solve the resulting equations. We reach the light blue curve in Fig. 1 for $f^{\star}_{\rm 1RSB}(\alpha)$ , and in particular we have

[TABLE]

Validity of the 1-RSB assumption – While the 1-RSB ansatz is a natural extension of the previous replica symmetric assumption, the results of [FPS*+*17] (which study the same model with a slightly different energy function) strongly suggest that for any $\alpha>2$ , at low enough temperatures the system undergoes a continuous transition from a RS to a Full RSB phase, without any finite level of RSB at intermediate temperatures111111 In particular, a stability analysis of the 1-RSB ansatz, similar to what we did in Appendix B.2, would yield that it becomes unstable at the same temperature as the RS ansatz. . This motivates us to compute the complete Full RSB picture in Section 3. Nevertheless, we will see the 1-RSB prediction of eq. (46) is already very accurate.

3 The full-RSB solution: exact injectivity threshold

3.1 The full-RSB prediction for the free entropy

The full-RSB calculation is detailed in Appendix E, and quite closely follows a similar derivation presented in [FPS*+*17, Urb18].

Notations – Before stating the result, let us introduce some notation. For any $\sigma\geq 0$ , we let $\gamma_{\sigma^{2}}(h)=\exp\{-h^{2}/(2\sigma^{2})\}/\sqrt{2\pi\sigma^{2}}$ the PDF of $\mathcal{N}(0,\sigma^{2})$ . For two functions $a,b:\mathbb{R}\to\mathbb{R}$ , we denote $(a\star b)(h)=\int\mathrm{d}u\,a(u)b(h-u)$ their convolution. For a function $f(x,h)$ with $x\in[0,1]$ and $h\in\mathbb{R}$ , we always consider convolutions in the $h$ variable, e.g. the notation $\gamma_{\sigma^{2}}\star f(x,h)$ denotes the function $(\gamma_{\sigma^{2}}\star f)(x,h)=\int\mathrm{d}u\,\gamma_{\sigma^{2}}(u)f(x,h-u)$ . Moreover, we denote with a dot derivatives in the $x$ variable, and with a prime derivatives in the $h$ variable, e.g. $\dot{f}=\partial_{x}f$ and $f^{\prime\prime}=\partial^{2}_{h}f$ .

Let us now state the results of the full-RSB calculation. We obtain the following formula for the free entropy:

[TABLE]

Here, we denoted $\langle q\rangle=\int_{0}^{1}\mathrm{d}u\,q(u)$ and we defined the auxiliary function:

[TABLE]

Moreover, $f(x,h)$ is taken to be the solution of the Parisi PDE:

[TABLE]

Similar equations were derived and analyzed in [FPS*+*17, Urb18]. These works followed a long series of important papers on the spherical perceptron and its connection to the packing of hard spheres [CKP*+*14, FPUZ15, RUYZ15]. Note that these works consider a shift $\sigma$ in the perceptron activation, so that here we are in the $\sigma=0$ setting of their results. Moreover, their energy function is slightly different from eq. (3), as it contains a multiplicative quadratic term.

The positive-temperature FRSB equations – In order to impose the Parisi PDE constraint on the function $f(x,h)$ in eq. (47), we use a functional Lagrange multiplier $\Gamma(x,h)$ . This yields that the free entropy $\Phi_{\rm FRSB}(\alpha,\beta)$ is given by the extremization with respect to $q(x),\Lambda(x,h),f(x,h)$ of:

[TABLE]

Differentiating these equations with respect to $\Lambda(x,h)$ yields the Parisi PDE of eq. (49) (as it should), while differentiation w.r.t. $q(x)$ and $f(x,h)$ respectively yield:

[TABLE]

Finally, differentiation w.r.t. $\beta$ yields the average energy:

[TABLE]

A sanity check: the RS solution – In the RS assumption, we have $q(x)=q_{0}$ for all $x$ . In particular, this implies that $\dot{q}(x)=0$ , and $q(0)=\langle q\rangle=q_{0}$ . Moreover, it is easy to see that in this case, since $\dot{q}(x)=0$ , we have $f(x,h)=f(1,h)=\log\gamma_{1-q_{0}}\star e^{-\beta\theta(h)}$ for all $x$ , and similarly $\Lambda(x,h)=\gamma_{q_{0}}(h)$ . Therefore, eq. (51a) becomes:

[TABLE]

One can check (the derivation is presented in Appendix E.3) that this equation is equivalent to eq. (33): we found back the RS solution!

3.2 Zero-temperature limit and algorithm for the injectivity threshold

The zero-temperature limit – In the zero temperature limit, the scaling of the FRSB equations in the “UNSAT” phase of a slightly different spherical perceptron has been shown in [FPS*+*17] to be very similar to the one of the SK model. We conjecture that this scaling remains the same in our model. More precisely, in the $\beta\to\infty$ limit, letting $\lambda(q)\coloneqq\lambda[x(q)]$ and $f(q,h)\coloneqq f(x(q),h)$ , we assume:

[TABLE]

Moreover, $\Lambda(q,h)\coloneqq\Lambda(x(q),h)$ remains finite. In particular, since $x(q=1)=1$ by definition (see Fig. 3), we have that $x_{\infty}(q)$ now extends up to $+\infty$ . We define $q_{\infty}(x)$ as the inverse function to $x_{\infty}(q)$ , and then we can define all functions in terms of $x$ , e.g. $f_{\infty}(x,h)\coloneqq f_{\infty}(q_{\infty}(x),h)$ . In this limit, all eqs. (51a),(51b),(51c) scale very naturally, and the Parisi PDE of eq. (49) as well. The only non-trivial part is the boundary condition at $x=1$ , which becomes

[TABLE]

The scaling of the right hand-side can be worked out exactly:

[TABLE]

Similarly, we can work out the zero-temperature limit of eq. (52), and we get:

[TABLE]

Algorithmic procedure – In this paragraph, for the clarity of the presentation, all quantities are considered in the zero-temperature limit, and we drop the $\infty$ subscripts.

The procedure we use is relatively similar to the finite-temperature one described in Appendix B of [FPS*+*17], but is done at zero temperature, and at fixed $x$ rather than fixed $q$ (as we found this choice to be numerically more stable). In order to increase numerical precision, we rescale $h$ and use $t=h/\sqrt{2\chi}$ , allowing to handle small values of the susceptibility $\chi$ . Our algorithmic procedure is as follows:

$\bullet$

Before starting – Pick $k$ large enough, $x_{\mathrm{max}}\gg 1$ large enough, and a grid $0<x_{0}<x_{1}<\cdots<x_{k-1}=x_{\mathrm{max}}<x_{k}=\infty$ .

$\bullet$

Initialization – Start from a guess $\chi>0$ and $0\leq q_{0}\leq q_{1}\leq\cdots\leq q_{k-1}<q_{k}=1$ .

$(i)$

Find the functions $f(x_{i},t)$ via the procedure:

[TABLE]

$(ii)$

Find $\Lambda(q_{i},t)$ via the procedure:

[TABLE]

$(iii)$

Compute $q^{-1}_{i}$ (the hierarchical elements of ${\textbf{Q}}^{-1}$ , not $1/q_{i}$ ) using, for all $i\in\{0,\cdots,k\}$ :

[TABLE]

$(iv)$

Update $\lambda_{i}=\lambda(q_{i})$ via

[TABLE]

$(v)$

Update $\{q_{i}\}_{i=0}^{k}$ with $q_{k}=1$ and

[TABLE]

$(vi)$

Update $\chi$ by solving the equation (with $q_{-1}=0$ and $x_{-1}=0$ ):

[TABLE]

$\bullet$

Iterate steps $(i)\to(vi)$ until convergence.

$\bullet$

Final value for the energy – We then compute the ground state energy as:

[TABLE]

The procedure is done for $k$ large enough so that the result does not vary with $k$ and approaches the $k\to\infty$ limit. Steps $(i)$ and $(ii)$ are a discretization of the zero-temperature limits of the PDEs of eqs. (49) and (51b), arising from the $k$ -RSB ansatz (see Appendix E). We give more details on the derivation of steps $(iii)-(vi)$ in Appendix F.1, leveraging results of [FPS*+*17].

The different convolutions with Gaussians are done using an analytical formula for the Discrete Fourier Transform (DFT) of a Gaussian under a Shannon-Whittaker interpolation, and fast Fourier transform techniques. More details on this point are given in Appendix F.3.

Implementation and results – We present our results for $f^{\star}_{\rm FRSB}$ and the zero-temperature susceptibility $\chi$ in Fig. 1, and the zero-temperature overlap distribution function $q(x)$ for various values of $\alpha$ in Fig. 4.

In particular, the full-RSB prediction for the injectivity threshold is $\alpha_{\mathrm{inj}}^{\mathrm{FRSB}}\simeq 6.698$ . We ran a more precise binary search procedure for computing the value of this transition, which we detail in Appendix F.4. A summary of its result is presented in Fig. 5, and it yields the bound we conjecture in Result 1.1:

[TABLE]

Note that this bound is compatible with the hierarchy described in eq. (17). Moreover, the 1-RSB predictions are found to be very close (but not equal) to the exact FRSB results. This can be intuitively visualized by the fact that $q(x)$ is relatively well approximated by a step function, which corresponds to the 1RSB ansatz, cf. Fig. 4. We emphasize however that the full-RSB algorithmic procedure above does not allow to directly recover the $1$ -RSB result, even used with $k=1$ : indeed it implicitly relies on the fact that one takes $k$ large enough so as not to have to optimize over the variables $x_{1},\cdots,x_{k}$ , so that we can take them to be fixed.

Remark: convexity of the Parisi functional – In the context of mixed $p$ -spin models, the so-called Parisi functional, i.e. the functional whose infimum we take in eq. (47), has been shown to be strictly convex, and thus to have a unique minimizer [AC15]. This is conjectured to hold as well in our setting, however there is no rigorous guarantee that our iterative procedure should converge to a global minimizer. However, our numerical simulations are compatible with this conjecture: as we detail in Appendix F.2, we find the iterative procedure to converge to a consistent solution for all initializing points. Moreover, our procedure exhibits polynomial convergence (see Fig. 6): this suggests that there is an accumulation of near-zero eigenvalues in the Hessian of the Parisi functional close to the minimum (otherwise we would observe exponential convergence), and thus that the Parisi functional is strictly but not strongly convex.

Appendix A Proofs

A.1 Proof of Proposition 1.1

Note that if $m<n$ , then ${\textbf{x}}\in\mathbb{R}^{n}\mapsto{\textbf{W}}{\textbf{x}}\in\mathbb{R}^{m}$ is not injective, $C_{m,n}=\mathbb{R}^{m}$ , and eq. (2) stands trivially. We thus assume $m\geq n$ . Since ${\textbf{x}}\in\mathbb{R}^{n}\mapsto{\textbf{W}}{\textbf{x}}\in\mathbb{R}^{m}$ is then a.s. injective, ${\textbf{W}}\mathbb{R}^{n}$ is a (random) $n$ -dimensional subspace of $\mathbb{R}^{m}$ . Moreover, by rotation invariance of the Gaussian distribution, it is uniformly sampled. Thus it is enough to show that $p_{m,n}=\mathbb{P}[({\textbf{W}}\mathbb{R}^{n})\cap C_{m,n}=\{0\}]$ . In the end, it suffices to show the following lemma, whose proof elements can be found in [PKL*+*22, Pal21, Clu22], which we repeat for completeness:

Lemma A.1.

Almost surely under the law of W, the following two statements are equivalent:

$(i)$

$\varphi_{\textbf{W}}$ * is injective.*

$(ii)$

$({\textbf{W}}\mathbb{R}^{n})\cap C_{m,n}=\{0\}$ .

Proof of Lemma A.1 – In the following, we assume that the following event stands:

[TABLE]

It is easy to see that $\mathbb{P}[E({\textbf{W}})]=1$ , since every set of $n$ independent standard Gaussian vectors in $\mathbb{R}^{n}$ is linearly independent almost surely.

Let us show first that $(ii)\Rightarrow(i)$ . Recall $\mathrm{ReLU}(x)=\max(0,x)$ . Note that for any $a\leq b\in\mathbb{R}$ , $\mathrm{ReLU}(a)=\mathrm{ReLU}(b)$ implies that ReLU is constant on $(a,b)$ . Assume that $\varphi_{\textbf{W}}({\textbf{x}})=\varphi_{\textbf{W}}({\textbf{y}})$ . Let us consider ${\textbf{z}}=({\textbf{x}}+{\textbf{y}})/2$ , then $\varphi_{\textbf{W}}({\textbf{z}})=\varphi_{\textbf{W}}({\textbf{x}})$ by the note above. Moreover, for all $\mu\in[m]$ such that $({\textbf{W}}{\textbf{z}})_{\mu}>0$ , then $({\textbf{W}}{\textbf{x}})_{\mu}=({\textbf{W}}{\textbf{y}})_{\mu}=({\textbf{W}}{\textbf{z}})_{\mu}$ . On the other hand, if $({\textbf{W}}{\textbf{z}})_{\mu}=0$ , then necessarily $({\textbf{W}}{\textbf{x}})_{\mu}=({\textbf{W}}{\textbf{y}})_{\mu}=0$ , since $({\textbf{W}}{\textbf{x}})_{\mu}\leq 0\Leftrightarrow({\textbf{W}}{\textbf{y}})_{\mu}\leq 0$ . By $(ii)$ , W****z has at least $n$ non-negative coordinates, so the argument above implies that there exists at least $n$ values of $\mu\in[m]$ such that $({\textbf{W}}{\textbf{x}})_{\mu}=({\textbf{W}}{\textbf{y}})_{\mu}$ . Since $E({\textbf{W}})$ stands, this shows that ${\textbf{x}}={\textbf{y}}$ .

Let us now show $(i)\Rightarrow(ii)$ . We divide $\mathbb{R}^{n}$ into equivalence classes defined by the relation ${\textbf{x}}\sim{\textbf{y}}\Leftrightarrow\forall\mu\in[m],({\textbf{W}}{\textbf{x}})_{\mu}>0\Leftrightarrow({\textbf{W}}{\textbf{y}})_{\mu}>0$ . These equivalence classes $\mathcal{R}_{S}$ are defined by a subset $S$ of $[m]$ , so that for all ${\textbf{x}}\in\mathcal{R}_{S}$ , $({\textbf{W}}{\textbf{x}})_{\mu}>0\Leftrightarrow\mu\in S$ . Assume that there exists ${\textbf{x}}\in\mathbb{R}^{n}$ such that ${\textbf{W}}{\textbf{x}}\neq 0$ and W****x has strictly less than $n$ positive coordinates, i.e. ${\textbf{x}}\in\mathcal{R}_{S}$ with $|S|<n$ . On $\mathcal{R}_{S}$ , $\varphi_{\textbf{W}}$ is a linear transformation with $|S|<n$ linearly independent rows $\{{\textbf{W}}_{\mu}\}_{\mu\in S}$ . Its image has thus dimension smaller than $n$ . Therefore, the following result, which implies that $\mathcal{R}_{S}$ has dimension $n$ , then implies that $\varphi_{\textbf{W}}$ is not injective on $\mathcal{R}_{S}$ . Having proved the contrapositive, we can then infer $(i)\Rightarrow(ii)$ . $\square$

Lemma A.2.

The following statement is true almost surely: for all $S\subseteq[m]$ , either $\mathcal{R}_{S}=\{0\}$ , or there exists ${\textbf{x}}\in\mathcal{R}_{S}$ and $\varepsilon>0$ such that $B_{2}({\textbf{x}},\varepsilon)\subseteq\mathcal{R}_{S}$ .

Proof of Lemma A.2 – By a union bound over all $S\subseteq[m]$ , it suffices to show this statement a.s. for any fixed $S\subseteq[m]$ . Let us assume that $\mathcal{R}_{S}\neq 0$ . The following statement implies the conclusion of Lemma A.2:

[TABLE]

Indeed, one can then a.s. find an element ${\textbf{x}}\in\mathcal{R}_{S}$ such that $({\textbf{W}}{\textbf{x}})_{\mu}>0$ for all $\mu\in S$ and $({\textbf{W}}{\textbf{x}})_{\mu}<0$ for all $\mu\notin S$ , therefore $B_{2}({\textbf{x}},\varepsilon)\subseteq\mathcal{R}_{S}$ for sufficiently small $\varepsilon$ . We now show eq. (64).

Assume that there exists ${\textbf{x}}\in\mathcal{R}_{S}\backslash\{0\}$ , with $\nu_{1},\cdots,\nu_{k_{\textbf{x}}}\in[m]$ all indices such that ${\textbf{W}}_{\nu}\cdot{\textbf{x}}=0$ , that satisfies $k_{\textbf{x}}\geq 1$ . Note that by $E({\textbf{W}})$ (which stands a.s.) and since ${\textbf{x}}\neq 0$ we must have $k_{\textbf{x}}<n$ . Thus, since $\{{\textbf{W}}_{\nu_{i}}\}_{i=1}^{k_{\textbf{x}}}$ are linearly independent on $E({\textbf{W}})$ , we can then fix ${\textbf{y}}\in\big{(}\{{\textbf{W}}_{\nu_{i}}\}_{i=1}^{k_{\textbf{x}}-1}\big{)}^{\perp}$ such that ${\textbf{W}}_{\nu_{k_{\textbf{x}}}}\cdot{\textbf{y}}<0$ . Consider ${\textbf{x}}^{\prime}={\textbf{x}}+\delta{\textbf{y}}$ with arbitrary $\delta>0$ . By hypothesis, ${\textbf{W}}_{\nu_{i}}\cdot{\textbf{x}}^{\prime}=0$ for all $i\in[k_{\textbf{x}}-1]$ . Moreover, for $\delta$ small enough, ${\textbf{W}}_{\mu}\cdot{\textbf{x}}^{\prime}$ has the same sign as ${\textbf{W}}_{\mu}\cdot{\textbf{x}}$ if $\mu\notin\{\nu_{1},\cdots,\nu_{k_{\textbf{x}}}\}$ . Finally, ${\textbf{W}}_{\nu_{k_{\textbf{x}}}}\cdot{\textbf{x}}^{\prime}=\delta{\textbf{W}}_{\nu_{k_{\textbf{x}}}}\cdot{\textbf{y}}<0$ . In the end, taking $\delta$ small enough, we have found ${\textbf{x}}^{\prime}\in\mathcal{R}_{S}$ with $k_{{\textbf{x}}^{\prime}}=k_{{\textbf{x}}}-1$ . Iterating this procedure, we have shown that a.s. there exists a point ${\textbf{x}}\in\mathcal{R}_{S}$ such that $k_{\textbf{x}}=0$ , which implies eq. (64). $\square$

A.2 Proof of Lemma 1.2

Let us first recall Cover’s theorem [Cov65]. We use the $\mathrm{sign}(x)$ function, with the convention $\operatorname*{sign}(0)=0$ . We call a set of vectors $\{{\textbf{W}}_{1},\cdots,{\textbf{W}}_{m}\}$ in $\mathbb{R}^{n}$ in general position if it has no linearly independent subset of size strictly less than $n$ . Cover’s theorem is an exact formula for the number of dichotomies121212A dichotomy is a binary labeling of the vectors. of this set that are realizable by a linear separation:

Theorem A.3 (Cover [Cov65]).

Let ${\textbf{W}}_{1},\cdots,{\textbf{W}}_{m}\in\mathbb{R}^{n}$ be in general position. Then

[TABLE]

Let us now show that Theorem A.3 implies Lemma 1.2. We assume $\alpha<3$ , so in particular we can fix $\delta>0$ such that $m\leq(3-\delta)(n-1)$ for $n$ large enough. We denote $\tilde{m}=m-(n-1)\leq(2-\delta)n$ . Since W is a Gaussian matrix, the set $\{{\textbf{W}}_{\mu}\}_{\mu\in[\tilde{m}]}$ is a.s. in general position. Moreover, by sign invariance, for any ${\boldsymbol{\varepsilon}}\in\{\pm 1\}^{m}$ we have:

[TABLE]

For ${\boldsymbol{\varepsilon}}$ uniformly sampled in $\{\pm 1\}^{m}$ (independently of W), we denote $\mathbb{P}_{{\textbf{W}},{\boldsymbol{\varepsilon}}}$ the joint probability law of $({\textbf{W}},{\boldsymbol{\varepsilon}})$ , and $\mathbb{P}_{\boldsymbol{\varepsilon}}$ the law of ${\boldsymbol{\varepsilon}}$ . The previous remark on sign invariance allows to deduce:

[TABLE]

by Theorem A.3. Since $\tilde{m}\leq(2-\delta)n$ , it is then elementary to check that this implies

[TABLE]

The proof is then finished by noticing that if x satisfies $\mathrm{sign}({\textbf{W}}_{\mu}\cdot{\textbf{x}})=-1$ for all $\mu\in[\tilde{m}]$ , it must satisfy $E_{\textbf{W}}({\textbf{x}})\leq m-\tilde{m}<n$ , and using eq. (4).

A.3 Proof of Theorem 1.4

It is easy to see that if ${\textbf{W}},{\textbf{W}}^{\prime}$ are two matrices for which ${\textbf{W}}^{\prime}_{\nu}={\textbf{W}}_{\nu}$ for all $\nu\in[m]\backslash\{\mu\}$ , then

[TABLE]

The theorem is then a simple consequence of McDiarmid’s inequality (see e.g. Theorem 6.2 of [BLM13]).

A.4 Proof of Corollary 1.5

By a dominated convergence argument, we have:

[TABLE]

Therefore

[TABLE]

Recall the definition of the Gibbs measure $\mathbb{P}_{\beta,{\textbf{W}}}$ in eq. (5). It is easy to see that the previous equation relates directly to the entropy of $\mathbb{P}_{\beta,{\textbf{W}}}$ , i.e.

[TABLE]

In the language of statistical physics, this is a rewriting of the fact that the temperature derivative of the free energy is given by (minus) the entropy. In particular, for any $n$ , $\beta\mapsto-\mathbb{E}_{\textbf{W}}\Phi_{n}({\textbf{W}},\beta)/\beta$ is non-increasing, and in the limit this shows that $\beta\mapsto-\Phi(\alpha,\beta)/\beta$ is non-increasing. The positivity of this function follows from $\Phi_{n}({\textbf{W}},\beta)\leq 0$ , since $E_{\textbf{W}}({\textbf{x}})\geq 0$ .

Let us now assume that there exists some $\beta<\infty$ such that $-\Phi(\alpha,\beta)<\beta$ . In particular, fixing $\delta>0$ , for $n$ large enough we have $\mathbb{E}_{{\textbf{W}}}\Phi_{n}({\textbf{W}},\beta)\geq-\beta+\delta$ . Using Theorem 1.4, for large enough $n$ , this implies

[TABLE]

for some $C(\alpha,\beta)>0$ . In particular, using eq. (9) and Proposition 1.1:

[TABLE]

The claim follows.

A.5 Proof of Theorem 1.8

Remark – In what follows we usually consider $m=\alpha n$ with $\alpha>0$ , and the proof can be straightforwardly generalized to the original assumption $m/n\to\alpha>0$ . For lightness of the presentation, we assume the simplified statement we described.

First, note that given Lemma 1.2, we can assume $\alpha\geq 3$ in what follows. Using Proposition 1.1, we want to characterize

[TABLE]

in which ${\textbf{W}}=\{{\textbf{W}}_{\mu}\}_{\mu=1}^{m}\overset{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\mathrm{I}_{n})$ . The minimum of this function is reached since it takes discrete values. Introducing an auxiliary variable $z_{\mu}\coloneqq{\textbf{W}}_{\mu}\cdot{\textbf{x}}$ , and a Lagrange multiplier ${\boldsymbol{\lambda}}\in\mathbb{R}^{m}$ to fix this relation, the problem is equivalent by strong duality to

[TABLE]

Note that the infimum over z in eq. (A.5) is actually done over ${\textbf{z}}\in{\textbf{W}}\mathcal{S}^{n-1}$ , since the supremum over ${\boldsymbol{\lambda}}$ becomes $+\infty$ for ${\textbf{z}}\neq{\textbf{W}}{\textbf{x}}$ . Letting $\|{\textbf{W}}\|_{\rm op}\coloneqq\max_{{\textbf{x}}\in\mathcal{S}^{n-1}}\|{\textbf{W}}{\textbf{x}}\|_{2}$ , we know by classical concentration inequalities (see e.g. [Ver18] - Theorem 4.4.5) and since $m=\alpha n$ , that $\mathbb{P}[\|{\textbf{W}}\|_{\mathrm{op}}\geq K\sqrt{n}]\leq e^{-n}$ , for some constant $K>1$ (that might depend on $\alpha$ ). Let us denote $B(K)\coloneqq\{{\textbf{z}}\in\mathbb{R}^{m}\,:\|{\textbf{z}}\|_{2}\leq K\sqrt{n}\}$ . By the argument above and the law of total probability, for all $t>0$ ,

[TABLE]

with

[TABLE]

Moreover, we will approximate $\mathds{1}\{z>0\}$ by continuous functions; we let, for any $\delta\geq 0$ :

[TABLE]

Since $\ell_{\delta}(x)\leq\mathds{1}\{x>0\}$ , it is clear that:

[TABLE]

We now make use of the Gaussian min-max theorem [Gor88, TOH15]:

Proposition A.4 (Gaussian min-max theorem).

Let ${\textbf{W}}\in\mathbb{R}^{m\times n}$ be an i.i.d. standard normal matrix, and ${\textbf{g}}\in\mathbb{R}^{m},{\textbf{h}}\in\mathbb{R}^{n}$ two independent vectors with i.i.d. $\mathcal{N}(0,1)$ coordinates. Let $\mathcal{S}_{\textbf{v}},\mathcal{S}_{\textbf{u}}$ be two compact subsets respectively of $\mathbb{R}^{n}$ and $\mathbb{R}^{m}$ , and let $\psi:\mathcal{S}_{\textbf{v}}\times\mathcal{S}_{\textbf{u}}\to\mathbb{R}$ be a continuous function. We define the two optimization problems:

[TABLE]

Then, for all $t\in\mathbb{R}$ , one has

[TABLE]

Remark I – It is easy to see from the proof of [Gor88, TOH15] that the statement of the theorem also holds if W is a block matrix of the form

[TABLE]

with ${\textbf{W}}_{1}\in\mathbb{R}^{m_{1}\times n_{1}}$ having i.i.d. $\mathcal{N}(0,1)$ elements. Denoting ${\textbf{u}}^{\intercal}{\textbf{W}}{\textbf{v}}={\textbf{u}}_{1}^{\intercal}{\textbf{W}}_{1}{\textbf{v}}_{1}$ , the definition of the auxiliary problem that appears in the theorem is then modified as:

[TABLE]

for ${\textbf{g}}\sim\mathcal{N}(0,\mathrm{I}_{m_{1}})$ , ${\textbf{h}}\sim\mathcal{N}(0,\mathrm{I}_{n_{1}})$ .

Remark II – The full result of [TOH15] actually includes the proof of a converse bound to eq. (71) when the function $\psi$ is convex-concave, and the sets $\mathcal{S}_{\textbf{u}},\mathcal{S}_{\textbf{w}}$ are convex. Here, we do not expect such a converse bound to be true, since the solution is conjecturally described by the full-RSB equations, and we will see that the upper bound of eq. (71) corresponds to the replica-symmetric (RS) solution.

Let us first state a lemma that simplifies the auxiliary problem:

Lemma A.5 (Auxiliary problem simplification).

For any $\delta>0$ , and any $A\in(0,\infty]$ , we define the auxiliary optimization problem, for ${\textbf{g}}\in\mathbb{R}^{m},{\textbf{h}}\in\mathbb{R}^{n}$ :

[TABLE]

Then $A\mapsto\mathcal{C}_{A,\delta}({\textbf{g}},{\textbf{h}})$ is non-decreasing and one has:

[TABLE]

Note that we added a constraint over $\|{\boldsymbol{\lambda}}\|$ in the auxiliary problem, so that the set of ${\boldsymbol{\lambda}}$ considered is compact. This allows to deduce, using Proposition A.4 (and Remark I below) in eqs. (70) and (67):

Lemma A.6.

For all $\delta>0$ , and all $t\in\mathbb{R}$ , one has

[TABLE]

with $\mathcal{C}_{\delta}({\textbf{g}},{\textbf{h}})$ the RHS of eq. (74), and ${\textbf{g}},{\textbf{h}}$ vectors with i.i.d. $\mathcal{N}(0,1)$ coordinates.

Lemmas A.5 and A.6 are proven in Section A.6. We are now ready to prove Theorem 1.8. Note that by weak duality:

[TABLE]

Therefore $\mathbb{P}[\mathcal{C}_{\delta}({\textbf{g}},{\textbf{h}})\leq t]\leq\mathbb{P}[\mathcal{M}_{\delta}({\textbf{g}},{\textbf{h}})\leq t]$ . Moreover, by $\|{\textbf{z}}\|^{2}=\sum z_{\mu}^{2}$ , one has:

[TABLE]

Let us show

[TABLE]

We can assume $\|{\textbf{h}}\|^{2}/n\geq 1/2$ , an event that has probability $1-o_{n}(1)$ . Denoting $f(\kappa,{\textbf{g}},{\textbf{h}})$ the maximized function in eq. (75), we then have131313Indeed, $\inf_{z\in\mathbb{R}}\{\kappa z^{2}+\ell_{\delta}(x-z)\}\leq\ell_{\delta}(x)\leq 1$ . $f(\kappa,{\textbf{g}},{\textbf{h}})\leq 2\alpha-\kappa/2$ and $f(0,{\textbf{g}},{\textbf{h}})\geq 0$ , so that we can write $\mathcal{M}_{\delta}({\textbf{g}},{\textbf{h}})=\max_{0\leq\kappa\leq 4\alpha}f(\kappa,{\textbf{g}},{\textbf{h}})$ . Letting

[TABLE]

we have then for all $\kappa\in[0,4\alpha]$ :

[TABLE]

Note that $\{X_{\mu}\}$ are i.i.d. random variables, and one shows easily that $X_{\mu}\in[0,1]$ , so by Hoeffding’s inequality, for all $t>0$ :

[TABLE]

Plugging it in eq. (77) and using the concentration of $\|{\textbf{h}}\|^{2}/n$ , we reach that for all $t>0$ :

[TABLE]

It is elementary to check that this implies $\max_{0\leq\kappa\leq 4\alpha}f(\kappa,{\textbf{g}},{\textbf{h}})\overset{\mathbb{P}}{\to}\max_{0\leq\kappa\leq 4\alpha}f_{\infty}(\kappa)$ , and therefore eq. (76). By Lemma A.6 we have then shown that for any $t,\delta>0$ ,

[TABLE]

We will then conclude by considering the limit $\delta\to 0$ :

Lemma A.7.

We have $\lim_{\delta\to 0}\mathcal{M}_{\delta}=\mathcal{M}$ , with

[TABLE]

Moreover, the maximum in eq. (A.7) is reached in $\kappa^{\star}$ such that:

[TABLE]

And the limit is then given by:

[TABLE]

We recognize the replica-symmetric prediction of eq. (37), with $\kappa=(2\chi_{\rm RS})^{-1}$ ! By Lemma A.7 and eq. (78), we showed that $\mathcal{M}>t$ implies that $\mathbb{P}[G({\textbf{W}})>t]\to 1$ as $n\to\infty$ . Applying it for $t=1$ ends the proof of Theorem 1.8. ∎

A.6 Proof of Lemmas A.5, A.6 and A.7

Proof of Lemma A.5 – First note that in eq. (73), writing ${\boldsymbol{\lambda}}=\tau{\textbf{e}}$ with $\|{\textbf{e}}\|=1$ , one can perform the supremum over e:

[TABLE]

The maximum over $\tau\in[0,A]$ and the minimum over x can be carried out explicitly:

[TABLE]

Letting ${\textbf{z}}^{\prime}={\textbf{g}}-{\textbf{z}}$ , this yields:

[TABLE]

We now show that:

[TABLE]

which ends the proof. Notice that the LHS of eq. (80) is obviously a non-decreasing function of $A$ , so that it indeed has a limit (possibly $+\infty$ ). Moreover, we can restrict the infimum to $\|{\textbf{z}}\|\leq\|{\textbf{h}}\|+\alpha/A$ , since trivially for all $A$ one has (recall $\ell_{\delta}\leq 1$ ):

[TABLE]

in which we used in $(\rm a)$ and $(\rm b)$ that $\ell_{\delta}(x)\in[0,1]$ . We let $\varepsilon>0$ , and for all $A>0$ we fix $\tilde{{\textbf{z}}}^{(A)}\in\mathbb{R}^{m}$ with $\|\tilde{{\textbf{z}}}^{(A)}\|\in(\|{\textbf{h}}\|,\|{\textbf{h}}\|+\alpha/A]$ such that:

[TABLE]

Since $\|\tilde{{\textbf{z}}}^{(A)}\|\leq\|{\textbf{h}}\|+\alpha/A$ , we can extract a converging subsequence ${\textbf{z}}^{(k)}=\tilde{{\textbf{z}}}^{A(k)}$ such that $\exists\lim_{k\to\infty}{\textbf{z}}^{(k)}=:{\textbf{z}}^{*}$ , and $\|{\textbf{h}}\|<\|{\textbf{z}}^{(k)}\|\leq\|{\textbf{h}}\|+\alpha/A(k)$ , with $A(k)\to\infty$ . Therefore $\|{\textbf{z}}^{*}\|=\|{\textbf{h}}\|$ . Moreover:

[TABLE]

Letting $\varepsilon>0$ be arbitrarily small, the claim of eq. (80) follows. $\square$

Proof of Lemma A.6 – Recall eq. (70). In particular, for any $A,\delta,K>0$ we have:

[TABLE]

By eq. (67) and eq. (81), we have:

[TABLE]

Using Proposition A.4 (since all sets are compact and functions involved are continuous, see in particular Remark I below it) we have, for all $t\in\mathbb{R}$ :

[TABLE]

in which $\mathcal{C}_{A,\delta,K}({\textbf{g}},{\textbf{h}})$ is defined as in eq. (73), restricting furthermore the infimum to ${\textbf{z}}\in B(K)$ . In particular, $\mathcal{C}_{A,\delta,K}({\textbf{g}},{\textbf{h}})\geq\mathcal{C}_{A,\delta}({\textbf{g}},{\textbf{h}})$ . Therefore by eq. (82):

[TABLE]

Note that $\mathbb{P}_{{\textbf{g}},{\textbf{h}}}[\mathcal{C}_{A,\delta}({\textbf{g}},{\textbf{h}})\leq t]=\mathbb{E}_{{\textbf{g}},{\textbf{h}}}[\mathds{1}\{\mathcal{C}_{A,\delta}({\textbf{g}},{\textbf{h}})\leq t\}]$ , and moreover by Lemma A.5141414We use there the fact that $A\mapsto\mathcal{C}_{A,\delta}({\textbf{g}},{\textbf{h}})$ is non-decreasing:

[TABLE]

Taking the $A\to\infty$ limit in eq. (83) and using the dominated convergence theorem ends the proof of Lemma A.6. $\square$

Proof of Lemma A.7 – For $\delta\geq 0$ , we define

[TABLE]

so that $\mathcal{M}_{\delta}=\sup_{\kappa\geq 0}f_{\delta}(\kappa)$ for $\delta>0$ , and $\mathcal{M}=\sup_{\kappa\geq 0}f_{0}(\kappa)$ . Notice first that eq. (A.7) follows from the following identity, that can be easily checked:

[TABLE]

Lemma A.7 will follow if we can show:

[TABLE]

Notice that for all $\delta>0$ and all $x\in\mathbb{R}$ , we have $\mathds{1}\{x>\delta\}\leq\ell_{\delta}(x)\leq\mathds{1}\{x>0\}$ . In particular,

[TABLE]

One computes easily the left and right sides of this inequality:

[TABLE]

Therefore we reach:

[TABLE]

which goes to [math] as $\delta\to 0$ , uniformly in $\kappa$ . This ends the proof. $\square$

Appendix B Replica-symmetric supplementary calculations

B.1 Zero-temperature limit of the replica-symmetric solution

In this section, we derive eqs. (36) and (37). Our arguments will sometimes be informal, and a rigorous treatment would demand more care.

Recall that we have the expansion of eq. (35), with $\chi_{\rm RS}$ the zero-temperature susceptibility of the system. In this section, we often drop the ${\rm RS}$ subscript on quantities to lighten the notations. We use the expansion of $H(x)=\int_{x}^{\infty}\mathcal{D}u$ for large $x\gg 1$ :

[TABLE]

Computation of $f^{\star}_{\rm RS}(\alpha)$ – We start by deriving eq. (37). As one can check from eq. (32) that $\Phi_{\rm RS}(\alpha,\beta)$ is a differentiable function of $\beta$ , by L’Hospital’s rule we have $f^{\star}_{\rm RS}(\alpha)=\lim_{\beta\to\infty}e^{\star}_{\rm RS}(\alpha,\beta)$ , with $e^{\star}_{\rm RS}(\alpha,\beta)\coloneqq-\partial_{\beta}\Phi(\alpha,\beta)$ . We have from eq. (35):

[TABLE]

We compute the limit of the integrand in eq. (34) (changing variables $\xi\to-\xi$ ):

[TABLE]

We separate three cases, and use the expansion of eq. (85) to reach that at leading order in $\beta$ :

[TABLE]

Using the pointwise limit above, we reach (as we mentioned above, a more careful argument would need to be carried out to make this expansion rigorous)

[TABLE]

In the end, we reach eq. (37):

[TABLE]

Computing $\chi$ – There now remains to find $\chi$ as a function of $\alpha$ , from eq. (33). Plugging in the expansion of eq. (35) we find (changing $\xi\to-\xi$ ):

[TABLE]

In the same way as in eq. (86), we can show:

[TABLE]

Therefore, we reach from eq. (87) that, as $\beta\to\infty$ :

[TABLE]

which is eq. (36).

B.2 Stability of the replica-symmetric solution

In this section we follow Appendix 4 of [EVdB01] (see also e.g. [Urb18]) to characterize the stability of the RS solution in replica space. This gives rise to the so-called de Almeida-Thouless conditions [dAT78, GD88], which is a criterion for stability expressed in terms of so-called replicon eigenvalues.

We start again from the general expression of eq. (27): $\Phi(\alpha,\beta;r)=\sup_{{\textbf{Q}}}G_{r}({\textbf{Q}})$ , with

[TABLE]

In what follows, we compute the Hessian of $G_{r}({\textbf{Q}})$ taken at the replica-symmetric point.

B.2.1 The derivatives of $G_{1,r}$

The derivatives of $G_{1,r}({\textbf{Q}})$ can be worked out in terms of the matrix elements of ${\textbf{Q}}^{-1}$ (here $a<b$ and $c<d$ ):

[TABLE]

Recall that at the replica symmetric point with $Q_{ab}=q$ and $Q_{aa}=1$ we have

[TABLE]

Therefore (taking the notations of [EVdB01]):

[TABLE]

in which $P_{1},Q_{1},R_{1}$ are defined as:

[TABLE]

We now take the limit $r\downarrow 0$ . With an abuse of notation, we still denote the limits $P_{1},Q_{1},R_{1}$ :

[TABLE]

B.2.2 The derivatives of $G_{2,r}$

We now turn to $G_{2,r}({\textbf{Q}})$ , that we rewrite using a Gaussian transformation:

[TABLE]

This form is more suitable for computing the Hessian with respect to Q. In order to write the formulas compactly, we introduce the following average for any function of $\{v^{a}\}$ :

[TABLE]

With this definition, we have from eq. (93):

[TABLE]

in which $a<b$ and $c<d$ . One can easily see that this Hessian has the same “replica-symmetric” structure as the one of $G_{1,r}$ :

[TABLE]

We compute these three terms separately in the limit $r\to 0$ . In order to simplify the results, we introduce the notation $\mathbb{E}\langle g(v)\rangle$ , with $\mathbb{E}$ the expectation over $\xi\sim\mathcal{N}(0,1)$ , and

[TABLE]

From this definition and eq. (94), one can check (using the same trick to decouple the replicas we used in the RS calculation, cf. Section 2.3) that we have, as $r\to 0$ :

[TABLE]

B.2.3 de Almeida-Thouless condition for replica-symmetric stability

Classical replica studies [EVdB01] show that for a Hessian having the form of eqs. (91) or (95), the linear stability of the RS local maximum is given by the sign of the “replicon” eigenvalue $P-2Q+R$ . More precisely, the AT condition for the stability of the RS solution in replica space reads here:

[TABLE]

By eqs. (92) and (96) we get:

[TABLE]

In order to make eq. (97) more explicit, we compute the right-hand side using the identity $\langle v^{2}\rangle-\langle v\rangle^{2}=-q^{-1}\partial^{2}_{\xi}\log\mathcal{Z}(\xi)$ , with

[TABLE]

This integral is easy to work out:

[TABLE]

Let us define $f_{\beta}(h)\coloneqq\log(1-(1-e^{-\beta})H[-h/\sqrt{1-q}])$ , so that $\log\mathcal{Z}(\xi)=f_{\beta}(\sqrt{q}\xi)$ . Then $\langle v^{2}\rangle-\langle v\rangle^{2}=-f_{\beta}^{\prime\prime}(\sqrt{q}\xi)$ . The AT condition for the stability of the replica-symmetric solution is then expressed easily as a function of $(\alpha,q)$ at any $\beta\geq 0$ as

[TABLE]

B.2.4 The $\beta\to\infty$ limit

We now take the limit $\beta\to\infty$ in eq. (98), introducing the zero-temperature susceptibility $\chi_{\rm RS}=\chi$ (cf. eq. (35)). Using the same expansions as in eqs. (86) and (88) we have as $\beta\to\infty$ :

[TABLE]

Therefore, we have at large $\beta$ , that $f_{\beta}^{\prime\prime}(h)\simeq-\beta\chi^{-1}\mathds{1}\{h\in(0,\sqrt{2\chi})\}$ . Since $(1-q)^{2}\simeq\chi^{2}/\beta^{2}$ , the RS stability condition (98) becomes, in the $\beta\to\infty$ limit:

[TABLE]

However, recall that in the zero-temperature limit, the RS susceptibility $\chi$ is given by the solution to eq. (36):

[TABLE]

which can be turned easily by integration by parts into:

[TABLE]

in which the inequality holds in all the “UNSAT” phase $\alpha>2$ for which $\chi<\infty$ . Therefore, eq. (100) is never satisfied for any $\alpha>2$ : at zero-temperature, the replica-symmetric solution is never linearly stable!

Appendix C A replica-symmetric lower bound

High-dimensional concentration on energy level sets – Let us first describe physical reasons (at a heuristic level) for the concentration of the energy under the Gibbs measure of eq. (5), for any $\beta\geq 0$ . As we mentioned in Section 1, proving this property is highly non-trivial. While we did not need to assume this concentration to hold in the rest of the paper, it will be important in this part to describe the derivation of a replica-symmetric lower bound for the injectivity threshold.

At fixed W, the distribution of intensive energies is described by a probability density $P_{\beta}(e)$ given by:

[TABLE]

One can show (using properties of the uniform measure $\mu_{n}$ and the fact that the energy is extensive) that the “entropic” term on the right scales exponentially with $n$ , that is that for any $e\in[0,\alpha]$ , one has a well-defined $F(e)\coloneqq\lim_{n\to\infty}(1/n)\log\int\mu_{n}(\mathrm{d}{\textbf{x}})\,\delta(E_{\textbf{W}}({\textbf{x}})/n-e)\in[-\infty,0]$ . In mathematical terms, for ${\textbf{x}}\sim\mu_{n}$ , $E_{\textbf{W}}({\textbf{x}})/n$ satisfies a large deviation principle in the scale $n$ , with rate function $-F(e)$ . Therefore, by eq. (101), $P_{\beta}(e)$ has large deviations in the scale $n$ around a value $e^{\star}(\beta)$ , i.e. we have the following behavior:

[TABLE]

In particular, the probability (under the Gibbs measure) of having a configuration with energy $e$ such that $|e-e^{\star}(\beta)|>\varepsilon$ is exponentially small in $n$ for any $\varepsilon>0$ . Therefore, we expect that at any $\beta\geq 0$ , all the mass of the Gibbs measure concentrates (as $n\to\infty$ ) around the level set with intensive energy $e^{\star}(\beta)$ , given thus also by the mean energy under the Gibbs measure [Ell06]. Note that we discarded the dependency of $e^{\star}$ on W: the concentration with respect to W can be justified (but not proven!) using the concentration of $\Phi_{n}({\textbf{W}},\beta)$ in Theorem 1.4. Indeed, note that the average energy under the Gibbs measure is precisely given by a derivative of the free energy:

[TABLE]

Therefore, one expects that the concentration of the free energy transfers to the derivatives, and thus that the energy level also concentrates as a function of W. Summing up, the function $e^{\star}(\alpha,\beta)$ (we explicit its dependency on $\alpha$ and $\beta$ ) is – conjecturally – equal to the following limit:

[TABLE]

in which the limit is again in probability over the randomness induced by W. Furthermore, in the limit $\beta\to\infty$ – as we argued in the main text – we expect the Gibbs measure to concentrate its mass around the global minima of $E_{\textbf{W}}$ , and therefore that

[TABLE]

The lower bound – By eq. (4) and eq. (103), $e^{\star}(\alpha,\beta=\infty)\geq 1\Leftrightarrow\alpha\geq\alpha_{\mathrm{inj}}$ 151515Assuming that $\alpha\mapsto e^{\star}(\alpha,\beta=\infty)$ is continuous and strictly increasing, which we always observe, see Fig. 1.. Since $e^{\star}(\alpha,\beta)$ is a decreasing function of $\beta$ and $e^{\star}(\alpha,\beta=0)=\alpha/2$ , this implies that for any $\alpha\in(2,\alpha_{\mathrm{inj}}]$ there exists $\beta^{\star}(\alpha)\in[0,\infty]$ such that

[TABLE]

Moreover, it is easy to see that $\beta^{\star}(\alpha)$ is a non-decreasing function of $\alpha$ . In particular, if for all $\beta<\beta^{\star}(\alpha)$ the RS solution is stable (in the sense of the dAT condition described above, and derived in Appendix B.2), then the replica-symmetric ansatz will yield the exact solution for all $\beta\in[0,\beta^{\star}(\alpha))$ . We denote $\alpha_{\rm dAT}$ the largest such $\alpha$ :

[TABLE]

Recall that $\alpha_{\mathrm{inj}}$ is, according to our criterion, equal to:

[TABLE]

However, we know that for $\beta\to\infty$ the RS solution is never stable for $\alpha>2$ (see Appendix B.2), so for all $\alpha>2$ , if $\beta^{\star}(\alpha)=\infty$ then $\alpha>\alpha_{\rm dAT}$ . In particular, we get the lower bound $\alpha_{\rm dAT}\leq\alpha_{\mathrm{inj}}$ . On the other hand, an important property of $\alpha_{\rm dAT}$ is that it can be computed solely from the RS solution (and its stability analysis). Recall that the stability condition for the RS solution is given by eq. (98), in which $q$ is the overlap given by the RS calculation, i.e. by eq. (33). A numerical evaluation of eq. (98), available in the attached code [Mai23], yields $\alpha_{\rm dAT}\simeq 5.3238$ , which implies the lower bound presented in eq. (39).

Appendix D One-step replica symmetry breaking

D.1 Derivation of the 1-RSB free entropy

We perform here, for completeness of our presentation, the textbook calculation of the spherical perceptron free entropy at the one-RSB level. We start again from eq. (27), which we rewrite using a Gaussian transformation as:

[TABLE]

We assume a 1RSB ansatz given in eq. (41), with $q_{1}>q_{0}$ , and $m\in\{1,\cdots,r\}$ with $m\,|\,r$ the Parisi parameter (i.e. the size of the diagonal blocks in the ultrametric Q). More precisely we have, with $k\coloneqq r/m$ :

[TABLE]

The entropic contribution – We focus on the first term of eq. (104). It is elementary algebra to check that under the ansatz of eq. (105), the spectrum of Q is:

[TABLE]

In particular, this yields:

[TABLE]

And thus:

[TABLE]

The interaction contribution – We focus now on the second term $\alpha G_{2,r}({\textbf{Q}})$ in eq. (104), with:

[TABLE]

Under the 1-RSB ansatz, it becomes:

[TABLE]

Introducing Gaussian transformations based on the formula $e^{-x^{2}/2}=\int\mathcal{D}z\,e^{-izx}$ to decouple the replicas, we obtain:

[TABLE]

Using this Gaussian transformation trick, we were able to decouple replicas and therefore obtain an expression that is analytic in $r$ . This allows to take the $r\downarrow 0$ limit (keeping $m$ fixed), and to reach:

[TABLE]

Performing the Gaussian integrals, and recall the definition of $H(x)\coloneqq\int_{x}^{\infty}\mathcal{D}u$ , we reach:

[TABLE]

Combining eq. (106) and eq. (107), we reach eq. (2.5).

D.2 Zero-temperature limit

Recall that in the $\beta\to\infty$ limit we have the scaling (see e.g. [FPS*+*17])

[TABLE]

while $q_{0}$ has a limit in $(0,1)$ as $\beta\to\infty$ . In the following of this section, we write $\chi_{\rm 1RSB}=\chi$ to lighten the notations. The asymptotics of the determinant term in eq. (2.5) can be worked out:

[TABLE]

The limit of the other term in eq. (2.5) can also be computed, using that:

[TABLE]

with $u\coloneqq\sqrt{q_{0}}\xi_{0}+\sqrt{1-q_{0}}\xi_{1}$ , and $f_{\beta}(h)\coloneqq\log(1-(1-e^{-\beta})H[-h/\sqrt{1-q_{1}}])$ . We described the expansion of $f_{\beta}(h)/\beta$ for large $\beta$ in eq. (99) (simply replacing $\chi_{\rm RS}$ by $\chi_{\rm 1RSB}$ ). We reach then:

[TABLE]

Anticipating on what follows, we introduce the auxiliary functions (in which $u=u(\xi_{0},\xi_{1})\coloneqq\sqrt{q_{0}}\xi_{0}+\sqrt{1-q_{0}}\xi_{1}$ ):

[TABLE]

By integration by parts, all these functions can be expressed in terms of elementary functions and $H(x)\coloneqq\int_{x}^{\infty}\mathcal{D}\xi$ . Moreover, note that we have the identities:

[TABLE]

Eqs. (111b) and (111c) can be obtained directly from the definition of eq. (110). For eq. (111a), we found more convenient to differentiate the finite- $\beta$ integral one can write for $n(\xi_{0})$ using eq. (109), and then take its large $\beta$ limit. We leave the derivation of these equations to the reader. Using the expansion of eq. (109), we obtain the limit of the second term of eq. (2.5):

[TABLE]

In the end, we have computed the limit of the free energy at the 1-RSB level, i.e. $f^{\star}_{\rm 1RSB}(\alpha)\coloneqq-\lim_{\beta\to\infty}\Phi_{\rm 1RSB}(\alpha,\beta)/\beta$ :

[TABLE]

in which one must implicitly maximize over $(c_{m},q_{0},\chi)$ . Note that an equivalent expression can be obtained using the limit of the average energy, since $e^{\star}_{\rm 1RSB}(\alpha,\beta=\infty)=\lim_{\beta\to\infty}[-\partial_{\beta}\Phi_{\rm 1RSB}(\alpha,\beta)]=f^{\star}_{\rm 1RSB}(\alpha)$ . Performing expansions in a similar way to the RS computations described in Appendix B.1, we reach:

[TABLE]

Let us emphasize that eq. (113) is an identity involving the parameters $(c_{m},q_{0},\chi)$ , which have to be found by maximizing eq. (112).

D.3 Numerical procedure

Let us summarize here the equations that allow to find the $1$ -RSB prediction for the injectivity threshold, using the set of auxiliary functions of eq. (110). One simply proceeds by derivation of the limit of the free energy functional given in eq. (112) with respect to $(q_{0},\chi,c_{m})$ , using eq. (111c). More precisely, at a given value of $\alpha>2$ , one must find $q_{0}\in(0,1)$ and $\chi,c_{m}>0$ satisfying the following set of three equations:

[TABLE]

Once one has found the solution to eq. (114), we can obtain the large- $\beta$ limit of the energy either from eq. (112) or eq. (113).

Following the statistical physics folklore, in order to implement an iterative scheme to solve eq. (114), we use auxiliary variables. Namely, we iterate the first two equations of eq. (114) as:

[TABLE]

in which we added a time subscript for the auxiliary functions to highlight their dependency on $q_{0}^{t},\chi^{t},c_{m}^{t}$ . Moreover, the functions $F_{1},F_{2}$ are defined as the only roots (in $q_{0},\chi$ ) of the equations

[TABLE]

such that $q_{0}\in(0,1)$ and $\chi\geq 0$ . Note that this implies that

[TABLE]

Therefore, in order for the solution to exist we must have $A_{0}<A_{1}$ , and then the solution satisfies $q_{0}>A_{0}/A_{1}$ . The remaining equation on $q_{0}$ can be written as:

[TABLE]

We solve eq. (117) on $q_{0}$ with a polynomial equations solver, and consider the unique solution in $(0,1)$ such that the corresponding $\chi$ in eq. (116) satisfies $\chi\geq 0$ , i.e. such that $q_{0}>A_{0}/A_{1}$ .

At a given iteration $t$ , we iterate eq. (115) for the value $c_{m}=c_{m}^{t}$ . We then do a binary search to solve the last equation of eq. (114) and find $c_{m}^{t+1}$ . We found this procedure to converge very quickly (see the attached code [Mai23]), and it yields the 1RSB curves in Fig. 1 and the prediction of eq. (46).

Appendix E Details of the FRSB computation

In this section we derive the full RSB conjecture for the free entropy $\Phi(\alpha,\beta)$ . Our computation is extremely close to the one of [FPS*+*17], and we refer to this work (and the lecture notes [Urb18]) for more details on the technicalities of the derivation.

Recall the form of the $r$ -th moment of the partition function, written as a function of the overlap matrix (without any assumption on the form of the saddle point), that is eq. (27). Note that by using the Gaussian integration formula, we can rewrite $I_{\beta}({\textbf{Q}})$ so as to obtain:

[TABLE]

Let us now perform the replica method under the full-RSB ansatz described in Fig. 3.

E.1 Entropic contribution

We start with the first “entropic” term in eq. (118). Its expression under a full-RSB ansatz is given in eq. (23) of [FPS*+*17], itself taken from Appendix A.II of [MP91]. However the derivation is itself very interesting and will be useful for the other term in eq. (118), so we first detail it here.

Derivation for $r>0$ – We focus on the entropic term, which we may write as:

[TABLE]

We fix a $k$ -RSB ansatz, cf. Fig. 3, and we will take in the end the limit $k\to\infty$ . We have in the $r\downarrow 0$ limit $m_{-1}\coloneqq r\leq m_{0}\leq m_{1}\leq\cdots\leq m_{k-1}\leq m_{k}=1$ , and the parameters $q_{0}\leq q_{1}\leq q_{k}\leq q_{k+1}=1$ . Recall that in this ansatz, the hierarchical overlap matrix $\{Q_{ab}\}$ can be written as:

[TABLE]

with $J_{m}^{(r)}$ the block-diagonal matrix with $r/m$ blocks of size $m$ , each diagonal block being the all-ones matrix. In order to compute the integral of eq. (119), we use a simple yet very powerful identity introduced in [Dup81], and valid for any matrix (not necessarily a hierarchical RSB matrix) $\{Q_{ab}\}$ :

[TABLE]

This identity can be shown by Taylor-expanding the exponential involving the differential operator. Using it in eq. (119) we get:

[TABLE]

Note that u does not appear in the differential operator, so that one can exchange the differential operator and the integral over u. Integrating with respect to u yields then, using the Fourier representation of the delta distribution (we denote $\partial_{a}=\partial/\partial_{h_{a}}$ ):

[TABLE]

We use now the crucial identity, for any $\omega\geq 0$ and smooth function $f$ , and which can be shown by Taylor-expanding $f$ around $h$ inside the integral on the right hand side:

[TABLE]

Here we denoted $\gamma_{\omega}(x)=e^{-x^{2}/(2\omega)}/\sqrt{2\pi\omega}$ , and $\gamma_{0}(x)=\delta(x)$ . Using eq. (123) inside eq. (E.1) we reach:

[TABLE]

We will iteratively apply the differential operator in the exponential, starting from $i=0$ up to $i=k$ . We will make use of another important identity, which is just a consequence of simple differential calculus combined with eq. (123), and valid for any $p,n\in\mathbb{N}$ and smooth $R(h_{1},\cdots,h_{n})$ :

[TABLE]

Let us now come back to eq. (124). We separate the term $i=0$ , and we have, with eq. (123):

[TABLE]

with $\Xi({\textbf{h}})$ defined as:

[TABLE]

Note that $\Xi({\textbf{h}})$ factorizes over the inner diagonal blocks of size $m_{0}$ , and we have $\Xi(h,\cdots,h)=\zeta(h)^{r/{m_{0}}}$ , with

[TABLE]

Therefore, putting it back into eq. (E.1) and using eq. (125), we have:

[TABLE]

with $\zeta(h)$ defined in eq. (127). This procedure can then be repeated iteratively on the diagonal blocks, all the way to the innermost ones. Eq. (124) then becomes:

[TABLE]

with the functions $g(m_{i},h)$ iteratively defined as:

[TABLE]

We now take the $k\to\infty$ (Full RSB) limit in eq. (129). In this limit we can approximate any function $q(x)$ (see Fig. 3), and taking for $(m_{i})_{i=0}^{k}$ a regular grid on $x\in[0,1]$ , we have $m_{0}\to 0$ , $m_{k-1}\to 1$ , and for all $i=0,\cdots,k-1$ , we have $m_{i}\to x$ and $m_{i}-m_{i-1}=\mathrm{d}x$ . Moreover $q_{k}\to q(1)$ , $q_{0}\to q(0)$ , and $q_{i+1}-q_{i}=\dot{q}(x)\mathrm{d}x$ . We sometimes also use the notation $q_{m}=q(0)$ , $q_{M}=q(1)$ . To make things clearer, we will denote derivatives w.r.t. $x$ with dots, and the ones w.r.t. $h$ with the usual prime. The second line of eq. (129) becomes, at first order in $\mathrm{d}x$ (recall the crucial eq. (123)), for $x\in(0,1)$ :

[TABLE]

Comparing the terms at first order in $\mathrm{d}x$ , we reach the PDE:

[TABLE]

It is convenient to rewrite eq. (130) in terms of $f(x,h)\coloneqq(1/x)\log g(x,h)$ , which yields the Parisi PDE:

[TABLE]

The boundary condition in the first line was given by eq. (129): $g(1,h)=\gamma_{1-q(1)}(h)$ .

Remark: universality of the Parisi PDE – As can be already hinted by the calculation above and the method of [Dup81], the Parisi PDE described in eq. (131) is actually extremely general: the specificities of the term that we wish to compute only appear in the boundary conditions at $x=1$ , while the evolution equation is only dependent on the ultrametric structure of the problem. We will see a clear example of this when computing the energetic contribution to the free entropy.

The $r\to 0$ limit – Taking the $r\to 0$ limit in eq. (128) yields finally:

[TABLE]

Solution to the Parisi PDE for the entropic contribution – Fortunately, with the boundary condition that we have here, the Parisi PDE of eq. (131) is analytically solvable. Indeed $g(x,h)$ always remains (up to a scaling) a centered Gaussian function of $h$ , or equivalently we can look for a solution in the form

[TABLE]

with $\omega(1)=1-q(1)$ and $C(1)=1$ . This yields after some algebra simple ODEs on $\omega,C$ that are easily verified to be solved by:

[TABLE]

We took the notation $\lambda(x)$ defined in eq. (48). In particular, for every $x\in(0,1)$ , we have:

[TABLE]

We can take the limit of this equation as $x\to 0$ . With our notations, we have $\lambda(0)=1-\langle q\rangle$ , $\lambda(1)=1-q_{M}$ and $\dot{\lambda}(q)=-u\dot{q}(u)$ . By integration by parts, we reach:

[TABLE]

Final result for the entropic contribution – Therefore, taking the limit $x\to 0$ , we have

[TABLE]

Note that eq. (132) is also equivalent to a formula given in Appendix II of [MP91] as can be seen by integration by parts:

[TABLE]

In the IPP, one uses $\lambda(0)=1-\langle q\rangle$ , $\lambda(1)=1-q_{M}$ , and $\lambda(u)=\lambda(0)+\mathcal{O}(u^{2})$ .

E.2 Energetic contribution

The second part of the free entropy is the energetic contribution, i.e. $\alpha G_{2,r}({\textbf{Q}})$ , with

[TABLE]

Again using the identity of eq. (120), we have:

[TABLE]

One can notice that this equation is extremely similar to eq. (E.1), but the function $\delta(h)$ has been replaced with $e^{-\beta\theta(h)}$ . However, the whole procedure that we described above to obtain the Parisi PDE does not change at all, since it did not depend on the specifics of this function: the PDE itself remains the same, only the boundary condition at $x=1$ will be different. In the end, this yields:

[TABLE]

with $f(x,h)$ given as the solution to the Parisi PDE with specific boundary condition at $x=1$ :

[TABLE]

Note that one can equivalently write this PDE in terms of the parameter $q$ rather than $x$ by a change of variable $q=q(x)$ , as described e.g. in [Urb18].

E.3 Recovering the RS result from the full RSB equations

In this paragraph we show that eq. (53) is equivalent to eq. (33). We denote $q_{0}=q$ coherently with the RS computation. One computes easily that

[TABLE]

In particular, we have:

[TABLE]

Therefore eq. (53) reads:

[TABLE]

or equivalently:

[TABLE]

Since $H^{\prime}(x)=-e^{-x^{2}/2}/\sqrt{2\pi}$ , letting $f(\xi)\coloneqq 1-(1-e^{-\beta})H[\xi\sqrt{q/(1-q)}]$ we can rewrite eq. (135) and use an integration by parts:

[TABLE]

which is equivalent to eq. (33).

Appendix F Technicalities of the algorithmic FRSB procedure

F.1 Technicalities of the derivation of the algorithmic procedure

We give here some details on the arising of eqs. (59)-(62). Recall that here all the quantities are considered at zero-temperature, with the scaling of eq. (54).

$\bullet$

Eq. (59) is a general relation between $q^{-1}$ , $f$ and $\Lambda$ when eq. (51c) is satisfied. It is explained for instance in [FPS*+*17], see eq. (B.4).

$\bullet$

Eq. (60) is a consequence of the general relation between $q^{-1}(x)$ (the function corresponding to the overlap matrix ${\textbf{Q}}^{-1}$ ) and $q(x)$ in the full RSB ansatz, which is (see e.g. eq. (B.9) in [FPS*+*17]):

[TABLE]

Discretization of this relation yields eq. (60).

$\bullet$

One can invert eq. (48) to obtain $q(x)$ as a function of $\lambda(x)$ via:

[TABLE]

It is the discretization of this equation that yields eq. (61).

$\bullet$

Eq. (62) is a consequence of the boundary condition of eq. (51a) taken at $x=1$ , followed by a change of variable from $x$ to $q$ in the parameters of the functions $\Lambda,f,\lambda$ . After these procedures, eq. (51a) becomes, for the unrescaled variables and any $\beta\geq 0$ :

[TABLE]

After taking the $\beta\to\infty$ limit, this yields for the variables that are rescaled as $\beta\to\infty$ according to eq. (54) (dropping the $\infty$ subscript):

[TABLE]

Moreover, $f^{\prime}(1,h)=-(h/\chi)\mathds{1}\{h\in(0,\sqrt{2\chi})\}$ . Therefore, rescaling then $t=h/\sqrt{2\chi}$ (and using the abusive notation $\Lambda(1,h)=\Lambda(1,t)$ ) we have:

[TABLE]

We focus on the left-hand side of this last equation, in the $k$ -RSB ansatz. We first use that $\lambda(q)=\chi+\int_{q}^{1}\mathrm{d}p\,x(p)$ , a simple consequence of eq. (48), after change of variables and rescaling. Therefore, we have:

[TABLE]

Using the convention $q_{-1}=0$ and $x_{-1}=0$ , we therefore reach from eq. (136) that:

[TABLE]

which is precisely eq. (62).

F.2 Numerical results of the procedure

In Fig. 6 we present the results of typical iterations of the algorithmic procedure described above. For different values of $\alpha$ and the RSB parameter $k$ we show the evolution of the estimates of $f^{\star}(\alpha)$ , the susceptibility $\chi$ , and the function $q(x)$ , along the iterations. In all the cases implemented we see power-law convergence to the solution, and very consistent results when varying the parameters used in the algorithm (in particular increasing $k$ ).

F.3 Some details on the implementation of convolutions

F.3.1 Convolutions via DFTs

In order to implement the algorithmic procedure of Section 3.2, we use a discrete Fourier transform approach. We refer to [Get13] for a review on Gaussian convolution algorithms. The goal is to compute the convolution of a centered Gaussian $\gamma_{\omega}$ with variance $\omega>0$ and a function $f(h)$ :

[TABLE]

We fix $N\in\mathbb{N}^{\star}$ and $H>0$ , and we consider a grid $h_{\mu}=\mu H/N$ , with $\mu\in\{-N,\cdots,N\}$ . In order to leverage analytical formulas for the DFT of the Gaussian, we use a Shannon-Whittaker interpolation for $f$ , i.e. we approximate $f$ as:

[TABLE]

with $\varphi(x)=\mathrm{sinc}(x)=\sin(\pi x)/(\pi x)$ . Since $\varphi(\nu)=0$ for all $\nu\in\mathbb{Z}^{\star}$ , and $\varphi(0)=1$ , we have $f_{\mu}=f(h_{\mu})$ . This approximation transfers into an approximation for $\gamma_{\omega}\star f$ as:

[TABLE]

with $\varphi_{\nu}(h)\coloneqq\varphi(Nh/H-\nu)$ . Thus we have, with $(\gamma_{\omega}\star f)_{\mu}=\gamma_{\omega}\star f(h_{\mu})$ , and using $\varphi_{\nu}(x)=\varphi_{0}(x-h_{\nu})$ :

[TABLE]

Note that we naturally extended $(\gamma_{\omega}\star\varphi_{0})_{\mu}$ to all $\mu\in\mathbb{Z}$ , since these coefficients have an analytic expression. In the same way, we extend $f_{\nu}=0$ if $|\nu|>N$ . For a general sequence $(f_{\nu})_{\nu=-N}^{N}$ , we define its Discrete Fourier Transform (DFT) as, for $k\in\{0,\cdots,2N\}$ :

[TABLE]

Taking the DFT of eq. (137), one finds:

[TABLE]

Moreover, we define the Fourier transform as $\tilde{f}(\xi)\coloneqq\int\mathrm{d}x\,f(x)\,e^{-2i\pi x\xi}$ , and have then easily $\tilde{\varphi}_{0}(\xi)=(H/N)\mathds{1}\{|\xi|\leq N/(2H)\}$ . The Fourier transform of the convolution is $\widetilde{f\star g}(\xi)=\tilde{f}(\xi)\tilde{g}(\xi)$ . This yields, via inverse Fourier transformation:

[TABLE]

Therefore, we have by eq. (138):

[TABLE]

Taking $N\gg 1$ , the term on the right is well approximated by the Dirac comb:

[TABLE]

However, since $|\xi|\leq 1/2$ and $k\in\{0,\cdots,2N\}$ , this implies:

[TABLE]

Plugging it back into eq. (139), we finally obtain the formula we use for the DFT of the convolution $\gamma_{\omega}\star f$ :

[TABLE]

F.3.2 Taking a large enough value of $N$

Note that in order for the Gaussian convolutions to be numerically well defined, we need the spacing in the grid we take on $h$ to be much smaller than the standard deviation of the Gaussians, that is we need for any $(q(x),q(x)+\mathrm{d}x\,\dot{q}(x))$ :

[TABLE]

Note that, as shown in [FPS*+*17], and as one can also verify from Fig. 4, we have the following scaling as $x\to\infty$ : $q(x)\sim 1-A/x^{2}$ , with $A>0$ . Therefore $\dot{q}(x_{\mathrm{max}})\sim 2A/x_{\mathrm{max}}^{3}$ . Since we take $\mathrm{d}x\sim x_{\mathrm{max}}/k$ in our numerical procedure, we have that in order for our procedure to be valid we need

[TABLE]

In practice, we find typically $\chi/A\sim 10^{-1}$ , so that we will impose $N\gg N_{0}$ , with

[TABLE]

In practice, we consider $N=cN_{0}$ (we often take $c=30$ ) with a constant $c\gg 1$ in order to be well into the regime $N\gg N_{0}$ , and still have a reasonable computational time.

F.4 Bounds on the injectivity threshold

Let us detail the results of our numerical computation of $\alpha_{\mathrm{inj}}^{\rm FRSB}$ , illustrated in Fig. 5. For a given value of all parameters of the algorithm detailed in Section 3.2, we ran Brent’s method to find the zero of $f^{\star}_{\rm FRSB}(\alpha)-1$ , with a tolerance of $10^{-4}$ on the value of $\alpha$ . For all values of $\alpha$ , we iterated the FRSB equations until $\|q^{t+1}-q_{t}\|_{\infty}\leq\epsilon=10^{-5}$ . We ran this procedure for different values of $k\in\{30,50,100,200\}$ , $x_{\mathrm{max}}\in\{10,11,12,13,14,15\}$ , $H\in\{40,60\}$ , and $c=30$ (recall that $H$ and $c$ are defined in Section F.3). In Fig. 5 the runs with different values of $H$ and $c$ are aggregated.

Appendix G An upper bound from Gordon’s “escape through a mesh” theorem

After reformulating the injectivity question in Proposition 1.1, a first natural attempt to bound $p_{m,n}$ is to use Gordon’s “escape through a mesh” theorem [Gor88], which gives upper bounds on the probability of a random set intersecting another fixed set. We record here this approach, which turns out to be looser than previous analyses [PKL*+*22].

For $m\geq 1$ , we denote $a_{m}\coloneqq\mathbb{E}[\|{\textbf{g}}\|]$ , for ${\textbf{g}}\sim\mathcal{N}(0,\mathrm{I}_{m})$ . One can easily show the bound [Ver18]:

[TABLE]

Moreover, for a closed set $S\subseteq\mathbb{R}^{m}$ , we define its Gaussian width as

[TABLE]

We are now ready to state Gordon’s theorem:

Theorem G.1 ([Gor88]).

Let $S\subseteq\mathcal{S}^{m-1}$ a closed subset such that $\omega(S)<a_{m-n}$ . Let $V$ be a uniformly-sampled random $n$ -dimensional subspace of $\mathbb{R}^{m}$ . Then

[TABLE]

Applying this theorem to Proposition 1.1, we directly reach

Corollary G.2.

Assume that $a_{m-n}-\omega(C_{m,n}\cap\mathcal{S}^{m-1})\to\infty$ as $m,n\to\infty$ . Then $p_{m,n}\to 1$ , i.e. $\varphi_{\textbf{W}}$ is injective w.h.p.

We therefore aim for an upper bound on $\omega(C_{m,n}\cap\mathcal{S}^{m-1})$ . We can prove the following

Lemma G.3.

Let $H(p)\coloneqq-p\log p-(1-p)\log(1-p)$ (for $p\in(0,1)$ ) denote the binary entropy function. Assume $\alpha>2$ . Then

[TABLE]

Since $a_{m-n}\geq\sqrt{\alpha-1}\sqrt{n}(1+o(1))$ by eq. (140), Lemma G.3 and Corollary G.2 imply that $p_{m,n}\to 1$ whenever

[TABLE]

A numerical application yields that eq. (141) is satisfied whenever $\alpha>\alpha_{\mathrm{inj}}^{\mathrm{mesh}}$ , with $\alpha_{\mathrm{inj}}^{\mathrm{mesh}}\simeq 23.54$ . Note that this bound is suboptimal with respect to previous works [PKL*+*22] and to Theorem 1.8.

Proof of Lemma G.3 – We denote $\omega=\omega(C_{m,n}\cap\mathcal{S}^{m-1})$ . By weak duality, we have

[TABLE]

An element ${\textbf{x}}\in C_{m,n}$ can be parametrized by a subset $S\subseteq[m]$ with $|S|<n$ , and a set of values $\{x_{\mu}\}$ , with $x_{\mu}>0$ for $\mu\in S$ and $x_{\mu}\leq 0$ for $\mu\notin S$ . This yields:

[TABLE]

We used that $\inf_{\lambda>0}[\lambda+T/\lambda]=2\sqrt{T}$ , for $T>0$ . Since the law of g is invariant under permutation of the indices, we can assume $g_{1}\geq g_{2}\geq\cdots\geq g_{p}\geq 0>g_{p+1}\geq\cdots\geq g_{m}$ , with $p=p({\textbf{g}})\in[m]$ . It is then straightforward to check that:

[TABLE]

Therefore

[TABLE]

Note that a more careful analysis than the argument presented so far would show that the first inequality of eq. (142) is actually an equality, although this is not needed in what follows. We also used Jensen’s inequality in the second inequality. The second term inside the square root can be computed easily:

[TABLE]

Moreover, for any $t\in(0,1)$ we have by Jensen’s inequality:

[TABLE]

using the union bound in ${\rm(a)}$ . Therefore, we reach

[TABLE]

Combining it with eqs. (142) and eq. (143) concludes the proof. $\square$

Bibliography84

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AC 15] Antonio Auffinger and Wei-Kuo Chen. The Parisi formula has a unique minimizer. Communications in Mathematical Physics , 335(3):1429–1444, 2015.
2[AC 18] Antonio Auffinger and Wei-Kuo Chen. On concentration properties of disordered Hamiltonians. Proceedings of the American Mathematical Society , 146(4):1807–1815, 2018.
3[ALMT 14] Dennis Amelunxen, Martin Lotz, Michael B Mc Coy, and Joel A Tropp. Living on the edge: Phase transitions in convex programs with random data. Information and Inference: A Journal of the IMA , 3(3):224–294, 2014.
4[AMÖS 19] Simon Arridge, Peter Maass, Ozan Öktem, and Carola-Bibiane Schönlieb. Solving inverse problems using data-driven models. Acta Numerica , 28:1–174, 2019.
5[AMS 22] Antonio Auffinger, Andrea Montanari, and Eliran Subag. Optimization of random high-dimensional functions: Structure and algorithms. ar Xiv preprint ar Xiv:2206.10217 , 2022.
6[AT 07] Robert J Adler and Jonathan E Taylor. Random fields and geometry , volume 80. Springer, 2007.
7[BC 20] Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and density estimation. Advances in Neural Information Processing Systems , 33:442–453, 2020.
8[BFH + 18] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Injectivity of ReLU networks: perspectives from statistical physics

Abstract

Contents

1 Introduction

1.1 Injectivity and (random) neural networks

1.2 Injectivity and random geometry

Notation –

Proposition 1.1** (Injectivity and random geometry).**

1.3 Statistical physics and the spherical perceptron

Injectivity as energy minimization –

Statistical physics of disordered systems –

Cover’s theorem and the bound αinj≥3\alpha_{\mathrm{inj}}\geq 3αinj​≥3 –

Lemma 1.2** (Cover’s lower bound for injectivity).**

Thermal relaxation: the Gibbs–Boltzmann distribution –

Universality of the free entropy –

1.4 Related work

Average Euler characteristic prediction –

Physics and mathematics of the perceptron –

Other related work –

Theorem 1.3** (Known bounds for injectivity [PKL*+*22, Pal21, Clu22]).**

1.5 Main results

1.5.1 Relating the free entropy to injectivity

Theorem 1.4** (Free entropy concentration).**

Corollary 1.5** (Sufficient condition for non-injectivity).**

Existence of the limit –

Conjecture 1.6** (Tightness of the free entropy bound).**

A generalized conjecture –

A remark on discretization –

1.5.2 Predictions of full replica symmetry breaking theory

Conjecture 1.7** (Parisi formula).**

Rigorous approaches –

Result 1.1** (“Full-RSB” conjecture).**

1.5.3 Additional bounds

The replica hierarchy of upper bounds –

Proving the replica-symmetric bound –

Theorem 1.8** (Replica-symmetric upper bound for the injectivity threshold).**

The annealed bound –

An additional lower bound –

1.6 Structure of the paper and open problems

Deep networks –

Stability of the inverse –

Improvement over Theorem 1.8 –

Large deviations of sublevel sets –

Numerical code and reproducibility –

Acknowledgments –

2 The replica hierarchy of upper bounds

2.1 General principles of the replica method

2.2 First steps of the replica method

2.3 The replica-symmetric solution

2.4 The overlap distribution and replica symmetry breaking

2.5 One-step replica symmetry breaking

3 The full-RSB solution: exact injectivity threshold

3.1 The full-RSB prediction for the free entropy

3.2 Zero-temperature limit and algorithm for the injectivity threshold

Appendix A Proofs

A.1 Proof of Proposition 1.1

Lemma A.1**.**

Lemma A.2**.**

A.2 Proof of Lemma 1.2

Theorem A.3** (Cover [Cov65]).**

A.3 Proof of Theorem 1.4

A.4 Proof of Corollary 1.5

A.5 Proof of Theorem 1.8

Proposition A.4** (Gaussian min-max theorem).**

Lemma A.5** (Auxiliary problem simplification).**

Lemma A.6**.**

Lemma A.7**.**

A.6 Proof of Lemmas A.5, A.6 and A.7

Appendix B Replica-symmetric supplementary calculations

B.1 Zero-temperature limit of the replica-symmetric solution

B.2 Stability of the replica-symmetric solution

B.2.1 The derivatives of G1,rG_{1,r}G1,r​

Proposition 1.1 (Injectivity and random geometry).

Cover’s theorem and the bound $\alpha_{\mathrm{inj}}\geq 3$ –

Lemma 1.2 (Cover’s lower bound for injectivity).

Theorem 1.3 (Known bounds for injectivity [PKL+22, Pal21, Clu22]).

Theorem 1.4 (Free entropy concentration).

Corollary 1.5 (Sufficient condition for non-injectivity).

Conjecture 1.6 (Tightness of the free entropy bound).

Conjecture 1.7 (Parisi formula).

Result 1.1 (“Full-RSB” conjecture).

Theorem 1.8 (Replica-symmetric upper bound for the injectivity threshold).

Lemma A.1.

Lemma A.2.

Theorem A.3 (Cover [Cov65]).

Proposition A.4 (Gaussian min-max theorem).

Lemma A.5 (Auxiliary problem simplification).

Lemma A.6.

Lemma A.7.

B.2.1 The derivatives of $G_{1,r}$

B.2.2 The derivatives of $G_{2,r}$

B.2.4 The $\beta\to\infty$ limit

F.3.2 Taking a large enough value of $N$

Theorem G.1 ([Gor88]).

Corollary G.2.

Lemma G.3.