Properties of the geometry of solutions and capacity of multi-layer   neural networks with Rectified Linear Units activations

Carlo Baldassi; Enrico M. Malatesta; Riccardo Zecchina

arXiv:1907.07578·cond-mat.dis-nn·May 6, 2024

Properties of the geometry of solutions and capacity of multi-layer neural networks with Rectified Linear Units activations

Carlo Baldassi, Enrico M. Malatesta, Riccardo Zecchina

PDF

TL;DR

This paper analytically investigates how ReLU activations influence the capacity and geometry of solution spaces in two-layer neural networks, revealing finite capacity and unique solution clustering properties.

Contribution

It provides the first analytical insights into ReLU effects on network capacity and solution landscape, contrasting with threshold units and highlighting robustness features.

Findings

01

Network capacity remains finite as hidden layer size increases.

02

Existence of dense, robust solution regions in the solution space.

03

Solutions are mostly isolated but some form large, stable clusters.

Abstract

Rectified Linear Units (ReLU) have become the main model for the neural units in current deep learning systems. This choice has been originally suggested as a way to compensate for the so called vanishing gradient problem which can undercut stochastic gradient descent (SGD) learning in networks composed of multiple layers. Here we provide analytical results on the effects of ReLUs on the capacity and on the geometrical landscape of the solution space in two-layer neural networks with either binary or real-valued weights. We study the problem of storing an extensive number of random patterns and find that, quite unexpectedly, the capacity of the network remains finite as the number of neurons in the hidden layer increases, at odds with the case of threshold units in which the capacity diverges. Possibly more important, a large deviation approach allows us to find that the geometrical…

Equations95

σ_{out}^{μ} = sgn (\frac{1}{K} l = 1 \sum K c_{l} τ_{l}^{μ}) = sgn (\frac{1}{K} l = 1 \sum K c_{l} g (λ_{l}^{μ})),

σ_{out}^{μ} = sgn (\frac{1}{K} l = 1 \sum K c_{l} τ_{l}^{μ}) = sgn (\frac{1}{K} l = 1 \sum K c_{l} g (λ_{l}^{μ})),

Z = \int d μ (W) X_{ξ, σ} (W),

Z = \int d μ (W) X_{ξ, σ} (W),

F = G_{S} + α G_{E} .

F = G_{S} + α G_{E} .

G_{E} = \int D z_{0} ln H (- \frac{Δ - Δ _{- 1}}{Δ _{2} - Δ} z_{0})

G_{E} = \int D z_{0} ln H (- \frac{Δ - Δ _{- 1}}{Δ _{2} - Δ} z_{0})

α_{c} = \frac{\frac{q ^}{2} ( 1 - q ) - \int D u ln ( 2 cosh ( q ^ u ) )}{\int D z _{0} ln H ( - \frac{Δ - Δ _{- 1}}{Δ _{2} - Δ} z _{0} )}

α_{c} = \frac{\frac{q ^}{2} ( 1 - q ) - \int D u ln ( 2 cosh ( q ^ u ) )}{\int D z _{0} ln H ( - \frac{Δ - Δ _{- 1}}{Δ _{2} - Δ} z _{0} )}

α_{c}^{RS} = \frac{2 δ Δ}{Δ _{2} - Δ _{- 1}} .

α_{c}^{RS} = \frac{2 δ Δ}{Δ _{2} - Δ _{- 1}} .

F_{FP} (S) = \frac{1}{N} ⟨ \frac{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ ) ln N _{ξ, σ} ( W ~ , S )}{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ )} ⟩_{ξ, σ}

F_{FP} (S) = \frac{1}{N} ⟨ \frac{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ ) ln N _{ξ, σ} ( W ~ , S )}{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ )} ⟩_{ξ, σ}

Z_{LD} (d, y)

Z_{LD} (d, y)

F_{LD} (q_{1}) = G_{S} (q_{1}) + α G_{E} (q_{1})

F_{LD} (q_{1}) = G_{S} (q_{1}) + α G_{E} (q_{1})

τ_{l}^{μ} = g (\frac{1}{N / K} i \sum W_{l i} ξ_{l i}^{μ})

τ_{l}^{μ} = g (\frac{1}{N / K} i \sum W_{l i} ξ_{l i}^{μ})

X_{ξ, σ} (W) = μ \prod Θ (\frac{σ ^{μ}}{K} l \sum c_{l} τ_{l}^{μ})

X_{ξ, σ} (W) = μ \prod Θ (\frac{σ ^{μ}}{K} l \sum c_{l} τ_{l}^{μ})

Z = \int d μ (W) X_{ξ, σ} (W)

Z = \int d μ (W) X_{ξ, σ} (W)

F = G_{S} + α G_{E}

F = G_{S} + α G_{E}

G_{S}^{sph} = - \frac{1}{2 n K} l \sum a, b \sum q_{l}^{ab} \overset{q}{^}_{l}^{ab} + \frac{1}{n K} l \sum lo g \int a \prod d W^{a} e^{\frac{1}{2} \sum_{a, b} \overset{q}{^}_{l}^{ab} W^{a} W^{b}}

G_{S}^{sph} = - \frac{1}{2 n K} l \sum a, b \sum q_{l}^{ab} \overset{q}{^}_{l}^{ab} + \frac{1}{n K} l \sum lo g \int a \prod d W^{a} e^{\frac{1}{2} \sum_{a, b} \overset{q}{^}_{l}^{ab} W^{a} W^{b}}

G_{S}^{bin} = - \frac{1}{2 n K} l = 1 \sum K a \neq = b \sum q_{l}^{ab} \overset{q}{^}_{l}^{ab} + \frac{1}{n K} l = 1 \sum K lo g W^{a} = \pm 1 \sum e^{\frac{1}{2} \sum_{a \neq = b} \overset{q}{^}_{l}^{ab} W^{a} W^{b}}

G_{S}^{bin} = - \frac{1}{2 n K} l = 1 \sum K a \neq = b \sum q_{l}^{ab} \overset{q}{^}_{l}^{ab} + \frac{1}{n K} l = 1 \sum K lo g W^{a} = \pm 1 \sum e^{\frac{1}{2} \sum_{a \neq = b} \overset{q}{^}_{l}^{ab} W^{a} W^{b}}

G_{E}

G_{E}

G_{S}^{sph}

G_{S}^{sph}

G_{S}^{bin}

G_{E}

Δ_{- 1}^{sgn} = 0; Δ^{sgn} = 1 - \frac{2}{π} arccos (q); Δ_{2}^{sgn} = 1

Δ_{- 1}^{sgn} = 0; Δ^{sgn} = 1 - \frac{2}{π} arccos (q); Δ_{2}^{sgn} = 1

Δ_{- 1}^{ReLU} = \frac{1}{2 π}; Δ^{ReLU} = \frac{1 - q ^{2}}{2 π} + \frac{q}{π} arctan \frac{1 + q}{1 - q}; Δ_{2}^{ReLU} = \frac{1}{2}

Δ_{- 1}^{ReLU} = \frac{1}{2 π}; Δ^{ReLU} = \frac{1 - q ^{2}}{2 π} + \frac{q}{π} arctan \frac{1 + q}{1 - q}; Δ_{2}^{ReLU} = \frac{1}{2}

G_{S}^{sph}

G_{S}^{sph}

G_{S}^{bin}

G_{E}

α_{c} = \frac{ln ( 1 + m ~ ( 1 - q _{0} ) ) + \frac{m ~ q _{0}}{1 + m ~ ( 1 - q _{0} )}}{2 f ( q _{0} , m ~ )}

α_{c} = \frac{ln ( 1 + m ~ ( 1 - q _{0} ) ) + \frac{m ~ q _{0}}{1 + m ~ ( 1 - q _{0} )}}{2 f ( q _{0} , m ~ )}

f (q_{0}, \tilde{m}) = \int D z_{0} ln \frac{δ Δ e ^{- \frac{( Δ _{0} - Δ _{- 1} ) m ~}{δ Δ + ( Δ _{2} - Δ _{0} ) m ~} \frac{z _{0}^{2}}{2}}}{δ Δ + ( Δ _{2} - Δ _{0} ) m ~} H (- \frac{Δ _{0} - Δ _{- 1}}{Δ _{2} - Δ _{0}} \frac{δ Δ z _{0}}{δ Δ + ( Δ _{2} - Δ _{0} ) m ~}) + H (\frac{Δ _{0} - Δ _{- 1}}{Δ _{2} - Δ _{0}} z_{0})

f (q_{0}, \tilde{m}) = \int D z_{0} ln \frac{δ Δ e ^{- \frac{( Δ _{0} - Δ _{- 1} ) m ~}{δ Δ + ( Δ _{2} - Δ _{0} ) m ~} \frac{z _{0}^{2}}{2}}}{δ Δ + ( Δ _{2} - Δ _{0} ) m ~} H (- \frac{Δ _{0} - Δ _{- 1}}{Δ _{2} - Δ _{0}} \frac{δ Δ z _{0}}{δ Δ + ( Δ _{2} - Δ _{0} ) m ~}) + H (\frac{Δ _{0} - Δ _{- 1}}{Δ _{2} - Δ _{0}} z_{0})

F_{FP} (S) = \frac{1}{N} ⟨ \frac{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ ) ln N _{ξ, σ} ( W ~ , S )}{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ )} ⟩_{ξ, σ}

F_{FP} (S) = \frac{1}{N} ⟨ \frac{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ ) ln N _{ξ, σ} ( W ~ , S )}{\int d μ ( W ~ ) X _{ξ, σ} ( W ~ )} ⟩_{ξ, σ}

F_{FP} = G_{S} + α G_{E}

F_{FP} = G_{S} + α G_{E}

G_{S}^{sph}

G_{S}^{sph}

G_{S}^{bin}

G_{E}

G_{S}^{sph}

G_{S}^{sph}

G_{S}^{bin}

G_{E}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Properties of the geometry of solutions and capacity of multi-layer

neural networks with Rectified Linear Units activations

Carlo Baldassi

Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy

Enrico M. Malatesta

[email protected]

Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy

Riccardo Zecchina

Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy

Abstract

Rectified Linear Units (ReLU) have become the main model for the neural units in current deep learning systems. This choice has been originally suggested as a way to compensate for the so called vanishing gradient problem which can undercut stochastic gradient descent (SGD) learning in networks composed of multiple layers. Here we provide analytical results on the effects of ReLUs on the capacity and on the geometrical landscape of the solution space in two-layer neural networks with either binary or real-valued weights. We study the problem of storing an extensive number of random patterns and find that, quite unexpectedly, the capacity of the network remains finite as the number of neurons in the hidden layer increases, at odds with the case of threshold units in which the capacity diverges. Possibly more important, a large deviation approach allows us to find that the geometrical landscape of the solution space has a peculiar structure: While the majority of solutions are close in distance but still isolated, there exist rare regions of solutions which are much more dense than the similar ones in the case of threshold units. These solutions are robust to perturbations of the weights and can tolerate large perturbations of the inputs. The analytical results are corroborated by numerical findings.

Artificial Neural Networks (ANN) have been studied since decades and yet only recently they have started to reveal their potentialities in performing different types of massive learning tasks (lecun2015deep, ). Their current denomination is Deep Neural Networks (DNN) in reference to the the choice of the architectures, which typically involve multiple interconnected layers of neuronal units. Learning in ANN is in principle a very difficult optimization problem, in which “good” minima of the learning loss function in the high dimensional space of the connection weights need to be found. Luckily enough, DNN models have evolved rapidly, overcoming some of the computational barriers that for many years have limited their efficiency. Important components of this evolution have been the availability of computational power and the stockpiling of extremely rich data sets.

The features on which the various modeling strategies have intersected, besides the architectures, are the choice of the the loss functions, the transfer functions for the neural units and the regularization techniques. These improvements have been found to help the convergence of the learning processes, typically based on Stochastic Gradient Descent (SGD) (bottou2010large, ), and to lead to solutions which can often avoid overfitting even in over parametrized regimes. All these results pose basic conceptual questions which need to find a clear explanation in term of the optimization landscape.

Here we study the effects that the choice of the Rectified Linear Units (ReLU) for the neurons (hahnloser2000digital, ) has on the geometrical structure of the learning landscape. ReLU is one of the most popular non-linear activation functions and it has been extensively used to train DNN, since it is known to dramatically reduce the training time for a typical algorithm (glorot2011deep, ). It is also known that another major benefit of using ReLU is that it does not produce the vanishing gradient problem as other common transfer functions (e.g. the $\mathrm{tanh}$ ) (hochreiter1991untersuchungen, ; glorot2011deep, ). We study ANN models with one hidden layer storing random patterns, for which we derive analytical results that are corroborated by numerical findings. At variance with what happens in the case of threshold units, we find that models built on ReLU functions present a critical capacity, i.e. the maximum number of patterns per weight which can be learned, that does not diverge as the number of neurons in the hidden unit is increased. At the same time we find that below the critical capacity they also present wider dense regions of solutions. These regions are defined in terms of the volume of the weights around a minimizer which do not lead to an increase of the loss value (e.g. number of errors) (baldassi_unreasonable_2016, ). For discrete weights this notions reduces to the so called Local Entropy (baldassi2015subdominant, ) of a minimizer. We also check analytically and numerically the improvement in the robustness of these solutions with respect to both weight and input perturbations.

Together with the recent results on the existence of such wide flat minima and on the effect of choosing particular loss functions to drive the learning processes toward them (baldassi2019shaping, ), our result contributes to create a unified framework for the learning theory in DNN, which relies on the large deviations geometrical features of the accessible solutions in the overparametrized regime.

The model. We will consider a two-layer neural network with $N$ input units, $K$ neurons in the hidden layer and one output. The mapping from the input to the hidden layer is realized by $K$ non-overlapping perceptrons each having $N/K$ weights. Given $p=\alpha N$ inputs $\xi^{\mu}$ labeled by index $\mu=\left\{1,\dots,p\right\}$ , the output of the network for each input $\mu$ is computed as

[TABLE]

where $\lambda_{l}^{\mu}$ is the input of the $l$ hidden unit i.e. $\lambda_{l}^{\mu}=\sqrt{\frac{K}{N}}\sum_{i=1}^{N/K}W_{li}\xi_{li}^{\mu}$ and $W_{li}$ is the weight connecting the input unit $i$ to the hidden unit $l$ . $c_{l}$ is the weight connecting hidden unit $l$ with the output; $g$ is a generic activation function. In the following we will mainly consider two particular choices of activation functions and of the weights $c_{l}$ . In the first one we take the sign activation $g\left(\lambda\right)=\mathrm{sgn}\left(\lambda\right)$ and we fix to 1 all the weights $c_{l}$ (in general their sign can be absorbed into the weights $W_{li}$ ). The $K=1$ version of this model is the well-known perceptron and it has been extensively studied since the ’80s by means of the replica and cavity methods (gardner1988The, ; gardner1988optimal, ; mezard1989space, ) used in spin glass theory (BOOKNishimori2001, ). The $K>1$ case is known as the tree-committee machine that has been studied in the ’90s (barkai1992broken, ; engel1992storage, ). In the second model we will use the ReLU activation function that is defined with $g\left(\lambda\right)=\max\left(0,\lambda\right)$ , and since the output of this transfer function is always non-negative we will fix half of the weights $c_{l}$ to +1 and the remaining half to $-1$ . Given a training set composed by random i.i.d patterns $\xi^{\mu}\in\left\{-1,1\right\}^{N}$ and labels $\sigma^{\mu}\in\left\{-1,1\right\}$ and defining $\mathbb{X}_{\xi,\sigma}\left(W\right)\equiv\prod_{\mu}\theta\left(\frac{\sigma^{\mu}}{\sqrt{K}}\sum_{l=1}^{K}c_{l}\,\tau_{l}^{\mu}\right)$ , the weights that correctly classify the patterns are those for which $\mathbb{X}_{\xi,\sigma}\left(W\right)=1$ . Their volume (or number) is therefore (gardner1988The, ; gardner1988optimal, )

[TABLE]

where $\mathrm{d}\mu(W)$ is a measure over the weights $W$ . In this study two constraints over the weights will be considered. The spherical constraint where for every $l\in\left\{1,\dots,K\right\}$ , we have $\sum_{i}W_{li}^{2}=\nicefrac{{N}}{{K}}$ , i.e. every sub-perceptron has weights that live on the hypersphere of radius $\sqrt{\nicefrac{{N}}{{K}}}$ . The second constraint we will use is the binary one, where for every $l\in\left\{1,\dots,K\right\}$ and $i\in\left\{1,\dots,\nicefrac{{N}}{{K}}\right\}$ we have $W_{li}\in\left\{-1,1\right\}$ . We are interested in the large $K$ limit for which we will be able to compute analytically the capacity of the model for different choices of transfer function, to study the typical distances between absolute minima and to perform the large deviation study giving the local volumes associated to the wider flat minima.

Critical Capacity. We will analyze the problem in the limit of a large number $N$ of input units. The standard scenario in this limit is that there is a sharp threshold $\alpha_{c}$ such that for $\alpha<\alpha_{c}$ the probability of finding a solution is 1 while for $\alpha>\alpha_{c}$ the volume of synapses is empty. $\alpha_{c}$ is therefore called critical capacity since it is the maximum number of patterns per synapses that one can store in a neural network. The critical capacity of the mode, for a generic choice of the transfer function, can be evaluated computing the free entropy $\mathcal{F}\equiv\frac{1}{N}\left\langle\ln Z\right\rangle_{\xi,\sigma}$ , where $\left\langle\cdot\right\rangle_{\xi,\sigma}$ denotes the average over the patterns, using the replica method; one finds

[TABLE]

$\mathcal{G}_{S}$ is the entropic term, which represents the logarithm of the volume at $\alpha=0$ , where there are no constraints induced by the training set; this quantity is independent on $K$ and it is affected only by the binary or spherical nature of the weights. $\mathcal{G}_{E}$ is the energetic term and it represents the logarithm of the fraction of solutions. Moreover it depends on the order parameters $q_{l}^{ab}\equiv\frac{K}{N}\sum_{i}W_{li}^{a}W_{li}^{b}$ which represent the overlap between sub-perceptrons $l$ of two different replicas $a$ and $b$ of the machine. Using a replica-symmetric (RS) ansatz, in which we assume $q_{l}^{ab}=q$ for all $a,b,l$ , and in the large $K$ limit, $\mathcal{G}_{E}$ is

[TABLE]

where $Dz\equiv\frac{\mathrm{d}z}{\sqrt{2\pi}}e^{-z^{2}/2}$ and $H\left(x\right)\equiv\int_{x}^{\infty}Dz$ . This expression is equivalent to that of the perceptron (i.e. $K=1$ ), the only difference being that the order parameters are replaced by effective ones that depend on the general activation function used in the machine (barkai1992broken, ). In (4) we have called these effective order parameters as $\Delta_{-1}$ , $\Delta$ and $\Delta_{2}$ , see the appendix B.1 for details, and in the perceptron they are 0, $q$ and $1$ respectively.

In the binary case the critical capacity is always smaller than 1 and it is identified with the point where the RS free entropy $\mathcal{F}$ vanishes. This condition requires

[TABLE]

where $\hat{q}$ is the conjugated parameter of $q$ . $q$ and $\hat{q}$ are found by solving their associated saddle point equations (details in appendix B.1). Solving these equations one finds for the ReLU case $\alpha_{c}=0.9039(9)$ which is a smaller value than in the sign activation function case, where one gets $\alpha_{c}=0.9469(5)$ as shown in (barkai1992broken, ).

In the spherical case the situation is different since the capacity is not bounded from above. Previous works (engel1992storage, ; barkai1992broken, ) have shown, in the case of the sign activations, that the RS estimate of the critical capacity diverges with the number of neurons in the hidden layer as $\alpha_{c}\simeq\left(\frac{72K}{\pi}\right)^{1/2}$ , violating the Mitchison-Durbin bound (mitchison1989bounds, ). The reason of this discrepancy is due to the fact that the Gardner volume disconnects before $\alpha_{c}$ and therefore replica-symmetry breaking (RSB) takes place. Indeed, the instability of the RS solution occurs at a finite value $\alpha_{\text{AT}}\simeq 2.988$ at large $K$ . A subsequent work (monasson1995weight, ) based on multifractal techniques derived the correct scaling of the capacity with $K$ as $\alpha_{c}\simeq\frac{16}{\pi}\sqrt{\ln K}$ , which saturates the Mitchison-Durbin bound.

In the case of the ReLU functions the RS estimate of the critical capacity is obtained simply performing the $q\to 1$ limit, as for the perceptron. Quite surprisingly, if the activation function is such that $\Delta_{2}-\Delta\simeq\delta\Delta\left(1-q\right)$ for $q\to 1$ , with $\delta\Delta$ a finite proportionality term, the RS estimate of the critical capacity is finite (for the same reason of the finiteness of the capacity of the perceptron where one exactly has $\Delta_{2}-\Delta=1-q$ ). Contrary to the sign activation (where the effective parameters are such that $\Delta_{2}-\Delta\simeq\delta\Delta\sqrt{1-q}$ ), the ReLU activation function happens to belong to this class (with $\delta\Delta=\frac{1}{2}$ ). The RS estimate of the critical capacity is therefore given by

[TABLE]

One correctly recovers $\alpha_{c}=2$ in the case of the perceptron (gardner1988The, ) whereas for the committee machine with ReLU activation one has $\alpha_{c}=2\left(1-\frac{1}{\pi}\right)^{-1}\simeq 2.934$ . As for the sign activation, one expects also for the ReLU activation that RS is unstable before $\alpha_{c}$ . Indeed we have computed the stability of the RS solution in the large $K$ limit and we found $\alpha_{\text{AT}}\simeq 0.615$ which is far smaller than the corresponding value of the sign activation. This suggests that strong RSB effects are at play.

We have therefore used a 1RSB ansatz to better estimate the critical capacity in the ReLU case. This can be obtained by taking the limits $q_{1}\to 1$ for intra-block overlap and $m\to 0$ for the Parisi parameter. We found $\alpha_{c}^{\text{1RSB}}\simeq 2.6643$ 111In a previous version of the paper a value of $\alpha_{c}^{\text{1RSB}}\simeq 2.92$ was reported. After the work of (zavatone2020activation, ) we found that it was incorrect, due to a bad initialization of the solver of the saddle point equations, that let the order parameter converge on the RS saddle point., which is not too far from the RS result.

For the sake of brevity, we mention that the results on the non divergent capacity with $K$ generalize to other monotone smooth functions such as the sigmoid.

Typical distances. In order to get a quantitative understanding of the geometrical structure of the weight space, we have also derived the so called Franz-Parisi entropy for the committee machine with a generic transfer function. This framework was originally introduced in (franz1995recipes, ) to study the metastable states of mean-field spin glasses and only recently it was used to study the landscape of the solutions of the perceptron (huang2014origin, ). The basic idea is to sample a solution $\tilde{W}$ from the equilibrium Boltzmann measure and to study the entropy landscape around it. In the binary setting, it turns out that the equilibrium solutions of the learning problem are isolated; this means that, for any positive value of $\alpha$ , one must flip an extensive number of weights to go from an equilibrium solution to another one. The Franz-Parisi entropy is defined as

[TABLE]

where $\mathcal{N}_{\xi,\sigma}\!\left(\tilde{W},S\right)\!=\!\int\mathrm{d}\mu\!\left(W\right)\mathbb{X}_{\xi,\sigma}\!\left(W\right)\prod_{l=1}^{K}\delta\!\left(W_{l}\cdot\tilde{W}_{l}-\frac{N}{K}S\right)$ counts the number of solutions at a distance $d=\frac{1-S}{2}$ from a reference $\tilde{W}$ . The distance constraint is imposed by fixing the overlap between every sub-perceptron of $W$ and $\tilde{W}$ to $\frac{S}{K}$ . The quantity defined in eq. (7) can again be computed by the replica method. However, the expression for $K$ finite is quite difficult to analyze numerically since the energetic term contains $4K$ integrals. The large $K$ limit is instead easier and, again, the only difference with the perceptron expression is that the order parameters are replaced with effective ones in the energetic term (see appendix C for details).

In Fig. 1 we plot the Franz-Parisi entropy $\mathcal{F}_{\text{FP}}$ as a function of the distance $d=\frac{1-S}{2}$ from a typical reference solution $\tilde{W}$ for the committee machine with both sign and ReLU activations. The numerical analysis shows that, as in the case of the binary perceptron, also in the 2-layer case solutions are isolated since there is a minimal distance $d^{*}$ below which the entropy becomes negative. This minimal distance increases with the constraint density $\alpha$ . However at a given value of $\alpha$ , typical solutions of the committee machine with ReLU activations are less isolated than the ones of the sign counterpart. The same framework applies to the case of spherical weights where we find that the minimum distance between typical solutions is smaller for the ReLU case.

Large deviation analysis. The results of the previous section show that the Franz-Parisi framework does not capture the features of high local entropy regions. These regions indeed exist since algorithms can be observed to find solutions belonging to large connected clusters of solutions. In order to study the properties of wide flat minima or high local entropy regions one needs to introduce a large deviation measure (baldassi2015subdominant, ), which favors configurations surrounded by an exponential number of solutions at small distance. This amounts to study a system with $y$ real replicas constrained to be at a distance $d$ from a reference configuration $\tilde{W}$ . The high local entropy region is found around $d\simeq 0$ in the limit of large $y$ . As shown in (baldassi2019shaping, ), an alternative approach can be obtained by directly constraining the set of $y$ replicas to be at a given mutual distance $d$ , that is:

[TABLE]

This last approach has the advantage of simplifying the calculations, since it is related to the standard 1RSB approach on the typical Gardner volume (krauth-mezard, ) given in equation (2): the only difference is that the Parisi parameter $m$ and the intra-block overlap $q_{1}$ are fixed as external parameters, and play the same role of $y$ and $1-2d$ respectively. Therefore $m$ is not limited anymore to the standard range $\left[0,1\right]$ ; indeed, the large $m$ regime is the significant one for capturing high local entropy regions. In the large $m$ and $K$ limit the large deviation free entropy $\mathcal{F}_{\text{LD}}\equiv\frac{1}{N}\left\langle\ln Z_{\text{LD}}\right\rangle_{\xi,\sigma}$ reads

[TABLE]

where, again, the entropic term has a different expression depending on the constraint over the weights $W$ . Its expression, together with the corresponding energetic term, is reported in appendix D.

We report in Fig. 2 the numerical results for both binary and spherical weights of the large deviation entropy (normalized with respect to the unconstrained $\alpha=0$ case) as a function of $q_{1}$ . For both sign and ReLU activations, the region $q_{1}\simeq 1$ is flat around zero. This means that there exist $\tilde{W}$ references around which the landscape of solutions is basically indistinguishable from the $\alpha=0$ case where all configurations are solutions. We also find that the ReLU curve, in the vicinity of $q_{1}\simeq 1$ , is always more entropic than the corresponding one of the sign. This picture is valid for sufficiently low values of $\alpha$ ; for $\alpha$ greater of a certain value $\alpha^{*}$ the two curves switch. This is due to the fact that the two models have completely different critical capacities (divergent in the sign case, finite in the ReLU case) so that one expects that clusters of solutions disappear at a smaller constrained capacity when ReLU activations are used.

Stability distribution and robustness. To corroborate our previous results, we have also computed (see appendix E), for various models and types of solutions $W$ , the distribution of the stabilities $\Xi=\left\langle\frac{\sigma}{\sqrt{K}}\sum_{l=1}^{K}c_{l}\,\tau_{l}^{\mu}\right\rangle_{\xi,\sigma}$ , which measure the distance from the threshold at the output unit in the direction of the correct label $\sigma$ , cf. eq. (1). Previous calculations (kepler1988domains, ) have shown that in the simple case of the spherical perceptron at the critical capacity the stability distribution around a typical solution $W$ develops a delta peak in the origin $\Xi\simeq 0$ ; we confirmed that even in the two-layer case the stability distribution of a typical solution, being isolated, also has its mode at $\Xi\simeq 0$ even at lower $\alpha$ , see the dashed lines in Fig. 3 (left). A solution surrounded by an exponential number of other solutions, instead, should be more robust and be centered away from [math]. Our calculations show that this is indeed the case both for the sign and for the ReLU activations, and we have confirmed the results by numerical simulations. In Fig. 3 (left) we show the analytical and numerical results for the case of binary weights at $\alpha=0.4$ with $y=20$ replicas at $q_{1}=0.85$ . For the numerical results, we have used simulated annealing on a system with $K=32$ ( $K=33$ ) for the ReLU (sign) activations (respectively), and $N=K^{2}\simeq 10^{3}$ . We have simulated a system of $y$ interacting replicas that is able to sample from the local-entropic measure (baldassi_unreasonable_2016, ) with the RRR Monte Carlo method (baldassi2017method, ), ensuring that the annealing process was sufficiently slow such that at the end of the simulation all replicas were solutions, and controlling the interaction such that the average overlap between replicas was equal to $q_{1}$ within a tolerance of $0.01$ . The results were averaged over $20$ samples. As seen in Fig. 3 (left), the agreement with the analytical computations is remarkable, despite the small values of $\nicefrac{{N}}{{K}}$ and $K$ and the approximations introduced by sampling with simulated annealing.

The stabilities for the sign and ReLU activations are qualitatively similar, but quantitatively we observe that in all cases the curves for the ReLU case have a peak closer to [math] and a smaller variance. These are not, however, directly comparable, and it is difficult to tell from the stability curves alone which choice is more robust. We have thus directly measured, on the results of the simulations, the effect of introducing noise in the input patterns. For each trained group of $y$ replicas, we used the configuration of the reference $\tilde{W}$ (which lays in the middle of the cluster of solutions) and we measured the probability that a pattern of the training set would be misclassified if perturbed by flipping a fraction $\eta$ of randomly chosen entries. We explored a wide range of values of the noise $\eta$ and sampled $50$ perturbations per pattern. The results are shown in Fig. 3 (right), and they confirm that the networks with ReLU activations are more robust than those with sign activations for this $\alpha$ , in agreement with the results of Fig. 2. We also verified that the reference configuration is indeed more robust than the individual replicas. The results for other choices of the parameters are qualitatively identical. Our preliminary tests show that the same phenomenology is maintained when the network architecture is changed to a fully-connected scheme, in which each hidden unit is connected to all of the input units.

The architecture of the model that we have analyzed here is certainly very simplified compared to state of the art deep neural networks used in applications. Investigating deeper models would certainly be of great interest, but extremely challenging with current analytical techniques, and is thus an open problem. Extending our analysis to a one-hidden-layer fully-connected model, on the other hand, would in principle be feasible (the additional complication comes from the permutation symmetry of the hidden layer). However, based on the existing literature (e.g. (barkai1992broken, ; urbanczik1997storage, )), and our preliminary numerical experiments mentioned above, we do not expect that such extension would result in qualitatively different outcomes compared to our tree-like model.

Appendix A Model

Our model is a tree-like committee machine with $N$ weights $W_{li}$ divided into $K$ groups of $\nicefrac{{N}}{{K}}$ entries. We use the index $l=1,\dots,K$ for the group and $i=1,\dots,\frac{N}{K}$ for the entry. We consider two cases, the binary case $W_{li}=\pm 1$ for all $l,i$ and the continuous case with spherical constraints on each group, $\sum_{i=1}^{N/K}W_{li}^{2}=N/K$ for all $l$ .

The training set consists of $p=\alpha N$ random binary i.i.d. patterns. The inputs are denoted by $\xi_{li}^{\mu}$ and the outputs by $\sigma^{\mu}$ , where $\mu=1,\dots,p$ is the pattern index.

In our analysis, we write the model using a generic activation function $g\left(x\right)$ for the first layer (hidden) units. We will consider two cases: the sign case $g\left(x\right)=\mathrm{sgn}$$\left(x\right)$ and the ReLU case $g\left(x\right)=\max\left(0,x\right)$ . The output of any given unit $l$ in response to a pattern $\mu$ is thus written as

[TABLE]

The connection weights between the first layer and the output are denoted by $c_{l}$ and considered binary and fixed; for the case of ReLU activations we set the first half $l=1,\dots,K/2$ to the value $+1$ and the rest to $-1$ ; for the case of the sign activations we can set them to all to $+1$ without loss of generality.

A configuration of the weights $W$ solves the training problem if it classifies correctly all the patterns; we denote this with the indicator function

[TABLE]

where $\Theta\left(x\right)=1$ if $x>0$ and [math] otherwise is the Heaviside step function.

The volume of the space of configurations that correctly classify the whole training set is then

[TABLE]

where $\mathrm{d}\mu\left(W\right)$ is the flat measure over the admissible values of the $W$ , depending on whether we’re analyzing the binary or the spherical case. The average of the log-volume over the distribution of the patterns, $\left\langle\log Z\right\rangle_{\xi,\sigma}$ , is the free entropy of the model $\mathcal{F}$ . We can evaluate it in the large $N$ limit with the the “replica trick”, using the formula $\log Z=\lim_{n\to 0}\partial_{n}Z^{n}$ , computing the average for integer $n$ and then taking the limit $n\to 0$ .

As explained in the main text, in all cases the resulting expression takes the form

[TABLE]

where the $\mathcal{G}_{S}$ part is only affected by the spherical or binary nature of the weights, whereas the $\mathcal{G}_{E}$ part is only affected by $K$ and by the activation function $g$ . Determining their value requires to compute a saddle-point over some overlap parameters $q_{l}^{ab}$ with $a,b=1,\dots,n$ representing overlaps between replicas, and their conjugates $\hat{q}_{l}^{ab}$ ; in turn, this requires an ansatz about the structure of the saddle-point in order to perform the $n\to 0$ limit.

For the spherical weights, the $\mathcal{G}_{S}$ part (before the $n\to 0$ limit) reads

[TABLE]

while for the binary case we have a very similar expression, except that the summations don’t have the $a=b$ case and the integral over $W^{a}$ becomes a summation:

[TABLE]

The $\mathcal{G}_{E}$ part reads:

[TABLE]

In all cases, we study the problem in the large $K$ limit, which allows to invoke the central limit theorem and leads to a crucial simplification of the expressions.

Appendix B Critical capacity

B.1 Replica symmetric ansatz

In the replica-symmetric (RS) case we seek solutions of the form $q_{l}^{ab}=\delta_{ab}+\left(1-\delta_{ab}\right)q$ for all $l,a,b$ , where $\delta_{ab}$ is the Kronecker delta symbol, and similarly for the conjugated parameters, $\hat{q}_{l}^{ab}=\delta_{ab}\hat{Q}+\left(1-\delta_{ab}\right)\hat{q}$ for all $l,a,b$ . The resulting expressions, as reported in the main text, are:

[TABLE]

where $Dz\equiv\mathrm{dz}\frac{e^{-\frac{1}{2}z^{2}}}{\sqrt{2\pi}}$ is a Gaussian measure, $H\left(x\right)=\int_{x}^{\infty}Dz=\frac{1}{2}\mathrm{erfc}\left(\frac{x}{\sqrt{2}}\right)$ and the expressions of $\Delta$ , $\Delta_{2}$ and $\Delta_{-1}$ depend on the activation function $g$ :

[TABLE]

The values of the overlaps and conjugated parameters are found by setting to [math] the derivatives of the free entropy.

The critical capacity $\alpha_{c}$ is found in the binary case by seeking numerically the value of $\alpha$ for which the saddle point solutions returns a zero free entropy.For the spherical case, instead, $\alpha_{c}$ is determined by finding the value of $\alpha$ such that $q\to 1$ , which can be obtained analytically by reparametrizing $q=1-\delta q$ and expanding around $\delta q\ll 1$ . In this limit, we must also reparametrize $\Delta$ using $\Delta_{2}-\Delta=\delta\Delta\,\delta q^{x}$ , where $x$ is an exponent that depends on the activation function: it is $x=\nicefrac{{1}}{{2}}$ for the sign and $x=1$ for the ReLU. Due to this difference in this exponent, $\alpha_{c}$ diverges in the sign activation case (as was shown in (barkai1992broken, ; engel1992storage, )), while for the ReLU activations it converges to $2\left(1-\frac{1}{\pi}\right)^{-1}$ . However, the RS result for the spherical case is only an upper bound, and a more accurate result requires replica-symmetry breaking.

B.2 1RSB ansatz

In the one-step replica-symmetry-breaking ( $1$ -RSB) ansatz we seek solutions with 3 possible values of the overlaps $q_{l}^{ab}$ and their conjugates. We group the $n$ replicas in $\frac{n}{m}$ groups of $m$ replicas each, and denote with $q_{0}$ the overlaps among different groups and with $q_{1}$ the overlaps within the same group. As before, the self overlap is $q^{aa}=1$ and its conjugate $\hat{q}^{aa}=\hat{Q}$ (these are only relevant in the spherical case).

The resulting expressions are:

[TABLE]

the expressions of $\Delta_{2}$ and $\Delta_{-1}$ are the same as in the RS case. The expressions of $\Delta_{0}$ and $\Delta_{1}$ take the same form as the RS expressions for $\Delta$ , except that $q_{0}$ and $q_{1}$ must be used instead of $q$ .

Similarly to the RS case, in order to determine the critical capacity $\alpha_{c}$ in the spherical case, we need to find the value of $\alpha$ such that $q_{1}\to 1$ . In this case however we must also have $m\to 0$ . The scaling is such that $\tilde{m}=m/\left(1-q_{1}\right)$ is finite. The final expression reads

[TABLE]

where

[TABLE]

The expression of $\delta\Delta$ is the same as in the RS case using $\Delta_{1}$ instead of $\Delta$ ; the values of $q_{0}$ and $\tilde{m}$ are determined by saddle point equations as usual.

Appendix C Franz-Parisi potential

The Franz-Parisi entropy (franz1995recipes, ; huang2014origin, ) is defined as (cf. eq. (7) of the main text):

[TABLE]

where $\mathcal{N}_{\xi,\sigma}\!\left(\tilde{W},S\right)\!=\!\int\mathrm{d}\mu\!\left(W\right)\mathbb{X}_{\xi,\sigma}\!\left(W\right)\prod_{l=1}^{K}\delta\!\left(W_{l}\cdot\tilde{W}_{l}-\frac{N}{K}S\right)$ counts the number of solutions at a distance $\frac{1-S}{2}$ from the reference $\tilde{W}$ . In this expression, $\tilde{W}$ represents a typical solution. The evaluation of this quantity requires the use of two sets of replicas: $n$ replicas of the reference configuration $\tilde{W}$ , for which we use the indices $a$ and $b$ , and $r$ replicas of the surrounding configurations $W$ , for which we use the indices $c$ and $d$ . The computation proceeds following standard steps, leading to these expressions, written in terms of the overlaps $q_{l}^{ab}=\frac{K}{N}\sum_{i=1}^{N/K}\tilde{W}_{li}^{a}\tilde{W}_{li}^{b}$ , $p_{l}^{cd}=\frac{K}{N}\sum_{i=1}^{N/K}W_{li}^{c}W_{li}^{d}$ , $t_{l}^{ac}=\frac{K}{N}\sum_{i=1}^{N/K}\tilde{W}_{li}^{a}W_{li}^{c}$ and their conjugate quantities:

[TABLE]

The calculation proceeds by taking the RS ansatz with the same structure as that of sec. B.1 and the limit $n\to 0$ , $r\to 0$ . We obtain, in the large $K$ limit:

[TABLE]

where we introduced the auxiliary quantities $\Sigma_{0}$ , $\Sigma_{1}$ , $\Gamma$ , $D_{0}$ , $D_{1}$ and $\Delta_{3}$ which depend on the choice of the activation function (like the $\Delta_{-1/0/1/2}$ of the previous section). For the sign activations we get:

[TABLE]

For the ReLU activations we get:

[TABLE]

In order to find the order parameters for any given $\alpha$ and $S$ , we need to set to [math] the derivatives of the free entropy w.r.t. the order parameters $q$ , $p$ , $t$ and the conjugates $\hat{Q}$ , $\hat{q}$ , $\hat{P}$ , $\hat{p}$ , $\hat{t}$ , $\hat{S}$ , thus obtaining a system of 9 equations (7 for the binary case) to be solved numerically. The equations actually reduce to 6 (5 in the binary case) since $q$ , $\hat{Q}$ and $\hat{q}$ are the same ones derived from the typical case (sec. B.1).

Appendix D Large deviation analysis

Following (baldassi2019shaping, ), the large deviation analysis for the description of the high-local-entropy landscape uses the same equations as the standard $1$ RSB expressions eqs. (20), (21) and (22). In this case, however, the overlap $q_{1}$ is not determined by a saddle point equation, but rather it is treated as an external parameter that controls the mutual overlap between the $y$ replicas of the system. Also, the parameter $m$ is not optimized and it is not restricted to the range $\left[0,1\right]$ ; instead, it plays the role of the number of replicas $y$ and it is generally taken to be large (we normally use either a large integer number to compare the results with numerical simulations, or we take the limit $m\to\infty$ ). For these reason, there are two saddle point equations less compared to the standard $1$ RSB calculation.

The resulting expression for the free entropy $\mathcal{F}_{\text{LD}}\left(q_{1}\right)$ represents, in the spherical case, the log-volume of valid configurations (solutions at the correct overlap) of the system of $y$ replicas. These configurations are thus embedded in $\mathcal{S}^{Ky}$ where $\mathcal{S}$ is the $\nicefrac{{N}}{{K}}$ -dimensional sphere of radius $\sqrt{\nicefrac{{N}}{{K}}}$ . In order to quantify the solution density, we must normalize $\mathcal{F}_{\text{LD}}\left(q_{1}\right)$ , subtracting the log-volume of all the admissible configurations at a given $q_{1}$ without the solution constraint (which is obtained by the analogous computation with $\alpha=0$ ). The resulting quantity is thus upper-bounded by [math] (cf. Fig. (2) of the main text). For the binary case, $\mathcal{F}_{\text{LD}}\left(q_{1}\right)$ is the log of the number of admissible solutions, and the same normalization procedure can be applied.

Large $m$ limit

[TABLE]

In the $m\to\infty$ case the order parameters $q_{0}$ and $\hat{q}_{0}$ need to be rescaled with $m$ and reparametrized with two new quantities $\delta q_{0}$ and $\delta\hat{q}_{0}$ , as follows:

[TABLE]

As a consequence, we also reparametrize $\Delta_{0}$ with a new parameter $\delta\Delta_{0}$ defined as:

[TABLE]

The expressions eqs. (20), (21) and (22) become:

Appendix E Distribution of stabilities

The stability for a given pattern/label pair $\xi^{*},\sigma^{*}$ is defined as:

[TABLE]

The distribution over the training set for a typical solutions can thus be computed as

[TABLE]

where we arbitrarily chose the first pattern/label pair $\xi^{1}$ , $\sigma^{1}$ without loss of generality. The expression can be computed by the replica method as usual, and the order parameters are simply obtained from the solutions of the saddle point equations for the free entropy. The resulting expression at the RS level is:

[TABLE]

where $G\left(x\right)=\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}$ is a standard Gaussian. The difference between the models (spherical/binary and sign/ReLU) is encoded in the different values for the overlaps and in the different expressions for the parameters $\Delta$ , $\Delta_{-1}$ , $\Delta_{2}$ .

In the large deviation case, we simply compute the expression with a $1$ RSB ansatz and fix $q_{1}$ and $m$ as described in the previous section. The resulting expression is

[TABLE]

where the effective parameters $\Delta_{-1}$ , $\Delta_{0}$ , $\Delta_{1}$ and $\Delta_{2}$ are the same defined in section B.2. In the $m\to\infty$ limit the previous expression reduces to

[TABLE]

where $z_{1}^{*}$ satifies

[TABLE]

where $\delta\Delta_{0}$ is defined in equation (50).

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature , 521(7553):436–444, 2015. doi:10.1038/nature 14539 . · doi ↗
2(2) Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 , pages 177–186. Springer, 2010. doi:10.1007/978-3-7908-2604-3_16 . · doi ↗
3(3) Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature , 405(6789):947, 2000. doi:10.1038/35016072 . · doi ↗
4(4) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , volume 15 of Proceedings of Machine Learning Research , pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR. URL: http://proceedings.mlr.press/v 15/glorot 11a.html .
5(5) Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München , 91(1), 1991.
6(6) Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences , 113(48):E 7655–E 7662, November 2016. doi:10.1073/pnas.1608103113 . · doi ↗
7(7) Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. , 115:128101, Sep 2015. doi:10.1103/Phys Rev Lett.115.128101 . · doi ↗
8(8) Carlo Baldassi, Fabrizio Pittorino, and Riccardo Zecchina. Shaping the learning landscape in neural networks around wide flat minima. ar Xiv preprint ar Xiv:1905.07833 , 2019. URL: https://arxiv.org/abs/1905.07833 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Properties of the geometry of solutions and capacity of multi-layer

Abstract

Appendix A Model

Appendix B Critical capacity

B.1 Replica symmetric ansatz

B.2 1RSB ansatz

Appendix C Franz-Parisi potential

Appendix D Large deviation analysis

Large mmm limit

Appendix E Distribution of stabilities

Large $m$ limit