Shaping the learning landscape in neural networks around wide flat minima
Carlo Baldassi, Fabrizio Pittorino, Riccardo Zecchina

TL;DR
This paper investigates the geometry of minima in neural network loss landscapes, revealing the prevalence of wide flat minima associated with better generalization, and proposes algorithms to find such minima.
Contribution
The study uncovers the existence of wide flat minima in neural networks and introduces entropy-driven algorithms to efficiently locate these regions.
Findings
Wide flat minima are common in neural network loss landscapes.
Minimizers of cross-entropy loss overlap with wide flat minima.
Algorithms focusing on flat regions improve generalization.
Abstract
Learning in Deep Neural Networks (DNN) takes place by minimizing a non-convex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points, and that such minimizers are often satisfactory at avoiding overfitting. How these two features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far reaching open question. In this paper we study basic non-convex one- and two-layer neural network models which learn random patterns, and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
Shaping the learning landscape in neural networks around wide flat
minima
Carlo Baldassi
Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy
Istituto Nazionale di Fisica Nucleare, Sezione di Torino, Italy
Fabrizio Pittorino
Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy
Politecnico di Torino, Italy
Riccardo Zecchina
Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, Milano, Italy
International Centre for Theoretical Physics, Trieste, Italy
Abstract
Learning in Deep Neural Networks (DNN) takes place by minimizing a non-convex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points, and that such minimizers are often satisfactory at avoiding overfitting. How these two features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far reaching open question. In this paper we study basic non-convex one- and two-layer neural network models which learn random patterns, and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers of the cross-entropy loss function overlap with the WFM of the error loss. We also show examples of learning devices for which WFM do not exist. From the algorithmic perspective we derive entropy driven greedy and message passing algorithms which focus their search on wide flat regions of minimizers. In the case of SGD and cross-entropy loss, we show that a slow reduction of the norm of the weights along the learning process also leads to WFM. We corroborate the results by a numerical study of the correlations between the volumes of the minimizers, their Hessian and their generalization performance on real data.
Contents
-
II HLE/WFM regions exist in non convex neural devices storing random patterns
-
II.5 Not all devices are appropriate: the Parity Machine does not display HLE/WFM regions
-
B Cross-Entropy minima, errors and high local entropy configurations
-
D.2 Exploring the space of solutions around a given configuration
I Introduction
Artificial Neural Networks (ANN), currently also known as Deep Neural Networks (DNN) when they have more than two layers, are powerful nonlinear devices used to perform different types of learning tasks (MacKay, 2003). From the algorithmic perspective, learning in ANN is in principle a hard computational problem in which a huge number of parameters, the connection weights, need to be optimally tuned. Yet, at least for supervised pattern recognition tasks, learning has become a relatively feasible process in many applications across domains and the performances reached by DNNs have had a huge impact on the field of Machine Learning (ML).
DNN models have evolved very rapidly in the last decade, mainly by an empirical trial and selection process guided by heuristic intuitions. As a result, current DNN are in a sense akin to complex physical or biological systems, which are known to work but for which a detailed understanding of the principles underlying their functioning remains unclear. The tendency to learn efficiently and to generalize with limited overfitting are two properties that often coexist in DNN and yet a unifying theoretical framework is still missing.
Here we provide analytical results on the geometrical structure of the loss landscape of ANN which shed light on the success of Deep Learning (DL) (LeCun et al., 2015) algorithms and allow us to design novel efficient algorithmic schemes.
We focus on non-convex one- and two-layer ANN models that exhibit sufficiently complex behavior and yet are amenable to detailed analytical and numerical studies. Building on methods of statistical physics of disordered systems, we analyze the complete geometrical structure of the minimizers of the loss function of ANN learning random patterns and discuss how the current DNN models are able to exploit such structure, e.g. starting from the choice of the loss function, avoiding algorithmic traps and reaching rare solutions which belong to wide flat regions of the weight space. In our study the notion of flatness is given in terms of the volume of the weights around a minimizer which do not lead to an increase of the loss value. This generalizes the so called Local Entropy of a minimizer (Baldassi et al., 2015), defined for discrete weights as the log of the number of optimal weights assignments within a given Hamming distance from the reference minimizer. We call these regions High Local Entropy (HLE) regions for discrete weights or Wide Flat Minima (WFM) for continuous weights. Our results are derived analytically for the case of random data and corroborated by numerics on real data. In order to eliminate ambiguities which may arise from changes of scale of the weights, we control the norm of the weights in each of the units that compose the network. The outcomes of our study can be summarized as follows.
(i) We show analytically that ANN learning random patterns possess the structural property of having extremely robust regions of optimal weights*, *namely wide flat minima of the loss, whose existence is important to achieve convergence in the learning process. Though these wide minima are rare compared to the dominant critical points (absolute narrow minima, local minima or saddle points in the loss surface), they can be accessed by a large family of simple learning algorithms. We also show analytically that other learning machines, such as the Parity Machine, do not possess wide flat minima.
(ii) We show analytically that the choice of the cross-entropy loss function has the effect of biasing learning algorithms toward HLE or WFM regions.
(iii) We derive a greedy algorithm – Entropic Least Action Learning (eLAL) – and a message passing algorithm – focusing Belief Propagation (fBP) – which zoom in their search on wide flat regions of minimizers.
(iv) We compute the volumes associated to the minimizers found by different algorithms using Belief Propagation.
(v) We show numerically that the volumes correlate well with the spectra of the Hessian on computationally tractable networks and with the generalization performance on real data. The algorithms which search for WFM display a spectrum which is much more concentrated around zero eigenvalues compared to plain SGD.
Our results on random patterns support the conclusion that the minimizers which are relevant for learning are not the most frequent isolated and narrow ones (which also are computationally hard to sample) but the rare ones which are extremely wide. While this phenomenon was recently disclosed for the case of discrete weights (Baldassi et al., 2015, 2016a), here we demonstrate that it is present also in non convex ANN with continuous weights. Building on these results we derive novel algorithmic schemes and shed light on the performance of SGD with the cross-entropy loss function. Numerical experiments suggest that the scenario generalizes to real data and is consistent with other numerical results on deeper ANN (Keskar et al., 2016).
II HLE/WFM regions exist in non convex
neural devices storing random patterns
In what follows we analyze the geometrical structure of the weights space by considering the simplest non convex neural devices storing random patterns: the single layer network with discrete weights and the two layer networks with both continuous and discrete weights. The choice of random patterns, for which no generalization is possible, is motivated by the possibility of using analytical techniques from statistical physics of disordered systems and by the fact that we want to identify structural features which do not depend on specific correlation patterns of the data.
II.1 The simple example of discrete weights
In the case of binary weights it is well known that even for the single layer network the learning problem is computationally challenging. Therefore we begin our analysis by studying the so called binary perceptron, which maps vectors of inputs to binary outputs as , where is the synaptic weights vector .
Given a training set composed of input patterns with and their corresponding desired outputs , the learning problem consists in finding a solution such that for all . The entries and the outputs are random unbiased i.i.d. variables. As discussed in Krauth and Mézard (1989) (but see also the rigorous bounds in Ding and Sun (2019)), perfect classification is possible with probability in the limit of large up to a critical value of , usually denoted as ; above this value, the probability of finding a solution drops to zero. is called the capacity of the device.
The standard analysis of this model is based on the study of the zero temperature limit of the Gibbs measure with a loss (or energy) function which counts the number of errors (NE) over the training set:
[TABLE]
where is the Heaviside step function, if and [math] otherwise. The Gibbs measure is given by
[TABLE]
where is the inverse temperature parameter. For large values of , concentrates on the minima of . The key analytical obstacle for the computation of is the evaluation of the normalization factor, the partition function :
[TABLE]
In the the zero temperature limit ( and below the partition function simply counts all solutions to the learning problem,
[TABLE]
where is a characteristic function which evaluates to one if all patterns are correctly classified, and to zero otherwise.
is an exponentially fluctuating quantity (in ), and its most probable value is obtained by exponentiating the average of , denoted by , over the realizations of the patterns
[TABLE]
The calculation of has been done in the 80s and 90s by the replica and the cavity methods of statistical physics and, as mentioned above, the results predict that the learning task undergoes a threshold phenomenon at , where the probability of existence of a solution jumps from one to zero in the large limit (Krauth and Mézard, 1989). This result has been put recently on rigorous grounds by Ding and Sun (2019). Similar calculations predict that for any , the vast majority of the exponentially numerous solutions on the hypercube are isolated, separated by a Hamming mutual distance (Huang and Kabashima, 2014). In the same range of , there also exist an even larger number of local minima at non-zero loss, a result that has been corroborated by analytical and numerical findings on stochastic learning algorithms which satisfy detailed balance (Horner, 1992). Recently it became clear that by relaxing the detailed balance condition it was possible to design simple algorithms which can solve the problem efficiently (Braunstein and Zecchina, 2006; Baldassi et al., 2007; Baldassi, 2009).
II.2 Local entropy theory
The existence of effective learning algorithms indicates that the traditional statistical physics calculations, which focus on set of solutions that dominate the zero temperature Gibbs measure (i.e. the most numerous ones), are effectively blind to the solutions actually found by such algorithms. Numerical evidence suggests that in fact the solutions found by heuristics are not at all isolated: on the contrary, they appear to belong to regions with a high density of nearby other solutions. This puzzle has been solved very recently by an appropriate large deviations study (Baldassi et al., 2015, 2016c, 2016b, 2016a) in which the tools of statistical physics have been used to study the most probable value of the local entropy of the loss function, i.e. a function that is able to detect the existence of regions with an radius containing a high density of solutions even when the number of these regions is small compared to the number of isolated solutions. For binary weights the local entropy function is the (normalized) logarithm of the number of solutions at Hamming distance from a reference solution
[TABLE]
with
[TABLE]
and where is the Kronecker delta symbol. In order to derive the typical values that the local entropy can take, one needs to compute the Gibbs measure of the local entropy
[TABLE]
where has the role of an inverse temperature. For large values of this probability measure focuses on the surrounded by an exponential number of solutions within a distance . The regions of high local entropy (HLE) are then described in the regime of large and small . In particular, the calculation of the expected value of the optimal local entropy
[TABLE]
shows the existence of extremely dense clusters of solutions up to values of close to (Baldassi et al., 2015, 2016c, 2016b, 2016a).
The probability measure eq. (8) can be written in an equivalent form that generalizes to the non zero errors regime, is analytically simpler to handle, and leads to novel algorithmic schemes (Baldassi et al., 2016a):
[TABLE]
where is a “local free entropy” potential in which the distance constraint is forced through a Lagrange multiplier
[TABLE]
where is some monotonically increasing function of the distance between configurations, defined according to the type of weights under consideration. In the limit and by choosing so that a given distance is selected, this expression reduces to eq. (8).
The crucial property of eq. (10) comes from the observation that by choosing to be a non-negative integer, the partition function can be rewritten as:
[TABLE]
where
[TABLE]
These are the partition function and the effective loss of interacting real replicas of the system, one of which acts as reference system ( while the remaining () are identical, each being subject to the energy constraint and to the interaction term with the reference system. As discussed in Baldassi et al. (2016a), several algorithmic schemes can be derived from this framework by minimizing . Here we shall also use the above approach to study the existence of WFMs in continuous models and to design message passing and greedy learning algorithms driven by the local entropy of the solutions.
II.3 Two layer networks with continuous weights
As for the discrete case, we are able to show that in non convex networks with continuous weights the WFMs exist, are rare and yet accessible to simple algorithms. In order to perform an analytic study, we consider the simplest non-trivial two-layer neural network, the committee machine with non-overlapping receptive fields. It consists of input units, one hidden layer with units and one output unit. The input units are divided into disjoint sets of units. Each set is connected to a different hidden unit. The input to the -th hidden unit is given by where is the connection weight between the input unit and the hidden unit , and is the -th input to the -th hidden unit. As before, is a pattern index. We study analytically the pure classifier case in which each unit implements a threshold transfer function and the loss function is the error loss. Other types of (smooth) functions, more amenable to numerical simulation, will be also discussed in a subsequent section. The output of the -th hidden unit is given by
[TABLE]
In the second layer all the weights are fixed and equal to one, and the overall output of the network is simply given by a majority vote
As for the binary perceptron, the learning problem consists in mapping each of the random input patterns with (, , ) onto a randomly chosen output Both and are independent random variables which take the values with equal probability. For a given set of patterns, the volume of the subspace of the network weights which correctly classifies the patterns, the so called version space, is given by
[TABLE]
where we have imposed a spherical constraint on the weights via a Dirac in order to keep the volume finite (though exponential in . In the case of binary weights the integral would become a sum over all the configurations and the volume would be the overall number of zero error assignments of the weights.
The committee machine has been studied extensively in the 90s (Barkai et al., 1992; Schwarze and Hertz, 1992; Engel et al., 1992). The capacity of the network can be derived by computing the typical weight space volume as a function of the number of correctly classified patterns , in the large limit. As for the binary case, the most probable value of is obtained by exponentiating the average of , , a difficult task which is achieved by the replica method (Mézard et al., 1987; Barkai et al., 1990).
For the smallest non-trivial value of , it has been found that above the space of solutions changes abruptly, becoming clustered into multiple components111Strictly speaking each cluster is composed of a multitude of exponentially small domains (Monasson and Zecchina, 1995).. Below the geometrical structure is not clustered and can be described by the simplest version of the replica method, known as replica symmetric (RS) solution. Above the analytical computation of the typical volume requires a more sophisticated analysis that properly describes a clustered geometrical structure. This analysis can be performed by a variational technique which is known in statistical physics as the replica-symmetry-breaking (RSB) scheme, and the clustered geometrical structure of the solution space is known as RSB phase.
The capacity of the network, above which perfect classification becomes impossible, is found to be 3.02. In the limit of large (but still with ), the clustering transition occurs at a finite number of patterns per weight, (Barkai et al., 1992) whereas the critical capacity grows with as (Monasson and Zecchina, 1995).
II.4 The existence of wide flat minima
In order to detect the existence of WFM we use a large deviation measure which is the continuous version of the measure used in the discrete case: each configuration of the weights is re-weighted by a local volume term, analogously to the analysis of section II.2. For the continuous case, however, we adopt a slightly different formalism which simplifies the analysis. Instead of constraining the set of real replicas222Not to be confused with the virtual replicas of the replica method. to be at distance from a reference weight vector, we can identify the same WFM regions by constraining them directly to be at a given mutual overlap: for a given value , we impose that for all pairs of distinct replicas . The overlap is bijectively related to the mutual distance among replicas (which tends to [math] as ). That, in turn, determines the distance between each replica and the barycenter of the group , which takes the role that the reference vector had in the previous treatment. Thus, the regime of small corresponds to the regime of close to , and apart from this reparametrization the interpretation of the WFM volumes is the same. As explained in Appendix C, the advantage of this novel technique is that it allows to use directly the first-step formalism of the RSB scheme (1-RSB). Similarly to the discrete case, the computation of the maximal WFM volumes leads to the following results: for and in the large limit, we find333All the details are reported in Appendix C.
[TABLE]
with
[TABLE]
where , , , and satisfies a saddle point equation that needs to be solved numerically. is the logarithm of the volume of the solutions, normalized by , under the spherical constraints on the weights, and with the real replicas forced to be at a mutual overlap . In analogy with the discrete case, we still refer to as to the local entropy. It is composed by the sum of two terms: the first one, , corresponds to the log-volume at , where all configurations are solutions and only the geometric constraints are present. This is an upper bound for the local entropy. The second term, , is in general negative and it represents the log of the fraction of solutions at overlap (among all configurations at that overlap), and we call it normalized local entropy. Following the interpretation given above, we expect (in analogy to the discrete case at small , cf. fig. 3) that in a WFM this fraction is close to (i.e. is close to [math]) in an extended region where the distance between the replicas is small, i.e. where is close to ; otherwise, WFMs do not exist in the model. In fig.** 1** (top panel) we report the values of vs the overlap , for different values of Indeed, one may observe that the behavior is qualitatively similar to that of the binary perceptron: besides the solutions and all the related local minima and saddles predicted by the standard statistical physics analysis (Barkai et al., 1992; Schwarze and Hertz, 1992; Engel et al., 1992; Monasson and Zecchina, 1995), there exist absolute minima which are flat at relatively large distances. Indeed, reaching such wide minima efficiently is non trivial, and different algorithms can have drastically different behaviors, as we will discuss in detail in sec. V.
The case is still relatively close to the simple perceptron, though the geometrical structure of its minima is already dominated by non convex features for . A case which is closer to more realistic ANNs is (but still ), which, luckily enough, is easier to study analytically. We find:
[TABLE]
where with , and is fixed by a saddle point equation. In fig. 1 (bottom panel) we observe that indeed WFM still exist for all finite values of .
The results of the above WFM computation may require small corrections due to RSB effects, which however are expected to be very tiny due to the compact nature of the space of solutions at small distances.
A more informative aspect is to study the volumes around the solutions found by different algorithms. This can be done by the Belief Propagation method, similarly to the computation of the weight enumerator function in error correcting codes (Di et al., 2006).
II.5 Not all devices are appropriate: the Parity Machine does not display
HLE/WFM regions
The extent by which a given model exhibits the presence of WFM can vary (see e.g. fig. 1). A direct comparison of the local entropy curves on different models in general does not yet have a well-defined interpretation, although at least for similar architectures it can still be informative Baldassi et al. (2019). On the other hand, the existence of WFM itself is a structural property. For neural networks, its origin relies in the threshold sum form of the nonlinearity characterizing the formal neurons. As a check of this claim, we can analyze a model which is in some sense complementary, namely the so called parity machine. We take its network structure to be identical to the committee machine, except for the output unit which performs the product of the hidden units instead of taking a majority vote. While the outputs of the hidden units are still given by sign activations, eq. (14), the overall output of the network reads The volume of the weights that correctly classifies a set of patterns is still given by eq. (15).
Parity machines are closely related to error correcting codes based on parity checks. The geometrical structure of the absolute minima of the error loss function is known (Monasson and Zecchina, 1995) to be composed by multiple regions, each in one to one correspondence with the internal representations of the patterns. For random patterns such regions are typically tiny and we expect the WFM to be absent. Indeed, the computation of the volume proceeds analogously to the previous case444It’s actually even simpler, see Appendix C.1., and it shows that in this case for any distance the volumes of the minima are always bounded away from the maximal possible volume, i.e. the volume one would find for the same distance when no patterns are stored: the log-ratio of the two volumes is constant and equal to . In other words, the minima never become flat, at any distance scale.
II.6 The connection between Local Entropy and Cross-Entropy
Given that dense regions of optimal solutions exist in non convex ANN, at least in the case of independent random patterns, it remains to be seen which role they play in current models. Starting with the case of binary weights, and then generalizing the result to more complex architectures and to continuous weights, we can show that the most widely used loss function, the so called Cross-Entropy (CE) loss, focuses precisely on such rare regions (see Baldassi et al. (2018) for the case of stochastic weights).
For the sake of simplicity, we consider a binary classification task with one output unit. The CE cost function for each input pattern reads
[TABLE]
where . The parameter allows to control the degree of “robustness” of the training, see the inset in figure 2. In standard machine learning practice is simply set to , but a global rescaling of the weights can lead to a basically equivalent effect. That setting can thus be interpreted as leaving as implicit, letting its effective value, and hence the norm of the weights, to be determined by the initial conditions and the training algorithm. As we shall see, controlling explicitly along the learning process plays a crucial role in finding HLE/WFM regions.
For the binary case, however, the norm is fixed and thus we must keep as an explicit parameter. Note that since the minima of below at large are the solutions to the training problem, i.e. they coincide with those of .
We proceed by first showing that the minimizers of this loss correspond to near-zero errors for a wide range of values of , and then by showing that these minimizers are surrounded by an exponential number of zero error solutions.
In order to study the probability distribution of the minima of in the large limit, we need to compute its Gibbs distribution (in particular, the average of the of the partition function, see eq. (5)) as it has been done for the error loss . The procedure follows standard steps and it is detailed in Appendix B. The method requires to solve two coupled integral equations as functions of the control parameters , and . In figure 2 we show the behavior of the fraction of errors vs the loading , for various values of . Up to relatively large values of the optimum of corresponds to extremely small values of , virtually equal to zero for any accessible size
Having established that by minimizing the the CE one ends up in regions of perfect classification where the error loss function is essentially zero, it remains to be understood which type of configurations of weights are found. Does the CE converge to an isolated point-like solution in the weight space (such as the typical zero energy configurations of the error function)555A quite unlikely fact given that finding isolated solutions is a well known intractable problem. or does it converge to the rare regions of high local entropy?
In order to establish the geometrical nature of the typical minima of the CE loss, we need to compute the average value of (which tells us how many zero energy configurations of the error loss function can be found within a given radius from a given , see eq. (6)) when is sampled from the minima of . This can be accomplished by a well known analytical technique (Franz and Parisi, 1995) which was developed for the study of the energy landscape in disordered physical systems. The computation is relatively involved, and here we report only the final outcome. For the dedicated reader, all the details of the calculation, which relies on the replica method and includes a double analytic continuation, can be found in Appendix B. As reported in figure 3, we find that the typical minima of the CE loss for small finite are indeed surrounded by an exponential number of zero error solutions. In other words, the CE focuses on HLE regions.
It is clear from the figure that needs to be sufficiently large for this phenomenon to occur. On the other hand, in the limit the space of solutions is again dominated by the isolated ones; we thus expect the existence of an optimal value of , depending on the parameters and . We set and used two values of , and , and measured the normalized local entropy where is the upper bound corresponding to the case (gray curve in fig. 3). The results are shown in fig. 4. Numerical issues prevented us from reaching the optimal at (the main plot, in log scale, shows that the curve is still growing), and the left inset (same data as the main plot, but not in log scale) shows that there is a whole region where the local entropy is extremely close to optimal (what we would call a dense region). At a larger distance, , we could find the optimum (denoted with a dot) at a lower local entropy (consistently with fig. 3), but it is also clear that the “good” region of is rather large since the curve is very flat. In the limit of the curves would tend to and for the and cases, respectively.
The physical significance of these “optimal regions” and their relation with the local geometry of the landscape is not obvious. The main reason for this is that a fine description of the local geometry of the HLE regions is still an open problem – circumstantial evidence from theoretical considerations and numerical simulations suggests that they are rather complex structures (Baldassi et al., 2015), possibly with a multifractal nature. At the present time, this is pure speculation. On the practical side, however, a local search algorithm that explores the landscape at low and gradually increases it should eventually reach this “good ” plateau that leads it toward a HLE region; further increasing would then be unlikely to drive it out, unless very strong drift terms are present (Baldassi et al., 2016a).
As an algorithmic check we have verified that while a Simulated Annealing approach gets stuck at very high energies when trying to minimize the error loss function, the very same algorithm with the CE loss is indeed successful up to relatively high values of , with just slightly worse performance compared to an analogous procedure based on local entropy (Baldassi et al., 2016c). In other words, the CE loss on single-layer networks is a computationally cheap and reasonably good proxy for the LE loss. These analytical results extend straightforwardly to two layer networks with binary weights. The study of continuous weight models can be performed resorting to the Belief Propagation method.
III Belief Propagation and focusing Belief Propagation
Belief Propagation (BP), also known as sum-product, is an iterative message-passing algorithm for statistical inference. When applied to the problem of training a committee machine with a given set of input-output patterns, it can be used to obtain, at convergence, useful information on the probability distribution, over the weights of the network, induced by the Gibbs measure. In particular, it allows to compute the marginals of the weights as well as their entropy, which in the zero-temperature regime is simply the logarithm of the volume of the solutions, eq. (15), rescaled by the number of variables . The results are approximate, but (with high probability) they approach the correct value in the limit of large in the case of random uncorrelated inputs, at least in the replica-symmetric phase of the space of the parameters. Due to the concentration property, in this limit the macroscopic properties of any given problem (such as the entropy) tend to converge to a common limiting case, and therefore a limited amount of experiments with a few samples is sufficient to describe very well the entire statistical ensemble.
We have used BP to study the case of the zero-temperature tree-like committee machine with continuous weights and . We have mostly used , which turns out to be large enough to produce results in quite good agreement with the replica theory analysis. The implementation can be made efficient by encoding each message with only two quantities (see Appendix D). As mentioned above, this algorithm works well in the replica-symmetric phase, which for our case means when . Above this value, the (vanilla) algorithm doesn’t converge at all.
However, BP can be employed to perform additional analyses as well. In particular, it can be modified rather straightforwardly to explore and describe the region surrounding any given configuration, as it allows to compute the local entropy (i.e. the log-volume of the solutions) for any given distance and any reference configuration (this is a general technique, the details for our case are reported in Appendix D.2). The convergence issues are generally much less severe in this case. Even in the RSB phase, if the reference configuration is a solution in a wide minimum, the structure is locally replica-symmetric, and therefore the algorithm converges and provides accurate results, at least up to a value of the distance where other unconnected regions of the solutions space come into consideration. In our tests, the only other issue arose occasionally at very small distances, where convergence is instead prevented by the loss of accuracy stemming from finite size effects and limited numerical precision.
Additionally, the standard BP algorithm can be modified and transformed into a (very effective) solver. There are several ways to do this, most of which are purely heuristic. However, it was shown in Baldassi et al. (2016a) that adding a particular set of self-interactions to the weight variables could approximately but effectively describe the replicated system of eq. (12): in other words, this technique can be used to analyze the local-entropy landscape instead of the Gibbs one. By using a sufficiently large number of replicas (we generally used and following an annealing protocol in the coupling parameter (starting from a low value and making it diverge) this algorithm focuses on the maximally dense regions of solutions, thus ending up in wide flat minima. For these reasons, the algorithm was called “focusing-BP” (fBP). The implementation closely follows that of Baldassi et al. (2016a) (complete details are provided in Appendix D.3). Our tests – detailed below – show that this algorithm is the best solver (by a wide margin) among the several alternatives that we tried in terms of robustness of the minima found (and thus of generalization properties, as also discussed below). Moreover, it also achieves the highest capacity, nearly reaching the critical capacity where all solutions disappear.
IV Entropic Least Action Learning
Least Action Learning (LAL) (Mitchison and Durbin, 1989) is a heuristic greedy algorithm that was designed to extend the well-known Perceptron algorithm to the case of committee-machines with a single binary output and sign activation functions. It takes one parameter, the learning rate . In its original version, patterns are presented randomly one at a time, and at most one hidden unit is affected at a time. In case of correct output, nothing is done, while in case of error the hidden unit, among those with a wrong output, whose pre-activation was closest to the threshold (and is thus the easiest to fix) is selected, and the standard perceptron learning rule (with rate ) is applied to it. In our tests we simply extended it to work in mini-batches, to make it more directly comparable with stochastic-gradient-based algorithms: for a given mini-batch, we first compute all the pre-activations and the outputs for all patterns at the current value of the weights, then we apply the LAL learning rule for each pattern in turn.
This algorithm proves to be surprisingly effective at finding minima of the NE loss very quickly: in the random patterns case, its algorithmic capacity is higher than gradient-based variants and almost as high as fBP, and it requires comparatively few epochs. It is also computationally very fast, owing to its simplicity. However, as we show in the sec. V, it finds solutions that are much narrower compared to those of other algorithms.
In order to drive LAL toward WFM regions, we add a local-entropy component to it, by applying the technique described in Baldassi et al. (2016a) (see eq. (13)): we run replicas of the system in parallel and we couple them with an elastic interaction. The resulting algorithm, that we call entropic-LAL (eLAL) can be described as follows. We initialize replicas randomly with weights and compute their average . We present mini-batches independently to each replica, using different permutations of the dataset for each of them. At each mini-batch, we apply the LAL learning rule. Then, each replica is pushed toward the group average with some strength proportional to a parameter : more precisely, we add a term to each of the weight vectors . After this update, we recompute the average . At each epoch, we increase the interaction strength . The algorithm stops when the replicas have collapsed to a single configuration.
This simple scheme proves rather effective at enhancing the wideness of the minima found while still being computationally efficient and converging quickly, as we show in the sec. V. We show tests performed with , but the entropic enhancement is gradual, already providing improvements with very few replicas and progressing at least up to , which is the maximum value that we have tested. Its main limitation is, of course, that it is tailored to committee machines and it is unclear how to extend it to general architectures. On the other hand, the effectiveness of the replication scheme seems very promising from the point of view of improving the quality of the minima found by greedy and efficient heuristics.
V Numerical studies
We conclude our study by comparing numerically the curvature, the wideness of the minima and the generalization error found by different approaches. We consider two main scenarios: one, directly comparable with the theoretical calculations, where a tree committee machine with is trained over random binary patterns, and a second one, which allows us to estimate the generalization capabilities, where a fully-connected committee machine with is trained on a subset of the Fashion-MNIST dataset (Xiao et al., 2017). The choice of using instead of is intended to enhance the potential robustness effect that the CE loss can have over NE on such architectures (see the inset of fig. 2): for , a correctly classified pattern already requires out for units to give the correct answer, and there is not much room for improvement at the level of the pre-activation of the output unit. On the other hand, since we study networks with a number of inputs of the order of , an even larger value of would either make too small in the tree-like case (exacerbating numerical issues for the BP algorithms and straying too far from the theoretical analysis) or make the computation of the Hessians too onerous for the fully-connected case (each Hessian requiring the computation of terms).
We compare several training algorithms with different settings (see Appendix E): stochastic GD with the CE loss (ceSGD); least-action learning (LAL) and its entropic version (eLAL); focusing BP (fBP). Of these, the non-gradient based ones (LAL, eLAL and fBP) can be directly used with the sign activation functions (14) and the NE loss. On the other hand, ceSGD requires a smooth loss landscape, therefore we used activations, adding a gradually-diverging parameter in their argument, since . The parameter of the CE loss (16) was also increased gradually. As in the theoretical computation, we also constrained the weights of each hidden unit of the network to be normalized. The NE loss with sign activations is invariant under renormalization of each unit’s weights, whereas the CE loss with activations is not. In the latter case, the parameters and can be directly interpreted as the norm of the weights, since they just multiply the pre-activations of the units. In a more standard approach, the norm would be controlled by the initial choice of the weights and be driven by the SGD algorithm automatically. In our tests instead we have controlled these parameters explicitly, which allows us to demonstrate the effect of different schedules. In particular, we show (for both the random and the Fashion-MNIST scenarios) that slowing down the growth of the norm with ceSGD makes a significant difference in the quality of the minima that are reached. We do this by using two separate settings for ceSGD, a “fast” and a “slow” one. In ceSGD-fast both and are large from the onset and grow quickly, whereas in ceSGD-slow they start from small values and grow more slowly (requiring much more epochs for convergence).
In all cases – for uniformity of comparison, simplicity, and consistency with the theoretical analysis – we consider scenarios in which the training error (i.e. the NE loss) gets to zero. This is, by definition, the stopping condition for the LAL algorithm. We also used this as a stopping criterion for ceSGD in the “fast” setting. For the other algorithms, the stopping criterion was based on reaching a sufficiently small loss (ceSGD in the “slow” setting), or the collapse of the replicas (eLAL and fBP).
The analysis of the quality of the results was mainly based on the study of the local loss landscape at the solutions. On one hand, we computed the normalized local entropy using BP as described in a sec. D.2, which provides a description of the NE landscape. On the other hand, we also computed the spectrum of the eigenvalues of a smoothed-out version of the NE loss, namely the mean square error loss computed on networks with activations. This loss depends on the parameters of the activations: we set to be as small as possible (maximizing the smoothing and thereby measuring features of the landscape at a large scale) under the constraint that all the solutions under consideration were still corresponding to zero error (to prevent degrading the performance). For the Fashion-MNIST case, we also measured the generalization error of each solution, and the robustness to input noise.
In the random patterns scenario we set and , and tested samples (the same for all the algorithms). The results are presented in figs. 5 and 6. The two analyses allow to rank the algorithms (for the Hessians we can use the maximum eigenvalue as a reasonable metric) and their results are in agreement. As expected, fBP systematically finds very dense regions of solutions, qualitatively compatible with the theoretical analysis (cf. fig. 5 with fig. 1) and corresponding to the narrowest spectra of the Hessian at all ; the other algorithms follow roughly in the order eLAL, ceSGD-slow, ceSGD-fast, LAL. The latter is a very efficient solver for this model, but it finds solutions in very narrow regions. On the other hand, the same algorithm performed in parallel on a set of interacting replicas is still efficient but much better at discovering WFMs. These results are for replicas in eLAL, but our tests show that would be sufficient to match ceSGD-slow and that would further improve the results and get closer to fBP. Overall, the results of the random pattern case confirm the existence of WFMs in continuous networks, and suggest that a (properly normalized) Hessian spectrum can be used as a proxy for detecting whether an algorithm has found a WFM region.
We then studied the performance of ceSGD (fast and slow settings), LAL and eLAL on a small fully-connected network which learns to discriminate between two classes of the Fashion-MNIST dataset (we chose the classes Dress and Coat, which are rather challenging to tell apart but also sufficiently different to offer the opportunity to generalize even with a small simple network trained on very few examples). We trained our networks on a small subset of the available examples ( patterns; binarized to by using the median of each image as a threshold on the original grayscale inputs; we filtered both the training and test sets to only use images in which the median was between and as to avoid too-bright or too-dark images and make the data more uniform and more challenging). This setting is rather close to the one which we could study analytically, except for the patterns statistics and the use of fully-connected rather than tree-like layers, and it is small enough to permit computing the full spectrum of the Hessian. On the other hand, it poses a difficult task in terms of inference (even though finding solutions with zero training error is not hard), which allowed us to compare the results of the analysis of the loss landscape with the generalization capabilities on the test set. Each algorithm was ran times. The results are shown in fig. 7, and they are analogous to those for the random patterns case, but in this setting we can also observe that indeed WFMs tend to generalize better. Also, while we could not run fBP on this data due to the correlations present in the inputs and to numerical problems related to the fully-connected architecture, which hamper convergence, it is still the case that ceSGD can find WFMs if the norms are controlled and increased slowly enough, and that we can significantly improve the (very quick and greedy) LAL algorithm by replicating it, i.e. by effectively adding a local-entropic component.
We also performed an additional batch of tests on a randomized version of the Fashion-MNIST dataset, in which the inputs were reshuffled across samples on a pixel-by-pixel basis (such that each sample only retained each individual pixel bias while the correlations across pixels were lost). This allowed us to bridge the two scenarios and directly compare the local volumes in the presence or absence of features of the data that can lead to proper generalization. We kept the settings of each algorithm as close as possible to those for the Fashion-MNIST tests. Qualitatively, the results were quite similar to the ones on the original data except for a slight degradation of the performance of eLAL compared to ceSGD. Quantitatively, we observed that the randomized version was more challenging and generally resulted in slightly smaller volumes. Additional measures comparing the robustness to the presence of noise in the input (which measures overfitting and thus can be conceived as being a precursor of generalization) confirm the general picture. The detailed procedures and results are reported in Appendix F.
Conclusions and future directions
In this paper, we have generalized the local entropy theory to continuous weights and we have shown that WFM exists in non convex neural systems. We have also shown that the CE loss spontaneously focuses on WFM. On the algorithmic side we have derived and designed novel algorithmic schemes, either greedy (very fast) or message passing, which are driven by the local entropy measure. Moreover, we have shown numerically that ceSGD can be made to converge in WFM by an appropriate cooling procedure of the parameter which controls the norm of the weights. Our findings are in agreement with recent results showing that ReLU transfer functions also help the learning dynamics to focus on WFM (Baldassi et al., 2019). Future work will be aimed at extending our methods to multiple layers, trying to reach a unified framework for current DNN models. This is a quite challenging task which has the potential to reveal the role that WFM play for generalization in different data regimes and how that can be connected to the many layer architectures of DNN.
Acknowledgements.
CB and RZ acknowledge ONR Grant N00014-17-1-2569. We thank Leon Bottou for discussions.
Appendix A High Local Entropy states from the 1-RSB formalism
Given a system described by a vector of discrete variables with an associated energy function , the Boltzmann equilibrium distribution at inverse temperature reads
[TABLE]
where the normalization factor is given by the partition function
[TABLE]
In the limit , the distribution is just a flat measure over the ground states of the energy; we can denote the ground state energy as and the characteristic function over the ground states as
[TABLE]
such that and gives the entropy of the ground states.
In Baldassi et al. (2015), we introduced a large-deviation measure with a modified energy function in which each configuration is reweighted by a “local entropy” term. There, we only considered the limit and defined the local entropy as the number of ground states at a certain normalized distance from a reference configuration :
[TABLE]
where is a suitably defined distance function and is the Kronecker delta. With this definition, we can define a modified partition function as follows:
[TABLE]
which (up to irrelevant constant factors) coincides with:
[TABLE]
This approach can be shown to be strictly related to the 1-step replica-symmetry-broken (1RSB) formalism of the bare energetic problem. First, let us define a local free entropy at any given inverse temperature (i.e. a generalization of eq. (20)), and use a soft-constraint on the distance through a Lagrange multiplier :
[TABLE]
Note that the use of a Lagrange multiplier is mostly convenient in order to make the relation with the 1RSB description evident. It is only equivalent to using a hard constraint for the distance in the thermodynamic limit, and depending on the convexity properties of the function , but we will generally ignore these issues for the time being and come back to them later.
We can then rewrite the large-deviation partition function eq. (21) in this more general case as:
[TABLE]
Let us now consider the case in which : this allows us, by simple algebraic manipulations, to rewrite the partition function introducing a sum over all the configurations of replicas of the system:
[TABLE]
This partition function describes a system of interacting real replicas with an interaction which is mediated by the reference configuration . However we can isolate the sum over the configurations of to obtain a system of interacting real replicas. In the special case we obtain:
[TABLE]
We have stressed the fact that the replicas are real to avoid the confusion with the virtual replicas used for the “replica trick”: here, we are not replicating the system virtually in order to compute a free entropy in the limit of zero replicas: instead, we are describing a system of identical interacting objects. The general case of can be obtained by analytic continuation once an expression for all integer is found.
This description is highly reminiscent of – in fact, almost identical to – the derivation of the ergodicity-breaking scheme used in Monasson (1995): there, an auxiliary symmetry breaking field is introduced (having the same role of in our notation); then, a free energy expression is introduced in which the role of the energy is taken by a “local free entropy” (the analogous of eq. (20) for general ), after which the system is replicated times and the auxiliary field is traced out, leading to a system of real replicas with an effective pairwise interaction. Finally, the limit of vanishing interaction () is taken in order to derive the equilibrium description. When this system is studied in the replica-symmetric (RS) Ansatz, it results in the 1RSB description of the original system, with having the role of the Parisi parameter (usually denoted by ). Indeed, in this limit of vanishing interaction and for , our equations reduce to the 1RSB case as well.
Therefore, apart from minor differences, the main point of discrepancy between that analysis and our approach is that we don’t restrict ourselves to the equilibrium distribution. Instead, we explore the whole range of values of . In this context, we also have no reason to restrict ourselves to the range , as it is usually done in order to give a physical interpretation to the 1RSB solution; to the contrary, we are (mostly) interested in the limit of large , in which only the configurations of maximal local entropy are described.
The relationship between our analysis and the usual 1RSB case can be made even more direct, leading to an alternative – although with very similar results – large deviations analysis: consider, instead of eq. (26), a partition function in which the interaction among the replicas is pairwise (without the reference configuration ) and the constraint on the distance is hard (introduced via a Dirac delta function):
[TABLE]
Suppose then that we study the average free entropy (where represents the average over the quenched parameters, if any) in the context of replica theory. Then, we will have virtual replicas of the whole system, and since each system has real replicas we end up with total replicas. Let’s use indices , for the virtual replicas and , for the real ones, such that a configuration will now have two indices, e.g. . Suppose that we manage to manipulate the expression such that it becomes a function, among other order parameters, of the overlaps , where represents some inner product, and that the distance function can be expressed in terms of those. Then, as usual, we would introduce auxiliary integrals
[TABLE]
Using this, we can rewrite the interaction term. Say that , then:
[TABLE]
By assuming replica symmetry, we seek a saddle point with this structure:
[TABLE]
with . The interaction term eq. (29) becomes:
[TABLE]
Therefore, the external parameter eliminates a degree of freedom in the solution to the saddle point equations for the overlaps. The final step in the replica calculation would have the form
[TABLE]
where is the expression which would have been derived in an equilibrium computation without the interaction term, the dots in the argument represent extra order parameters, and the order parameters are fixed by the saddle point equations
[TABLE]
Thus, the difference with respect to the usual 1RSB computation is that the equation for finding the extremum over is removed, and the one for finding the extremum over is modified. Maximizing over , by solving for , is then equivalent to the usual 1RSB description (equivalent to the case in the soft-constraint case):
[TABLE]
In the common case where is fixed (e.g. if the variables are discrete, or constraints on the norm are introduced) then this representation fixes ; it is clear then that our large deviations analysis (the alternative one of eq. (27)) is simply derived by fixing as an external parameter, and thus omitting the saddle point equation . Note that this wouldn’t make physical sense in the standard derivation of the 1RSB equations, since in that context is only introduced as an overlap between virtual replicas when choosing an Ansatz for the solutions of the saddle point equations; our derivation is only physically meaningful when describing a system of real interacting replicas or, in the case of the original derivation from eq. (21), a system with a modified energy function.
Appendix B Cross-Entropy minima, errors and high local entropy
configurations
B.1 Cross-Entropy loss ground states
In order to study analytically the properties of the minima of the CE loss function in the case of i.i.d. random patterns, the key obstacle is to compute the normalization factor of the Gibbs measure, the partition function . Once this is done one has access to the the Gibbs measure which concentrates on the minima of the loss in the limit.
is an exponentially fluctuating random variable and in order to find its most probable values we need to average its logarithm, a complicated task which we perform by the replica method. Once this is done, the typical value of can be recovered by , where stands for the average over the random patterns.
We refer to Engel and Van den Broeck (2001) for a thorough review of the replica method. Here we just remind the reader that the replica method is an analytic continuation technique which allows in some cases (mean-field models) to compute the expectation of the logarithm of the partition function from the knowledge of its integer moments. The starting point is the following small expansion
[TABLE]
This identity may be averaged over the random patterns and gives the average of the from the averaged -th power of the partition function
[TABLE]
The idea of the replica method is to restrict to integer and to take the analytic continuation
[TABLE]
We have replicas of the initial model. The random patterns in the expression of the energy disappear once the average has been carried out. Eventually one computes the partition function of an effective system of variables with a non random energy function resulting from the average. The result may be written formally as
[TABLE]
where is the expression resulting from the sum over all configurations. Once the small limit is taken, the final expression can be estimated analytically by means of the saddle-point method given that is assumed to be large.
In the case of our problem we have
[TABLE]
Following the replica approach, we need to compute
[TABLE]
where the integration measure is just over the binary values of the weights. By enforcing through a delta function, we can linearize the dependence on the randomness of the patterns and perform the average as follows:
[TABLE]
where we have used the delta functions to introduce the order parameters and . In order to write the multiple integrals in a form which can be evaluated by saddle point, we restrict to the replica symmetric assumption and , and perform few simplifications.
First we sum over the weights:
[TABLE]
Second, we simplify the terms which are raised to the power :
[TABLE]
Finally we can write the saddle point expression for the replicated partition function:
[TABLE]
where it is useful to write the action as the sum of three terms
[TABLE]
The entropic contribution reads
[TABLE]
and the energetic one
[TABLE]
The replicated partition function can then be computed in the limit by solving the saddle point equations and . The derivatives of and read
[TABLE]
and
[TABLE]
Setting to zero these derivatives we get the saddle point equations for and
[TABLE]
In the limit of large we need to rescale the order parameters to obtain finite quantities. By setting , we find for the last equation
[TABLE]
Once the saddle point equations are solved numerically, we can compute the minimum energy (minimum loss) and the entropy at low temperature. We have:
[TABLE]
In the limit of large , with , we find
[TABLE]
where
[TABLE]
We can compute the entropy using the relation .
In figure 2 we show the behavior of the energy vs the loading . As one may observe, up to relatively large values of the energy is extremely small, virtually equal zero for any accessible size
Having established that by minimizing the the cross-entropy one ends up in regions of perfect classification where the error loss function is zero, it remains to be understood which type of configurations of weights are found. Does the CE converge to a typical zero energy configuration of the error function (i.e. an isolated point-like solution in the weight space) or does it converge to the rare regions of high local entropy?
The answer to this question is that the CE does in fact focus on the HLE subspaces.
B.2 The local entropy around CE ground states
In order to show this analytically we need to be able to count how many zero error configurations exist in a region close to a typical minima of the CE loss. Following Franz and Parisi (1995), this computation can be done by averaging with the CE Gibbs measure the entropy of the Number of Errors loss function.
Let’s call and the cross-entropy and error loss functions per pattern, respectively. We need to evaluate the most probable value of
[TABLE]
which can be computed by replica approach as we have done for , yielding the local entropy . In the above expression is the constrained overlap between the different minima, which is trivially related to the Hamming distance by . We will need to perform twice the replica trick, one to extract the most probable value of (index and one to linearize the inside the integral (index ),
[TABLE]
This quantity is computed for integer ( and ) and eventually the analytic continuation is taken. We will do this under the replica symmetric (RS) assumption which for small distances ** **is expected to by exact. For the sake of completeness, we report hereafter all the main steps of the calculation.
We need to compute
[TABLE]
The average over the patterns is factorized and can be easily performed. Upon expanding the results for large , the term in the brackets reads
[TABLE]
Introducing the order parameters corresponding to the different overlaps we find for the total expression
[TABLE]
In order to proceed, we search the solutions of the saddle point equations in the RS subspace, , , , etc. The various factors can be simplified as follows:
[TABLE]
and
[TABLE]
A series of further simplifications are needed in order to write the in the appropriate saddle point form. The terms containing the integrals over and become factorized
[TABLE]
where we have kept the notation just for the sake of clarity. Being an integration variable we now drop it. By summing over and v and with some straightforward change of variables we get
[TABLE]
For the integral containing the dependence on and we find similar simplifications.
[TABLE]
where we have used the notation
[TABLE]
Continuing the computation, we can linearize the terms in with an auxiliary integral, factor the terms with the index , take the limit and perform explicitly two integrals:
[TABLE]
where
[TABLE]
The local entropy is nothing but the total exponent for the saddle point equations, which after some additional changes of variables can eventually be written as
[TABLE]
where
[TABLE]
and
[TABLE]
where we have defined .
If we now take the limit and plug in the expression for the error loss function , we can eliminate one integral
[TABLE]
and the expression of simplifies to
[TABLE]
In order to compute the entropy for a given distance , we need to solve the saddle point equations with respect to with the values of and obtained by solving the equations for the CE loss function. The saddle point equations can be written as
[TABLE]
with
[TABLE]
where we have defined .
The results are reported in fig.** 3**. We may observe that the minima of the CE are indeed surrounded by an exponential number of zero error solutions. In other words, the CE focuses on HLE regions.
B.3 RS stability and zero entropy condition
In order to corroborate the validity of the RS solution, we need to check two necessary conditions: the entropy of the CE model is positive and the replica symmetric solution is stable. In other words we can focus on values of the parameters such that both conditions are met.
Following Engel and Van den Broeck (2001), the stability is verified if the following condition is met
[TABLE]
where and are the two eigenvalues of the Hessian matrix computed at the RS saddle point. We find
[TABLE]
and
[TABLE]
where the averages can be expressed through the quantity
[TABLE]
One finds:
[TABLE]
and similarly
[TABLE]
For each we can thus identify the values of and for which both the entropy is positive and the solution is stable. In particular, can be chosen to be quite large, corresponding to energies which are extremely small (RSB is expected to have relatively minor effects at zero temperature).
Appendix C Wide Flat Minima for the continuous case
Following Barkai et al. (1992) and the technique described in sec. A we may analyze the existence of WFM by studying the 1-RSB saddle point equations, with and (usually called in the 1-RSB context) used as control parameters.
The computation of the average of over the patterns by the replica methods leads to the following saddle point expression in the large limit
[TABLE]
where
[TABLE]
Given that the distribution of the input patterns is the same for each hidden unit, averages are expected to be independent of and the dependency on of the order parameters can be dropped . In 1-RSB scheme, once the conjugate order parameters are integrated out, the expressions for and read
[TABLE]
where and is a complicated function of the order parameters which for reads
[TABLE]
with
[TABLE]
For our WFM computation, is the constrained overlap between the weight vectors of the real replicas. is the only parameter for which we have to solve the saddle point equation. In order to look for the WFM of maximum volume we are interested in the large limit. In this case the expressions simplify substantially
[TABLE]
where and . The pre-factor of has been eliminated by a change of variables (with then renamed ).
Notice that corresponds to the volume at , i.e. to the volume of the weight space just under the spherical constraints, with the real replicas forced to be at an overlap . If WFM exist for positive , we expect to observe the normalized local entropy to approach [math] for sufficiently close to one.
In fig.** 1** (top panel) we report the values of the WFM volumes vs the overlap for different values of Indeed one may observe that the behavior is qualitatively similar to that of the binary perceptron, WFM exist deep into the RSB region, i.e. besides the RSB states and all the related local minima and saddles, there exist absolute minima which are flat at large distances. We mention that evaluating the inside the integral is a quite challenging task as the function to be maximized may present multiple maxima. We have tackled this problem by an appropriate sampling technique.
The case is still relatively close to the perceptron, though the geometrical structure of its minima is already dominated by non convex RSB features for . A case which is closer to more realistic NN is , which, luckily enough, is easier to study analytically (Barkai et al., 1992).
The 1-RSB expression for , simplifies to
[TABLE]
where , and . While the critical capacity diverges with as (Monasson and Zecchina, 1995), the value of at which RSB sets in and the landscape of the minima becomes non trivial remains finite, .
As we have done for the we study the large limit. We find
[TABLE]
In fig.** 1** (bottom panel) we observe that WFM are indeed still present.
In order to check the validity of the WFM computation one should check for the stability of the solutions of the saddle point equations by a stability analysis or a 2-RSB computation. For numerical reasons this is a quite difficult task. We thus decided to follow a different path which has also algorithmic interest, namely to study the problem by Belief Propagation.
C.1 Large deviation analysis of the Parity Machine
Here we report the results of the large deviation analysis on the so-called parity machine. Its network structure is identical to the committee machine, except for the output unit which performs the product of the hidden units instead of taking a majority vote. The outputs of the hidden units are still given by sign activations. Thus, the overall output of the network reads:
[TABLE]
For a given set of patterns, the volume of the weights which correctly classifies the patterns is then given by
[TABLE]
The computation proceeds as for the committee machine case, until we find the following expressions for the 1-RSB volume:
[TABLE]
where and where and is fixed by a saddle point equation. The sum over the internal states be computed for general leading to the following final expression for :
[TABLE]
where . In the large limit, converges rapidly to zero and the expression for simplifies. We can thus compute the volume by optimizing over for arbitrary and , and compare it to the volume that one would find for the same distance when no patterns are stored: the log-ratio of the two volumes is constant and equal to . This shows that the minima never become flat, at any distance scale.
Appendix D Belief Propagation on a Tree-like committee machine
with continuous weights: equations, local volume around solutions and algorithms for the replicated network
D.1 BP equations for the committee machine
We can use Belief Propagation (see e.g. Krzakala et al. (2016); Braunstein and Zecchina (2006); Baldassi et al. (2016a)) to study the space of the solutions of a tree-like committee machine with continuous weights and random inputs (the outputs can either be random or generated from a rule). The messages in this case are probability density distributions over , and we will need to ensure normalization by using an additional constraints over the norm of the weights vectors.
The basic factor graph is thus composed of continuous variable nodes , divided into groups of variables each; for each pattern, we will have factor nodes, each one involving one group of variables and an auxiliary binary output variable (with two indices, one for the hidden unit and one for the pattern), and another factor node connected to the variables and controlling the final output. We will enforce the normalization of the weights by adding an extra field for each variable , as explained below.
As a general notation scheme, we will use the letter to denote messages from variable nodes to factor nodes, the letter for messages from factor nodes to variable nodes, and the letter for non-cavity marginals over variables.
Let’s then call the cavity message from variable (representing a weight with hidden unit index k$$\in\left\{1,\dots,K\right\}, weight index , whose value is ) to the factor node (representing the part of an input pattern which involves the hidden unit ). The BP equation reads:
[TABLE]
where represents a message from another pattern node to the variable , while is an external field enforcing the normalization constraint, which is the same for all variables:
[TABLE]
This method of enforcing normalization is equivalent to using a Dirac delta in the limit of large , but it’s easier to implement in the BP algorithm. In order to set the parameter we will need to evaluate the norm of the weights at convergence (detailed below) and adjust until the norm matches the desired value, .
The other messages read:
[TABLE]
where the message goes from the auxiliary output variable to node . Notice that the messages are assumed to be normalized, but the messages aren’t, because the integral of expression (37) might diverge.
In the limit of large , eq. (37) can be approximated using the central limit theorem, since we’re assuming that the input pattern entries are random i.i.d. variables. Therefore, we don’t need the full distributions to perform the integral, only their first and second moments. For simplicity of implementation, though, instead of the moments it is actually more convenient to use the inverse of the variance and the mean rescaled by the variance, which we will denote with and , respectively. We will use the following notation:
[TABLE]
With these, we can compute the two auxiliary quantities
[TABLE]
which can be simplified by noting that . We can now rewrite eq. (37) using the central limit theorem:
[TABLE]
where in the last line we have introduced the shorthand notation
[TABLE]
because we can represent any message over a binary variable with a single quantity (in this case, the “magnetization” often employed in spin glass literature) and therefore we abuse the notation and identify that message with a single parameter. The context and the presence or absence of the argument will suffice to disambiguate the notation.
In the limit of large , we observe that and , and therefore , but since we are assuming due to the normalization constraint. We can therefore expand to the second order, obtaining (up to an irrelevant factor):
[TABLE]
where
[TABLE]
Now we can substitute this into eq. (35):
[TABLE]
where is a normalization constant. Expanding this to the second order we finally obtain:
[TABLE]
and thus we obtain the following simple expression for the parameters of the distribution eqs. (38)-(39):
[TABLE]
Also, the normalization constant is:
[TABLE]
One can immediately write the corresponding formulas for the non-cavity parameters of the marginal :
[TABLE]
This form of the equations shows that we can, in fact, parametrize all the distributions as Gaussian, and also that the updates in the equations can be performed efficiently for each node by keeping in memory the non-cavity versions of the parameters and and the parameters and , such that updating is only a matter of subtracting the previous value of and adding the new one.
We can also note that the computation of in eqs. (44)-(45) can be made computationally more efficient by writing and in terms of their non-cavity counterparts
[TABLE]
and expanding. However, in practice this further approximation does not work well at large values of when using the Focusing-BP protocol (detailed below, sec. D.3) and at moderate values of : by avoiding it, we can find solutions at slightly larger values of (e.g. we can reach instead of at , ). Relatedly, it is also slightly problematic at large values of the polarization field when exploring the space of configurations (see below, sec. D.2). In general terms, the problems arise when the fields become very polarized and the normalization constants become very small. Therefore, in our implementation this approximation is optional.
In order to set the normalization parameter , we simply compute the following quantity at the end of each iteration
[TABLE]
and adjust so that this becomes . Of course, we also add the criterion that this adjustment needs to be sufficiently small in evaluating whether the BP equations have converged.
The remaining BP equations involve the nodes connecting the auxiliary variables and enforcing the desired outputs from the committee:
[TABLE]
where is the desired output for pattern . For small , we compute exactly (i.e. without using the central limit theorem), which can be done in time.
Equations (40), (41), (44), (45), (47), (48), (53), (54), (56) and (57), together with the normalization requirement obtained through eq. (55), form the full set of BP equations.
The free entropy (also sometimes known as action) of the system can be computed from the usual BP formulas. We get:
[TABLE]
where:
[TABLE]
From this, we can compute the entropy by simply accounting for the energetic contribution introduced by the normalization constraint. We also shift it by subtracting the log-volume of the normalized sphere, so that its value is upper bounded by [math]
[TABLE]
D.2 Exploring the space of solutions around a given configuration
Given a particular configuration , which we assume normalized as , we are interested in exploring the space of solutions at a given distance from it (for a suitable definition of the distance). Analogously to the norm, in the limit of large we can control the distance by just adding an extra field to each node (putting it as an extra factor in eq. (35)), of the form:
[TABLE]
By varying the auxiliary parameter between [math] and we can restrict ourselves to smaller and smaller regions around . Adding this extra field in practice just amounts at adding two terms and to the expressions of and , respectively (eqs. (47)-(48)).
For convenience, we actually abuse the terminology and use a squared distance in our definition:
[TABLE]
which can be computed from the BP messages at convergence, as follows. First define the auxiliary quantities (representing the cavity variance and mean of each variable without the field):
[TABLE]
Then the expression of the average distance reads:
[TABLE]
The free entropy of the system can be written as before, eq. (58), but now in the computation of the entropy we also need to account for the energy of the extra field:
[TABLE]
By varying and using eqs. (66) and (67), we can obtain a plot of the local entropy as a function of the distance around any given configuration , as long as the BP equations converge. As a general rule of thumb, the equations don’t converge when is too low in the 1-RSB phase, and when is too large and is not a solution. The equations do converge however even in the 1-RSB phase for large enough if is a solution, which can be understood as the external field breaking the symmetry. When is not a solution, going to the limit eventually restricts the BP equations to a region of the configuration space without solutions, leading to non-normalizable messages (this could of course be amended e.g. by just working at non-zero temperature). Besides these situations, other numerical problems may arise under certain circumstances when is not large enough, due to the approximations in the messages.
As a consistency check, it can be verified numerically that formula (67) yields the expected result for the case, .
D.3 Focusing-BP
In order to implement the Focusing-BP protocol we need to add a new type of node to each variable. These nodes do not directly represent energy terms in the usual sense, and therefore they are not factor nodes; rather, their role is that of effectively representing an interaction of identical replicas of the original system with an extra auxiliary configuration . The derivation (in a discrete setting) can be found in Baldassi et al. (2016a).
We denote as the new extra field to be multiplied in eq. (35). This field, like all others over the variables, is parametrized by two quantities and which get added to and (eqs. (47) and (48)).The update equation is rather involved, but it can be simplified by breaking it down in steps and adopting a new notation for messages composition, leading to this expression:
[TABLE]
where the notation is as follows:
- •
represents (intuitively speaking) the effect of passing a message through a Gaussian interaction with strength . In formulas:
[TABLE]
- •
represents the “composition” of two messages, in formulas:
[TABLE]
which, using the internal representation in terms of rescaled mean and inverse variance for the , and the corresponding quantities for the fields, simply translates to:
[TABLE]
- •
with integer represents the composition of with itself (i.e. ) times. This can be trivially extended to non-integer and yields:
[TABLE]
- •
is a normalization field, required for normalization of the variables analogously to the field for the variables; analogously to that field, it is parametrized with a single parameter and can be represented as , (cf. equations (35), (36) and (47)).
- •
is the cavity field from a variable to the corresponding variable , which can be defined from the relation .
Therefore formula (68) can be intuitively understood as follows: the cavity messages coming from the replicated variable nodes ( are passed through their interaction (with strength with the corresponding variable ; the resulting messages are composed together (there are identical messages) and with the normalization field ; the resulting cavity message is again passed through the interaction to yield the field .
The parameter must be set to a value that normalizes the variables; the norm of the variables can be obtained from their marginals with the same formula used for the , eq. (55); the marginals can be obtained with this formula:
[TABLE]
Appendix E Numerical experiments details
Here we provide the details and the settings used for the numerical experiments reported in sec. V.
In all the experiments and for all algorithms except fBP we have used a mini-batch size of . The mini-batches were generated by randomly shuffling the datasets and splitting them at each epoch. For eLAL, the permutations were performed independently for each replica. Also, for all algorithms except fBP the the weights were initialized from a uniform distribution and then normalized for each unit. The learning rate was kept fixed throughout the training. The parameters and for the ceSGD algorithm were initialized at some values , and multiplied by , after each epoch. Analogously, the parameter for the eLAL algorithm was initialized to and multiplied by after each epoch. The parameter for the fBP algorithm ranged in all cases between and with an exponential schedule divided into steps; at each step, the algorithm was run until convergence or at most iterations. We used for eLAL and for fBP. The stopping criterion for ceSGD-fast was that a solution (0 errors with ) was found; for ceSGD-slow, that the CE loss reached ; for eLAL, that the sum of the squared distances between each replica and the average replica reached . We also report here the average and standard deviation of the number of epochs for each algorithm.
Parameters for the case of random patterns. ceSGD-fast: , , , , (). ceSGD-slow: , , , , (). LAL: (). eLAL: , , ().
Parameters for the Fashion-MNIST experiments. ceSGD-fast: , , , , (). ceSGD-slow: , , , , (). LAL: (). eLAL: , , ().
Appendix F Experiments with randomized Fashion-MNIST
Our protocol to produce a randomized versions of the Fashion-MNIST dataset was as follows. Starting with the binarized, two-class dataset described in the main text, new input patterns were derived from the original ones as where is a random permutation, different for each (i.e. we shuffled each pixel across samples). In this way, each pixel has the same bias as in the original dataset, , but the (connected) correlations are destroyed, , and furthermore the patterns no longer carry information about the target label , so that no generalization is possible. As a result, the randomized patterns carry more information per pixel that needs to be stored by the device, which in turn can be expected to make the learning problem harder.
We produced such shuffled datasets and performed tests on each with the same algorithms as for the Fashion-MNIST tests, using the same parameters when possible. The exception was eLAL, which we had to tweak slightly to avoid divergencies: we used , , .
We directly compared the results with those obtained on the original dataset. Fig. A1 shows that the maximum eigenvalues (cf. fig. 7, bottom panel) hardly change between the two tests, with only a slight degradation for the eLAL algorithm.
Fig. A2 shows the volumes around the solutions, computed with the BP algorithm; this is the analogous of fig. 7, top panel. We also kept the same scale for easiness of comparison. Again, we observe a slight degradation of the eLAL algorithm compared to ceSGD, albeit only at short distances. It is still the case that LAL is by far the worst algorithm, and that ceSGD slow is better than ceSGD fast. The volumes are overall smaller than for the original dataset.
We also performed an additional experiment on both datasets, measuring the robustness towards the presence of noise in the input. For each input image, the noise was added by replacing a randomly selected fraction of pixels with random binary values. Each new value of a pixel was extracted with the same bias observed in the original dataset for that pixel, : in this way, the new corrupted images were still rather close to the original input distribution (although the correlation with the desired output label, and the internal correlations between pixels, were degraded) and the networks would be able to operate in the same regime in which they were trained. This model of noise is intended to provide a rough proxy for the generalization capabilities of the networks without the need for a validation set, since it measures the amount of overfitting within a manifold which should approximate the distribution of the data to be classified. The results are shown in fig. A3, and they are fully consistent with the picture emerging from the study of the local volumes.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Baldassi (2009) Carlo Baldassi. Generalization learning in a perceptron with binary synapses. Journal of Statistical Physics , 136(5):902–916, 2009. doi:10.1007/s 10955-009-9822-1 . · doi ↗
- 2Baldassi et al. (2007) Carlo Baldassi, Alfredo Braunstein, Nicolas Brunel, and Riccardo Zecchina. Efficient supervised learning in networks with binary synapses. Proceedings of the National Academy of Sciences of the United States of America , 104(26):11079–1084, 2007. doi:10.1073/pnas.0700324104 . · doi ↗
- 3Baldassi et al. (2015) Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses. Physical Review Letters , 115(12):128101, September 2015. ISSN 0031-9007, 1079-7114. doi:10.1103/Phys Rev Lett.115.128101 . · doi ↗
- 4Baldassi et al. (2016 a) Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences , 113(48):E 7655–E 7662, November 2016 a. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1608103113 . · doi ↗
- 5Baldassi et al. (2016 b) Carlo Baldassi, Federica Gerace, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Learning may need only a few bits of synaptic precision. Physical Review E , 93(5):052313, May 2016 b. ISSN 2470-0045, 2470-0053. doi:10.1103/Phys Rev E.93.052313 . · doi ↗
- 6Baldassi et al. (2016 c) Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Local entropy as a measure for sampling solutions in constraint satisfaction problems. Journal of Statistical Mechanics: Theory and Experiment , 2016(2):P 023301, February 2016 c. ISSN 1742-5468. doi:10.1088/1742-5468/2016/02/023301 . · doi ↗
- 7Baldassi et al. (2018) Carlo Baldassi, Federica Gerace, Hilbert J Kappen, Carlo Lucibello, Luca Saglietti, Enzo Tartaglione, and Riccardo Zecchina. Role of synaptic stochasticity in training low-precision neural networks. Physical review letters , 120(26):268103, 2018. doi:10.1103/Phys Rev Lett.120.268103 . · doi ↗
- 8Baldassi et al. (2019) Carlo Baldassi, Enrico M Malatesta, and Riccardo Zecchina. Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations. Physical Review Letters , 123(17):170602, 2019. doi:10.1103/Phys Rev Lett.123.170602 . · doi ↗
