The peculiar statistical mechanics of Optimal Learning Machines

Matteo Marsili

arXiv:1904.09144·physics.data-an·January 29, 2020

The peculiar statistical mechanics of Optimal Learning Machines

Matteo Marsili

PDF

TL;DR

This paper explores the statistical mechanics of Optimal Learning Machines, revealing their unique position at a phase transition boundary characterized by a stretched exponential energy distribution, with implications for learnability and predictability.

Contribution

It introduces a theoretical framework linking OLM properties to a specific energy distribution exponent, explaining their independence from environment size and phase transition behavior.

Findings

01

OLM correspond to a critical case with exponential energy distribution.

02

Systems with different energy distribution exponents exhibit distinct phase transition types.

03

OLM's behavior remains stable regardless of environment size, indicating efficient representation.

Abstract

Optimal Learning Machines (OLM) are systems that extract maximally informative representation of the environment they are in contact with, or of the data they are presented. It has recently been suggested that these systems are characterised by an exponential distribution of energy levels. In order to understand the peculiar properties of OLM within a broader framework, I consider an ensemble of optimisation problems over functions of many variables, part of which describe a sub-system and the rest account for its interaction with a random environment. The number of states of the sub-system with a given value of the objective function obeys a stretched exponential distribution, with exponent $γ$ , and the interaction part is drawn at random from the same distribution, independently for each configuration of the whole system. Systems with $γ = 1$ then correspond to OLM, and we…

Figures8

Click any figure to enlarge with its caption.

Equations134

A = {x : - \frac{1}{n} lo g p (x) - h_{0} < ϵ},

A = {x : - \frac{1}{n} lo g p (x) - h_{0} < ϵ},

s : x \to s (x) \in S

s : x \to s (x) \in S

- lo g p (x ∣ s) ≃ u_{s} \equiv - x \sum p (x ∣ s) lo g p (x ∣ s),

- lo g p (x ∣ s) ≃ u_{s} \equiv - x \sum p (x ∣ s) lo g p (x ∣ s),

H [s] = - s \sum p (s) lo g p (s),

H [s] = - s \sum p (s) lo g p (s),

H [s ∣ u] = s \sum p (s) lo g W (u_{s})

H [s ∣ u] = s \sum p (s) lo g W (u_{s})

p (s) \equiv x : s (x) = s \sum p (x) = \frac{1}{Z} e^{u_{s}}

p (s) \equiv x : s (x) = s \sum p (x) = \frac{1}{Z} e^{u_{s}}

H [u] = H [s] - H [s ∣ u]

H [u] = H [s] - H [s ∣ u]

W (u) = W_{0} e^{- ν u},

W (u) = W_{0} e^{- ν u},

(s^{*}, t^{*}) = arg (s, t) max U (s, t)

(s^{*}, t^{*}) = arg (s, t) max U (s, t)

U (s, t) = u_{s} + v_{t ∣ s} .

U (s, t) = u_{s} + v_{t ∣ s} .

∣ {s : u_{s} > u} ∣ = 2^{n} e^{- (u /Δ)^{γ}}, u > 0.

∣ {s : u_{s} > u} ∣ = 2^{n} e^{- (u /Δ)^{γ}}, u > 0.

u_{s} = u_{0} [1 + \frac{d}{n} lo g_{2} \frac{d}{n} + (1 - \frac{d}{n}) lo g_{2} (1 - \frac{d}{n})]^{1/ γ}, d = ∣ s - s_{0} ∣.

u_{s} = u_{0} [1 + \frac{d}{n} lo g_{2} \frac{d}{n} + (1 - \frac{d}{n}) lo g_{2} (1 - \frac{d}{n})]^{1/ γ}, d = ∣ s - s_{0} ∣.

P {v_{t ∣ s} \geq x} = e^{- x^{γ}}, γ > 0,

P {v_{t ∣ s} \geq x} = e^{- x^{γ}}, γ > 0,

p (s ∣ u) = P {s^{*} = s ∣ u}

p (s ∣ u) = P {s^{*} = s ∣ u}

H [s] = - s \sum p (s ∣ u) lo g p (s ∣ u)

H [s] = - s \sum p (s ∣ u) lo g p (s ∣ u)

(s, t) max U (s, t)

(s, t) max U (s, t)

≅

β_{m} = γ (m lo g 2)^{1 - 1/ γ} .

β_{m} = γ (m lo g 2)^{1 - 1/ γ} .

p (s ∣ u) = \frac{1}{Z} e^{β_{m} u_{s}}, Z = s \sum e^{β_{m} u_{s}},

p (s ∣ u) = \frac{1}{Z} e^{β_{m} u_{s}}, Z = s \sum e^{β_{m} u_{s}},

p_{0} (q) = s \sum p_{0} (s) δ (q - q_{s})

p_{0} (q) = s \sum p_{0} (s) δ (q - q_{s})

p_{new} (s) = \frac{1}{Z ( g )} p_{0} (s) e^{g q_{s}}, Z (g) = \int d q p_{0} (q) e^{g q}

p_{new} (s) = \frac{1}{Z ( g )} p_{0} (s) e^{g q_{s}}, Z (g) = \int d q p_{0} (q) e^{g q}

β_{m} u_{s} = g q_{s} + g^{'} q_{s}^{'} + g " q_{s} " + \dots

β_{m} u_{s} = g q_{s} + g^{'} q_{s}^{'} + g " q_{s} " + \dots

β_{m} u_{s} = n (lo g 2) γ μ^{1 - 1/ γ} ν_{s}

β_{m} u_{s} = n (lo g 2) γ μ^{1 - 1/ γ} ν_{s}

\frac{m}{n} = μ < μ_{c} = Δ^{- γ / (γ - 1)}, (γ > 1)

\frac{m}{n} = μ < μ_{c} = Δ^{- γ / (γ - 1)}, (γ > 1)

H [s] \equiv Σ (u^{*}) = n (1 - \frac{μ}{μ _{c}}) lo g 2

H [s] \equiv Σ (u^{*}) = n (1 - \frac{μ}{μ _{c}}) lo g 2

\frac{m}{n} = μ > μ_{c} = (γ Δ)^{γ / (1 - γ)} (γ < 1) .

\frac{m}{n} = μ > μ_{c} = (γ Δ)^{γ / (1 - γ)} (γ < 1) .

C_{v} = ⟨ (u_{s} - ⟨ u_{s} ⟩)^{2} ⟩

C_{v} = ⟨ (u_{s} - ⟨ u_{s} ⟩)^{2} ⟩

U (s, t, z, \dots) = u_{s} + v_{t ∣ s} + y_{z ∣ t, s} + \dots

U (s, t, z, \dots) = u_{s} + v_{t ∣ s} + y_{z ∣ t, s} + \dots

u_{s}

u_{s}

v_{t ∣ s}

y_{z ∣ t, s}

s^{*} = arg s max {u_{s} + t max [v_{t ∣ s} + z max (y_{z ∣ t, s} + \dots)]}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The peculiar statistical mechanics of Optimal Learning Machines

Matteo Marsili

The Abdus Salam International Center for Theoretical Physics, Strada Costiera 11, 34151 Trieste, Italy

Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Trieste, Italy

Abstract

Optimal Learning Machines (OLM) are systems that extract maximally informative representation of the environment they are in contact with, or of the data they are presented. It has recently been suggested that these systems are characterised by an exponential distribution of energy levels. In order to understand the peculiar properties of OLM within a broader framework, I consider an ensemble of optimisation problems over functions of many variables, part of which describe a sub-system and the rest account for its interaction with a random environment. The number of states of the sub-system with a given value of the objective function obeys a stretched exponential distribution, with exponent $\gamma$ , and the interaction part is drawn at random from the same distribution, independently for each configuration of the whole system. Systems with $\gamma=1$ then correspond to OLM, and we find that they sit at the boundary between two regions with markedly different properties. For all $\gamma>0$ the system exhibits a freezing phase transition. The transition is discontinuous for $\gamma<1$ and it is continuous for $\gamma>1$ . The region $\gamma>1$ corresponds to learnable energy landscapes and the behaviour of the sub-system becomes predictable as the size of the environment exceeds a critical threshold. For $\gamma<1$ , instead, the energy landscape is unlearnable and the behaviour of the system becomes more and more unpredictable as the size of the environment increases. Sub-systems with $\gamma=1$ (OLM) feature a behaviour which is independent of the relative size of the environment. This is consistent with the expectation that efficient representations should be largely independent of the level of detail of the description of the environment.

1 Introduction

Living systems rely in many ways on the efficiency of the internal representation they form of their environment [1, 2]. For example, in order for a bacterium to responds to challenges, it has to encode a representation of the environment in its internal state. This suggests that the metabolism or gene regulatory network can be regarded as learning machines, that have evolved to perform tasks not so dissimilar from pattern recognition in artificial intelligence (e.g. deep neural networks).

Here we focus on a particular ideal limit of what we call optimal learning machines (OLM). These are machines that extract representations that are maximally informative on the generative process of the states of the environment or of the data. It has been shown [3] that OLM so defined, are characterised by an exponential distribution of energy levels, independently of architectural details or of the nature of what is represented. This implies a linear behaviour of the entropy111The entropy here is defined as the logarithm of the number of energy levels at energy $E$ . $S(E)=\nu E+S_{0}$ with the energy. This prediction can be tested empirically since it implies statistical criticality in a finite sample, as shown in Refs. [4, 5]. This phenomenon amounts to the observation of broad frequency distributions, i.e. that the number of states observed $k$ times in the sample behaves as $m_{k}\sim k^{-\nu-1}$ . Statistical criticality is ubiquitous in empirical data of natural systems that supposedly express efficient representations (see e.g. [5, 6, 7, 8]) as well as in efficient representations in statistical learning [9, 10, 11]. The parameter $\nu$ gauges the trade-off between signal and noise, and Ref. [3] shows that the point $\nu=1$ corresponds to the most compressed lossless representation. In a finite sample, the case $\nu=1$ corresponds to Zipf’s law222Zipf’s law is the observation that the frequency of the $r^{\rm th}$ most frequent outcome in a dataset scales as $1/r$ or that the number of outcomes observed $k$ times behaves as $m_{k}\sim k^{-2}$ . [12, 4], which is observed e.g. in language [6], neural coding [7] and the immune system [8]. In deep neural networks, Ref. [9] shows that layers with $\nu\approx 1$ are those that best reproduce the statistics of the training sample. This lends some support to the idea that biological systems and machine learning operates close to the ideal limit of OLM.

This evidence suggests that understanding the properties of systems with exponential energy density may shed light both on learning machines in artificial intelligence as well as in Nature [1, 2]. This is the goal of the present paper. Our goal is to reveal the peculiar properties of systems with exponential energy density within a wider class of systems. This is done studying systems with a stretched exponential density of states, that interpolates between OLM and more familiar physical systems, such as the Random Energy Model (REM) [13].

We focus on a generic model, introduced in [14], of a system that optimises a complex function over a large number of variables. The system is composed of a sub-system and its environment. The components of the objective function of the sub-system and of its interaction with the environment, obey a stretched exponential distribution with exponent $\gamma>0$ . The case $\gamma=2$ coincides with the REM whereas the case $\gamma=1$ describes efficient representations. A well defined thermodynamic limit can be defined for all values of $\gamma$ when the size of the system diverges, with a fixed ratio $\mu$ between the sizes of the environment and of the sub-system. For all values of $\gamma$ the model is described by a Gibbs distribution over the states of the sub-system, that corresponds to a generalised REM with stretched exponential distributions. As shown in Ref. [15], this model exhibits a freezing phase transition, as the strength $\Delta$ of the interactions in the sub-system varies (see Fig. 1). Yet the nature of the phase transition differs substantially depending on whether $\gamma<1$ or $\gamma>1$ [15]. The regime $\gamma>1$ is characterised by a continuous transition and a disordered region that shrinks as the relative size of the environment increases. For $\gamma>1$ , instead, the phase transition is sharp (first order), with a disordered region that gets larger for bigger environments. Systems with an exponential distribution (i.e. $\gamma=1$ ) therefore, have a very peculiar behaviour, because they are located exactly at the transition between these two regions. The freezing phase transition for $\gamma=1$ occurs at a critical point that is independent of the size of the environment. This is suggestive for OLM whose internal state should not depend on the degree of details in the description of the environment. Furthermore, OLM exhibit Zipf’s law exactly at the phase transition. This is the only point, in the whole phase diagram, where the (analogous of the) specific heat diverges, as a consequence of the appearance of a broad free energy minimum (i.e. a wide distribution of energies). Hence, Zipf’s law is a unique feature of systems at $\gamma=\Delta=1$ within the phase diagram of Fig. 1.

The next section reviews the derivation of the exponential density of states for OLM. The following one introduces the problem and discusses its properties whereas Section 4 derives the thermodynamic description. General remarks are drawn in the final section.

2 The Density of States of Optimal Learning Machines

For completeness, this section provides a self-contained derivation of the exponential density for optimal learning machines, in the context of the present paper. Imagine a data generating process $p(\vec{x})$ , where $\vec{x}\in\mathbb{R}^{n}$ is a very high-dimensional vector ( $n\gg 1$ ). Examples of possible systems are digital pictures, where $\vec{x}$ specifies the intensity of the different pixels; the time series of a stock, where each component of $\vec{x}$ is the return of the stock in a particular day; the neural activity of a population of neurons in a particular region of the brain in a particular time interval, where $\vec{x}$ specifies the activity of each neuron, etc.

We assume that the entropy $H[\vec{x}]=-\sum_{\vec{x}}p(\vec{x})\log p(\vec{x})$ is proportional to $n$ , so that $H[\vec{x}]/n\simeq h_{0}$ is finite. We also assume that $\vec{x}$ satisfies the Asymptotic Equipartition Property (AEP) [16]. This states that points $\vec{x}$ drawn from $p(\vec{x})$ almost surely belong to the typical set

[TABLE]

for any $\epsilon>0$ , in the limit of large $n$ . This implies that all points $\vec{x}$ have the same probability $p(\vec{x})\simeq e^{-nh_{0}}$ , to leading order. Still $p(\vec{x})$ contains information on the statistical dependencies that we aim at representing.

A representation is a function of the data

[TABLE]

with $|\mathcal{S}|<+\infty$ . The first requirement of an efficient representation is that, upon conditioning on $s$ , $\vec{x}$ should contain only irrelevant details. If this is true, the AEP should apply to data generated from $p(\vec{x}|s)$ , i.e.

[TABLE]

for all $\vec{x}\in\mathcal{A}$ such that $s(\vec{x})=s$ . Notice that Eq. (3) holds exactly when $s(\vec{x})$ provides a complete description of the distribution, as in the case where $p(\vec{x})=F[s(\vec{x})]$ . The AEP identifies the variable $u_{s}$ in Eq. (3) as the natural coordinate for distinguishing noise (i.e. irrelevant details) from relevant details. Two points $\vec{x}$ and $\vec{x}^{\prime}$ with $u_{s(\vec{x})}\neq u_{s(\vec{x}^{\prime})}$ cannot belong to the same typical set and hence should differ by relevant details. Instead, if $u_{s(\vec{x})}=u_{s(\vec{x}^{\prime})}$ , the difference between two points $\vec{x}$ and $\vec{x}^{\prime}$ can be attributed to noise, even if if $s(\vec{x})\neq s(\vec{x}^{\prime})$ . If there are $W(u)$ configurations $s$ of the representation with $u_{s}=u$ , then the entropy $\log W(u)$ measures the amount of information the representation $s$ is unable to untangle. More precisely, of the total information content

[TABLE]

the part

[TABLE]

measures the number of bits that cannot be distinguished from noise. Notice that, for all $\vec{x}$ such that $s(\vec{x})=s$ , we have $p(\vec{x})\simeq p(\vec{x}|s)p(s)$ . Taking the logarithm of this equation

[TABLE]

with $Z\simeq e^{nh_{0}}$ .

The second requirement of a maximally informative representation, is that for any fixed value of $H[s]$ , $H[s|u]$ should be as small as possible, so that the amount

[TABLE]

of informative bits is as large as possible. It is easy to see that the minimisation of $H[s|u]$ over $W(u)$ , at a fixed value of the entropy $H[s]$ , leads to an exponential distribution of $u$

[TABLE]

where the parameter $\nu$ enters as a Lagrange multiplier in the minimisation of $H[s|u]-\nu H[s]$ , to enforce the constraint on $H[s]$ . Notice that, when $\nu=1$ , the problem reduces to that of the unconstrained maximisation of $H[u]$ , and we recover $\log W(u)=\log W_{0}-u$ . Such a linear behaviour between energy and entropy, as discussed in Refs. [4, 12], corresponds to Zipf’s law and to a uniform distribution $p(u)=W(u)e^{u}/Z$ of $u_{s}$ . Indeed, the second requirement is analogous to demanding that $u_{s}$ should have a distribution which is as broad as possible.

3 An ensemble of optimisation problems

Consider a system described by a configuration $s=(\sigma_{1},\ldots,\sigma_{n})$ of $n$ binary (or spin variables) $\sigma_{i}=\pm 1$ . The system is in contact with an environment, whose configuration $t=(\tau_{1},\ldots,\tau_{m})$ is specified by $m$ binary (or spin variables) $\tau_{j}=\pm 1$ .

As in Ref. [14], we consider the problem of finding the maximum

[TABLE]

of an objective function that can be divided in two parts

[TABLE]

Here $u_{s}$ depend on the interactions of the variables within the system and $v_{s,t}$ accounts for the interactions with the environment333Ref. [14] discusses several examples of systems where this generic description may apply. For example, a protein domain is a sequence $s$ of amino acids that has been optimised, in the course of evolution, for a specific function, e.g. regulate the flux of ions across the cellular membrane. This function depends on the interaction ( $v_{t|s}$ ) with other molecules in the cell, and on their specific composition $t$ . Each sequence in a protein database can be thought of as a realisation of the optimisation process above, for a different choice of $v_{t,s}$ . Likewise, a word $s$ in a sentence is chosen to best express a concept, depending on the other words $t$ of that sentence.. The number of states with $u_{s}>u$ is given by

[TABLE]

This can be realised by drawing at random $u_{s}$ from a stretched exponential distribution, which results in a rough energy landscape, as in the REM [13]. Yet there is no need to assume such a rough energy landscape for the sub-system444One way to define a smooth landscape satisfying Eq. (11), is to assume that $u_{s}$ depends only on the (Hamming) distance $|s-s_{0}|$ from a state $s_{0}$ . In order to do this, it is sufficient to equate the entropy $\Sigma(u)=n[1-(u/\Delta)^{\gamma}]\log 2$ to the number ${n\choose d}$ of states $s$ at distance $d$ from $s_{0}$ . This gives

$u_{s}=u_{0}\left[1+\frac{d}{n}\log_{2}\frac{d}{n}+\left(1-\frac{d}{n}\right)\log_{2}\left(1-\frac{d}{n}\right)\right]^{1/\gamma},\qquad d=|s-s_{0}|.$

(12)

The function $u_{s}$ defined in this way is smooth, apart from the point $s_{0}$ , where $|u_{s}-u_{s_{0}}|\simeq-\frac{d}{\gamma n}\log_{2}\frac{d}{n}+\ldots$ has a singular behaviour.. For the environment, we assume that $v_{s,t}$ is drawn from a distribution

[TABLE]

independently, for each $s$ and $t$ . Therefore $s^{*}$ depends on the realisation $v_{t|s}$ of the interaction with the environment. For $\Delta\gg 1$ we expect the optimisation to depend weakly on the environment, and to be dominated by the term $u_{s}$ . In this case, $s^{*}$ will likely be one of the few states $s$ with values of $u_{s}$ close to the maximum $u_{0}=\max_{s}u_{s}$ , i.e. the probability

[TABLE]

that $s^{*}=s$ will be dominated by few values of $s$ . Hence, the entropy

[TABLE]

will be small, for $\Delta\gg 1$ . When $\Delta\ll 1$ , instead, we expect that the environment $v_{t|s}$ dominates the optimisation, and hence that $s^{*}$ will be broadly distributed on an exponential number of states. This corresponds to an extensive entropy $H[s]\propto n$ . Our main focus will be on the transition between these two regimes.

3.1 The Gibbs distribution

As shown in Ref. [14], Extreme Value Theory (EVT) can be invoked to integrate out the degrees of freedom in the environment, by observing that for $m\gg 1$

[TABLE]

where555This is an asymptotic result, but it is derived taking the maximum over $2^{m}$ random variables $v_{t|s}$ , which is an astronomically large number for $m\gg 1$ . $\eta_{s}$ is a random variable which follows a Gumbel distribution $P\{\eta_{s}\leq x\}=e^{-e^{-x}}$ , $a_{m}=(m\log 2)^{1/\gamma}$ and

[TABLE]

The knowledge of the distribution of $\eta_{s}$ allows us to compute the probability that $s^{*}=s$ , which is the probability that $u_{s}+a_{m}+\eta_{s}/\beta_{m}\geq u_{s^{\prime}}+a_{m}+\eta_{s^{\prime}}/\beta_{m}$ for all $s^{\prime}\neq s$ . The result reads [14]

[TABLE]

which is Gibbs distribution with an inverse temperature $\beta_{m}$ . Note that, for $\gamma>1$ , $\beta_{m}\to\infty$ as $m\to\infty$ , so the entropy $H[s]$ is expected to decrease as the size of the environment increases. On the contrary, $\beta_{m}\to 0$ for $\gamma<1$ , which means that larger and larger environments make the sub-system’s behaviour less predictable. For $\gamma=1$ instead $\beta_{m}=1$ , i.e. the distribution of $s$ is independent of the size of the environment. In this case, Eq. (19) coincides with Eq. (6). Note also that, the parameter $\nu$ discussed in Section 2 is given by $\nu=1/\Delta$ .

3.2 System’s learnability

Can the function $u_{s}$ be learned from a series of experiments, when it is not known in advance? Let $p_{0}(s)$ be the distribution that encodes the current state of knowledge about the system. For an extensive quantity $q_{s}\propto n$ , it is possible to compute its distribution

[TABLE]

If $q_{s}$ is a self-averaging quantity, we expect its distribution to be sharply peaked around a typical value $q_{\rm typ}=\langle q\rangle$ . Imagine running an experiment where the value $q_{\rm exp}$ is measured. If $q_{\rm exp}\approx q_{\rm typ}$ within experimental errors, then the current theory is confirmed, otherwise it has to be revised. In the latter case, the standard recipe to update the theory is given by Large Deviation Theory [17]. This maintains that the new distribution should be such that $\langle q\rangle_{\rm new}=q_{\rm exp}$ , without assuming anything else. More precisely, the amount of information that the measurement gives on the state $s$ is given by the mutual information $I(s,q)=D_{KL}(p_{\rm new}||p_{0})$ . Hence, $p_{\rm new}$ should be the distribution with $\langle q\rangle_{\rm new}=q_{\rm exp}$ for which $D_{KL}(p_{\rm new}||p_{0})$ is minimal. The distribution that satisfies this requirement is

[TABLE]

where $g$ is adjusted in such a way to satisfy $\langle q\rangle_{\rm new}=q_{\rm exp}$ . This process can be continued with additional measures of different observables $q_{s}^{\prime},q_{s}",\ldots$ , and, in principle, it leads to infer

[TABLE]

to the desired accuracy from a series of experiments.

This recipe, however, only works for quantities for which $p(q)$ has a distribution which falls off faster than exponential as $q\to\pm\infty$ , which corresponds to $\gamma\geq 1$ . If $-\log p_{0}(q)\simeq c|q|^{\gamma}$ for $|q|\to\infty$ with $\gamma<1$ , then the integral defining $Z(g)$ in Eq. (20) is not defined. There is no well defined way to incorporate the observation $q_{\rm exp}\neq q_{\rm typ}$ in the distribution $p_{0}(s)$ and to update our state of knowledge in this case666As observed in [16], a distribution that would reproduce $q_{\rm exp}=\langle q\rangle_{\rm new}$ is $p_{\rm new}(s)=(1-\epsilon)p_{0}(s)+\epsilon p_{1}(s)$ for any $p_{1}(s)$ such that $\sum_{s}p_{1}(s)q_{s}=q_{\rm typ}+(q_{\rm exp}-q_{\rm typ})/\epsilon$ . A possible interpretation is that, a priori, if $q_{\rm exp}\neq q_{\rm typ}$ then, with probability $1-\epsilon$ we should discard the observation $q_{\rm exp}$ and keep the old theory $p_{0}$ and with probability $\epsilon$ , instead, we should discard $p_{0}(s)$ altogether and take $p_{1}(s)$ as our new theory. In the first case we don’t learn anything. In the second, the current state of knowledge is wiped out altogether. Notice that if it were possible to measure $\tilde{q}_{s}=q_{s}^{\alpha}$ instead of $q_{s}$ , for a small enough $\alpha$ the distribution of $\tilde{q}$ may fall off sufficiently fast, thus leading us back to the case $\gamma>1$ .. In this sense, $\gamma=1$ separates the region of learnable systems ( $\gamma\geq 1$ ) from the one ( $\gamma<1$ ) of systems for which $u_{s}$ cannot be learned through a series of experiments.

4 The thermodynamic limit

The thermodynamic limit is defined as the limit $n,m\to\infty$ with $\mu=m/n$ finite. The largest value of $u_{s}$ is of the order $u_{0}=\Delta(n\log 2)^{1/\gamma}$ so

[TABLE]

is extensive777The existence of the thermodynamic limit relies on the choice of the same distribution for both $u_{s}$ and $v_{t|s}$ . Under a different distribution, the thermodynamic limit would require a specific scaling of $v_{t|s}$ with $m$ ., when the intensive variable $\nu_{s}=\Delta u_{s}/u_{0}$ varies in the interval $[0,\Delta]$ . Likewise, the free energy $-\log Z$ is also extensive. Hence the model of Eq. (19) with $u_{s}$ drawn from Eq. (13) coincides with a generalised REM, that has been discussed in Ref. [15]. This Section re-derives and discusses its properties in the present setting. We refer to the appendix for detailed calculation and discuss the main results here.

Fig. 2(left) shows the entropy density $\Sigma(u)/n$ as a function of $u/u_{0}$ . This is the logarithm of the number of states at a given value of $u$ , divided by $n$ . For $\gamma>1$ this is a concave function, so the thermodynamics can be computed in the usual manner. For a certain value of $\beta_{m}$ , the partition function $Z$ is dominated by the point where $\Sigma(u)$ is tangent to the line of slope $-\beta_{m}$ (dashed lines). Notice that, by Eq. (22), $\beta_{m}$ is controlled by $\mu$ . As long as

[TABLE]

$Z$ is dominated by an intermediate point $u^{*}\in[0,u_{0})$ for which an exponential (in $n$ ) number of states contribute to the sum in $Z$ . Accordingly, the entropy

[TABLE]

is extensive, and it vanishes linearly with $\mu/\mu_{c}$ as $\mu\to\mu_{c}^{-}$ (see Fig. 2 right), for all values of $\gamma>1$ . As a consequence, for $\mu<\mu_{c}$ , the probability $p(s|\mathbf{u})$ is exponentially small in $n$ , for all $s$ including $s_{0}$ .

For $\mu>\mu_{c}$ the slope $\beta_{m}$ is larger than that of the curve $\Sigma(u)$ at $u_{0}$ , hence $Z$ is dominated by states with $u_{s}\simeq u_{0}$ . Hence, the probability $p(s_{0}|\mathbf{u})$ is finite as well as the entropy $H[s]$ . The phase diagram in the $(\mu,\Delta)$ plane is shown in Fig. 3 (left). In summary, the typical behaviour of the REM holds in the whole region $\gamma>1$ .

For $\gamma<1$ , instead, $\Sigma(u)$ is a convex function of $u$ and the construction above fails to work. For all $\beta_{m}$ small enough, the partition function is dominated by the point $u=0$ whereas for large $\beta_{m}$ it is dominated by states with $u_{s}\approx u_{0}$ . As a result, the entropy is $H[s]=n\log 2$ for

[TABLE]

whereas $H[s]/n\to 0$ as $n\to\infty$ for $\mu<\mu_{c}$ . The transition between the two regimes is discontinuous, as shown in Fig. 2 (right). Notice that, since $\beta_{m}$ is an increasing function of $\mu$ for $\gamma<1$ (see Eq. 22), the transition is also reversed.

The case $\gamma=1$ is discussed in the appendix The phase transition occurs at the point $\Delta_{c}=1$ for all values of $\mu$ . As shown in Fig. 4 (left), the entropy decreases sharply from $H[s]\simeq n\log 2$ to a finite value. At the transition, the distribution of $u$ extends across the whole range $[0,n\log 2]$ , which is signalled by the divergence of the (analog of the) specific heat

[TABLE]

as shown in Fig. 4 (center). This divergence is usually taken as a signature of a second order phase transition. In the ensemble of systems discussed here, it occurs only at $\gamma=1$ . Finally Fig. 4 (right) shows the behaviour of the entropy $H[u]$ of the random variable $u$ . This, in an efficient representation is taken as a measure of the amount of useful information. In an infinite system $H[u]\simeq-\log|\Delta-1|$ diverges at $\Delta=1$ whereas for a finite system it reaches its maximum $H[u]\simeq\log(n\log 2)+1$ at $\Delta=1$ .

We remark that the thermodynamic description discussed above holds for any system for which the number of energy levels at energy $E$ is given by $W(E)=e^{n[\log 2-(-E/n)^{\gamma}]}$ for $E\leq 0$ , irrespective of the relation between the energy $E_{s}$ and the configuration $s$ of the system. The case where the $2^{n}$ energy levels are drawn at random, independently, from the same distribution $p\{E_{s}\leq nx\}=e^{-n(-x)^{\gamma}}$ , for $x<0$ , provides a particular (ensemble of) realisation(s) of this system. Yet, this is not the only way in which a function $E_{s}$ with a given number $W(E)=|\{s:~{}E_{s}=E\}|$ of states at energy $E$ , can be realised. In particular, energy landscapes where $E_{s}$ is drawn independently for each $s$ are not ideal paradigms for learning machines. First because we expect some sort of continuity in the representation, so that similar objects $s$ and $s^{\prime}$ have similar energies $E_{s}\approx E_{s^{\prime}}$ . Second, random landscapes are characterised by an extremely slow dynamics [18]. Hence, a smooth energy landscape is a desirable property of OLM both because of continuity of the representation and because of the dynamical accessibility of the equilibrium state Eq. (19).

5 Discussion

Figure 1 puts on the same phase diagram systems with very different statistical properties. The right side ( $\gamma>1$ ) describes REM like behaviour typical of disordered systems in physics. The left side ( $\gamma<1$ ) describes unlearnable systems with a first order phase transitions. Optimal learning machines, that are characterised by $\gamma=1$ , sit exactly at the boundary between these two regimes.

This lends itself to a number of interesting, though speculative, comments. First, among the systems studied in this paper, OLM have the widest variation of thermodynamically accessible energy levels $u$ . Indeed, the range of energies is given by $u_{0}=\Delta(n\log 2)^{1/\gamma}$ , which is a decreasing function of $\gamma$ . Yet, for $\gamma<1$ only $u_{s}=0$ and $u_{s}=u_{0}$ are thermodynamically accessible, so the range of thermodynamically accessible values of $u$ is maximal for $\gamma=1$ . This is consistent with the fact that the energy is the natural coordinate in learning because it corresponds to the coding cost $-\log p(s)$ . Maximally informative representations use the energy spectrum as efficiently as possible [3].

It is interesting to relate the phase transition for $\gamma=1$ with the trade-off between resolution $H[s]$ and noise $H[s|u]$ discussed in Ref. [3] (see also Section 2). As $\Delta$ varies $H[s|u]$ traces a convex curve as a function of $H[s]$ , where the slope $\nu=1/\Delta$ is related to the Lagrange multiplier that is used to enforce the constraint on $H[s]$ in the minimisation of $H[s|u]$ [3]. This means that when $H[s]$ is reduced by one bit, the noise is reduced by $1/\Delta$ bits. Therefore the region $\Delta<1$ describes noisy representations and correspond to values of $H[s]$ larger than the value $H_{c}$ for which $\Delta=1$ . The region $H[s]<H_{c}$ corresponds to $\Delta>1$ . In this region, reduction in the resolution come at the expense of a loss of information on the generative model. In supervised learning, it is reasonable to surmise that compression for $\Delta>1$ occurs at the expense of details of the generative models that are irrelevant with respect to the specific input-output task that the machine is learning. Hence the representation depends significantly on the output. Conversely, for $\Delta<1$ we expect that the representation depends mostly on the input and only weakly on the output. This leads to the conjecture that maximally informative representations have an universal nature for $\Delta\leq 1$ , which depend mostly on the input data, and are largely independent of the specific input-output relation that the machine is learning. In this picture, the phase transition at $\Delta=1$ marks the point where the ergodicity in the space of representations (and the symmetry with respect to different outputs) gets (spontaneously) broken. This conjecture can in principle be disproved or confirmed by further research on specific architectures888As an analogy, the critical temperature in a ferromagnetic Ising model, marks the point where the response to a small external magnetic field changes dramatically. In the paramagnetic region, the response is continuous whereas in the ferromagnetic phase it is discontinuous. A possible way to confirm this conjecture might be to probe the response of maximally informative representations to changes in the output, at different values of $H[s]$ . The change should small and “continuous” in the $\Delta<1$ phase and sharp in the $\Delta>1$ phase..

We’ve also seen that systems with $\gamma<1$ cannot be learned from a series of experiments and OLM sit exactly at the boundary between learnable and unlearnable systems. In order to appreciate the possible significance of this observation, let us consider a larger system

[TABLE]

with $q$ additional variables $z=(\zeta_{1},\ldots,\zeta_{q})$ . As in Ref. [14], the different terms in Eq. (27) can be defined as

[TABLE]

with $E[U|x]$ representing the expected value on the distribution of $U$ at given $x$ , i.e. is the best estimate of the objective function, when the variable $x$ is fixed. Let us also assume that $v_{t|s}$ and $y_{z|t,s}$ are drawn independently from a stretched exponential distribution with exponents $\gamma_{v}$ and $\gamma_{y}$ , respectively. In the limit when $q\propto m\propto n\gg 1$ , the derivation in Section 3.1 shows that the statistics of the variable

[TABLE]

still follows the Gibbs distribution Eq. (19), but the value of $\beta$ is dominated by the variables $t$ if $\gamma_{v}<\gamma_{y}$ and by the variables $z$ otherwise999Note that, the decomposition in Eq. (27) is not unique, since one could as well define $U(s,t,z,\ldots)=u_{s}+w_{z|s}+x_{z|t,s}+\ldots$ . Hence, one without loss of generality, one can focus on the decomposition for which $\gamma_{v}\leq\gamma_{y}\leq\ldots$ .. Therefore, the most relevant set of variables are those with the smallest value of $\gamma$ . In this sense, systems with $\gamma=1$ are characterised by the most relevant variables that can be implemented in a physically accessible system. This also offers a guideline for finding relevant variables in high-dimensional data, as those for which the sample exhibits statistical criticality (see Refs. [19, 20] for attempts in this direction). Furthermore, fort $\gamma_{v}=\gamma_{y}=\gamma=1$ one recovers the Eq. (19) with $\beta_{m}=1$ . In words, the behaviour of OLM is invariant if further details are added to the problem, which is a desirable property of efficient representations. For example, the classification of a dataset of images should be invariant, independently of the resolution of the images, beyond a certain level.

A further unique property of systems with $\gamma=1$ is that the system can, in principle, be further decomposed in sub-systems with the same properties. More precisely, one can find variables $p=(\pi_{1},\ldots,\pi_{l})$ and $r=(\rho_{1},\ldots,\rho_{n-l})$ such that $s=(p,r)$ and $u_{s}=w_{p}+z_{r|p}$ , with $w_{p}$ and $z_{r|p}$ having again a distributions that asymptotically behaves as an exponential. In particular, critical systems with $\Delta=1$ admit sub-systems that are also “poised” at the critical point $\Delta=1$ . It is tempting to regard this remarkable self-similarity as a distinguishing feature of living systems. For example, both the abundance of metabolites [21] and gene expression levels [22] inside cells have been reported to obey Zipf’s law.

On the contrary, systems with $\gamma>1$ exhibit a behaviour which is more and more predictable the smaller is the number $n$ of variables (i.e. for large $\mu$ ). Within the simple class of models discussed here, the possibility to describe a complex system in terms of few variables101010This was regarded as a wonderful gift by Wigner [23]. emerges as a typical property of physical systems with $\gamma>1$ .

Acknowledgements

Interesting discussions and useful comments with J. Barbier, J.-P. Bouchaud and S. Franz are gratefully acknowledged.

Appendix A The Statistical mechanics approach for $\gamma\neq 1$

The maximum value of $u_{s}$ , from EVT, is given by

[TABLE]

where $\xi$ is also a random variable drawn from a Gumbel distribution. Here and in what follows, we introduced the shorthand $\tilde{n}=n\log 2$ . Therefore, neglecting $1/\tilde{n}$ corrections, we introduce the intensive variable $\nu_{s}$ by

[TABLE]

We focus on the case where the size of the heat bath $m=\mu n$ is proportional to $n$ . Then

[TABLE]

is extensive and the number of configurations with $u_{s}\geq{\tilde{n}}^{1/\gamma}\nu$ is

[TABLE]

which has the conventional exponential behaviour with systems size $\tilde{n}$ . In the annealed approximation, we can compute the partition function as

[TABLE]

For $\gamma>1$ the free energy $f(\nu)$ is a concave function and $Z$ can be computed by saddle point. The saddle point value reads

[TABLE]

As long as $\nu^{*}<\Delta$ , the annealed approximation is valid. This holds as long as

[TABLE]

The saddle point calculation yields

[TABLE]

This allows us to compute the entropy for $\mu<\mu_{c}$ which is given by

[TABLE]

which vanishes as $\mu\to\mu_{c}^{-}$ .

The saddle point approximation cannot be used for $\gamma<1$ because the function $f(\nu)$ is convex. Indeed the integral in Eq. (37) is either dominated by the point $\nu=0$ or by the point $\nu=\Delta$ . As long as $f(0)>f(\Delta)$ the first dominates and we have

[TABLE]

The condition $f(0)>f(\Delta)$ is equivalent to

[TABLE]

As long as this condition is satisfied, the entropy $H[s]=\tilde{n}$ is asymptotically the same as the entropy of a flat distribution over the states $s$ . When $\mu<\mu_{c}$ the annealed approximation ceases to be valid because the partition function is dominated by few states.

When $Z$ is dominated by the point $\nu=\Delta$ , i.e. for $\mu<\mu_{c}$ , the annealed partition function can be estimated with the change of variables $\nu=\Delta-z/\tilde{n}$ so that the free energy becomes $\tilde{n}f\simeq\gamma\mu^{1-1/\gamma}\Delta\tilde{n}-[(\mu/\mu_{c})^{1/\gamma-1}-\gamma]z+\ldots$ and the integral yields

[TABLE]

which suggests that the probability of states with $\beta_{m}u_{s}=\tilde{n}\gamma\mu^{1-1/\gamma}\Delta$ is of order one:

[TABLE]

Notice that this does not vanish as $\mu\to\mu_{c}^{-}$ which is a further signature of a first order phase transition.

Appendix B The case $\gamma=1$

In the case $\gamma=1$ we can resort to a simple approximation, assuming that the spectrum of possible values of $u$ is limited in the range $[0,u_{0}]$ , with

[TABLE]

and that the number of energy levels in the interval $[u,u+du)$ is given by

[TABLE]

We can obtain most quantities of interest from the partition function

[TABLE]

Indeed, $Z(1)$ yields the normalisation of the distribution over $s$ and derivatives of $\log Z(\lambda)$ with respect to $\lambda$ , computed at $\lambda=1$ yield the moments of the distribution of $u_{s}$ . Also, the entropy

[TABLE]

Within the approximation above, we find

[TABLE]

The expected value of $u$ reads

[TABLE]

where we have introduced the scaling variable $\chi$ . For $\Delta<1$ the leading behaviour for $n\to\infty$ is obtained for $-\chi\gg 1$ , whereas for $\Delta>1$ it is obtained for $\chi\gg 1$ . Hence

[TABLE]

where $\theta(x)=1$ for $x\geq 0$ and $\theta(x)=0$ is the Heaviside function. The specific heat is given by

[TABLE]

The entropy reads

[TABLE]

It is also possible to compute the entropy of the variable $u$

[TABLE]

for $\Delta\neq 1$ and $n\to\infty$ one finds $H[u]\to\log\frac{\Delta}{|\Delta-1|}+1$ whereas at $\Delta=1$ one finds $H[u]\simeq\log(n\log 2)+1$ .

B.1 A refined approach

The approach discussed so far relies on the annealed approximation for the partition function. This approach is accurate in the disordered phase but it does not work in the frozen phase. Indeed, for $\Delta>\Delta_{c}$ the partition function is dominated by few states and it is not self averaging. The probability $p(s|\mathbf{u})$ is a function of $\mathbf{u}=\{u_{s}\}$ and as such, it attains different values depending on the values of $\mathbf{u}$ . As a result, the entropy $H[s]$ is also a random variable.

In order to appreciate this effects, we compute the function

[TABLE]

where $\langle\ldots\rangle_{s,\mathbf{u}}$ stands for the average over $s$ and $\mathbf{u}$ whereas $\langle\ldots\rangle_{\mathbf{u}_{-s}|u_{s}=u}$ for the average over all values of $u_{s^{\prime}}$ for $s^{\prime}\neq s$ , with $u_{s}=u$ . Now,

[TABLE]

with $N=2^{n}\gg 1$ . The term $\langle e^{-te^{u^{\prime}}}\rangle^{N-1}_{u^{\prime}}$ is vanishingly small unless $t\ll 1$ . Hence, for $\Delta>1$ , and $t\ll 1$ we can write

[TABLE]

Anticipating that $p(s|\mathbf{u})$ is non-negligible for values of $u_{s}=u_{0}+x$ , we compute

[TABLE]

where we set $z=N^{\Delta}e^{x}t$ in the last equation and took the limit $n,N\to\infty$ .

Inserting this in Eq. (58), with the change of variables $x=u-\Delta\log N$ , we observe that observe that $p(u)\to e^{-x/\Delta}/N$ with a factor $1/N$ that cancels the sum on $s$ . Therefore, with $y=x/\Delta$ , we find

[TABLE]

Setting $y=u+\log[\Gamma(1-1/\Delta)z^{1/\Delta}]$ the integrals separate and one finds

[TABLE]

Note that $\Omega(0)=1$ as necessary for normalisation. The knowledge of $\Omega(\lambda)$ allows us to compute observables in the $\Delta>1$ region. For example the probability that two replicas end up in the same state, i.e that $s^{*}_{1}=s^{*}_{2}$ , is given by $\Omega(1)=1-1/\Delta$ . Likewise, the probability that $q+1$ replicas coincide is

[TABLE]

which vanishes linearly with $\Delta\to 1^{+}$ , for all $q\geq 1$ , and it decays as $q^{-1/\Delta}$ for $q\gg 1$ .

The expected value of the entropy is given by

[TABLE]

The leading divergence as $\Delta\to 1^{+}$ matches the one found within the annealed approximation. Its variance can also be computed

[TABLE]

Interestingly, $V[H[s]]\to 0$ as $\Delta\to 1^{+}$ .

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Gašper Tkačik and William Bialek. Information processing in living systems. Annual Review of Condensed Matter Physics , 7(1):89–117, 2016.
2[2] Jorge Hidalgo, Jacopo Grilli, Samir Suweis, Miguel A. Muñoz, Jayanth R. Banavar, and Amos Maritan. Information-based fitness and the emergence of criticality in living systems. Proceedings of the National Academy of Sciences , 111(28):10095–10100, 2014.
3[3] Ryan John Cubero, Junghyo Jo, Matteo Marsili, Yasser Roudi, and Juyong Song. Statistical criticality arises in most informative representations. Journal of Statistical Mechanics: Theory and Experiment , 2019(6):063402, jun 2019.
4[4] Thierry Mora and William Bialek. Are biological systems poised at criticality? Journal of Statistical Physics , 144(2):268–302, 2011.
5[5] Miguel A. Muñoz. Colloquium: Criticality and dynamical scaling in living systems. Rev. Mod. Phys. , 90:031001, Jul 2018.
6[6] George Kingsley Zipf. Selected studies of the principle of relative frequency in language . Harvard university press, 1932.
7[7] Gašper Tkačik, Thierry Mora, Olivier Marre, Dario Amodei, Stephanie E Palmer, Michael J Berry, and William Bialek. Thermodynamics and signatures of criticality in a network of neurons. Proceedings of the National Academy of Sciences , 112(37):11508–11513, 2015.
8[8] Javier D. Burgos and Pedro Moreno-Tovar. Zipf-scaling behavior in the immune system. Biosystems , 39(3):227 – 232, 1996.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

The peculiar statistical mechanics of Optimal Learning Machines

Abstract

1 Introduction

2 The Density of States of Optimal Learning Machines

3 An ensemble of optimisation problems

3.1 The Gibbs distribution

3.2 System’s learnability

4 The thermodynamic limit

5 Discussion

Acknowledgements

Appendix A The Statistical mechanics approach for γ≠1\gamma\neq 1γ=1

Appendix B The case γ=1\gamma=1γ=1

B.1 A refined approach

Appendix A The Statistical mechanics approach for $\gamma\neq 1$

Appendix B The case $\gamma=1$