The peculiar statistical mechanics of Optimal Learning Machines
Matteo Marsili

TL;DR
This paper explores the statistical mechanics of Optimal Learning Machines, revealing their unique position at a phase transition boundary characterized by a stretched exponential energy distribution, with implications for learnability and predictability.
Contribution
It introduces a theoretical framework linking OLM properties to a specific energy distribution exponent, explaining their independence from environment size and phase transition behavior.
Findings
OLM correspond to a critical case with exponential energy distribution.
Systems with different energy distribution exponents exhibit distinct phase transition types.
OLM's behavior remains stable regardless of environment size, indicating efficient representation.
Abstract
Optimal Learning Machines (OLM) are systems that extract maximally informative representation of the environment they are in contact with, or of the data they are presented. It has recently been suggested that these systems are characterised by an exponential distribution of energy levels. In order to understand the peculiar properties of OLM within a broader framework, I consider an ensemble of optimisation problems over functions of many variables, part of which describe a sub-system and the rest account for its interaction with a random environment. The number of states of the sub-system with a given value of the objective function obeys a stretched exponential distribution, with exponent , and the interaction part is drawn at random from the same distribution, independently for each configuration of the whole system. Systems with then correspond to OLM, and we…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 2
Figure 3
Figure 3
Figure 4
Figure 4
Figure 4Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The peculiar statistical mechanics of Optimal Learning Machines
Matteo Marsili
The Abdus Salam International Center for Theoretical Physics, Strada Costiera 11, 34151 Trieste, Italy
Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Trieste, Italy
Abstract
Optimal Learning Machines (OLM) are systems that extract maximally informative representation of the environment they are in contact with, or of the data they are presented. It has recently been suggested that these systems are characterised by an exponential distribution of energy levels. In order to understand the peculiar properties of OLM within a broader framework, I consider an ensemble of optimisation problems over functions of many variables, part of which describe a sub-system and the rest account for its interaction with a random environment. The number of states of the sub-system with a given value of the objective function obeys a stretched exponential distribution, with exponent , and the interaction part is drawn at random from the same distribution, independently for each configuration of the whole system. Systems with then correspond to OLM, and we find that they sit at the boundary between two regions with markedly different properties. For all the system exhibits a freezing phase transition. The transition is discontinuous for and it is continuous for . The region corresponds to learnable energy landscapes and the behaviour of the sub-system becomes predictable as the size of the environment exceeds a critical threshold. For , instead, the energy landscape is unlearnable and the behaviour of the system becomes more and more unpredictable as the size of the environment increases. Sub-systems with (OLM) feature a behaviour which is independent of the relative size of the environment. This is consistent with the expectation that efficient representations should be largely independent of the level of detail of the description of the environment.
1 Introduction
Living systems rely in many ways on the efficiency of the internal representation they form of their environment [1, 2]. For example, in order for a bacterium to responds to challenges, it has to encode a representation of the environment in its internal state. This suggests that the metabolism or gene regulatory network can be regarded as learning machines, that have evolved to perform tasks not so dissimilar from pattern recognition in artificial intelligence (e.g. deep neural networks).
Here we focus on a particular ideal limit of what we call optimal learning machines (OLM). These are machines that extract representations that are maximally informative on the generative process of the states of the environment or of the data. It has been shown [3] that OLM so defined, are characterised by an exponential distribution of energy levels, independently of architectural details or of the nature of what is represented. This implies a linear behaviour of the entropy111The entropy here is defined as the logarithm of the number of energy levels at energy . with the energy. This prediction can be tested empirically since it implies statistical criticality in a finite sample, as shown in Refs. [4, 5]. This phenomenon amounts to the observation of broad frequency distributions, i.e. that the number of states observed times in the sample behaves as . Statistical criticality is ubiquitous in empirical data of natural systems that supposedly express efficient representations (see e.g. [5, 6, 7, 8]) as well as in efficient representations in statistical learning [9, 10, 11]. The parameter gauges the trade-off between signal and noise, and Ref. [3] shows that the point corresponds to the most compressed lossless representation. In a finite sample, the case corresponds to Zipf’s law222Zipf’s law is the observation that the frequency of the most frequent outcome in a dataset scales as or that the number of outcomes observed times behaves as . [12, 4], which is observed e.g. in language [6], neural coding [7] and the immune system [8]. In deep neural networks, Ref. [9] shows that layers with are those that best reproduce the statistics of the training sample. This lends some support to the idea that biological systems and machine learning operates close to the ideal limit of OLM.
This evidence suggests that understanding the properties of systems with exponential energy density may shed light both on learning machines in artificial intelligence as well as in Nature [1, 2]. This is the goal of the present paper. Our goal is to reveal the peculiar properties of systems with exponential energy density within a wider class of systems. This is done studying systems with a stretched exponential density of states, that interpolates between OLM and more familiar physical systems, such as the Random Energy Model (REM) [13].
We focus on a generic model, introduced in [14], of a system that optimises a complex function over a large number of variables. The system is composed of a sub-system and its environment. The components of the objective function of the sub-system and of its interaction with the environment, obey a stretched exponential distribution with exponent . The case coincides with the REM whereas the case describes efficient representations. A well defined thermodynamic limit can be defined for all values of when the size of the system diverges, with a fixed ratio between the sizes of the environment and of the sub-system. For all values of the model is described by a Gibbs distribution over the states of the sub-system, that corresponds to a generalised REM with stretched exponential distributions. As shown in Ref. [15], this model exhibits a freezing phase transition, as the strength of the interactions in the sub-system varies (see Fig. 1). Yet the nature of the phase transition differs substantially depending on whether or [15]. The regime is characterised by a continuous transition and a disordered region that shrinks as the relative size of the environment increases. For , instead, the phase transition is sharp (first order), with a disordered region that gets larger for bigger environments. Systems with an exponential distribution (i.e. ) therefore, have a very peculiar behaviour, because they are located exactly at the transition between these two regions. The freezing phase transition for occurs at a critical point that is independent of the size of the environment. This is suggestive for OLM whose internal state should not depend on the degree of details in the description of the environment. Furthermore, OLM exhibit Zipf’s law exactly at the phase transition. This is the only point, in the whole phase diagram, where the (analogous of the) specific heat diverges, as a consequence of the appearance of a broad free energy minimum (i.e. a wide distribution of energies). Hence, Zipf’s law is a unique feature of systems at within the phase diagram of Fig. 1.
The next section reviews the derivation of the exponential density of states for OLM. The following one introduces the problem and discusses its properties whereas Section 4 derives the thermodynamic description. General remarks are drawn in the final section.
2 The Density of States of Optimal Learning Machines
For completeness, this section provides a self-contained derivation of the exponential density for optimal learning machines, in the context of the present paper. Imagine a data generating process , where is a very high-dimensional vector (). Examples of possible systems are digital pictures, where specifies the intensity of the different pixels; the time series of a stock, where each component of is the return of the stock in a particular day; the neural activity of a population of neurons in a particular region of the brain in a particular time interval, where specifies the activity of each neuron, etc.
We assume that the entropy is proportional to , so that is finite. We also assume that satisfies the Asymptotic Equipartition Property (AEP) [16]. This states that points drawn from almost surely belong to the typical set
[TABLE]
for any , in the limit of large . This implies that all points have the same probability , to leading order. Still contains information on the statistical dependencies that we aim at representing.
A representation is a function of the data
[TABLE]
with . The first requirement of an efficient representation is that, upon conditioning on , should contain only irrelevant details. If this is true, the AEP should apply to data generated from , i.e.
[TABLE]
for all such that . Notice that Eq. (3) holds exactly when provides a complete description of the distribution, as in the case where . The AEP identifies the variable in Eq. (3) as the natural coordinate for distinguishing noise (i.e. irrelevant details) from relevant details. Two points and with cannot belong to the same typical set and hence should differ by relevant details. Instead, if , the difference between two points and can be attributed to noise, even if if . If there are configurations of the representation with , then the entropy measures the amount of information the representation is unable to untangle. More precisely, of the total information content
[TABLE]
the part
[TABLE]
measures the number of bits that cannot be distinguished from noise. Notice that, for all such that , we have . Taking the logarithm of this equation
[TABLE]
with .
The second requirement of a maximally informative representation, is that for any fixed value of , should be as small as possible, so that the amount
[TABLE]
of informative bits is as large as possible. It is easy to see that the minimisation of over , at a fixed value of the entropy , leads to an exponential distribution of
[TABLE]
where the parameter enters as a Lagrange multiplier in the minimisation of , to enforce the constraint on . Notice that, when , the problem reduces to that of the unconstrained maximisation of , and we recover . Such a linear behaviour between energy and entropy, as discussed in Refs. [4, 12], corresponds to Zipf’s law and to a uniform distribution of . Indeed, the second requirement is analogous to demanding that should have a distribution which is as broad as possible.
3 An ensemble of optimisation problems
Consider a system described by a configuration of binary (or spin variables) . The system is in contact with an environment, whose configuration is specified by binary (or spin variables) .
As in Ref. [14], we consider the problem of finding the maximum
[TABLE]
of an objective function that can be divided in two parts
[TABLE]
Here depend on the interactions of the variables within the system and accounts for the interactions with the environment333Ref. [14] discusses several examples of systems where this generic description may apply. For example, a protein domain is a sequence of amino acids that has been optimised, in the course of evolution, for a specific function, e.g. regulate the flux of ions across the cellular membrane. This function depends on the interaction () with other molecules in the cell, and on their specific composition . Each sequence in a protein database can be thought of as a realisation of the optimisation process above, for a different choice of . Likewise, a word in a sentence is chosen to best express a concept, depending on the other words of that sentence.. The number of states with is given by
[TABLE]
This can be realised by drawing at random from a stretched exponential distribution, which results in a rough energy landscape, as in the REM [13]. Yet there is no need to assume such a rough energy landscape for the sub-system444One way to define a smooth landscape satisfying Eq. (11), is to assume that depends only on the (Hamming) distance from a state . In order to do this, it is sufficient to equate the entropy to the number of states at distance from . This gives
(12)
The function defined in this way is smooth, apart from the point , where has a singular behaviour.. For the environment, we assume that is drawn from a distribution
[TABLE]
independently, for each and . Therefore depends on the realisation of the interaction with the environment. For we expect the optimisation to depend weakly on the environment, and to be dominated by the term . In this case, will likely be one of the few states with values of close to the maximum , i.e. the probability
[TABLE]
that will be dominated by few values of . Hence, the entropy
[TABLE]
will be small, for . When , instead, we expect that the environment dominates the optimisation, and hence that will be broadly distributed on an exponential number of states. This corresponds to an extensive entropy . Our main focus will be on the transition between these two regimes.
3.1 The Gibbs distribution
As shown in Ref. [14], Extreme Value Theory (EVT) can be invoked to integrate out the degrees of freedom in the environment, by observing that for
[TABLE]
where555This is an asymptotic result, but it is derived taking the maximum over random variables , which is an astronomically large number for . is a random variable which follows a Gumbel distribution , and
[TABLE]
The knowledge of the distribution of allows us to compute the probability that , which is the probability that for all . The result reads [14]
[TABLE]
which is Gibbs distribution with an inverse temperature . Note that, for , as , so the entropy is expected to decrease as the size of the environment increases. On the contrary, for , which means that larger and larger environments make the sub-system’s behaviour less predictable. For instead , i.e. the distribution of is independent of the size of the environment. In this case, Eq. (19) coincides with Eq. (6). Note also that, the parameter discussed in Section 2 is given by .
3.2 System’s learnability
Can the function be learned from a series of experiments, when it is not known in advance? Let be the distribution that encodes the current state of knowledge about the system. For an extensive quantity , it is possible to compute its distribution
[TABLE]
If is a self-averaging quantity, we expect its distribution to be sharply peaked around a typical value . Imagine running an experiment where the value is measured. If within experimental errors, then the current theory is confirmed, otherwise it has to be revised. In the latter case, the standard recipe to update the theory is given by Large Deviation Theory [17]. This maintains that the new distribution should be such that , without assuming anything else. More precisely, the amount of information that the measurement gives on the state is given by the mutual information . Hence, should be the distribution with for which is minimal. The distribution that satisfies this requirement is
[TABLE]
where is adjusted in such a way to satisfy . This process can be continued with additional measures of different observables , and, in principle, it leads to infer
[TABLE]
to the desired accuracy from a series of experiments.
This recipe, however, only works for quantities for which has a distribution which falls off faster than exponential as , which corresponds to . If for with , then the integral defining in Eq. (20) is not defined. There is no well defined way to incorporate the observation in the distribution and to update our state of knowledge in this case666As observed in [16], a distribution that would reproduce is for any such that . A possible interpretation is that, a priori, if then, with probability we should discard the observation and keep the old theory and with probability , instead, we should discard altogether and take as our new theory. In the first case we don’t learn anything. In the second, the current state of knowledge is wiped out altogether. Notice that if it were possible to measure instead of , for a small enough the distribution of may fall off sufficiently fast, thus leading us back to the case .. In this sense, separates the region of learnable systems () from the one () of systems for which cannot be learned through a series of experiments.
4 The thermodynamic limit
The thermodynamic limit is defined as the limit with finite. The largest value of is of the order so
[TABLE]
is extensive777The existence of the thermodynamic limit relies on the choice of the same distribution for both and . Under a different distribution, the thermodynamic limit would require a specific scaling of with ., when the intensive variable varies in the interval . Likewise, the free energy is also extensive. Hence the model of Eq. (19) with drawn from Eq. (13) coincides with a generalised REM, that has been discussed in Ref. [15]. This Section re-derives and discusses its properties in the present setting. We refer to the appendix for detailed calculation and discuss the main results here.
Fig. 2(left) shows the entropy density as a function of . This is the logarithm of the number of states at a given value of , divided by . For this is a concave function, so the thermodynamics can be computed in the usual manner. For a certain value of , the partition function is dominated by the point where is tangent to the line of slope (dashed lines). Notice that, by Eq. (22), is controlled by . As long as
[TABLE]
is dominated by an intermediate point for which an exponential (in ) number of states contribute to the sum in . Accordingly, the entropy
[TABLE]
is extensive, and it vanishes linearly with as (see Fig. 2 right), for all values of . As a consequence, for , the probability is exponentially small in , for all including .
For the slope is larger than that of the curve at , hence is dominated by states with . Hence, the probability is finite as well as the entropy . The phase diagram in the plane is shown in Fig. 3 (left). In summary, the typical behaviour of the REM holds in the whole region .
For , instead, is a convex function of and the construction above fails to work. For all small enough, the partition function is dominated by the point whereas for large it is dominated by states with . As a result, the entropy is for
[TABLE]
whereas as for . The transition between the two regimes is discontinuous, as shown in Fig. 2 (right). Notice that, since is an increasing function of for (see Eq. 22), the transition is also reversed.
The case is discussed in the appendix The phase transition occurs at the point for all values of . As shown in Fig. 4 (left), the entropy decreases sharply from to a finite value. At the transition, the distribution of extends across the whole range , which is signalled by the divergence of the (analog of the) specific heat
[TABLE]
as shown in Fig. 4 (center). This divergence is usually taken as a signature of a second order phase transition. In the ensemble of systems discussed here, it occurs only at . Finally Fig. 4 (right) shows the behaviour of the entropy of the random variable . This, in an efficient representation is taken as a measure of the amount of useful information. In an infinite system diverges at whereas for a finite system it reaches its maximum at .
We remark that the thermodynamic description discussed above holds for any system for which the number of energy levels at energy is given by for , irrespective of the relation between the energy and the configuration of the system. The case where the energy levels are drawn at random, independently, from the same distribution , for , provides a particular (ensemble of) realisation(s) of this system. Yet, this is not the only way in which a function with a given number of states at energy , can be realised. In particular, energy landscapes where is drawn independently for each are not ideal paradigms for learning machines. First because we expect some sort of continuity in the representation, so that similar objects and have similar energies . Second, random landscapes are characterised by an extremely slow dynamics [18]. Hence, a smooth energy landscape is a desirable property of OLM both because of continuity of the representation and because of the dynamical accessibility of the equilibrium state Eq. (19).
5 Discussion
Figure 1 puts on the same phase diagram systems with very different statistical properties. The right side () describes REM like behaviour typical of disordered systems in physics. The left side () describes unlearnable systems with a first order phase transitions. Optimal learning machines, that are characterised by , sit exactly at the boundary between these two regimes.
This lends itself to a number of interesting, though speculative, comments. First, among the systems studied in this paper, OLM have the widest variation of thermodynamically accessible energy levels . Indeed, the range of energies is given by , which is a decreasing function of . Yet, for only and are thermodynamically accessible, so the range of thermodynamically accessible values of is maximal for . This is consistent with the fact that the energy is the natural coordinate in learning because it corresponds to the coding cost . Maximally informative representations use the energy spectrum as efficiently as possible [3].
It is interesting to relate the phase transition for with the trade-off between resolution and noise discussed in Ref. [3] (see also Section 2). As varies traces a convex curve as a function of , where the slope is related to the Lagrange multiplier that is used to enforce the constraint on in the minimisation of [3]. This means that when is reduced by one bit, the noise is reduced by bits. Therefore the region describes noisy representations and correspond to values of larger than the value for which . The region corresponds to . In this region, reduction in the resolution come at the expense of a loss of information on the generative model. In supervised learning, it is reasonable to surmise that compression for occurs at the expense of details of the generative models that are irrelevant with respect to the specific input-output task that the machine is learning. Hence the representation depends significantly on the output. Conversely, for we expect that the representation depends mostly on the input and only weakly on the output. This leads to the conjecture that maximally informative representations have an universal nature for , which depend mostly on the input data, and are largely independent of the specific input-output relation that the machine is learning. In this picture, the phase transition at marks the point where the ergodicity in the space of representations (and the symmetry with respect to different outputs) gets (spontaneously) broken. This conjecture can in principle be disproved or confirmed by further research on specific architectures888As an analogy, the critical temperature in a ferromagnetic Ising model, marks the point where the response to a small external magnetic field changes dramatically. In the paramagnetic region, the response is continuous whereas in the ferromagnetic phase it is discontinuous. A possible way to confirm this conjecture might be to probe the response of maximally informative representations to changes in the output, at different values of . The change should small and “continuous” in the phase and sharp in the phase..
We’ve also seen that systems with cannot be learned from a series of experiments and OLM sit exactly at the boundary between learnable and unlearnable systems. In order to appreciate the possible significance of this observation, let us consider a larger system
[TABLE]
with additional variables . As in Ref. [14], the different terms in Eq. (27) can be defined as
[TABLE]
with representing the expected value on the distribution of at given , i.e. is the best estimate of the objective function, when the variable is fixed. Let us also assume that and are drawn independently from a stretched exponential distribution with exponents and , respectively. In the limit when , the derivation in Section 3.1 shows that the statistics of the variable
[TABLE]
still follows the Gibbs distribution Eq. (19), but the value of is dominated by the variables if and by the variables otherwise999Note that, the decomposition in Eq. (27) is not unique, since one could as well define . Hence, one without loss of generality, one can focus on the decomposition for which .. Therefore, the most relevant set of variables are those with the smallest value of . In this sense, systems with are characterised by the most relevant variables that can be implemented in a physically accessible system. This also offers a guideline for finding relevant variables in high-dimensional data, as those for which the sample exhibits statistical criticality (see Refs. [19, 20] for attempts in this direction). Furthermore, fort one recovers the Eq. (19) with . In words, the behaviour of OLM is invariant if further details are added to the problem, which is a desirable property of efficient representations. For example, the classification of a dataset of images should be invariant, independently of the resolution of the images, beyond a certain level.
A further unique property of systems with is that the system can, in principle, be further decomposed in sub-systems with the same properties. More precisely, one can find variables and such that and , with and having again a distributions that asymptotically behaves as an exponential. In particular, critical systems with admit sub-systems that are also “poised” at the critical point . It is tempting to regard this remarkable self-similarity as a distinguishing feature of living systems. For example, both the abundance of metabolites [21] and gene expression levels [22] inside cells have been reported to obey Zipf’s law.
On the contrary, systems with exhibit a behaviour which is more and more predictable the smaller is the number of variables (i.e. for large ). Within the simple class of models discussed here, the possibility to describe a complex system in terms of few variables101010This was regarded as a wonderful gift by Wigner [23]. emerges as a typical property of physical systems with .
Acknowledgements
Interesting discussions and useful comments with J. Barbier, J.-P. Bouchaud and S. Franz are gratefully acknowledged.
Appendix A The Statistical mechanics approach for
The maximum value of , from EVT, is given by
[TABLE]
where is also a random variable drawn from a Gumbel distribution. Here and in what follows, we introduced the shorthand . Therefore, neglecting corrections, we introduce the intensive variable by
[TABLE]
We focus on the case where the size of the heat bath is proportional to . Then
[TABLE]
is extensive and the number of configurations with is
[TABLE]
which has the conventional exponential behaviour with systems size . In the annealed approximation, we can compute the partition function as
[TABLE]
For the free energy is a concave function and can be computed by saddle point. The saddle point value reads
[TABLE]
As long as , the annealed approximation is valid. This holds as long as
[TABLE]
The saddle point calculation yields
[TABLE]
This allows us to compute the entropy for which is given by
[TABLE]
which vanishes as .
The saddle point approximation cannot be used for because the function is convex. Indeed the integral in Eq. (37) is either dominated by the point or by the point . As long as the first dominates and we have
[TABLE]
The condition is equivalent to
[TABLE]
As long as this condition is satisfied, the entropy is asymptotically the same as the entropy of a flat distribution over the states . When the annealed approximation ceases to be valid because the partition function is dominated by few states.
When is dominated by the point , i.e. for , the annealed partition function can be estimated with the change of variables so that the free energy becomes and the integral yields
[TABLE]
which suggests that the probability of states with is of order one:
[TABLE]
Notice that this does not vanish as which is a further signature of a first order phase transition.
Appendix B The case
In the case we can resort to a simple approximation, assuming that the spectrum of possible values of is limited in the range , with
[TABLE]
and that the number of energy levels in the interval is given by
[TABLE]
We can obtain most quantities of interest from the partition function
[TABLE]
Indeed, yields the normalisation of the distribution over and derivatives of with respect to , computed at yield the moments of the distribution of . Also, the entropy
[TABLE]
Within the approximation above, we find
[TABLE]
The expected value of reads
[TABLE]
where we have introduced the scaling variable . For the leading behaviour for is obtained for , whereas for it is obtained for . Hence
[TABLE]
where for and is the Heaviside function. The specific heat is given by
[TABLE]
The entropy reads
[TABLE]
It is also possible to compute the entropy of the variable
[TABLE]
for and one finds whereas at one finds .
B.1 A refined approach
The approach discussed so far relies on the annealed approximation for the partition function. This approach is accurate in the disordered phase but it does not work in the frozen phase. Indeed, for the partition function is dominated by few states and it is not self averaging. The probability is a function of and as such, it attains different values depending on the values of . As a result, the entropy is also a random variable.
In order to appreciate this effects, we compute the function
[TABLE]
where stands for the average over and whereas for the average over all values of for , with . Now,
[TABLE]
with . The term is vanishingly small unless . Hence, for , and we can write
[TABLE]
Anticipating that is non-negligible for values of , we compute
[TABLE]
where we set in the last equation and took the limit .
Inserting this in Eq. (58), with the change of variables , we observe that observe that with a factor that cancels the sum on . Therefore, with , we find
[TABLE]
Setting the integrals separate and one finds
[TABLE]
Note that as necessary for normalisation. The knowledge of allows us to compute observables in the region. For example the probability that two replicas end up in the same state, i.e that , is given by . Likewise, the probability that replicas coincide is
[TABLE]
which vanishes linearly with , for all , and it decays as for .
The expected value of the entropy is given by
[TABLE]
The leading divergence as matches the one found within the annealed approximation. Its variance can also be computed
[TABLE]
Interestingly, as .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Gašper Tkačik and William Bialek. Information processing in living systems. Annual Review of Condensed Matter Physics , 7(1):89–117, 2016.
- 2[2] Jorge Hidalgo, Jacopo Grilli, Samir Suweis, Miguel A. Muñoz, Jayanth R. Banavar, and Amos Maritan. Information-based fitness and the emergence of criticality in living systems. Proceedings of the National Academy of Sciences , 111(28):10095–10100, 2014.
- 3[3] Ryan John Cubero, Junghyo Jo, Matteo Marsili, Yasser Roudi, and Juyong Song. Statistical criticality arises in most informative representations. Journal of Statistical Mechanics: Theory and Experiment , 2019(6):063402, jun 2019.
- 4[4] Thierry Mora and William Bialek. Are biological systems poised at criticality? Journal of Statistical Physics , 144(2):268–302, 2011.
- 5[5] Miguel A. Muñoz. Colloquium: Criticality and dynamical scaling in living systems. Rev. Mod. Phys. , 90:031001, Jul 2018.
- 6[6] George Kingsley Zipf. Selected studies of the principle of relative frequency in language . Harvard university press, 1932.
- 7[7] Gašper Tkačik, Thierry Mora, Olivier Marre, Dario Amodei, Stephanie E Palmer, Michael J Berry, and William Bialek. Thermodynamics and signatures of criticality in a network of neurons. Proceedings of the National Academy of Sciences , 112(37):11508–11513, 2015.
- 8[8] Javier D. Burgos and Pedro Moreno-Tovar. Zipf-scaling behavior in the immune system. Biosystems , 39(3):227 – 232, 1996.
