The Stochastic complexity of spin models: Are pairwise models really   simple?

Alberto Beretta; Claudia Battistin; Cl\'elia de Mulatier; Iacopo; Mastromatteo; Matteo Marsili

arXiv:1702.07549·cond-mat.dis-nn·October 17, 2018·Entropy

The Stochastic complexity of spin models: Are pairwise models really simple?

Alberto Beretta, Claudia Battistin, Cl\'elia de Mulatier, Iacopo, Mastromatteo, Matteo Marsili

PDF

Open Access

TL;DR

This paper investigates the stochastic complexity of spin models with various interaction orders, revealing that model simplicity depends on the arrangement of interactions rather than their order, with fully connected pairwise models being highly complex.

Contribution

It introduces a framework to analyze the stochastic complexity of spin models, highlighting invariances and classifying models into equivalence classes based on complexity.

Findings

01

Models with localized, non-overlapping interactions are simple.

02

Fully connected pairwise models are highly complex due to extensive interactions.

03

Complexity depends on interaction arrangement, not order.

Abstract

Models can be simple for different reasons: because they yield a simple and computationally efficient interpretation of a generic dataset (e.g. in terms of pairwise dependences) - as in statistical learning - or because they capture the essential ingredients of a specific phenomenon - as e.g. in physics - leading to non-trivial falsifiable predictions. In information theory and Bayesian inference, the simplicity of a model is precisely quantified in the stochastic complexity, which measures the number of bits needed to encode its parameters. In order to understand how simple models look like, we study the stochastic complexity of spin models with interactions of arbitrary order. We highlight the existence of invariances with respect to bijections within the space of operators, which allow us to partition the space of all models into equivalence classes, in which models share the same…

Equations17

P (s ∣ g, M) = \frac{1}{Z _{M} ( g )} e^{\sum_{μ \in M} g^{μ} ϕ^{μ} (s)},

P (s ∣ g, M) = \frac{1}{Z _{M} ( g )} e^{\sum_{μ \in M} g^{μ} ϕ^{μ} (s)},

with

lo g \hat{s} \sum P (\hat{s} ∣ \hat{g}, M) ≃ \frac{∣ M ∣}{2} lo g (\frac{N}{2 π}) + c_{M} .

lo g \hat{s} \sum P (\hat{s} ∣ \hat{g}, M) ≃ \frac{∣ M ∣}{2} lo g (\frac{N}{2 π}) + c_{M} .

c_{M} = lo g \int d g det J (g),

c_{M} = lo g \int d g det J (g),

J_{μν} (g) = \frac{\partial ^{2}}{\partial g ^{μ} \partial g ^{ν}} lo g Z_{M} (g) .

J_{μν} (g) = \frac{\partial ^{2}}{\partial g ^{μ} \partial g ^{ν}} lo g Z_{M} (g) .

N_{GT} (n) = 2^{n^{2}} k = 1 \prod n (1 - 2^{- k}) .

N_{GT} (n) = 2^{n^{2}} k = 1 \prod n (1 - 2^{- k}) .

Z_{\mathcal{M}}(\boldsymbol{g})=2^{n}\bigg{(}\prod_{\mu\in\mathcal{M}}\cosh(g^{\mu})\bigg{)}\;\sum_{\mathcal{\ell}\in\mathcal{L}}\;\prod_{\mu\in\mathcal{\ell}}\tanh(g^{\mu})\,.

Z_{\mathcal{M}}(\boldsymbol{g})=2^{n}\bigg{(}\prod_{\mu\in\mathcal{M}}\cosh(g^{\mu})\bigg{)}\;\sum_{\mathcal{\ell}\in\mathcal{L}}\;\prod_{\mu\in\mathcal{\ell}}\tanh(g^{\mu})\,.

c_{M} = lo g \int_{F} d φ det J (φ),

c_{M} = lo g \int_{F} d φ det J (φ),

c_{\overline{M}}

c_{\overline{M}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Statistical Mechanics and Entropy · Neural Networks and Applications

Full text

The Stochastic complexity of spin models:

Are pairwise models really simple?

Alberto Beretta

The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, I-34014 Trieste, Italy

Claudia Battistin

Kavli Institute for Systems Neuroscience and Centre for Neural Computation, NTNU, Olav Kyrres gate 9, 7030 Trondheim, Norway

Clélia de Mulatier

The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, I-34014 Trieste, Italy

Iacopo Mastromatteo

Capital Fund Management, 23 rue de l’Université, 75007 Paris, France

Matteo Marsili

The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, I-34014 Trieste, Italy

Abstract

Models can be simple for different reasons: because they yield a simple and computationally efficient interpretation of a generic dataset (e.g. in terms of pairwise dependences) – as in statistical learning – or because they capture the essential ingredients of a specific phenomenon – as e.g. in physics – leading to non-trivial falsifiable predictions. In information theory and Bayesian inference, the simplicity of a model is precisely quantified in the stochastic complexity, which measures the number of bits needed to encode its parameters. In order to understand how simple models look like, we study the stochastic complexity of spin models with interactions of arbitrary order. We highlight the existence of invariances with respect to bijections within the space of operators, which allow us to partition the space of all models into equivalence classes, in which models share the same complexity. We thus found that the complexity (or simplicity) of a model is not determined by the order of the interactions, but rather by their mutual arrangements. Models where statistical dependencies are localized on non-overlapping groups of few variables (and that afford predictions on independencies that are easy to falsify) are simple. On the contrary, fully connected pairwise models, which are often used in statistical learning, appear to be highly complex, because of their extended set of interactions.

Information theory $|$ Statistical inference $|$ Model complexity $|$ Spin model

Science, as the endeavour of reducing complex phenomena to simple principles and models, has been instrumental to solve practical problems. Yet, problems such as image or speech recognition and language translation have shown that Big Data can solve problems without necessarily understanding Mayer-Schonberger and Cukier (2013); Anderson (2008); Cristianini (2010). A statistical model trained on a sufficiently large number of instances can learn how to mimic the performance of the human brain on these tasks LeCun et al. (2010); Hannun et al. (2014). These models are simple in the sense that they are easy to evaluate, train and/or to infer. They offer simple interpretations in terms of low order (typically pairwise) dependencies, which in turn afford an explicit graph theoretical representation Bishop (2006). Their aim is not to uncover fundamental laws but to “generalize well”, i.e. to describe well yet unseen data. For this reason, machine learning relies on “universal” models that are apt to describe any possible data on which they can be trained Wu et al. (2008), using suitable “regularization” schemes in order to tame parameter fluctuations (overfitting) and achieve small generalization error Goodfellow et al. (2016).

Scientific models, instead, are the simplest possible descriptions of experimental results. A physical model is a representation of a real system and its structure reflects the laws and symmetries of Nature. It predicts well not because it generalizes well, but rather because it captures essential features of the specific phenomena that it describes. It should depend on few parameters and is designed to provide predictions that are easy to be falsified Popper (2005). For example, Newton’s laws of motion are consistent with momentum conservation, a fact that can be checked in scattering experiments.

The intuitive notion of a “simple model” hints at a succinct description, one that requires few bits Chater and Vitányi (2003). The stochastic complexity Rissanen (1997), derived within Minimum Description Length (MDL) Rissanen (1978); Grünwald (2007), provides a quantitative measure for “counting” the complexity of models in bits. The question this paper addresses is: what are the features of simple models according to MDL and are they simple in the sense surmised in statistical learning or in physics? In particular, are models with up to pairwise interactions, which are frequently used in statistical learning, simple?

We address this issue in the context of spin models, describing the statistical dependence among $n$ binary variables. There has been a surge of recent interest in the inference of spin models Chau Nguyen et al. (2017) from high dimensional data, most of which was limited to pairwise models. This is partly because pairwise models allow for an intuitive graph representation of statistical dependencies. Most importantly, since the number of $k$ -variable interactions grows as $n^{k}$ , the number of samples is hardly sufficient to go beyond $k=2$ . For this reason, efforts to go beyond pairwise interactions have mostly focused on low order interactions (e.g. $k=3$ , see Margolin et al. (2010) and references therein). Ref. Merchan and Nemenman (2016) recently suggested that even for data generated by models with higher order interactions, pairwise models may provide a sufficiently accurate description of the data. Within the class of pairwise models, L1 regularization Ravikumar et al. (2010) has proven to be a remarkably efficient heuristic of model selection (but see also Bulso et al. (2016)).

Here we focus on the exponential family of spin models with interactions of arbitrary order. This class of models assume a sharp separation between relevant observables and irrelevant ones, whose expected value is predicted by the model. In this setting, the stochastic complexity Rissanen (1997) computed within MDL coincides with the penalty that, in Bayesian model selection, accounts for model’s complexity, under non-informative (Jeffrey’s) priors Balasubramanian (1997).

.1 The exponential family of spin models (with interactions of arbitrary order)

Consider $n$ spin variables $\boldsymbol{s}=(s_{1},\ldots,s_{n})$ , taking values $s_{i}=\pm 1$ . The probability distribution of $\boldsymbol{s}$ under a model $\mathcal{M}$ belonging to the exponential family is given by:

[TABLE]

where the model $\mathcal{M}$ is identified by the set $\{\phi^{\mu}(\boldsymbol{s}),~{}\mu\in\mathcal{M}\}$ of product spin operators, $\phi^{\mu}(\boldsymbol{s})=\prod_{i\in\mu}s_{i}\,$ . Each operator $\phi^{\mu}(\boldsymbol{s})$ models the interaction that involves all the spins of the subset $\mu$ of the $n$ spins. We thus consider interactions of any arbitrary order (see Appendix sec. SI-0). For instance, for pairwise interaction models, the operators $\phi^{\mu}(\boldsymbol{s})$ are single spins $s_{i}$ or product of two spins $s_{i}s_{j}$ , for $i,j\in\{1,...,n\}$ . The $g^{\mu}$ are the conjugate parameters111There is a broader class of models, where subsets $\mathcal{V}\subseteq\mathcal{M}$ of operators have the same parameter, i.e. $g^{\mu}=g^{\mathcal{V}}$ for all $\mu\in\mathcal{V}$ . These degenerate models are rarely considered in the inference literature. Here we confine our discussion to non-degenerate models and refer the reader to Appendix sec. SI-7 for more discussion. that modulates the strength of the interaction associated with $\phi^{\mu}$ . Finally, the partition function $Z_{\mathcal{M}}(\boldsymbol{g})$ ensures normalisation.

We remark that the models of (1) can be derived as the maximum entropy distributions that are consistent with the requirement that the model reproduces the empirical averages of the operators $\phi^{\mu}(\boldsymbol{s})$ for all $\mu\in\mathcal{M}$ on a given dataset Jaynes (1957); Tikochinsky et al. (1984). In other words, empirical averages of $\phi^{\mu}(\boldsymbol{s})$ are sufficient statistics, i.e. their values are enough to compute the maximum likelihood parameters $\boldsymbol{\hat{g}}$ . Therefore the choice of the operators $\phi^{\mu}$ in $\mathcal{M}$ inherently entails a sharp separation between relevant variables (the sufficient statistics) and irrelevant ones, which may have important consequences in the inference process. For example, if statistical inference assumes pairwise interactions, it might be blind to relevant patterns in the data resulting from higher order interactions. Without prior knowledge, all models $\mathcal{M}$ should be compared. According to MDL and Bayesian model selection (see Appendix sec. SI-0), models should be compared on the basis of their maximum (log)likelihood corrected by their complexity. In other words, simple models should be preferred a priori.

Stochastic complexity

The complexity of a model can be defined unambiguously within MDL as the number of bits needed to specify a priori the parameters $\boldsymbol{\hat{g}}$ that best describe a dataset $\boldsymbol{\hat{s}}=(\boldsymbol{s}^{(1)},\dots,\boldsymbol{s}^{(N)})$ consisting of $N$ samples independently drawn from the distribution $P(\boldsymbol{s}\,|\,\boldsymbol{g},\mathcal{M})$ for some unknown $\boldsymbol{g}$ (see Appendix sec. SI-0). Asymptotically for $N\to\infty$ , for systems of discrete variables, the MDL complexity is given by Rissanen (1996, 2001):

[TABLE]

The two terms in the r.h.s. are the stochastic complexity Rissanen (1997); Myung et al. (2000). The first term, which is the basis of the Bayesian Information Criterion (BIC) Schwarz (1978); Myung et al. (2000), captures the increase of the complexity with the number $|\mathcal{M}|$ of model’s parameters and with the number $N$ of data points. This accounts for the fact that the uncertainty in each parameter $\boldsymbol{\hat{g}}$ decreases with $N$ as $N^{-1/2}$ , so its description requires $\sim\frac{1}{2}\log N$ bits. The second term $c_{\mathcal{M}}$ quantifies the statistical dependencies between the parameters, and it is given by

[TABLE]

where $\mathbb{J}(\boldsymbol{g})$ is the Fisher Information matrix with entries

[TABLE]

The term $c_{\mathcal{M}}$ encodes for the intrinsic notion of simplicity we are interested in. To distiguish these two terms, we will refer to the first as BIC term and to the second as stochastic complexity. For an exponential family, the MDL criteria (3) coincides with the Bayesian model selection approach, assuming Jeffreys’ prior over the parameters $\boldsymbol{g}$ Myung et al. (2000); Jeffreys (1946); Amari (2016) (see Appendix sec. SI-0). Within a fully Bayesian approach, the model that maximises its posterior given the data $\boldsymbol{\hat{s}}$ , $P(\mathcal{M}|\boldsymbol{\hat{s}})$ , is the one to be selected. Therefore, if two models have the same number of parameters (same BIC term), the simplest one, i.e. the one with the lowest stochastic complexity $c_{\mathcal{M}}$ , has to be chosen a priori. However, the number of possible interactions $\phi^{\mu}$ among $n$ spins is $2^{n}-1$ , and therefore the number of spin models is $2^{2^{n}-1}$ . The super-exponential growth of the number of models with the number of spins $n$ makes selecting the simplest model unfeasible even for moderate $n$ . Our aim is then to understand how the stochastic complexity depends on the structure of the model $\mathcal{M}$ and eventually provide guidelines for the search of simpler models in such a huge space.

Equivalence classes of models

.2 Gauge transformations

Let’s start by showing that low order interactions do not have a privileged status and are not necessarily related to low complexity $c_{\mathcal{M}}$ , with the following argument: Alice is interested in finding which model $\mathcal{M}$ best describes a dataset $\boldsymbol{\hat{s}}$ ; Bob is interested in the same problem, but his dataset $\mathbf{\hat{\boldsymbol{\sigma}}}$ is related to Alice’s dataset by a gauge transformation. The latter is defined as a bijective transformation between the $n$ spin variables $\boldsymbol{s}$ of Alice and those of Bob, $\boldsymbol{\sigma}=(\sigma_{1},\cdots,\sigma_{n})\in\{\pm 1\}^{n}$ , that corresponds to a bijection from the set of all operators to itself, $\phi^{\mu}(\boldsymbol{s})\to\phi^{\mu^{\prime}}(\boldsymbol{\sigma})$ (see the examples in Fig. 1 and Appendix sec. SI-1). This induces a bijective transformation between Alice’s models and those of Bob, as shown in Fig. 1, that preserves the number of interactions $|\mathcal{M}|$ . Whatever conclusion Bob draws on the relative likelihood of models can be translated into Alice’s world, where it has to coincide with Alice’s result. It follows that two models $\mathcal{M}$ and $\mathcal{M}^{\prime}$ related by a gauge transformation must also have the same complexity $c_{\mathcal{M}}=c_{\mathcal{M}^{\prime}}$ . In particular, pairwise interactions can be mapped to interactions of any order (see Fig. 1), and, consequently, low order interactions are not necessarily simpler than higher order ones.

Observe that models connected by gauge transformations have remarkably different structures. In Fig. 1, model a) has all the possible interactions concentrated on 3 spins, having the properties of a simplicial complex222 A simplicial complex Courtney and Bianconi (2016), in our notation, is a model such that, for any interaction $\mu\in\mathcal{M}$ , any interaction that involves any subset $\nu\subseteq\mu$ of spins is also contained in the model (i.e. $\nu\in\mathcal{M}$ ). Courtney and Bianconi (2016); however, its gauge-transformed counterparties are not simplicial complexes. Model d) is invariant under any permutations of the four spins, whereas the other models have a lower degree of symmetry under permutations (see the different multiplicities in Fig. 1).

Gauge transformations are discussed in more details in Appendix sec. SI-1. One can also see them as a change of the basis $\boldsymbol{s}\to\boldsymbol{\sigma}$ in which the operators are expressed. Counting the number of possible bases then gives us the number of gauge transformations (see Appendix sec. SI-1):

[TABLE]

Notice that the number of gauge transformations, (6), is much smaller than the number $2^{n}!$ of possible bijections of the set of $2^{n}$ states into itself. Indeed a generic bijection between the state spaces of $\boldsymbol{s}$ and $\boldsymbol{\sigma}$ maps each product operator to one of the binary functions $f:\boldsymbol{\sigma}\to\{+1,-1\}$ , which does not necessarily correspond to a product operator $\phi^{\mu}(\boldsymbol{\sigma})$ .

.3 Complexity classes

Gauge transformations allow us to divide the set of all models into equivalence classes, which we call complexity classes. Models belonging to the same class are related to each other by a gauge transformation (that is the equivalence relation), and thus have the same complexity $c_{\mathcal{M}}$ . This classification suggests the presence of “quantum numbers” (invariants), in terms of which models can be classified. These invariants emerge explicitly when writing the cluster expansion of the partition function Landau and Lifshitz (2013); Kramers and Wannier (1941); Pelizzola (2005) (see Appendix sec. SI-2):

[TABLE]

The sum runs on the set $\mathcal{L}$ of all possible loops $\ell$ that can be formed with the operators $\mu\in\mathcal{M}$ . A loop is any subset $\ell\subseteq\mathcal{M}$ such that $\prod_{\mu\in\ell}\phi^{\mu}(\boldsymbol{s})=1$ for any value of $\boldsymbol{s}$ , i.e. such that each spin $s_{i}$ occurs zero or an even number of times in this product. The set $\mathcal{L}$ includes the empty loop $\ell=\emptyset$ . The structure of $Z_{\mathcal{M}}(\boldsymbol{g})$ in (7) depends on few characteristics of the model $\mathcal{M}$ : the number $|\mathcal{M}|$ of operators (or, equivalently, of parameters) and the structure of its set of loops $\mathcal{L}$ (which operator is involved in which loop). The invariance under gauge transformation of the complexity in (4) reveals itself in the fact that the partition function of models related by a gauge transformation have the same functional dependence on their parameters up to relabeling.

Let us focus on the loop structure of models belonging to the same class. The set $\mathcal{L}$ of loops of any model $\mathcal{M}$ has the structure of a finite Abelian group: if $\ell_{1},\ell_{2}\in\mathcal{L}$ , then $\ell_{1}\oplus\ell_{2}$ is also a loop of $\mathcal{M}$ , where $\oplus$ is the symmetric difference333 The symmetric difference of two sets $\ell_{1}$ and $\ell_{2}$ is the set that contains the elements that occur in $\ell_{1}$ but not in $\ell_{2}$ and viceversa: $\ell_{1}\oplus\ell_{2}=(\ell_{1}\cup\ell_{2})\setminus(\ell_{1}\cap\ell_{2})$ . It corresponds to the XOR operator between the spins of the two loops. of two sets (see Appendix sec. SI-3). As a consequence, for each model $\mathcal{M}$ one can identify a minimal generating set of $\lambda$ loops, such that any loop in $\mathcal{L}$ can be uniquely expressed as a product of loops in the minimal generating set. Note that the choice of the generating set is not unique, though all choices have the same cardinality $\lambda$ ; Fig. 2 gives examples of this decomposition for the models of Fig. 1. Note also that $\ell\oplus\ell=\emptyset$ for each loop $\ell\in\mathcal{L}$ . As a consequence, the cardinality of the loop group is $|\mathcal{L}|=2^{\lambda}$ (including the empty loop $\emptyset$ ). We found that $\lambda$ is related to the number $|\mathcal{M}|$ of operators of the model by $\lambda=|\mathcal{M}|-n_{\mathcal{M}}$ (see Appendix sec. SI-3), where $n_{\mathcal{M}}$ is the number of independent operators of a model $\mathcal{M}$ , i.e. the maximal number of operators that can be taken in $\mathcal{M}$ without forming any loop. This implies that $\lambda$ attains its minimal value, $\lambda=0$ , for models with only independent operators ( $|\mathcal{M}|=n_{\mathcal{M}}$ ), and its maximal value, $\lambda=2^{n}-1-n$ , for the complete model $\overline{\mathcal{M}}$ , that contains all the $|\overline{\mathcal{M}}|=2^{n}-1$ possible operators. The number of independent operators $n_{\mathcal{M}}$ is preserved by gauge transformation, and, as the total number of operators $|\mathcal{M}|$ is also an invariant of the class, so is the cardinality of the minimal generating set $\lambda$ . For example, all models in Fig. 1 have $n_{\mathcal{M}}=3$ independent operators and $\lambda=4$ (see Fig. 2). It can also be shown that gauge transformations imply a duality relation, that associates to each class of models with $|\mathcal{M}|$ operators a class of models with the $2^{n}-1-|\mathcal{M}|$ complementary operators (see Appendix sec. SI-3). Summarizing, the quantities $|\mathcal{M}|$ and $n_{\mathcal{M}}$ , and the structure of $\mathcal{L}$ (through its generators) fully characterize a complexity class.

How do simple models look like?

.4 Fewer independent operators, shorter loops

Coming to the quantitative estimate of the complexity, $c_{\mathcal{M}}$ generally depends on the extent to which ensemble averages of the operators $\phi^{\mu}(\boldsymbol{s})$ in the model $\mu\in\mathcal{M}$ constrain each other. This appears explicitly by rewriting (4) as an integral over the ensemble averages of the operators, $\boldsymbol{\varphi}=\{\langle\phi^{\mu}\rangle,\mu\in\mathcal{M}\}$ , exploiting the bijection between the parameters $\boldsymbol{g}$ and their dual parameters $\boldsymbol{\varphi}$ and re-parameterization invariance Amari and Nagaoka (2007); Amari (2016):

[TABLE]

where $\mathbb{J}(\boldsymbol{\varphi})$ is the Fisher Information Matrix in the $\boldsymbol{\varphi}$ -coordinates. The new domain $\mathcal{F}$ of integration is over the values of $\boldsymbol{\varphi}$ that can be realized in any empirical sample drawn from the model $\mathcal{M}$ (known in this context as marginal polytope Wainwright and Jordan (2008)) and is related to the mutual constraints between the ensemble averages $\varphi^{\mu}$ (see Appendix sec. SI-4 for more details). If the model contains no loop, i.e. $\mathcal{L}=\{\emptyset\}$ , then $J_{\mu\nu}(\boldsymbol{\varphi})=[1-(\varphi^{\mu})^{2}]^{-1}\delta_{\mu\nu}$ is diagonal: the integral in (8) factorizes and gives $c_{\mathcal{M}}={|\mathcal{M}|}\log\pi$ . In this case, the variables $\varphi^{\mu}$ are not constrained at all and the domain of integration is $\mathcal{F}=[-1,1]^{|\mathcal{M}|}$ . If instead the model contains loops, the variables $\varphi^{\mu}$ become constrained and the marginal polytope $\mathcal{F}$ is reduced. For example, for a model with a single loop of length three (e.g. $\phi^{1}=s_{1}$ , $\phi^{2}=s_{2}$ and $\phi^{3}=s_{1}s_{2}$ ), the values of $\boldsymbol{\varphi}$ in $[-1,1]^{3}$ are not all attainable, indeed $\mathcal{F}=\{\boldsymbol{\varphi}\in[-1,1]^{3}:~{}|\varphi^{1}+\varphi^{2}|-1\leq\varphi^{3}\leq 1-|\varphi^{1}-\varphi^{2}|\}$ is reduced, which decreases the complexity. The complexity $c_{\mathcal{M}}(k)$ of models with a fixed number $|\mathcal{M}|$ of parameters and a single (non-empty) loop of length $k$ is shown in Fig. 3 (see Appendix sec. SI-6): $c_{\mathcal{M}}(k)$ increases with $k$ and saturates at $|\mathcal{M}|\log\pi$ , which is the value one would expect if all operators where unconstrained. This is consistent with the expectation that longer loops induce weaker constraints among the operators. Note that the number of independent operators is kept constant here, equal to $n_{\mathcal{M}}=|\mathcal{M}|-1$ .

The single loop calculation allows computing the complexity of models with non-overlapping loops ( $\ell\cap\ell^{\prime}=\emptyset$ for all $\ell,\ell^{\prime}\in\mathcal{L}$ ), for which $c_{\mathcal{M}}=\sum_{\ell\in\mathcal{L}}c_{\ell}$ is the sum over the complexity $c_{\ell}$ associated to each loop. In the general case of models with more complex loop structures, the explicit calculation of $c_{\mathcal{M}}$ is non-trivial. Yet, the argument above suggests that, at fixed number of parameters $|{\mathcal{M}}|$ , $c_{\mathcal{M}}$ should increase with the number $n_{\mathcal{M}}$ of independent operators. Fig. 4 summarises the results for all models with $n=4$ spins and supports this conclusion: for a given value of $|\mathcal{M}|$ , classes with lower values of $n_{\mathcal{M}}$ (i.e. with less independent operators) are less complex.

A surprising result of Fig. 4 is that $c_{\mathcal{M}}$ is not monotonic with the number $|\mathcal{M}|$ of operators of the model, increasing first with $|\mathcal{M}|$ and then decreasing. Complete models $\overline{\mathcal{M}}$ turn out to be the simplest (see the dashed curve in Fig. 4). As a consequence, for a given $|\mathcal{M}|$ , models that contain a complete model on a subset of spins are generally simpler than models where operators have support on all the spins. For instance, the complexity class displayed in Fig. 1 is the class of models with $|\mathcal{M}|=7$ operators that has the lowest complexity (see green triangle on the dashed curve in Fig. 4).

Fig. 4 also confirms that pairwise models are not simpler than models with higher order interactions. Indeed, for instance for $|\mathcal{M}|=7$ , $c_{\mathcal{M}}$ increases drastically when changing model a) of Fig. 1 into a pairwise model by turning the $3$ -spin interaction into an external field acting on $s_{4}$ . Likewise, the model with all 6 pairwise interactions for $|\mathcal{M}|=10$ is more complex than the one where one of them is turned into a $3$ -spin interaction.

.5 Complete and sub-complete models

It is possible to compute explicitly the complexity of a complete model $\overline{\mathcal{M}}$ with $n$ spins. Indeed, there is a mapping $g^{\mu}=2^{-n}\sum_{\boldsymbol{s}}\phi^{\mu}(\boldsymbol{s})\log p(\boldsymbol{s})$ between the $2^{n}-1$ parameters $g^{\mu}$ of $\overline{\mathcal{M}}$ and the $2^{n}$ probability $p(\boldsymbol{s})$ , also constrained by their normalization Mastromatteo (2013). The complexity in (4) is invariant under reparametrization Amari and Nagaoka (2007). Re-writing this integral in terms of the variables $p(\boldsymbol{s})$ and using that $\det\mathbb{J}(\mathbf{p})=\prod_{\boldsymbol{s}}1/p(\boldsymbol{s})$ , we find (see Appendix sec. SI-5):

[TABLE]

Note that, for $n>4$ , $c_{\overline{\mathcal{M}}}$ becomes negative (for $n=6$ , $c_{\overline{\mathcal{M}}}\simeq-41.5$ ). This suggests that the class of least complex models with $|\mathcal{M}|$ interactions is the one that contains the model where the maximal number of loops are concentrated on the smallest number of spins. This agrees with our previous observations on single loop models and sub-complete models. On the contrary, models where interactions are distributed uniformly across the variables (e.g. models with only single spin operators for $n\geq|\mathcal{M}|$ or with non-overlapping sets of loops) have higher complexity.

.6 Maximally overlapping loops

This finally leads us to conjecture that stochastic complexity is related to the localization properties of the set of loops $\mathcal{L}$ (i.e. its group structure) rather than to the order of the interactions: models where the loops $\ell,\ell^{\prime}\in\mathcal{L}$ have a “large” overlap $\ell\cap\ell^{\prime}$ are simple, whereas models with an extended homogeneous network of interactions (e.g. fully connected Ising models with up-to pairwise interaction) have many non-overlapping loops $\ell\cap\ell^{\prime}=\emptyset$ and therefore are rather complex. It is interesting to note that the former (simple models) lend themselves to predictions on the independence of different groups of spins. These predictions suggest “fundamental” properties of the system under study (i.e. invariance properties, spin permutation symmetry breaking) and are easy to falsify (i.e. it is clear how to devise a statistical test for these hypotheses to any given confidence level). On the contrary, complex models (e.g. fully connected pairwise Ising models) are harder to falsify as their parameters can be adjusted to fit reasonably well any sample, irrespectively of the system under study.

.7 Summary

We find that at fixed number $|\mathcal{M}|$ of operators, simpler models are those with fewer independent operators (i.e. smaller $n_{\mathcal{M}}$ ). For the same value of $n_{\mathcal{M}}$ , models can still have different complexities. The simpler ones are then those with a loop structure that will impose the most constraints between the operators of the model. More generally, we show that the complexity of a model is not defined by the order of the interactions involved, but is, instead, intimately connected to its internal geometry, i.e. how interactions are arranged in the model. The geometry of this arrangement implies mutual dependencies between interactions, that constrain the states accessible to the system. More complex models are those that implement fewer constraints, and can thus account for broader types of data. This result is consistent with the information geometric approach of Ref. Myung et al. (2000), which studies model complexity in terms of the geometry of the space of probability distributions444In information geometry (Amari, 2016; Amari and Nagaoka, 2007), a model $\mathcal{M}$ defines a manifold in the space of probability distributions. For exponential models (1), the natural metric, in the coordinates $g^{\mu}$ , is given by the Fisher Information (5), and the stochastic complexity (4) is the volume of the manifold Myung et al. (2000).. The contribution of this paper clarifies the relation between the information geometric point of view and the specific structure of the model, i.e. the actual arrangement of its interactions.

A rough estimate of the number $N$ of data samples beyond which the complexity term becomes negligible in Bayesian inference can be obtained with the following argument: An upper bound for the complexity of models with $n$ spins and $m$ parameters is given by $m\log\pi$ , i.e. when all operators are independent. As a lower bound, we take Eq. (9) with $m=2^{n}-1$ . This implies that an upper bound for the variation of the complexity is given by $\Delta c=\frac{m-1}{2}\log\pi+\log\Gamma\left(\frac{m+1}{2}\right)$ . When this is much smaller than the BIC term, the stochastic complexity can be neglected. For large $m$ this implies $N\gg m$ , which may be relevant for the applicability of fully connected pairwise models ( $m\simeq n^{2}/2$ ) in typical cases, for instance when samples cannot be considered as independent observations from a stationary distribution (see Bulso et al. (2016)).

Conclusion

As pointed out by Wigner Wigner (1960) long ago, the unreasonable effectiveness of mathematical models relies on isolating phenomena that depend on few variables, whose mutual variation is described by simple models and is independent of the rest. Remarkably we find that, for a fixed number of spin variables and parameters, simple models, according to MDL, are precisely of this form: statistical dependencies are concentrated on the smallest subset of variables and these are independent of all the rest.

Such simple models are not optimal to generalize, i.e. to describe generic statistical dependencies, rather they are easy to falsify. They are designed for spotting independencies that may hint at deeper principles (e.g. symmetries or conservation laws) that may “take us beyond the data” 555In his response to Ref. Anderson (2008) on edge.org, W.D. Willis observes that “Models are interesting precisely because they can take us beyond the data”.. On the contrary, fully connected pairwise models appears to be rather complex. This, we conjecture, is the origin of pairwise sufficiency Merchan and Nemenman (2016) that makes them so successful to describe a wide variety of data from neural tissues Schneidman et al. (2006) to voting behaviour Lee et al. (2015).

On the other hand, pairwise interactions play a special role in our understanding of phenomena as they allow to reduce statistical dependencies into direct interactions between variables. Therefore it would be important to identify methods to quantitatively assess when a dataset is genuinely described by pairwise interactions. The results of this paper allow one to address this issue by comparing inference with pairwise models to inference with models obtained via their gauge transformations. Since the latter preserve the number of interactions and the stochastic complexity, transformed models have the same flexibility in terms of generalisation. For the same reason, the comparison between pairwise models and their gauge transformed ones can be done on the basis of likelihood alone.

In conclusion, our results suggest that when data are scarce and high dimensional, Bayesian inference should privilege simple models, i.e. those with small stochastic complexity, over more complex ones, such as fully connected pairwise models that are often used Chau Nguyen et al. (2017); Schneidman et al. (2006); Lee et al. (2015). A full Bayesian model selection approach is hampered by the calculation of the stochastic complexity that is a daunting task. Developing approximate heuristics for accomplishing this task is a challenging future avenue of research.

Acknowledgements.

C. B. acknowledges financial support from the Kavli Foundation and the Norwegian Research Council’s Centre of Excellence scheme (Centre for Neural Computation, grant number 223262). A. B. acknowledges financial support from International School for Advanced Studies (SISSA).

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Mayer-Schonberger and Cukier (2013) V. Mayer-Schonberger and K. Cukier, Big Data: A Revolution That Will Transform How We Live, Work and Think. (John Murray Publishers, UK, 2013).
2Anderson (2008) C. Anderson, Wired Magazine (2008).
3Cristianini (2010) N. Cristianini, Neural Networks 23 , 466 (2010).
4Le Cun et al. (2010) Y. Le Cun, K. Kavukcuoglu, and C. Farabet, in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on (IEEE, 2010) pp. 253–256.
5Hannun et al. (2014) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenge, S. Satheesh, S. Sengupta, A. Coates, and A. Ng, Ar Xiv e-prints (2014), ar Xiv:1412.5567 [cs.CL] .
6Bishop (2006) C. Bishop, Pattern Recognition and Machine Learning , Information Science and Statistics (Springer, 2006).
7Wu et al. (2008) X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. Mc Lachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, Knowledge and Information Systems 14 , 1 (2008) . · doi ↗
8Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) http://www.deeplearningbook.org .