Causality inference in stochastic systems from neurons to currencies:   Profiting from small sample size

Danh-Tai Hoang; Juyong Song; Vipul Periwal; Junghyo Jo

arXiv:1705.06384·physics.data-an·February 20, 2019

Causality inference in stochastic systems from neurons to currencies: Profiting from small sample size

Danh-Tai Hoang, Juyong Song, Vipul Periwal, Junghyo Jo

PDF

2 Repos

TL;DR

This paper introduces a novel data-driven statistical physics method for causality inference in stochastic systems, demonstrating superior performance with small datasets in fields like neuroscience and finance.

Contribution

The authors develop a free energy minimization approach for model inference that outperforms traditional methods in small sample scenarios, applicable to complex systems like neural and currency networks.

Findings

01

Effective inference of neural connectivity networks from limited data.

02

Successful modeling of currency exchange networks with small samples.

03

Scalable approach applicable to large systems.

Abstract

Success in modeling complex phenomena such as human perception hinges critically on the availability of data and computational power. Significant progress has been made in modeling such phenomena using probabilistic methods, particularly in image analysis and speech recognition. Maximum Likelihood Estimation (MLE) combined with Bayesian model selection is the basis of much of this progress, as MLE converges to the true model with copious data. In the sciences, large enough datasets are rarae aves, so alternatives to MLE must be developed for small sample size. We introduce a data-driven statistical physics approach to model inference based on minimizing a free energy of data and show superior model recovery for small sample sizes. We demonstrate coupling strength inference in non-equilibrium kinetic Ising models, including in the difficult large coupling variability regime, and show…

Equations140

P (σ_{i} (t + 1) = \pm 1∣ σ (t)) = \frac{exp ( \pm H _{i} ( σ ( t )))}{exp ( H _{i} ( σ ( t ))) + exp ( - H _{i} ( σ ( t )))}

P (σ_{i} (t + 1) = \pm 1∣ σ (t)) = \frac{exp ( \pm H _{i} ( σ ( t )))}{exp ( H _{i} ( σ ( t ))) + exp ( - H _{i} ( σ ( t )))}

\frac{\partial F}{\partial J _{i}} = \frac{\sum _{t} σ _{i} ( t ) exp ( J \cdot σ ( t ) - β E _{i} ( t ))}{\sum _{t} exp ( J \cdot σ ( t ) - β E _{i} ( t ))} = ⟨ σ_{i} ⟩_{J} \equiv m_{i} (J) .

\frac{\partial F}{\partial J _{i}} = \frac{\sum _{t} σ _{i} ( t ) exp ( J \cdot σ ( t ) - β E _{i} ( t ))}{\sum _{t} exp ( J \cdot σ ( t ) - β E _{i} ( t ))} = ⟨ σ_{i} ⟩_{J} \equiv m_{i} (J) .

\frac{\partial G}{\partial β} = - \frac{\partial F}{\partial β} = \frac{\sum _{t} E _{i} ( t ) exp ( J \cdot σ ( t ) - β E _{i} ( t ))}{\sum _{t} exp ( J \cdot σ ( t ) - β E _{i} ( t ))} = ⟨ E_{i} ⟩_{m} .

\frac{\partial G}{\partial β} = - \frac{\partial F}{\partial β} = \frac{\sum _{t} E _{i} ( t ) exp ( J \cdot σ ( t ) - β E _{i} ( t ))}{\sum _{t} exp ( J \cdot σ ( t ) - β E _{i} ( t ))} = ⟨ E_{i} ⟩_{m} .

E_{i} (t) \equiv \frac{σ _{i} ( t + 1 )}{⟨⟨ σ _{i} ( t + 1 )⟩ ⟩ _{σ (t)}} H_{i} (σ (t)),

E_{i} (t) \equiv \frac{σ _{i} ( t + 1 )}{⟨⟨ σ _{i} ( t + 1 )⟩ ⟩ _{σ (t)}} H_{i} (σ (t)),

D_{i}(W)\equiv\sum_{t}\big{[}\sigma_{i}(t+1)-\langle\langle\sigma_{i}(t+1)\rangle\rangle_{\sigma(t)}\big{]}^{2}.

D_{i}(W)\equiv\sum_{t}\big{[}\sigma_{i}(t+1)-\langle\langle\sigma_{i}(t+1)\rangle\rangle_{\sigma(t)}\big{]}^{2}.

Z (J, β) = t \sum exp (J \cdot σ (t) - β E_{i} (t)) .

Z (J, β) = t \sum exp (J \cdot σ (t) - β E_{i} (t)) .

\frac{\partial F}{\partial J}

\frac{\partial F}{\partial J}

- \frac{\partial F}{\partial β}

F (J, β) + G (m, β) = J \cdot m .

F (J, β) + G (m, β) = J \cdot m .

P (σ (t)) \equiv exp (J \cdot σ (t) - β E_{i} (σ (t)) + F)

P (σ (t)) \equiv exp (J \cdot σ (t) - β E_{i} (σ (t)) + F)

G (m, β) = β ⟨ E_{i} ⟩_{J, β} - S,

G (m, β) = β ⟨ E_{i} ⟩_{J, β} - S,

S = - t \sum P (σ (t)) lo g P (σ (t)) .

S = - t \sum P (σ (t)) lo g P (σ (t)) .

\frac{\partial G}{\partial m} = J,

\frac{\partial G}{\partial m} = J,

\frac{\partial G}{\partial β} = - \frac{\partial F}{\partial β} = ⟨ E_{i} ⟩_{J, β} .

G (m, β)

G (m, β)

\displaystyle+\frac{1}{6}\sum_{j,k,l}\bigg{[}\frac{\partial^{3}G}{\partial m_{j}\partial m_{k}\partial m_{l}}\bigg{]}^{*}(m_{j}-m_{j}^{*})(m_{k}-m_{k}^{*})(m_{l}-m_{l}^{*})

+ O (δ^{4} m)

\frac{\partial G ( m , β )}{\partial β}

\frac{\partial G ( m , β )}{\partial β}

\displaystyle+\frac{1}{2}\sum_{j,k}\frac{\partial}{\partial\beta}\bigg{[}\frac{\partial^{2}G}{\partial m_{j}\partial m_{k}}\bigg{]}^{*}(m_{j}-m_{j}^{*})(m_{k}-m_{k}^{*})

\displaystyle-\frac{1}{2}\sum_{j,k,l}\frac{\partial m_{l}^{*}}{\partial\beta}\bigg{[}\frac{\partial^{3}G}{\partial m_{j}\partial m_{k}\partial m_{l}}\bigg{]}^{*}(m_{j}-m_{j}^{*})(m_{k}-m_{k}^{*})

+ O (δ^{3} m) .

-\frac{\partial m_{k}}{\partial\beta}=\frac{\partial}{\partial\beta}\bigg{[}\frac{\sum_{t}\sigma_{k}(t)\exp(J\cdot\sigma(t)-\beta E_{i}(t))}{\sum_{t}\exp(J\cdot\sigma(t)-\beta E_{i}(t))}\bigg{]}=\langle\delta E_{i}\delta\sigma_{k}\rangle.

-\frac{\partial m_{k}}{\partial\beta}=\frac{\partial}{\partial\beta}\bigg{[}\frac{\sum_{t}\sigma_{k}(t)\exp(J\cdot\sigma(t)-\beta E_{i}(t))}{\sum_{t}\exp(J\cdot\sigma(t)-\beta E_{i}(t))}\bigg{]}=\langle\delta E_{i}\delta\sigma_{k}\rangle.

\frac{\partial ^{2} G}{\partial m _{j} \partial m _{k}} = \frac{\partial J _{k}}{\partial m _{j}} = [C^{- 1}]_{j k},

\frac{\partial ^{2} G}{\partial m _{j} \partial m _{k}} = \frac{\partial J _{k}}{\partial m _{j}} = [C^{- 1}]_{j k},

C_{j k}

C_{j k}

= ⟨ δ σ_{j} δ σ_{k} ⟩ .

\displaystyle\frac{\partial}{\partial\beta}\bigg{[}\frac{\partial^{2}G}{\partial m_{j}\partial m_{k}}\bigg{]}

\displaystyle\frac{\partial}{\partial\beta}\bigg{[}\frac{\partial^{2}G}{\partial m_{j}\partial m_{k}}\bigg{]}

= - μ, ν \sum [C^{- 1}]_{j μ} \frac{\partial C _{μν}}{\partial β} [C^{- 1}]_{ν k}

= μ, ν \sum [C^{- 1}]_{j μ} [C^{- 1}]_{k ν} ⟨ δ E_{i} δ σ_{μ} σ_{ν} ⟩ .

\frac{\partial ^{3} G}{\partial m _{j} \partial m _{k} \partial m _{l}}

\frac{\partial ^{3} G}{\partial m _{j} \partial m _{k} \partial m _{l}}

= - λ, μ, ν \sum [C^{- 1}]_{j λ} [C^{- 1}]_{k μ} \frac{\partial C _{μν}}{\partial J _{λ}} [C^{- 1}]_{ν l}

= - λ, μ, ν \sum [C^{- 1}]_{j λ} [C^{- 1}]_{k μ} [C^{- 1}]_{l ν} ⟨ δ σ_{λ} δ σ_{μ} σ_{ν} ⟩ .

⟨ δ E_{i} ⟩^{'}

⟨ δ E_{i} ⟩^{'}

+ \frac{1}{2} j, k \sum μ, ν \sum ⟨ δ E_{i} δ σ_{μ} σ_{ν} ⟩^{*} [C^{- 1}]_{j μ}^{*} [C^{- 1}]_{k ν}^{*} ⟨ δ σ_{j} ⟩^{'} ⟨ δ σ_{k} ⟩^{'}

- \frac{1}{2} j, k, l \sum λ, μ, ν \sum ⟨ δ E_{i} δ σ_{l} ⟩^{*} ⟨ δ σ_{λ} δ σ_{μ} σ_{ν} ⟩^{*}

M M M M \times [C^{- 1}]_{j λ}^{*} [C^{- 1}]_{k μ}^{*} [C^{- 1}]_{l ν}^{*} ⟨ δ σ_{j} ⟩^{'} ⟨ δ σ_{k} ⟩^{'},

⟨ δ E_{i} ⟩^{'} = j \sum W_{ij}^{*} ⟨ δ σ_{j} ⟩^{'} + \frac{1}{2} j, k \sum Q_{ij k}^{*} ⟨ δ σ_{j} ⟩^{'} ⟨ δ σ_{k} ⟩^{'},

⟨ δ E_{i} ⟩^{'} = j \sum W_{ij}^{*} ⟨ δ σ_{j} ⟩^{'} + \frac{1}{2} j, k \sum Q_{ij k}^{*} ⟨ δ σ_{j} ⟩^{'} ⟨ δ σ_{k} ⟩^{'},

W_{ij}^{*} \equiv k \sum ⟨ δ E_{i} δ σ_{k} ⟩^{*} [C^{- 1}]_{k j}^{*}

W_{ij}^{*} \equiv k \sum ⟨ δ E_{i} δ σ_{k} ⟩^{*} [C^{- 1}]_{k j}^{*}

Q_{ij k}^{*}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\templatetype

pnasresearcharticle

\leadauthorHoang \significancestatementStochasticity is a dominant feature of natural dynamical phenomena. Maximum likelihood estimation (MLE) is standard for stochastic model inference, but MLE converges to the true parameter values only in the large sample limit. When the data is insufficient, as in neuronal dynamics, or the observed stochastic dynamics are modulated by time-varying deterministic trends, as in financial markets, alternatives to MLE are required. We use the mathematical formalism of statistical physics to define a free energy of data that measures the plausibility of observed configurations. Minimizing this free energy without resorting to gradient descent provides a computationally effective way to infer system couplings from small sets of observations, even including higher-order interactions. We demonstrate applications ranging from biological to financial networks. \authorcontributionsPlease provide details of author contributions here. \authordeclarationPlease declare any conflict of interest here. \correspondingauthor1To whom correspondence should be addressed. E-mail: [email protected] or [email protected]

Causality inference in stochastic systems from neurons to currencies: Profiting from small sample size

Danh-Tai Hoang

Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA

Department of Natural Sciences, Quang Binh University, Dong Hoi, Quang Binh 510000, Vietnam

Juyong Song

Asia Pacific Center for Theoretical Physics, Pohang, Gyeongbuk 37673, Korea

Department of Physics, Pohang University of Science and Technology, Pohang, Gyeongbuk 37673, Korea

The Abdus Salam International Centre for Theoretical Physics, Strada Costiera 11, 34014 Trieste, Italy

Vipul Periwal

Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA

Junghyo Jo

Asia Pacific Center for Theoretical Physics, Pohang, Gyeongbuk 37673, Korea

Department of Physics, Pohang University of Science and Technology, Pohang, Gyeongbuk 37673, Korea

School of Computational Sciences, Korea Institute for Advanced Study, Seoul 02455, Korea

Abstract

Success in modeling complex phenomena such as human perception hinges critically on the availability of data and computational power. Significant progress has been made in modeling such phenomena using probabilistic methods, particularly in image analysis and speech recognition. Maximum Likelihood Estimation (MLE) combined with Bayesian model selection is the basis of much of this progress, as MLE converges to the true model with copious data. In the sciences, large enough datasets are rarae aves, so alternatives to MLE must be developed for small sample size. We introduce a data-driven statistical physics approach to model inference based on minimizing a free energy of data and show superior model recovery for small sample sizes. We demonstrate coupling strength inference in non-equilibrium kinetic Ising models, including in the difficult large coupling variability regime, and show scaling to systems of arbitrary size. As applications, we infer a functional connectivity network in the salamander retina and a currency exchange rate network from time-series data of neuronal spiking and currency exchange rates, respectively. Accurate small sample size inference is critical for devising a profitable currency hedging strategy.

keywords:

\dates

This manuscript was compiled on

\verticaladjustment

-2pt

An explosion in data availability in recent years has ushered in a new era of data-driven research for natural and social sciences. Identifying systems dynamics from observed data, e.g. biochemical reactions (1), gene expression measurements (2), neuronal or brain region activities (3, 4, 5, 6), and population dynamics (7), is of fundamental interest in science (8, 9, 10, 11, 12). For complex phenomena, such as human perception, modeling system dynamics in a probabilisitic framework became possible with the advent of inexpensive computational resources, and has led to great progress in the last 25 years. Regardless of whether stochasticity is inherent in the system, or only apparent due to partial observability (13), many stochastic processes have been analyzed by autoregressive-moving-average models (14) or probabilistic directed acyclic graphical models, often termed Bayesian networks (15).

The structure of such dynamic processes is often unknown and, in the social sciences in particular, there may be no underlying fundamental theory to delineate possible models. Thus, a universal model-free data-driven approach has merit for the inference of models from time-series data (16). Machine learning using recurrent neuronal networks is such an approach (17), but it usually requires a large amount of training data and is computationally intensive. Given time series of $N$ variables, network inference rapidly becomes too complex with increasing $N.$ Even considering only pair-wise interactions requires determining $N^{2}$ parameters and demands $L\geq N^{2}$ samples. Including higher-order interactions leads to an exponential increase in the number of model parameters, and a concomitant increase in sample size. In scientific contexts, however, we often encounter the case that data generated from experiments are not big enough to reconstruct the interaction network for a given system. Theorists contend with the computational difficulties of inferring large systems by positing properties such as sparsity of interactions or specifying distributions of couplings, usually with scant experimental support.

Maximum Likelihood Estimation (MLE) is the gold standard for stochastic model parameter inference, as it converges to the true model parameters in the limit of large sample size. On the other hand, MLE is limited by the fact that the likelihood equations are specific to a given estimation problem, that the numerical estimation is usually non-trivial, and most importantly, MLE can be heavily biased for small samples where the optimality properties of MLE may not apply. MLE can also be sensitive to the choice of starting values (18).

According to the Rao-Blackwell theorem (19, 20), the conditional expected value of an estimator given a sufficient statistic is another estimator that is at least as good, and this result applies to MLE estimators as well. The Rao-Blackwell result usually applies for sufficient and complete statistics, and leads to an idempotent improvement, in other words, the improvement requires no iteration. However, for our small sample size purposes, more apropos is the recent result of Galili and Meilijson (21), which suggests that a Rao-Blackwell–type iterative improvement of a parameter estimator is worth investigating.

Statistical physics is often used for model inference (22, 23), but, in fact, for small sample sizes, the observed configurations of the system may bear no semblance to random sampling or a thermodynamic limit. We develop here an iterative parameter-free model estimator using only the mathematical formalism of statistical physics to define a free energy of data, and show that minimizing this free energy corresponds to linear and higher-order data regressions. Over-fitting is a major problem in the analysis of under-determined systems. By decoupling an iterative Rao-Blackwell estimator update step from an update–consistent stopping criterion, we demonstrate that our Free Energy Minimization (FEM) approach infers coupling strengths in non-equilibrium kinetic Ising models, outperforming previous approaches particularly in the large coupling variability and small sample size regimes. Real data is always a stringent test of model inference so we demonstrate applications of FEM to infer biological and financial networks from neuronal activities and currency fluctuations.

Iterative stochastic causality inference from Free Energy Minimization

The elegant mathematical formalism developed by Schwinger provides a natural connection between expectation values $m=\langle\sigma\rangle$ of microstates $\sigma$ and expectation values $\langle E_{i}\rangle_{m}$ of observables $E_{i}$ conditioned on $m$ (24, 25). We will use it to implement a Rao-Blackwell estimator update. As a concrete illustration, let us start with a kinetic Ising model in which a vector $\sigma$ of $N$ spins $\sigma_{i}(t)=\pm 1$ is stochastically updated based on the following conditional probability

[TABLE]

with a local field $H_{i}(\sigma(t))\equiv\sum_{j}W_{ij}\sigma_{j}(t)$ . Our goal is to infer the coupling strength $W_{ij}$ that minimizes the discrepancy between observed $\sigma_{i}(t+1)$ and model expectation ${\langle\langle\sigma_{i}(t+1)\rangle\rangle_{\sigma(t)}}\equiv\sum_{\rho=\pm 1}\rho P(\sigma_{i}(t+1)=\rho|\sigma(t)).$ For the kinetic Ising model, ${\langle\langle\sigma_{i}(t+1)\rangle\rangle_{\sigma(t)}}=\tanh H_{i}(\sigma(t)).$

To implement a Rao-Blackwell scheme of estimator improvement $H_{i}^{\textrm{new}}(m)\leftarrow\langle E_{i}\rangle_{m},$ we first define a moment generating function, $Z(J,\beta)=\sum_{t}\exp(J\cdot\sigma(t)-\beta E_{i}(t))$ , which is a function of a vector parameter $J$ , a scalar parameter $\beta,$ and a ‘data energy’ $E_{i}(t)$ that we will define below. A convex free energy $F=\log Z$ can be used to obtain expectation values of spin activities by differentiation,

[TABLE]

As usual, a convex dual free energy $G$ can be defined to make the expected activity vector $m$ the independent variable, and $J(m)$ the dependent vector, by using the convexity preserving Legendre transform $F(J)+G(m)=J\cdot m.$ The expectation value of $E_{i}$ is obtained by differentiation (identifying $\langle E_{i}\rangle_{J(m)}\equiv\langle E_{i}\rangle_{m}$ ),

[TABLE]

The free energy $G(m,\beta)=\beta\langle E_{i}\rangle_{m}-S$ where $S$ is the Shannon entropy of data. At $\beta=0,$ minimizing the free energy is exactly maximizing the entropy, making every sample equally valuable. At its minimum, $m^{*},$ we have $J(m^{*})=\partial_{m}G(m^{*})=0,$ and this is the value of $J$ about which we will expand, hence the term Free Energy Minimization (FEM).

We now turn to finding an appropriate $E_{i}.$ Consider

[TABLE]

and define the Rao-Blackwell conditional expectation update: $H_{i}(m)^{\textrm{new}}\leftarrow\langle E_{i}\rangle_{m}.$ Intuitively, if the observation $\sigma_{i}(t+1)$ is larger/smaller than the corresponding model expectation $\langle\langle\sigma_{i}(t+1)\rangle\rangle_{\sigma(t)},$ this update increases/decreases $H_{i}(\sigma(t))$ proportionally to the discrepancy ratio between the observation and the model expectation, including the sign. The differential geometry of $G(m,\beta)$ around its minimum $m^{*}$ then gives $W_{ij}^{\textrm{new}}=\sum_{k}\langle\delta E_{i}\delta\sigma_{k}\rangle_{m^{*}}[C^{-1}]_{kj}$ as a matrix multiplication, where $\delta f\equiv f-\langle f\rangle_{m^{*}}$ and $C_{jk}\equiv\langle\delta\sigma_{j}\delta\sigma_{k}\rangle_{m^{*}}$ (see SI Text 1 for the detailed derivation).

The second crucial aspect for small sample size inference is to find a suitable stopping criterion for the Rao-Blackwell update. We consider the overall discrepancy between ${\sigma_{i}(t+1)}$ and ${\langle\langle\sigma_{i}(t+1)\rangle\rangle_{\sigma(t)}}$ :

[TABLE]

The minimum of $D_{i}(W)$ is the closest we can approach a fixed point of the update iteration, consistent with Eq. (4) and the Rao-Blackwell expectation. Therefore, we stop the iteration when $D_{i}(W)$ starts to increase.

To summarize inference with FEM: (i) Compute $H_{i}(\sigma(t))\equiv\sum_{j}W_{ij}\sigma_{j}(t)$ (initialize with a random $W_{ij}$ ); (ii) Compute $E_{i}(t)$ as defined in Eq. (4); (iii) Extract $W_{ij}^{\textrm{new}}=\sum_{k}\langle\delta E_{i}\delta\sigma_{k}\rangle_{m^{*}}[C^{-1}]_{kj};$ (iv) Repeat (i)-(iii) until $D_{i}(W)$ starts to increase; (v) Compute (i)-(iv) in parallel for every index $i\in\{1,2,\cdots,N\}$ .

Results

Kinetic Ising model

We first tested FEM on the inference of connection weights $W_{ij}$ ( $\neq W_{ji}$ ) in the kinetic Ising model, which is often used as a benchmark for stochastic causality inference. The Sherrington-Kirkpatrick (SK) model assumes $W_{ij}$ are normally distributed with zero mean and variance equal to $g^{2}/N$ (26). In the limit of large sample size (large $L/N^{2}$ ), our iterative method decreases the mean square error, MSE = $N^{-2}\sum_{i,j=1}^{N}(W_{ij}-W_{ij}^{\textrm{true}})^{2}$ , as the number of iterations increases (Fig. 1A). We obtain good agreement between true and predicted weights (Fig. 1B). In real world problems, $W_{ij}^{\textrm{true}}$ is inaccessible so MSE cannot be defined. However, $D_{i}(W)$ in Eq. (5) is an alternative measure of the discrepancy between observation $\sigma_{i}(t+1)$ and model expectation. The discrepancy measures $D_{i}(W)$ are independent for each spin $i.$ We checked that MSE and $D=N^{-1}\sum_{i=1}^{N}D_{i}(W)$ change similarly during iterations. More importantly, for small sample sizes (small $L/N^{2}$ ), MSE and $D$ decrease with iterations initially, but start to increase after some number of iterations (Fig. 1C). For the kinetic Ising model, $D_{i}(W)=4\sum_{t}[1-P(\sigma_{i}(t+1)|\sigma(t))]^{2}$ with the transition probability, $P(\sigma_{i}(t+1)|\sigma(t))$ in Eq. (1). Therefore, decreasing $D_{i}(W)$ can only result from $P(\sigma_{i}(t+1)|\sigma(t))$ saturating the causal relation between observations, $\sigma(t)$ and $\sigma_{i}(t+1)$ , through $W.$ Distinct spins indexed by $i$ often require different numbers of iterations. Stopping the iteration for spin $i$ when $D_{i}(W)$ saturates leads to accurate inference with minimal computation. For limited data (e.g. $L/N^{2}=0.2$ ), these stopping criteria lead to accurate inference (Fig. 1D) without over-fitting.

Now we compare the inference performance of our method with other representative methods (27, 28, 29): naïve mean field (nMF), Thouless-Anderson-Palmer mean field (TAP), exact mean field (eMF), and maximum likelihood estimation (MLE). MLE requires maximizing the data likelihood, ${\cal P}=\prod_{t=1}^{L-1}\prod_{i=1}^{N}P(\sigma_{i}(t+1)|\sigma(t))$ , and uses gradient ascent to update $W_{ij}$ incrementally through $W^{\textrm{new}}_{ij}=W_{ij}+{\alpha}/({L-1}){\partial\log{\cal P}}/{\partial W_{ij}}$ (27, 29), where the learning rate $\alpha$ is an undetermined parameter controlling the updating speed. In contrast, the maximizing condition ( $\partial\log{\cal P}/{\partial W_{ij}}=0$ ) and mean-field approximations provide matrix equations, $W=A^{-1}BC^{-1}$ , where matrices $B_{ij}=\langle\delta\sigma_{i}(t+1)\delta\sigma_{j}(t)\rangle$ and $C_{ij}=\langle\delta\sigma_{i}(t)\delta\sigma_{j}(t)\rangle$ represent time-delayed and equal-time correlations in data, and $A$ are diagonal matrices, which are different for nMF, TAP, and eMF (SI Text 2 has brief reviews of these mean-field methods).

For weak coupling ( $g=1$ ), TAP, eMF, MLE and FEM have similar inference accuracy that increases with sample size (Fig. 1E). nMF showed poor accuracy independent of data size, since the zeroth-order mean-field approximation works only for very weak coupling strengths (27). As we further increase coupling strength, the other two mean-field methods, TAP and eMF also start to give less accurate results than MLE and FEM (Fig. 1F-H). For large sample size ( $L/N^{2}>1$ ), our iterative method, FEM, works as well as standard MLE. For small sample size, however, FEM provides better accuracy than MLE. For example, the inference error (MSE) of FEM is approximately 4 times lower than that of MLE for $L/N^{2}=0.2$ and $g=4.$ In addition to inference accuracy, FEM has two advantages in computation. First, the FEM update is multiplicative and not incremental, while MLE updates (using conjugate gradient ascent or some other numerical maximization) have an undetermined parameter, the learning rate $\alpha,$ which needs to be determined. A very large rate ( $\alpha=3$ ) leads to loss of convergence, whereas a very small rate ( $\alpha=0.5$ ) leads to many iterations with infinitesimal updates. We set $\alpha=1$ . Second, FEM requires 20 times fewer updates than MLE (Fig. S1A), which reduces computation time a 100-fold (Fig. S1B).

To further demonstrate the effectiveness of FEM, we show two examples of inferred networks when $W_{ij}$ has more general coupling distributions than the SK model, as real systems often deviate strongly from normally-distributed coupling strengths. In the first example, the spins have alternating bands of positive and negative couplings modulated by distance as $|W_{ij}|=W_{0}/\log(R_{ij})$ , where $R_{ij}$ represents the radius of the circle (Fig. 2A). The couplings are non-normally distributed (Fig. 2B). The spin raster scan exhibits nontrivial structure (Fig. 2C), reminiscent of binocular rivalry (30). As the number of observed configurations increases, the predicted coupling strengths (Fig. 2D) approach their true values (Fig. 2A). In the second, the 2018 Gerber baby’s photograph was used as the heatmap of the coupling matrix (Fig. 2E). These couplings are also non-normally distributed (Fig. 2F) with periodic bursting in the simulated spin raster scan (Fig. 2G), but the couplings are still predicted well (Fig. 2H).

Our formulation, based on the differential geometry of the data free energy, automatically includes higher-order regression equations for the local field $H_{i}(\sigma)$ (SI Text 1). For example, we checked higher-order inference with FEM by using a generalized kinetic Ising model with linear and quadratic couplings, $H_{i}(\sigma(t))=\sum_{j}W_{ij}\sigma_{j}(t)+\sum_{j,k}Q_{ijk}\sigma_{j}(t)\sigma_{k}(t)/2$ , where $W_{ij}$ and $Q_{ijk}$ are normally distributed. The quadratic couplings are symmetric ( $Q_{ijk}=Q_{ikj}$ ) and have no self-interactions ( $Q_{ijj}=0$ ) since $\sigma_{j}^{2}=1.$ The number of $Q_{ijk}$ parameters is $N^{2}(N-1)/2.$ The recovery of both linear and quadratic couplings is evident (Fig. 3).

Neuronal network

We applied our method to infer a neuronal network from temporal neuronal activities in the tiger salamander (Ambystoma tigrinum) retina (31). The multi-channel experiment recorded stochastic firing patterns of 160 neurons when the salamander retina was stimulated by a film clip of fish swimming. As in Ref. (32), we considered only the 100 most active neurons. After processing the data (SI Text 3; Fig. 4A), we inferred the neuronal network governing the local field, $H_{i}(\sigma(t))=H_{i}^{\textrm{ext}}+\sum_{j}W_{ij}\sigma_{j}(t)$ . Here we included a constant bias external field $H_{i}^{\textrm{ext}}$ for neuron $i$ to consider the persistent silence of neurons. We inferred the neuronal network weights $W_{ij}$ (Fig. 4B), and the external local fields for each neuron by using $H_{i}^{\textrm{ext}}=\langle H_{i}\rangle-\sum_{j}W_{ij}\langle\sigma_{j}\rangle$ . The external local fields are mostly negative, which implies that neuronal activities are biased to be silent (Fig. 4C).

The true couplings are unknown for this system. As a validation, with the $H_{i}^{{\textrm{ext}}}$ and $W_{ij}$ we determined, we simulated neuronal activities. We found agreement between the covariances of neuronal activities $C_{ij}=\langle\delta\sigma_{i}(t)\delta\sigma_{j}(t)\rangle$ of the observed and simulated data (Fig. 4D). For a more stringent validation, we reconstructed the full neuronal activities from specific ‘pinned’ neuron activities, representing inputs. Fixing the time sequences $\sigma_{j}(t)$ of specific chosen input neurons $j\in I$ , we reconstructed the activities $\sigma_{i}(t+1)$ of the remaining neurons $i\not\in I$ . As a control, we selected the input neurons at random and compared them with input neurons selected on the basis of the coupling strength $|W_{ij}|$ as the input set $I.$ As more input neurons are considered, the reconstruction predicts $\sigma_{i}(t+1)$ more accurately (Figs. 4E and S3). Pinning the activities of only $|I|=10$ strongly coupled neurons gave predicted activities of the remaining 90 neurons that were very close to the observed activities (Fig. 4F), in contrast to predicted activities obtained by pinning randomly selected sets of 10 input neurons (Fig. 4G).

Currency network

Finally, we apply our method to another difficult and representative stochastic problem, currency exchange rate fluctuations. We obtained time series of currency exchange rates from January 2000 to December 2017 (33), and examined exchange rates denominated in Euro (EUR) of $11$ actively traded currencies (Fig. 5A). First, we concentrate on the daily fluctuations of the exchange rates, since most financial analyses center on price increments rather than absolute prices (34). We binarize the real-valued rates to concentrate on the sign of their daily fluctuations (Fig. 5B). We defined the binarized rate $\sigma_{i}(t)=1$ for a day-to-day increase of exchange rate $i$ at time $t$ ( $r_{i}(t)>r_{i}(t-1))$ , and $\sigma_{i}(t)=-1$ for the decrease. If there was no change ( $r_{i}(t)=r_{i}(t-1)$ ), we set $\sigma_{i}(t)=\sigma_{i}(t-1)$ . Second, we divide the data for different periods to investigate the time dependence of the couplings between exchange rates. Using the Fourier transform of the binarized time series, we identified a characteristic period, 550 business days ( $\sim$ 2 years), of the fluctuations (Fig. 5C). We inferred the currency network weights $W_{ij}$ separately in two year periods, shown here (Figs. 5D-F, upper) for the three periods 2012-2013, 2014-2015, and 2016-2017. We found agreement between the covariance $C_{ij}=\langle\delta\sigma_{i}(t)\delta\sigma_{j}(t)\rangle$ of the observed currency data and that of the simulated currency data using $H_{i}(\sigma(t))=H_{i}^{\textrm{ext}}+\sum_{j}W_{ij}\sigma_{j}(t)$ (Figs. 5D-F, lower). In contrast, when we estimated the currency network using the data for the entire period 2000-2017, the network had weaker connections and smaller covariances $C_{ij}$ compared to the time-dependent analysis (Figs. 5G)

The raw exchange rate data is continuous. Is our binarized inference of any practical value? To address this, we simulated a currency trade strategy, and checked if the strategy was profitable. Using only data within a time window of a period $T$ , $\{\sigma(t-T+1),\sigma(t-T+2),\cdots,\sigma(t)\},$ we predicted the currency fluctuations $\sigma(t+1)$ on the next day. For the trade simulation, we considered a hedging trader who buys one currency with 1 EUR and sells one currency with 1 EUR. To earn profits, the trader is supposed to sell/buy a currency that has the highest probability of increase/decrease in exchange rate: the currency $sell=\arg\max_{i}P(\sigma_{i}(t+1)=+1|\sigma(t))$ and the currency $buy=\arg\max_{i}P(\sigma_{i}(t+1)=-1|\sigma(t))$ . Then, a daily profit can be defined as ${\text{profit}(t)}=r_{sell}(t+1)/r_{sell}(t)-r_{buy}(t+1)/r_{buy}(t)$ . We calculated cumulative profits of the trade simulation from 2004 to 2017 with various time window sizes that we considered as past information (Fig. 5H for $T=500$ days). Hedging strategies profit from market volatility and, indeed, our trade simulation showed large profits when the exchange rates had large fluctuations (Fig. 5A). The window size $T$ had an optimal period of 500-750 business days (Fig. 5I). For a more refined strategy, we considered the quality or accuracy of our inference by probing the discrepancy $D_{i}(W)$ in Eq. (5). Instead of trading every day, we traded only on the days when the discrepancy at that day, $D(t)\equiv\sum_{i}\big{[}\sigma_{i}(t)-\langle\langle\sigma_{i}(t)\rangle\rangle_{\sigma(t-1)}\big{]}^{2}$ , was lower than the average $T^{-1}\sum_{t=1}^{T}D(t)$ for a fixed window size $T$ . This strategy doubled the profits per transaction (Figs. 5H and 5I), showing that the discrepancy $D_{i}(W)$ is a useful measure of model accuracy.

Discussion

We demonstrated that under-determined stochastic systems can be inferred in a conceptually simple and computationally efficient manner using the mathematical framework of statistical physics. Since network inference is an important subject, many different approaches have been developed. Equilibrium approaches assume symmetric interactions ( $W_{ij}=W_{ji}$ ) between node $i$ and node $j$ , and estimate the pair-wise interaction strengths that can maximally explain the observed static patterns of network activity in brains (32, 35, 36), proteins (37, 38), and stock markets (39). In contrast, non-equilibrium approaches do not assume symmetry, and infer asymmetric causal relations between nodes that can better explain dynamic patterns of network activity (29). Causality inference for non-equilibrium models (e.g., using recurrent neuronal networks) is computationally expensive. Although mean-field methods have been introduced to circumvent this practical problem (27), these approximation methods only work for weak-interaction regimes with large sample size. All small sample size inference must contend with over-fitting so the key feature of our approach was to consistently decouple the model update step and a discrepancy measure that is similar to Expectation Maximization. This decoupling allowed us to iterate with a multiplicative model update, and to stop when the discrepancy measure quantifies that the multiplicative update has saturated. We derived this within a standard statistical physics formulation (24, 25), so no ad hoc averaging or approximation steps were involved. We demonstrated that our method outperfoms others in inferring the asymmetric interactions of the kinetic Ising model, especially in strong-interaction regimes, and particularly when available data was limited. Another aspect of small sample size inference is that longer time-scale modulation of couplings can be uncovered. This is of considerable practical import as we demonstrated with the currency exchange rate network.

FEM has several computational merits. Besides having no incremental learning rate that requires tuning, the method is parallelizable and scalable: We computed results for the kinetic Ising model with up to $N=5000$ interacting spins, determining $2.5\times 10^{7}$ parameters (Fig. S4). We also demonstrated that the method can infer not only linear interactions but also higher-order interactions. Moreover, FEM is generalizable to systems with any number of discrete states, although we focused on binary stochastic systems here. Uncovering hidden nodes for stochastic network inference (40) is an exciting avenue for future work.

\acknow

Gašper Tkačik generously provided the neuronal activity data. We thank Changbong Hyeon and Arthur Sherman for comments on the manuscript. This work was supported by Intramural Research Program of the National Institutes of Health, NIDDK (D.-T.H.,V.P.), and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2016R1D1A1B03932264) and the Max Planck Society, Gyeongsangbuk-Do and Pohang City (J.J.).

\showacknow

Supporting Information (SI)

SI Text 1: Schwinger’s source formalism

Here, we derive the differential geometry of $\langle E_{i}\rangle$ in terms of $\langle\sigma\rangle$ dependency by using Schwinger’s source formalism (24, 25). This is a model-free approach, because we do not assume a specific functional form of $\langle E_{i}\rangle$ at the beginning. First, we defined the moment generating function,

[TABLE]

The log partition function, $F=\log Z$ , allows the computation of expectation values of $\sigma$ and $E_{i}$ simply by differentiation

[TABLE]

Here, the activity expectation $m(J)$ depends on $J$ . We can make the observable expectation $m$ the independent variable, and the control parameter $J$ the dependent variable by using a Legendre transform:

[TABLE]

Defining a normalized probability,

[TABLE]

in Eq. (S1), it is straightforward to show that

[TABLE]

with the Shannon entropy appearing naturally,

[TABLE]

Then, the duality between the free energies $F$ and $G$ through their Legendre transform in Eq. (S4) leads to

[TABLE]

Therefore, once we know the free energy $G(m,\beta)$ , it is straightforward to obtain $\langle E\rangle_{J,\beta}$ . For our purposes, however, it is unnecessary to obtain $G(m,\beta)$ for all values of $m,$ as it suffices to know the function at minimum, because the free energy is minimized at the data expectation: $m^{*}=\langle\sigma\rangle_{J=0,\beta=0}$ . Note that $J=0$ imposes the minimum condition ( $\partial G/\partial m=0$ ) in Eq. (S8). Then, we have the Taylor expansion of $G(m,\beta)$ at $m=m^{*}:$

[TABLE]

where the derivatives $[\cdot]^{*}$ are taken at $m=m^{*}$ . Differentiating the expanded $G(m,\beta)$ with respect to $\beta$ leads to

[TABLE]

Now, we calculate each derivative in Eq. (SI Text 1: Schwinger’s source formalism):

(i)

[TABLE]

(ii)

[TABLE]

where

[TABLE]

(iii)

[TABLE]

(iv)

[TABLE]

Plugging these derivatives into Eq. (SI Text 1: Schwinger’s source formalism), we obtain the following equation up to second order in $\delta m$ :

[TABLE]

where we used the shorter notation: $\langle f\rangle^{\prime}\equiv\langle f\rangle_{J,\beta=0}$ , $\langle f\rangle^{*}\equiv\langle f\rangle_{J=0,\beta=0}$ , and $\langle\delta f\rangle^{\prime}\equiv\langle f\rangle^{\prime}-\langle f\rangle^{*}$ . Finally, we obtain the following relation:

[TABLE]

where

[TABLE]

and

[TABLE]

The second term in Eq. (S18) can be approximated as

[TABLE]

where the second line assumes a negligible correlation between $\sigma_{j}$ and $\sigma_{k}$ : $\langle\sigma_{j}\sigma_{k}\rangle\approx\langle\sigma_{j}\rangle\langle\sigma_{k}\rangle$ . Then, with the Rao-Blackwell conditional expectation update $H_{i}(m)^{\textrm{new}}\leftarrow\langle E_{i}\rangle_{J(m^{*})}$ , Eq. (S18) implies

[TABLE]

where we used $Q_{ijk}=Q_{ikj}$ . This formalism allows one to infer the linear and quadratic relations between $H_{i}$ and $\sigma$ .

SI Text 2: Review on the mean-field methods for the kinetic Ising model

Maximum likelihood estimation (MLE)

The kinetic Ising model updates spins with the conditional probability,

[TABLE]

where $H_{i}(\sigma(t))=\sum_{j}W_{ij}\sigma_{j}(t)$ . Then, the expectation value of $\sigma_{i}(t+1)$ given $\sigma(t)$ becomes

[TABLE]

Given $N$ -dimensional time-series data $\sigma(t)$ with length $L$ , the data likelihood is defined as

[TABLE]

Using MLE, one can optimize $W_{ij}$ to increase $\log\cal{P}$ :

[TABLE]

with a learning rate $\alpha$ (29). Here, one can calculate the gradient with Eq. (S23),

[TABLE]

Naïve mean-field approximation (nMF)

The maximum condition of the log-likelihood ( $\partial\log{\cal{P}}/\partial W_{ij}$ =0) in Eq. (S27) gives

[TABLE]

with $H_{i}(\sigma(t))=\sum_{k}W_{ik}\sigma_{k}(t)$ . For a mean-field approximation, spin activities are represented by the mean field activity plus its residual: $\sigma_{i}(t)=m_{i}+\delta\sigma_{i}(t)$ . Then, using the Taylor expansion, one can approximate $\tanh\big{(}H_{i}(\sigma(t))\big{)}\approx\tanh(g_{i})+\big{(}1-\tanh^{2}(g_{i})\big{)}\sum_{k}W_{ik}\delta\sigma_{k}(t)$ with $g_{i}=\sum_{k}W_{ik}m_{k}$ . The zeroth-order expectation of $\langle\langle\sigma_{i}(t+1)\rangle\rangle_{\sigma(t)}\approx\tanh(g_{i})$ gives the self-consistent equation

[TABLE]

Then, using the mean-field approximation, Eq. (S28) becomes

[TABLE]

Given the data with length $L$ ,

[TABLE]

One can also derive this equation from $\delta\sigma_{i}(t+1)=(\partial m_{i}/\partial m_{k})\delta\sigma_{k}(t)$ with Eq. (S29). The equality gives a matrix equation to infer

[TABLE]

where $[A_{\text{nMF}}]_{ij}=(1-m_{i}^{2})\delta_{ij}$ is a diagonal matrix; $B_{ij}=\langle\delta\sigma_{i}(t+1)\delta\sigma_{j}(t)\rangle$ is a time-delayed correlation; and the covariance matrix $C_{ij}=\langle\delta\sigma_{i}(t)\delta\sigma_{j}(t)\rangle$ is an equal-time correlation (27).

Thousless-Anderson-Palmer mean-field approximation (TAP)

Compared to nMF, TAP considers the second-order correction of the Onsager’s reaction term:

[TABLE]

with $g_{i}\equiv\sum_{k}W_{ik}m_{k}$ , $\delta g_{i}\equiv\sum_{k}W_{ik}\delta\sigma_{k}(t)$ , and

[TABLE]

under the assumption of the negligible correlation between $\sigma_{k}$ and $\sigma_{l}$ : $\langle\delta\sigma_{k}\delta\sigma_{l}\rangle\approx 0$ for $k\neq l$ . The correction gives a refined self-consistent equation

[TABLE]

Then, using $\delta\sigma_{i}(t+1)=(\partial m_{i}/\partial m_{k})\delta\sigma_{k}(t)$ , one can derive

[TABLE]

with $F_{i}\equiv(1-m_{i}^{2})\sum_{l}W_{il}^{2}(1-m_{l}^{2})$ . This leads to

[TABLE]

Therefore, one obtains the TAP estimates

[TABLE]

Here, one can obtain $F_{i}$ as a solution of the self-consistent equation (27):

[TABLE]

Exact mean-field approximation (eMF)

For random $W_{ik}$ with a large number $N$ of spin components, it is a reasonable assumption that $H_{i}=\sum_{k=1}^{N}W_{ik}\sigma_{k}$ follows a Gaussian distribution with a mean $g_{i}=\sum_{k}W_{ik}m_{k}$ and a variance $\Delta_{i}=\langle\delta g_{i}^{2}\rangle=\sum_{l}W_{il}(1-m_{l}^{2})$ in Eq. (Thousless-Anderson-Palmer mean-field approximation (TAP)):

[TABLE]

Here, the zeroth-order and second-order Taylor expansion of $\tanh(g_{i}+x\sqrt{\Delta_{i}})$ with respect to $x$ give the nMF and TAP solutions in Eqs. (S29) and (S35). The multi-variable $x\equiv\delta g_{i}$ and $y\equiv\delta g_{j}$ may also follow a Gaussian distribution:

[TABLE]

where the covariance $\Delta_{ij}$ is defined as

[TABLE]

Then, the time-delayed correlation matrix $B$ can be approximated as

[TABLE]

Using $B$ , one can derive $BW^{\top}=AWCW^{\top}$ as follows:

[TABLE]

This equation gives

[TABLE]

where $[A_{\text{eMF}}]_{ij}=a_{i}\delta_{ij}$ is a diagonal matrix. In practice, one can obtain $W_{\text{eMF}}$ with the following iterations (28):

(i)

Calculate $\Delta_{i}$ (Gguess $\Delta_{i}$ for the first round):

[TABLE]

(ii)

Find $g_{i}$ as a solution for the following integral equation:

[TABLE]

(iii)

Calculate $a_{i}$ given $g_{i}$ and $\Delta_{i}$ :

[TABLE]

SI Text 3: Neuronal data processing

In the original data, neuron $i$ is defined as “active” ( $\sigma_{i}(t)=1$ ), if the neuron fires at least once during the time window $[t,t+\delta t]$ , otherwise “silent” ( $\sigma_{i}(t)=-1)$ (Fig. S2, upper). To suppress the dependency of the time interval $\delta t$ for the activity definition, we used a moving average of activities. We examined the past five and future five activities of neuron $i$ , and redefined $\sigma_{i}(t)=1$ , if neuron $i$ emitted at least one spike in the time window, otherwise $\sigma_{i}(t)=-1$ (Fig. S2, lower). Since neurons may have a refractory period that prevents consecutive spikes after emitting a spike (41), the moving average can also help infer the genuine interaction between neurons by reducing the effect of the refractory period.

For the estimation of $W_{ij}$ and $H_{i}^{\textrm{ext}}$ , we estimated $W_{ij}$ first with $H_{i}^{\textrm{ext}}=0$ , and then estimated $W_{ij}$ and $H_{i}^{\textrm{ext}}$ together because $H_{i}^{\textrm{ext}}$ turned out to be quite large compared to $W_{ij}$ . These training procedures were repeated for $20$ times.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Klimovskaia A, Ganscha S, Claassen M (2016) Sparse regression based structure learning of stochastic reaction networks from single cell snapshot time series. P Lo S computational biology 12(12):e 1005234.
2(2) Bar-Joseph Z, Gitter A, Simon I (2012) Studying and modelling dynamic biological processes using time-series gene expression data. Nature Reviews Genetics 13(8):552–564.
3(3) Dombeck DA, Khabbaz AN, Collman F, Adelman TL, Tank DW (2007) Imaging large-scale neural activity with cellular resolution in awake, mobile mice. Neuron 56(1):43–57.
4(4) Schneidman E, Berry, 2nd MJ, Segev R, Bialek W (2006) Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 440(7087):1007–12.
5(5) Nguyen JP, et al. (2016) Whole-brain calcium imaging with cellular resolution in freely behaving caenorhabditis elegans. Proc Natl Acad Sci U S A 113(8):E 1074–81.
6(6) Bernal-Casas D, Lee HJ, Weitz AJ, Lee JH (2017) Studying brain circuit function with dynamic causal modeling for optogenetic fmri. Neuron 93(3):522–532.e 5.
7(7) Sugihara G, et al. (2012) Detecting causality in complex ecosystems. Science 338:496–500.
8(8) Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science 324(5923):81–5.