Identifying Nonlinear 1-Step Causal Influences in Presence of Latent Variables
Saber Salehkaleybar, Jalal Etesami, Negar Kiyavash

TL;DR
This paper introduces an information-theoretic method to identify 1-step causal influences in stochastic dynamical systems with latent variables, including a linear regression-based approach for linear dynamics, validated through simulations.
Contribution
It presents a novel approach for causal discovery in systems with latent variables, extending existing methods to nonlinear and linear cases with validation.
Findings
Successfully recovers causal relations among observed variables.
Effective in systems where latent variables lack exogenous noise.
Validated through numerical simulations demonstrating practical applicability.
Abstract
We propose an approach for learning the causal structure in stochastic dynamical systems with a -step functional dependency in the presence of latent variables. We propose an information-theoretic approach that allows us to recover the causal relations among the observed variables as long as the latent variables evolve without exogenous noise. We further propose an efficient learning method based on linear regression for the special sub-case when the dynamics are restricted to be linear. We validate the performance of our approach via numerical simulations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Gaussian Processes and Bayesian Inference · Machine Learning and Algorithms
MethodsLinear Regression
Identifying Nonlinear 1-Step Causal Influences in Presence of Latent Variables
Abstract
We propose an approach for learning the causal structure in stochastic dynamical systems with a -step functional dependency in the presence of latent variables. We propose an information-theoretic approach that allows us to recover the causal relations among the observed variables as long as the latent variables evolve without exogenous noise. We further propose an efficient learning method based on linear regression for the special sub-case when the dynamics are restricted to be linear. We validate the performance of our approach via numerical simulations.
1 Introduction
Identifying causal influences in a network of time series is one of fundamental problems in many different fields, including social sciences, economics, computer science, and biology. In macroeconomics, for instance, researchers seek to understand what are the factors contributing to economic fluctuations and how these factors interact with each other [12]. In neuroscience, extensive body of research focuses on learning the interactions between different regions of brain by analyzing neural spike trains [16].
In 1960’s, Granger proposed a definition of causality between random processes [8]. The key idea of his definition is that if a process causes another process , then knowing the past of up to time must aid in predicting . In particular, let be the mean square error (MSE) of the optimal -step predictor of a random process at time given information . Process is said to Granger cause process if:
[TABLE]
where the set contains all information in the universe related to the past and the present of . We also say that the process has a 1-step cause on if the above inequality holds for . In other words, considering in the set improves prediction of .
Granger’s definition of causality is consistent with the belief that a cause cannot come after the effect, but it is not practical in some settings because it requires knowledge of the entire set . To put it differently, it is hard to identify and account for all parts of universe that are related to a specific process . Hence, only the available information related to is considered in practice [11]. To see what may go wrong in such a situation, consider the following linear model with three state variables:
[TABLE]
where and are observable but is latent. Let and be i.i.d random variables with the same variance. If we fit a linear model only on and without considering , our estimation of upper left submatrix would be: . This result implies that is a 1-step cause of with the strength which is wrong. The concept of Granger causality can be generalized to nonlinear setting using an information-theoretic quantity “directed information” [13]. Still the problem caused by latent processes persists in that setting as well.
Identifying causal relations between random variables has been studied in the presence of latent variables to some extent. For instance, Elidan et al. proposed an algorithm based on expectation maximization (EM) to estimate the parameters of their model by fixing the number of latent variables and also the structural relationships between latent and observed variables [3]. Chandrasekaran et al. [1] presented a tractable convex program based on regularized maximum likelihood for recovering causal relations for a model where the latent and observed variables are jointly Gaussian, and the conditional statistics of the observed variables given the latent variables is a sparse graph. A well-known approach for learning latent Markov models uses quartet-based distances to discover the structure [10, 4]. In most of quartet-based solutions, a set of quartets is constructed for all subsets of four observable variables and then quartets are merged to form a tree structure.
In recent years, there has been an increasing interest in inferring causal relations in random processes. Jalali and Sanghavi showed that 1-step causal relations between observed variables can be identified in a Vector Auto-Regressive (VAR) model assuming that connections between observed variables are sparse and each latent variable interacts with many observed variables [9]. In [7], Geiger et al. showed that identifying 1-step causes between observed variables is possible under some algebraic conditions on the transition matrix of VAR model. Recently, Etesami et al. studied a network of processes with polytree structure and introduced an algorithm that can learn latent polytrees using a discrepancy measure [6].
In this paper, we propose an information-theoretic criteria for identifying the causal relations in a general model of stochastic dynamical systems without restricting the mapping functions (say to linear mappings) or the underlying structure (e.g., being a tree) among the observed nodes also when there is no exogenous noise in the latent part. We propose an efficient method to identify functional dependencies for the special case of linear mappings. We further demonstrate the applicability of this criteria though simulation results for both linear and nonlinear cases.
The paper is organized as follows. In Section 2, we provide the preliminary definitions and describe the system model. In Section III, we present the main result and study the special restriction of it to linear models. We provide our simulation results in Section IV. Finally, we conclude in Section V.
2 Problem Definition
In this section, after some notational conventions, the model of stochastic dynamical system is presented. Afterwards, we present our definition of 1-step functional dependency between the processes for this model.
2.1 Notations
Any vector with with entries is denoted by . We denote the -th random variable in the -th process by . We use underlined characters to represent a collection of processes, for example is used to denote a set of random processes with index set from time [math] up to time . For , we denote by . We also define: . The identity matrix of size is shown by . We denote entry of a matrix by .
In a directed graph that is characterized by a set of vertices (or nodes) and a set of ordered pairs of vertices, called arrows (or edges) , we denote the set of parents of a node by and define it as .
2.2 System model
Consider a dynamical system described by states in which the first processes, denoted by are observable states and the rest which denoted by are latent. More precisely, the joint dynamic of the system is given by:
[TABLE]
where the exogenous noises are i.i.d. with mean zero. , are mapping functions that belong to appropriately constrained class of functions. Furthermore, we assume that is a vector of unknown but fixed values. The goal of this work is to identify the causal structure among the observed processes given their realizations. Next, we formally introduce what we mean by a causal structure of a dynamical system.
2.3 Causal Structural Graph
In dynamical systems with functional dependencies, there is a natural notion of influence among the processes, in the sense that process causes process , if is a function of . Such dependencies has been studied in the literature [5]. Adopting the definition of functional dependency in [5], we define the causal structure of the system in (5) as follows.
Random process 1-step functionally depends on process over the time horizon , if changing the value of while keeping all the other variables fixed results in a change in for some time . Next, we present our formal definition of functional dependencies in systems whose joint dynamics is described by (5).
Definition 1**.**
We say 1-step functionally influences if and only if , where
[TABLE]
* and are two realizations of .*
In order to visualize the causal structure in (5), we introduce a directed graph whose nodes represent random processes and there is an arrow from node to nodes , if 1-step functionally influences .
Example 1**.**
Consider a causal system with 3 processes such that their joint dynamic is given by:
[TABLE]
where s are independent exogenous noises. Figure 1 depicts the functional dependency graph of this system.
Directed Information Graphs (DIGs) are another type of graphical models that encode statistical dependencies in dynamical systems [2]. These graphs are defined using an information-theoretic measure, the “conditional directed information” [14, 18]. The relationship between the functional dependencies in a stochastic dynamical system and their corresponding DIG has been studied in [5].
For the sake of completeness, we present the definition of DIG. Consider two random processes and and a set of indices such that , then the conditional directed information from to , given is defined as:
[TABLE]
where is the Radon-Nikodym derivative [19] and denotes the causal conditioning defined as
Definition 2**.**
[17]** A directed information graph (DIG) is a directed graph, , over a set of random processes . Node represents the random process ; there is an arrow from to for if and only if:
[TABLE]
Note that in the definition of DIG, it is assumed that there are no latent processes. Thus as demonstrated in the example below, when a subset of processes is not observable (as in our model), the corresponding DIG may not encode the 1-step causal relationships accurately.
Example 2**.**
Consider the following joint dynamics:
[TABLE]
where are independent exogenous noises. The corresponding DIG of this system when all processes are observed is , and when is latent, it is . But we know that there is no 1-step functional dependency between and .
Definition 3**.**
A joint distribution is called positive if there exists a reference measure such that and .
Remark 1**.**
In addition to requiring no latent processes, DIGs recover the structure correctly when underlying distribution is positive. This is to avoid degenerate cases that arise with deterministic relationships. For instance, suppose and are two random processes such that for some deterministic function . Then is not positive since the distribution of given is a point mass.
Note that our model in (5) does not satisfy the non-degeneracy assumption. This is because in this model the hidden variables are a deterministic functions of the other processes. Yet as we will show next, the 1-step causal structure between the observed processes is unique and recoverable as long as the marginal distribution of the observed processes is positive.
3 Main Result
Herein, we introduce our approach for learning the 1-step functional dependencies among the observed variables given their realizations. This approach does not require any prior knowledge about the number of latent process nor functions and .
Theorem 1**.**
Consider the dynamical system in (5) and assume that the marginal distribution of the observed variables is positive. Then if and only if:
[TABLE]
Proof.
First, we prove that if then (9) holds. Suppose that does not 1-step functionally depend on . According to (5), the latent vector can be determined recursively as a function of and . We denote this by . Therefore, the entropy of given will be:
[TABLE]
The last equation holds because is a deterministic function of , and is independent of . Furthermore, we have:
[TABLE]
where is a uni-variate function obtained from by determining the values of . But does not change by varying since we assumed . Hence, the above equation is equal to and by comparing with (3), we can deduce that (9) holds.
For the converse, note that and are independent given according to (9). Consequently, we have:
[TABLE]
For any realization of like , the left hand side of the above equation is equal to:
[TABLE]
where . Since the joint distribution of the observed processes is positive, we know that cannot be written as a deterministic function of . Thus, the right hand of (12) does not depends on . From this fact and (13), we can conclude that is not a function of for any realization of and thus . ∎
This result can be used to recover the 1-step causal structure of the observed processes in (5) given . To do so, one can estimate the conditional mutual information in (9) for all . If (9) holds, then we declare that there is no 1-step dependency from to . Next, we propose an efficient method to learn the 1-step causal structure of the observed processes in (5) when s and are linear functions.
3.1 The Linear Model
Suppose s and are linear functions, then the equations in (5) can be rewritten as follows:
[TABLE]
where , , , and denote the coefficient matrices. We also define . The functional dependency of state vector on its history and also , and for can be written as follows:
[TABLE]
where and . Now, suppose that information-theoretic criteria in (9) is zero. By the same arguments in the proof of Theorem 9, we can show that the following term is zero:
[TABLE]
for any realization of . Since is positive, we can deduce that . Consequently, learning the 1-step causal structure among the observed processes reduces to determining the support of .
Assume that the support of corresponds to an acyclic directed graph, i.e. there exists an such that . Under this condition, the equation (15) can be simplified as:
[TABLE]
The above equation can be interpreted as a VAR model of order . Hence, all matrices can be obtained by doing multivariate least square estimation [11]. Moreover, coefficients in the VAR model can be checked for zero constraints by Wald test [11]. Thus, we can check the information-theoretic criteria merely performing a Wald test.
4 Experimental Results
In this section, we utilize the method described in previous part for network identification problem in consensus protocols [15]. In control systems, a well-known approach for network identification is based on running a series of “node-knockout” experiments in which variables are sequentially forced to be zero without being removed from the network [15, 20]. The main drawback of this approach is that we need to intervene in the system. Here, we will show that the direct edges between observed nodes can be detected just by analyzing the time-series of observed processes.
Consider the weighted consensus protocol within a system with nodes:
[TABLE]
where represents the state of node at time such as its speed, heading, or position, and the weight denotes the weight on the edge . The first state variables correspond to states of observed nodes and the rest belong to hidden nodes. We are trying to find all directed edges (with nonzero weight) between observed nodes by injecting the white noise into the observed node , i.e. if is an observed node and otherwise. In fact, this problem can be reformulated to the form in (14) such that:
[TABLE]
where , and . Hence, identifying all directed edges with nonzero weight between observed nodes is equivalent to obtaining the support of matrix .
We generated 1000 instances of the linear system with observed nodes and latent nodes. The weight was selected randomly from the set with probability where and if were hidden. Otherwise, the weight was chosen randomly from the set with probability where . Moreover, we set to .
In our simulations, we excluded the generated networks which had cycles in the latent part. Furthermore, the noise process was chosen as i.i.d with . It can be easily seen that the conditional mutual information in (5) is equal to:
[TABLE]
where and are the variances of and , respectively. Thus, learning -step functional dependencies, corresponds to finding the support of matrix , denoted by .
In order to obtain nonzero entries in matrix , we performed a linear regression between and data where is the lag length. Let be the output of linear regression for time series of length . According to Wald test [11], for large number of samples, we can obtain by setting entry to one if . In Fig. 2, the error is averaged over generated random matrices where is the Frobenius norm of a matrix. As it can be seen, the support of matrix can be recovered perfectly as the lag length increases. This trend is expected since the lag length should be at least equal to the order of linear model , in order to have perfect recovery. Moreover, as shown in Fig. 2, for a fixed lag length, the average error is higher for larger . This is because the matrices become more dense for larger which leads to higher average error when the right lag length is not selected.
We also examined our proposed criteria in a nonlinear system with three state variables with the following joint dynamics:
[TABLE]
where and are i.i.d and have normal distribution with zero mean and unit variance. The quantity in (5) can be written as a linear combination of some joint entropies. Hence, we can utilize the -nearest neighbor method of [21] to obtain an estimation of the desired quantity. To do so, we generated samples of for . For , the numerical results were: , and . From these results, we can infer that is 1-step functional dependent on which is consistent with the system model in (24).
5 Conclusion
We proposed an information-theoretic quantity for identifying causal relations among observed variables for a general -step stochastic dynamical system in the presence of latent variables when there exists no exogenous noise in the latent part. It would be interesting to see if by further imposing some additional constraints on the structure of functional dependencies, it would be possible to recover the inter-connections in the latent sub-graph.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky. Latent variable graphical model selection via convex optimization. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on , pages 1610–1613. IEEE, 2010.
- 2[2] Alexander G Dimitrov, Aurel A Lazar, and Jonathan D Victor. Information theory in neuroscience. Journal of computational neuroscience , 30(1):1–5, 2011.
- 3[3] Gal Elidan, Iftach Nachman, and Nir Friedman. ” ideal parent” structure learning for continuous variable bayesian networks. Journal of Machine Learning Research , 8(8), 2007.
- 4[4] Péter L Erdos, Michael A Steel, LászlóA Székely, and Tandy J Warnow. A few logs suffice to build (almost) all trees: Part ii. Theoretical Computer Science , 221(1):77–118, 1999.
- 5[5] Jalal Etesami and Negar Kiyavash. Measuring causal relationships in dynamical systems through recovery of functional dependencies. IEEE Transactions on Signal and Information Processing over Networks , 2016.
- 6[6] Jalal Etesami, Negar Kiyavash, and Todd Coleman. Learning minimal latent directed information polytrees. Neural Computation , 2016.
- 7[7] Philipp Geiger, Kun Zhang, Mingming Gong, Dominik Janzing, and Bernhard Schölkopf. Causal inference by identification of vector autoregressive processes with hidden components. In Proceedings of 32th International Conference on Machine Learning (ICML 2015) , 2015.
- 8[8] Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society , pages 424–438, 1969.
