Entropic Causality and Greedy Minimum Entropy Coupling
Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath, Babak, Hassibi

TL;DR
This paper analyzes a greedy algorithm for approximate minimum entropy coupling, a key step in entropic causality, providing guarantees on local optimality and approximation error despite the problem's NP-hardness.
Contribution
It offers a theoretical analysis of a greedy algorithm for minimum entropy coupling, establishing local optimality and approximation bounds.
Findings
The algorithm always finds a local minimum.
It is within an additive approximation error of the global minimum.
Abstract
We study the problem of identifying the causal relationship between two discrete random variables from observational data. We recently proposed a novel framework called entropic causality that works in a very general functional model but makes the assumption that the unobserved exogenous variable has small entropy in the true causal direction. This framework requires the solution of a minimum entropy coupling problem: Given marginal distributions of m discrete random variables, each on n states, find the joint distribution with minimum entropy, that respects the given marginals. This corresponds to minimizing a concave function of nm variables over a convex polytope defined by nm linear constraints, called a transportation polytope. Unfortunately, it was recently shown that this minimum entropy coupling problem is NP-hard, even for 2 variables with n states. Even representing points…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Statistical Methods and Inference · Markov Chains and Monte Carlo Methods
Entropic Causality and
Greedy Minimum Entropy Coupling
Murat Kocaoglu
Department of Electrical and Computer Engineering, The University of Texas at Austin, USA
Alexandros G. Dimakis
Department of Electrical and Computer Engineering, The University of Texas at Austin, USA
Sriram Vishwanath
Department of Electrical and Computer Engineering, The University of Texas at Austin, USA
Babak Hassibi
Department of Electrical Engineering, California Institute of Technology, USA
Abstract
We study the problem of identifying the causal relationship between two discrete random variables from observational data. We recently proposed a novel framework called entropic causality that works in a very general functional model but makes the assumption that the unobserved exogenous variable has small entropy in the true causal direction.
This framework requires the solution of a minimum entropy coupling problem: Given marginal distributions of discrete random variables, each on states, find the joint distribution with minimum entropy, that respects the given marginals. This corresponds to minimizing a concave function of variables over a convex polytope defined by linear constraints, called a transportation polytope. Unfortunately, it was recently shown that this minimum entropy coupling problem is NP-hard, even for 2 variables with states. Even representing points (joint distributions) over this space can require exponential complexity (in ) if done naively.
In our recent work we introduced an efficient greedy algorithm to find an approximate solution for this problem. In this paper we analyze this algorithm and establish two results: that our algorithm always finds a local minimum and also is within an additive approximation error from the unknown global optimum.
1 Introduction
Causality is of interest to statisticians, philosophers, engineers and medical scientists **[1, 7, 18]**. Understanding the causal relations between observable parameters is important in analyzing the workings of a system, as well as predicting how it will behave after a policy change. Causality has been studied under several frameworks including potential outcomes **[19]** and structural equation modeling **[15]**. In this paper we rely on structure equation models and data-driven causality using information theory.
The use of information theoretic tools for causal discovery is recently gaining increasing attention through various approaches: For example, Janzing et al. **[9]** propose an information geometry approach that relies on a cause and mechanism independence assumption. Another line of work focuses on time-series data and uses Granger causality and directed information **[6, 5, 17, 12]**. In this paper we also use information measures but rely on a different framework that we recently proposed **[11]**.
Our framework, called entropic causality **[11]** is data-driven, i.e., it can estimate causal directions between two discrete random variables without interventions. Our approach uses Rényi entropy as a complexity measure and considers the simpler model more likely to be the true causal direction. In **[11]** we showed that finding the simplest causal model that explains an observed joint distribution requires solving a minimum entropy coupling problem: Given marginal distributions of discrete random variables, each on states, find the joint distribution with minimum entropy, that respects the given marginals. This corresponds to minimizing a concave function of variables over a convex polytope defined by linear constraints, called a transportation polytope **[3]**.
The minimum entropy coupling problem between two variables was shown to be NP-hard in **[13]**. In **[11]**, we proposed a greedy algorithm for the minimum entropy coupling problem and showed that for two variables, it always finds a local optimum. The proof used a characterization of the KKT conditions of the corresponding optimization problem and a characterization of the algorithm output when there are two variables. However, this characterization cannot be used when there are more variables.
In this work, we extend the result in **[11]**: We develop a new characterization of the algorithm output for any number of variables. This characterization allows us to conclude that the algorithm output satisfies the KKT conditions irrespective of the number of variables, which implies that the algorithm returns a local optimum. Moreover, we show an additive approximation guarantee with respect to the global optimum.
In Section 2, we provide a very short overview of the causal inference literature. In Section 3, we summarize the results of **[11]** and explain how minimum entropy coupling arises in the entropic causal inference framework. In Section 4, we identify the conditions necessary for a solution to be a local optimum and show that our algorithm’s output always satisfies these conditions by deriving a new characterization. In Section 5, we develop our approximation guarantee for a variant of this algorithm, which is easier to analyze.
2 Related Work
Causal relationships between random variables can be represented by causal directed graphical models **[15, 22]**. Pearl’s framework led to a complete graph theoretic characterization of which parts of a causal graph are learnable using statistical tests. Efficient algorithms were developed for this learning task by Spirtes et al. **[22]**. Unfortunately, a general causal graph cannot be uniquely identified from data samples.
A complete solution to the causal graph identification problem requires experiments, also called interventions. An intervention forces the value of a variable without affecting the other system variables. This removes the effect of its causes, effectively creating a new causal graph. These changes in the causal graph create a post-interventional distribution among variables, which can be used to learn additional causal relations in the original graph. The procedure can be applied repeatedly to fully identify any causal graph **[20]**. There is significant progress recently on how to efficiently perform experiments **[4, 20]**, even under constraints **[10]**. Unfortunately, in many cases it is very difficult (or even impossible) to perform experiments and we are only given a static dataset.
When performing experiments is not an option, to identify the causal relations between the variables we need additional assumptions on the data generating process. The most widely employed assumption is the additive noise assumption, which asserts that the unobserved variables affect the observable variables additively. Under this assumption, authors in **[8]** showed that, except for a measure zero parameter set, one can identify the true causal direction between two variables, as long as the relation is non-linear. A similar result is known when the noise is non-Gaussian, irrespective of the relation between the variables **[21]**. These approaches inherently assume continuous variables and additive noise. Other works consider discrete variables with the additive noise **[16]**, or continuous variables without the additive noise assumption **[14]**.
Another approach is to exploit the postulate that the cause and mechanism are in general independently assigned by nature. The notion of independence here is captured by assigning maps, or conditional distributions to random variables to argue about independence of cause and mechanism. In this direction an information-geometry based approach is suggested **[9]**. Independence of cause and mechanism is captured by treating the log-slope of the function as a random variable, and assuming that it is independent from the cause. In the case of a deterministic relation , there are theoretical guarantees on identifiability. However, this assumption is restrictive for real data.
In **[11]**, we introduced the entropic causality framework. Our framework does not assume additive noise and uses probability distributions as opposed to variable values. Thus, it can naturally handle both categorical as well as ordinal variables. The central postulate is that in the true direction, the Rényi entropy of the exogenous variable is small. The central theoretical result of **[11]** is identifiability for zero order Rényi entropy (i.e., support of distribution): If the cardinality of the exogenous variable is small in the true direction, then there does not exist any causal model where the cardinality of the exogenous variable in the reverse direction is also small, under mild assumptions. We conjecture that a similar identifiability result is true for Rényi entropy of order 1, i.e., Shannon entropy, and numerical simulations seem to verify it. Furthermore, we showed that the corresponding causality test can match or outperform the previous state of the art in causal identification benchmarks in real and synthetic datasets **[11]**.
In very recent parallel work, Cicalese et al. **[2]** proposed a more involved greedy algorithm for the minimum entropy coupling problem and showed a very strong 1-bit approximation guarantee for it. The proposed algorithm only applies for two variables. Two variable algorithms for minimum entropy coupling can only be used for entropic causality if one of the two variables takes only two-values. Therefore, it would be very interesting if it can be extended for multiple variables, especially if similar strong approximation guarantees are true.
3 Background
3.1 Notation
We use uppercase letters () for random variables, lowercase letters for their realizations and constants (), lowercase bold letters for column vectors (), uppercase bold letters for matrices and tensors . We represent the set by , whereas indicates the continuous interval from to as usual. Vectors and sets with indices are simply represented through subscripts as follows: represents the column vector and represents the set . means the random variable is distributed with the probability mass function , i.e., . stands for the statistical independence between random variables. The Shannon entropy naturally extends to matrices (and tensors) as , where stands for the logarithm base 2.
3.2 Causal Model
In this section, we introduce Pearl’s causal model for two variables and no unobserved common causes. Causal models are powerful because they can answer hypothetical questions involving experiments. An experiment, called an intervention in this context, means forcing a set of random variables to take certain values. This operation is captured by the do(.) operator of Pearl **[15]**. Thus, by definition, the causal model captures the knowledge of what will happen after performing any intervention on the observed variables. Consider two variables . Suppose causes . The following are what this causal model entails: (i) There exists an exogenous (unobserved) random variable and a map such that . Let . (ii) An intervention changes the data generating model and yields . Thus, an intervention on does not change the distribution of , but fixes the value of . Hence the distribution of is affected through these changes. However, an intervention on has a different effect. (iii) changes the model as follows: . The important thing to notice here is that intervening on makes it independent from , whereas intervening on does not make it independent from .111Technically, to talk about statistical independence, we need stochastic interventions: Consider which forces to take the same values as an independent random variable .
The fact that a causal model can answer interventional queries is what makes it so powerful, but also hard to learn from data. In general, given a joint distribution over one can find functions where and . This makes the problem of learning the causal relation between and unidentifiable in general. The objective of data driven causal inference is to identify the assumptions on either the function or the variable , under which the causal model can be learned.
3.3 The Entropic Causal Inference Framework
Entropic causal inference **[11]** uses the number of random bits as a complexity measure and chooses the simpler model as the true causal model. Suppose we observe the joint distribution of two variables each with states. Consider the problem of identifying the exogenous variable with minimum Shannon entropy such that there is a causal model where causes , that yields this joint distribution. In **[11]**, we established that this problem is equivalent to the minimum entropy coupling problem between variables each with states.
Consider the variables with . Suppose causes . Then , where is an exogenous variable of cardinality for some independent from , and is some map . Let be a random variable that has the same distribution as the distribution of conditioned on : . We have the following lemma:
Lemma 1**.**
[11*]**
Let be two variables with . Consider any causal model . Then , where is the minimum joint entropy of variables subject to the constraint that each has the same marginal distribution as the conditional distribution of given .*
Moreover, there is an with .
Proof.
See the proof of Theorem 3 in the appendix of [11]. ∎
Lemma 1 puts the minimum entropy coupling problem at the center of the entropic causal inference framework. If we could solve the minimum entropy coupling problem, we could identify the exogenous variable with minimum entropy. If the identifiability result holds (Conjecture 1 in **[11]**), will be greater than if entropy of is sufficiently small. Hence, closely approximating the minimum entropy coupling is essential for an effective causal inference algorithm using the entropic causal inference framework.
3.4 Greedy Minimum Entropy Coupling Algorithm
Different from **[11]**, we provide the version of the greedy minimum entropy coupling algorithm that constructs the joint distribution tensor, rather than only the non-zero probability values, which is more instructional for this paper. The greedy algorithm is given in Algorithm 1. The marginal distribution of variable is shown by the column vector . Note that in practice, one would only store the non-zero probability values output by the algorithm, rather than creating the extremely sparse tensor with entries.
At each iteration, the algorithm finds the largest probability mass in each marginal, and assigns the minimum of these to the corresponding coordinate in the joint probability tensor. The motivation is that, the large chunks of probability masses are not split into smaller chunks, making as small contribution as possible to the total entropy. The algorithm satisfies at least one marginal constraint at each step, and of them in the last step. Thus it terminates in at most steps.
4 Greedy Algorithm Gives Local Optimum
In this section, we present our main theorem and show that the greedy algorithm always finds a local optimum. We consider variables each with states. The extension of the analysis to variables each with states is trivial. Let us first formalize the entropy minimization problem:
Definition 1** (Minimum Entropy Coupling).**
Let be discrete random variables with states, with marginal distributions . The minimum entropy coupling problem is to find the joint distribution with minimum entropy that is consistent with the given marginals:
[TABLE]
We can equivalently write down this optimization problem by representing the joint probability value for each configuration as a different variable. This representation has variables and constraints (* marginals and points for each marginal). Let be a variable for every -tuple . Notice that the index for *th dimension, i.e., , captures the realization of variable . Then the optimization problem can be written as follows:
[TABLE]
In (4), we dropped the constraint . Total sum is equivalent to first marginalizing out dimensions to , and then marginalizing out dimension . If marginalizing out the first dimensions gives , which is already captured as a separate equality constraint, summing across this dimension gives 1 since sums to 1.
In this section, we show the following theorem:
Theorem 1**.**
Algorithm 1 finds a local optimum point of the optimization problem in (4).
4.1 KKT Conditions
First, we characterize the points that satisfy the KKT conditions. We have the following lemma:
Lemma 2**.**
Consider the optimization problem in (4). Let be a point that satisfies the KKT conditions. Then there are vectors each of length such that either , or
[TABLE]
Proof.
Consider the following general optimization problem:
[TABLE]
Lagrangian becomes
[TABLE]
which gives the KKT conditions
[TABLE]
This implies, for fixed , either or . Matching the constraints in (4) to the functions in (4), we identify and as follows:
[TABLE]
The Lagrangian of (4) can be written as follows:
[TABLE]
for the dual parameters and . The gradient being zero gives us the following:
[TABLE]
The conditions above imply the following for the optimal point : Either or if it satisfies
[TABLE]
Thus, for vectors of length , we have ∎
By Lemma 2, the optimal point satisfies the following: Each nonzero joint probability can be written as a product of the corresponding entries of vectors of length . Inspired by the definition of independence, we will term such joint distributions as quasi-independent:
Definition 2**.**
A joint distribution for is called quasi-independent, if there are vectors such that either or .
4.2 Characterization of Greedy Algorithm Output
Consider Algorithm 1. It selects the minimum of maximum probability values across each marginal at each step, subtracts this probability mass from the corresponding coordinates in each marginal and iterates. Next, we show that one can always construct vectors that satisfy , where is the probability mass assigned to point by the algorithm.
Let the algorithm select a probability mass for the point at iteration . . Let after this assignment. Define the column vector . are length-* vectors to be decided. We will show that, given the assignments made by the algorithm, one can always construct a such that (3) holds.*
Observe that each iteration of the algorithm corresponds to a linear equation in . Note that has length and at iteration , should satisfy the constraint , where is the indicator vector that is 1 in the columns from and zero otherwise: If , then , where . We know that the algorithm terminates in at most steps. Thus, we have linear equations and variables. This corresponds to a system of linear equations , where and is a column vector.
We have the following key observation: At each iteration step, the algorithm satisfies at least one of the marginal constraints, since it chooses the minimum of maximum probabilities. Thus, if at iteration the algorithm select the set of the coordinates , then for some algorithm never selects the coordinate again, since the corresponding marginal constraint is already satisfied. In terms of the matrix , this translates to the following statement: Every row of contains a column where . Thus, every row of has a column where that row contains the last 1 in that column. We have the following lemma:
Lemma 3**.**
Let be a matrix where no row is identically zero. If for every row , of all the columns with value 1, there exists a column such that , then the rows of are linearly independent.
Proof.
Assume otherwise. Then there exists a set of rows and coefficients such that . Let . By definition, row of has a column with . Thus, this column cannot be made 0 using a linear combination of rows with a larger index, which contradicts with . ∎
By Lemma (3), the rows of are linearly independent. This is also true for the augmented matrix of the system . Hence, the assignments are consistent and there is at least one solution to the linear system .
Proof of Theorem 1.
Consider the joint distribution output by the greedy algorithm. From the above discussion, the assignments to the joint distribution by the greedy entropy minimization algorithm can always be used to create vectors, such that the points where the joint is non-zero can be written as the product of the corresponding coordinates of these vectors. Thus, the greedy algorithm outputs a point which is quasi-independent, and satisfies the KKT conditions of the minimum entropy coupling problem. Hence, this is a stationary point. Since entropy is a concave function, there are no saddle points. Thus, greedy algorithm outputs a local optimum. ∎
5 Approximation Guarantee
In this section, we analyze a variant of the greedy algorithm, Algorithm 2, which is easier to develop an approximation guarantee for.
Different from Algorithm 1, Algorithm 2 looks at each value of every given marginal exactly once during Phase I. This allows us to relate the entropy contribution of Phase I to a lower bound to the optimum entropy.
Consider two random variables . We use to represent the marginal distributions of and after sorting their probabilities in decreasing order. We can extend the entropy function to operate on vectors which do not necessarily sum to 1. To make the distinction from entropy, we use for this operator222 is often used for the differential entropy operator. Since we do not use differential entropy in this paper, we believe this is not a source of confusion..
Theorem 2**.**
Let be two discrete random variables with states and , be their marginal distribution vectors sorted in decreasing order. Let . Let be the joint distribution output by the greedy algorithm, and the minimum joint entropy of all joints that respect the marginals. Then
[TABLE]
where for , and is the total variation distance between the sorted marginals of and .
Proof.
Define . In Phase I, algorithm chooses for . Consider
[TABLE]
is the entropy of the distribution which is obtained by splitting into and . Since each probability value is divided into at most 2 probability values,
[TABLE]
Similarly, we can write
[TABLE]
Then in Phase I, algorithm creates an entropy contribution . Based on the definitions of
[TABLE]
Let . Combining with (9) and (5), we get
[TABLE]
To bound the contribution of the second phase, we use an "independence" bound. The following lemma is useful:
Lemma 4**.**
Consider the vectors where and . Let . Let for be a matrix with row sum equal to and column sum equal to , i.e., and . Then .
Moreover, when is the outer product of and , the equality holds.
Proof.
Define the random variables and as the variables with marginal distributions and , respectively. Let be the joint distribution matrix for that respects the marginals and . Since , we have
[TABLE]
Define where . Notice that row sum of is and column sum of is . Then we have,
[TABLE]
Suppose . Then we have,
[TABLE]
∎
Following Lemma 4, the maximum contribution of the second phase to the entropy is obtained when we place the scaled outer product of the remaining probability values on the joint probability matrix. The remaining probabilities after phase 1 are and for and . The remaining probability mass is the total variation distance, i.e., . Thus, in Phase II, and contributes the entropy of . Finally, we can write
[TABLE]
(12) is obtained by selecting if and if , and through the bound . ∎
Consider the bound given in Theorem 2. is a constant less than 1. However, the term can scale with depending on the difference between the sorted marginals. In Section 5.1, we give an example where . Interestingly, for the same example we can show that the greedy algorithm output is at most 1 bit away from the global optimum. Thus, it may be possible to identify a tighter bound.
We can extend the analysis to the case of variables instead of only 2. We then have the following theorem:
Theorem 3**.**
Let be random variables each with states and be their marginal distribution vectors sorted in decreasing order. Let . Let be the joint distribution output by Algorithm 2 and the global optimum. Then
[TABLE]
where for , and .
Proof.
Define . In Phase 1, the algorithm chooses for . Consider for all
[TABLE]
is the entropy of the distribution which is obtained by splitting into and . Since each probability value is divided into at most 2 probability values,
[TABLE]
In Phase I, algorithm creates an entropy contribution . Define for all . Then we have
[TABLE]
Let and . Combining with (15), we get
[TABLE]
To bound the contribution of the second phase, we use an "independence" bound similar to the one in the proof of Theorem 2. We need the following lemma:
Lemma 5**.**
Consider the vectors where and . Let . Let be a tensor that satisfies the following: . Then .
Moreover, when is the outer product of ,for all , the equality holds.
Proof.
Define the random variables as the variables with marginal distributions for all . Let be the joint distribution tensor for that respects the marginals and . Since , we have
[TABLE]
Define where . Notice that with this scaling, marginalizing out every dimension in except for dimension gives vector. Then we have,
[TABLE]
Suppose . Then we have,
[TABLE]
∎
Following Lemma 5, the maximum contribution of the second phase to the entropy is obtained when we place the scaled outer product of the remaining probability values on the joint probability matrix. The remaining probabilities after Phase 1 are for for all . The remaining probability mass is . Thus, in Phase II, contributes the entropy of . Finally, we can write
[TABLE]
(18) is obtained by selecting for , and through the bound
[TABLE]
∎
5.1 A special family of distributions
Let be uniformly distributed random variable over states, i.e., . Let have the distribution with the following: and , where . One can check that sums to 1 with this parameterization. We can calculate the entropies of and which yields . Running Algorithm 2 on and , we have the following:
[TABLE]
where in (20) we used the reparameterization for . Since , algorithm outputs a joint distribution with entropy at most 1 bit away from the optimum. However, we have . Thus, yielding a gap of at least . In the light of this example, we believe that a tighter guarantee should be provable for the given algorithm.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Krzysztof Chalupka, Tobias Bischoff, Pietro Perona, and Frederick Eberhardt. Unsupervised discovery of el nino using causal feature learning on microlevel climate data. In Proc. of UAI’16 , 2016.
- 2[2] Ferdinando Cicalese, Luisa Gargano, and Ugo Vaccaro. How to find a joint probability distribution of minimum entropy (almost), given the marginals. In ar Xiv pre-print , 2017.
- 3[3] Jesús De Loera and Edward D. Kim. Combinatorics and geometry of transportation polytopes: An update. In ar Xiv pre-print , 2013.
- 4[4] Frederick Eberhardt. Causation and Intervention (Ph.D. Thesis) , 2007.
- 5[5] Jalal Etesami and Negar Kiyavash. Discovering influence structure. In IEEE ISIT , 2016.
- 6[6] Clive W. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society , pages 424–438, 1969.
- 7[7] Moritz Grosse-Wentrup, Dominik Janzing, Markus Siegel, and Bernhard Schölkopf. Identification of causal relations in neuroimaging data with latent confounders: An instrumental variable approach. Neuro Image (Elsevier) , 125:825–833, 2016.
- 8[8] Patrik O Hoyer, Dominik Janzing, Joris Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Proc. of NIPS 2008 , 2008.
