Entropic Causality and Greedy Minimum Entropy Coupling

Murat Kocaoglu; Alexandros G. Dimakis; Sriram Vishwanath; Babak; Hassibi

arXiv:1701.08254·cs.IT·January 31, 2017

Entropic Causality and Greedy Minimum Entropy Coupling

Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath, Babak, Hassibi

PDF

Open Access

TL;DR

This paper analyzes a greedy algorithm for approximate minimum entropy coupling, a key step in entropic causality, providing guarantees on local optimality and approximation error despite the problem's NP-hardness.

Contribution

It offers a theoretical analysis of a greedy algorithm for minimum entropy coupling, establishing local optimality and approximation bounds.

Findings

01

The algorithm always finds a local minimum.

02

It is within an additive approximation error of the global minimum.

Abstract

We study the problem of identifying the causal relationship between two discrete random variables from observational data. We recently proposed a novel framework called entropic causality that works in a very general functional model but makes the assumption that the unobserved exogenous variable has small entropy in the true causal direction. This framework requires the solution of a minimum entropy coupling problem: Given marginal distributions of m discrete random variables, each on n states, find the joint distribution with minimum entropy, that respects the given marginals. This corresponds to minimizing a concave function of nm variables over a convex polytope defined by nm linear constraints, called a transportation polytope. Unfortunately, it was recently shown that this minimum entropy coupling problem is NP-hard, even for 2 variables with n states. Even representing points…

Equations99

p (U_{1}, U_{2}, \dots, U_{n}) min H (U_{1}, U_{2}, \dots, U_{n})

p (U_{1}, U_{2}, \dots, U_{n}) min H (U_{1}, U_{2}, \dots, U_{n})

s. t. j \neq = i \sum u_{j} \in [n] \sum p (u_{1}, u_{2}, \dots, u_{n}) = p_{i} (u_{i}), \forall i, u_{i} .

x min i_{j} \in [n], \forall j \in [n] \sum - x (i_{1}, i_{2}, \dots, i_{n}) lo g x (i_{1}, i_{2}, \dots, i_{n})

x min i_{j} \in [n], \forall j \in [n] \sum - x (i_{1}, i_{2}, \dots, i_{n}) lo g x (i_{1}, i_{2}, \dots, i_{n})

s. t. i_{k} \in [n], \forall k \neq = l, i_{l} = j \sum x (i_{1}, i_{2}, \dots, i_{n}) = p_{l} (j), \forall j, l \in [n]

x (i_{1}, i_{2}, \dots, i_{n}) \geq 0, \forall (i_{1}, i_{2}, \dots, i_{n}) \in [n]^{n}

lo g x^{*} (i_{1}, i_{2}, \dots, i_{n}) + 1 = \sum_{k \in [n]} u_{k} (i_{k}) .

lo g x^{*} (i_{1}, i_{2}, \dots, i_{n}) + 1 = \sum_{k \in [n]} u_{k} (i_{k}) .

x min

x min

h_{i} (x) = 0, i \in [p]

f_{i} (x) \leq 0, i \in [m],

L (x, λ, v) = f_{0} (x) + i = 1 \sum m λ_{i} f_{i} (x) + i = 1 \sum p v_{i} h_{i} (x),

L (x, λ, v) = f_{0} (x) + i = 1 \sum m λ_{i} f_{i} (x) + i = 1 \sum p v_{i} h_{i} (x),

f_{i} (x^{*}) \leq 0, i \in [m]

f_{i} (x^{*}) \leq 0, i \in [m]

h_{i} (x^{*}) = 0, i \in [p]

λ_{i}^{*} \geq 0, i \in [m]

λ_{i}^{*} f_{i} (x^{*}) = 0, i \in [m]

\nabla L (x^{*}, λ^{*}, v^{*}) = 0

f (i_{1}, i_{2}, \dots, i_{n}) = - x (i_{1}, i_{2}, \dots, i_{n}), \forall (i_{1}, i_{2}, \dots, i_{n}) \in [n]^{n}

f (i_{1}, i_{2}, \dots, i_{n}) = - x (i_{1}, i_{2}, \dots, i_{n}), \forall (i_{1}, i_{2}, \dots, i_{n}) \in [n]^{n}

h_{l, j} = i_{k} \in [n] \forall k \neq = l, i_{l} = j \sum x (i_{1}, i_{2}, \dots, i_{n}) - p_{l} (j), \forall l \in [n], j \in [n]

L (x, λ, ν)

L (x, λ, ν)

- (i_{1}, i_{2}, \dots, i_{n}) \in [n]^{n} \sum λ (i_{1}, i_{2}, \dots, i_{n}) x (i_{1}, i_{2}, \dots, i_{n})

+ j \in [n], l \in [n] \sum ν_{l, j} (i_{k} \in [n] \forall k \neq = l, i_{l} = j \sum x (i_{1}, i_{2}, \dots, i_{n}) - p_{l} (j)),

\frac{\partial L}{\partial x ( i _{1} , i _{2} , \dots , i _{n} )}

\frac{\partial L}{\partial x ( i _{1} , i _{2} , \dots , i _{n} )}

- λ (i_{1}, i_{2}, \dots, i_{n}) + l \in [n] \sum ν_{l, i_{l}} = 0

\frac{\partial L}{\partial λ ( i _{1} , i _{2} , \dots , i _{n} )}

x^{*} (i_{1}, i_{2}, \dots, i_{n}) = 2^{- 1 + \sum_{k \in [n]} ν_{k, i_{k}}} .

x^{*} (i_{1}, i_{2}, \dots, i_{n}) = 2^{- 1 + \sum_{k \in [n]} ν_{k, i_{k}}} .

H (U) \leq H^{*} (X_{1}, X_{2}) + 1 - T lo g (1/ T) + min {h (l_{1}), h (l_{2})},

H (U) \leq H^{*} (X_{1}, X_{2}) + 1 - T lo g (1/ T) + min {h (l_{1}), h (l_{2})},

H_{a}

H_{a}

\dots, p_{m} (n), p_{1} (n) - p_{m} (n)) .

H_{a} \leq H (X_{1}) + 1.

H_{a} \leq H (X_{1}) + 1.

H_{b}

H_{b}

\dots, p_{m} (n), p_{2} (n) - p_{m} (n)) \leq H (X_{2}) + 1.

H_{a} = H_{P h 1} + h (l_{1}), H_{b} = H_{P h 1} + h (l_{2}) .

H_{a} = H_{P h 1} + h (l_{1}), H_{b} = H_{P h 1} + h (l_{2}) .

H_{P h 1} + α h (l_{1}) + (1 - α) h (l_{2}) \leq α H (X_{1}) + (1 - α) H (X_{2}) + 1.

H_{P h 1} + α h (l_{1}) + (1 - α) h (l_{2}) \leq α H (X_{1}) + (1 - α) H (X_{2}) + 1.

- i, j \in [n] \sum s_{i, j} lo g (s_{i, j})

- i, j \in [n] \sum s_{i, j} lo g (s_{i, j})

= \frac{1}{T} (- i \sum p_{i} lo g (p_{i}) + i \sum p_{i} lo g (T) - j \sum q_{j} lo g (q_{j}) + j \sum q_{j} lo g (T))

\displaystyle=\frac{1}{T}\Big{(}h(\mathbf{p})+h(\mathbf{q})+2T\log(T)\Big{)}

h (R)

h (R)

= T (- i, j \sum s_{i, j} lo g (s_{i, j})) - T lo g (T)

\displaystyle=T\left(\frac{1}{T}\Big{(}h(\mathbf{p})+h(\mathbf{q})+2T\log(T)\Big{)}\right)-T\log(T)

= h (p) + h (q) + T lo g (T)

h (R)

h (R)

\displaystyle=\frac{1}{T}\Big{(}-\sum_{i,j}p_{i}q_{j}\log(p_{i})-\sum_{i,j}p_{i}q_{j}\log(q_{j})+T^{2}\log(T)\Big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Statistical Methods and Inference · Markov Chains and Monte Carlo Methods

Full text

Entropic Causality and

Greedy Minimum Entropy Coupling

Murat Kocaoglu

Department of Electrical and Computer Engineering, The University of Texas at Austin, USA

Alexandros G. Dimakis

Department of Electrical and Computer Engineering, The University of Texas at Austin, USA

Sriram Vishwanath

Department of Electrical and Computer Engineering, The University of Texas at Austin, USA

Babak Hassibi

Department of Electrical Engineering, California Institute of Technology, USA

Abstract

We study the problem of identifying the causal relationship between two discrete random variables from observational data. We recently proposed a novel framework called entropic causality that works in a very general functional model but makes the assumption that the unobserved exogenous variable has small entropy in the true causal direction.

This framework requires the solution of a minimum entropy coupling problem: Given marginal distributions of $m$ discrete random variables, each on $n$ states, find the joint distribution with minimum entropy, that respects the given marginals. This corresponds to minimizing a concave function of $n^{m}$ variables over a convex polytope defined by $n\,m$ linear constraints, called a transportation polytope. Unfortunately, it was recently shown that this minimum entropy coupling problem is NP-hard, even for 2 variables with $n$ states. Even representing points (joint distributions) over this space can require exponential complexity (in $n,m$ ) if done naively.

In our recent work we introduced an efficient greedy algorithm to find an approximate solution for this problem. In this paper we analyze this algorithm and establish two results: that our algorithm always finds a local minimum and also is within an additive approximation error from the unknown global optimum.

1 Introduction

Causality is of interest to statisticians, philosophers, engineers and medical scientists **[1, 7, 18]**. Understanding the causal relations between observable parameters is important in analyzing the workings of a system, as well as predicting how it will behave after a policy change. Causality has been studied under several frameworks including potential outcomes **[19]** and structural equation modeling **[15]**. In this paper we rely on structure equation models and data-driven causality using information theory.

The use of information theoretic tools for causal discovery is recently gaining increasing attention through various approaches: For example, Janzing et al. **[9]** propose an information geometry approach that relies on a cause and mechanism independence assumption. Another line of work focuses on time-series data and uses Granger causality and directed information **[6, 5, 17, 12]**. In this paper we also use information measures but rely on a different framework that we recently proposed **[11]**.

Our framework, called entropic causality **[11]** is data-driven, i.e., it can estimate causal directions between two discrete random variables without interventions. Our approach uses Rényi entropy as a complexity measure and considers the simpler model more likely to be the true causal direction. In **[11]** we showed that finding the simplest causal model that explains an observed joint distribution requires solving a minimum entropy coupling problem: Given marginal distributions of $m$ discrete random variables, each on $n$ states, find the joint distribution with minimum entropy, that respects the given marginals. This corresponds to minimizing a concave function of $n^{m}$ variables over a convex polytope defined by $n\,m$ linear constraints, called a transportation polytope **[3]**.

The minimum entropy coupling problem between two variables was shown to be NP-hard in **[13]**. In **[11]**, we proposed a greedy algorithm for the minimum entropy coupling problem and showed that for two variables, it always finds a local optimum. The proof used a characterization of the KKT conditions of the corresponding optimization problem and a characterization of the algorithm output when there are two variables. However, this characterization cannot be used when there are more variables.

In this work, we extend the result in **[11]**: We develop a new characterization of the algorithm output for any number of variables. This characterization allows us to conclude that the algorithm output satisfies the KKT conditions irrespective of the number of variables, which implies that the algorithm returns a local optimum. Moreover, we show an additive approximation guarantee with respect to the global optimum.

In Section 2, we provide a very short overview of the causal inference literature. In Section 3, we summarize the results of **[11]** and explain how minimum entropy coupling arises in the entropic causal inference framework. In Section 4, we identify the conditions necessary for a solution to be a local optimum and show that our algorithm’s output always satisfies these conditions by deriving a new characterization. In Section 5, we develop our approximation guarantee for a variant of this algorithm, which is easier to analyze.

2 Related Work

Causal relationships between random variables can be represented by causal directed graphical models **[15, 22]**. Pearl’s framework led to a complete graph theoretic characterization of which parts of a causal graph are learnable using statistical tests. Efficient algorithms were developed for this learning task by Spirtes et al. **[22]**. Unfortunately, a general causal graph cannot be uniquely identified from data samples.

A complete solution to the causal graph identification problem requires experiments, also called interventions. An intervention forces the value of a variable without affecting the other system variables. This removes the effect of its causes, effectively creating a new causal graph. These changes in the causal graph create a post-interventional distribution among variables, which can be used to learn additional causal relations in the original graph. The procedure can be applied repeatedly to fully identify any causal graph **[20]**. There is significant progress recently on how to efficiently perform experiments **[4, 20]**, even under constraints **[10]**. Unfortunately, in many cases it is very difficult (or even impossible) to perform experiments and we are only given a static dataset.

When performing experiments is not an option, to identify the causal relations between the variables we need additional assumptions on the data generating process. The most widely employed assumption is the additive noise assumption, which asserts that the unobserved variables affect the observable variables additively. Under this assumption, authors in **[8]** showed that, except for a measure zero parameter set, one can identify the true causal direction between two variables, as long as the relation is non-linear. A similar result is known when the noise is non-Gaussian, irrespective of the relation between the variables **[21]**. These approaches inherently assume continuous variables and additive noise. Other works consider discrete variables with the additive noise **[16]**, or continuous variables without the additive noise assumption **[14]**.

Another approach is to exploit the postulate that the cause and mechanism are in general independently assigned by nature. The notion of independence here is captured by assigning maps, or conditional distributions to random variables to argue about independence of cause and mechanism. In this direction an information-geometry based approach is suggested **[9]**. Independence of cause and mechanism is captured by treating the log-slope of the function as a random variable, and assuming that it is independent from the cause. In the case of a deterministic relation $Y=f(X)$ , there are theoretical guarantees on identifiability. However, this assumption is restrictive for real data.

In **[11]**, we introduced the entropic causality framework. Our framework does not assume additive noise and uses probability distributions as opposed to variable values. Thus, it can naturally handle both categorical as well as ordinal variables. The central postulate is that in the true direction, the Rényi entropy of the exogenous variable is small. The central theoretical result of **[11]** is identifiability for zero order Rényi entropy (i.e., support of distribution): If the cardinality of the exogenous variable is small in the true direction, then there does not exist any causal model where the cardinality of the exogenous variable in the reverse direction is also small, under mild assumptions. We conjecture that a similar identifiability result is true for Rényi entropy of order 1, i.e., Shannon entropy, and numerical simulations seem to verify it. Furthermore, we showed that the corresponding causality test can match or outperform the previous state of the art in causal identification benchmarks in real and synthetic datasets **[11]**.

In very recent parallel work, Cicalese et al. **[2]** proposed a more involved greedy algorithm for the minimum entropy coupling problem and showed a very strong 1-bit approximation guarantee for it. The proposed algorithm only applies for two variables. Two variable algorithms for minimum entropy coupling can only be used for entropic causality if one of the two variables takes only two-values. Therefore, it would be very interesting if it can be extended for multiple variables, especially if similar strong approximation guarantees are true.

3 Background

3.1 Notation

We use uppercase letters ( $X$ ) for random variables, lowercase letters for their realizations and constants ( $x,i,\alpha$ ), lowercase bold letters for column vectors ( $\mathbf{p}$ ), uppercase bold letters for matrices and tensors $(\mathbf{G})$ . We represent the set $\{1,2,\ldots,n\}$ by $[n]$ , whereas $[a,b]$ indicates the continuous interval from $a$ to $b$ as usual. Vectors and sets with indices are simply represented through subscripts as follows: $[x_{i}]_{i\in[n]}$ represents the column vector $[x_{1},x_{2},\ldots,x_{n}]^{T}$ and $\{u_{i}\}_{i\in[m]}$ represents the set $\{u_{1},u_{2},\ldots,u_{m}\}$ . $X\sim p_{X}$ means the random variable $X$ is distributed with the probability mass function $p_{X}$ , i.e., $\Pr(X=i)=p_{X}(i)$ . $\perp\!\!\!\perp$ stands for the statistical independence between random variables. The Shannon entropy $H([p_{i}]_{i})=-\sum_{i}p_{i}\log(p_{i})$ naturally extends to matrices (and tensors) as $H([r_{i,j}]_{i,j})=-\sum_{i,j}r_{i,j}\log(r_{i,j})$ , where $\log(.)$ stands for the logarithm base 2.

3.2 Causal Model

In this section, we introduce Pearl’s causal model for two variables and no unobserved common causes. Causal models are powerful because they can answer hypothetical questions involving experiments. An experiment, called an intervention in this context, means forcing a set of random variables to take certain values. This operation is captured by the do(.) operator of Pearl **[15]**. Thus, by definition, the causal model captures the knowledge of what will happen after performing any intervention on the observed variables. Consider two variables $X,Y$ . Suppose $X$ causes $Y$ . The following are what this causal model entails: (i) There exists an exogenous (unobserved) random variable $E\perp\!\!\!\perp X$ and a map $f$ such that $Y=f(X,E)$ . Let $E\sim p_{E},X\sim p_{X}$ . (ii) An intervention $do(X=x)$ changes the data generating model and yields $X=x,E\sim p_{E},Y=f(x,E)$ . Thus, an intervention on $X$ does not change the distribution of $E$ , but fixes the value of $X$ . Hence the distribution of $Y$ is affected through these changes. However, an intervention on $Y$ has a different effect. (iii) $do(Y=y)$ changes the model as follows: $X\sim p_{X},E\sim p_{E},Y=y$ . The important thing to notice here is that intervening on $Y$ makes it independent from $X$ , whereas intervening on $X$ does not make it independent from $Y$ .111Technically, to talk about statistical independence, we need stochastic interventions: Consider $do(X=U)$ which forces $X$ to take the same values as an independent random variable $U$ .

The fact that a causal model can answer interventional queries is what makes it so powerful, but also hard to learn from data. In general, given a joint distribution over $X,Y$ one can find functions $f,g$ where $Y=f(X,E),E\perp\!\!\!\perp X$ and $X=g(Y,\tilde{E}),\tilde{E}\perp\!\!\!\perp Y$ . This makes the problem of learning the causal relation between $X$ and $Y$ unidentifiable in general. The objective of data driven causal inference is to identify the assumptions on either the function $f$ or the variable $E$ , under which the causal model can be learned.

3.3 The Entropic Causal Inference Framework

Entropic causal inference **[11]** uses the number of random bits as a complexity measure and chooses the simpler model as the true causal model. Suppose we observe the joint distribution of two variables $X,Y$ each with $n$ states. Consider the problem of identifying the exogenous variable with minimum Shannon entropy such that there is a causal model where $X$ causes $Y$ , that yields this joint distribution. In **[11]**, we established that this problem is equivalent to the minimum entropy coupling problem between $n$ variables each with $n$ states.

Consider the variables $X,Y$ with $X,Y\in[n]$ . Suppose $X$ causes $Y$ . Then $Y=f(X,E)$ , where $E$ is an exogenous variable of cardinality $m$ for some $m$ independent from $X$ , and $f$ is some map $f:[n]\times[m]\rightarrow[n]$ . Let $U_{i}$ be a random variable that has the same distribution as the distribution of $X$ conditioned on $Y=i$ : $\Pr(U_{i}=j)=\Pr(X=j|Y=i)$ . We have the following lemma:

Lemma 1.

[11*]**

Let $X,Y$ be two variables with $X,Y\in[n]$ . Consider any causal model $X=g(Y,\tilde{E}),\tilde{E}\perp\!\!\!\perp Y$ . Then $H(\tilde{E})\geq H^{*}(U_{1},\ldots,U_{n})$ , where $H^{*}(U_{1},\ldots,U_{n})$ is the minimum joint entropy of variables $\{U_{1},\ldots,U_{n}\}$ subject to the constraint that each $U_{i}$ has the same marginal distribution as the conditional distribution of $X$ given $Y=i$ .*

Moreover, there is an $\tilde{E}\perp\!\!\!\perp Y$ with $H(\tilde{E})=H^{*}(U_{1},\ldots,U_{n})$ .

Proof.

See the proof of Theorem 3 in the appendix of [11]. ∎

Lemma 1 puts the minimum entropy coupling problem at the center of the entropic causal inference framework. If we could solve the minimum entropy coupling problem, we could identify the exogenous variable with minimum entropy. If the identifiability result holds (Conjecture 1 in **[11]**), $H(Y)+H(\tilde{E})$ will be greater than $H(X)+H(E)$ if entropy of $E$ is sufficiently small. Hence, closely approximating the minimum entropy coupling is essential for an effective causal inference algorithm using the entropic causal inference framework.

3.4 Greedy Minimum Entropy Coupling Algorithm

Different from **[11]**, we provide the version of the greedy minimum entropy coupling algorithm that constructs the joint distribution tensor, rather than only the non-zero probability values, which is more instructional for this paper. The greedy algorithm is given in Algorithm 1. The marginal distribution of variable $i$ is shown by the column vector $\mathbf{p_{i}}$ . Note that in practice, one would only store the non-zero probability values output by the algorithm, rather than creating the extremely sparse tensor $\mathbf{P}$ with $n^{m}$ entries.

At each iteration, the algorithm finds the largest probability mass in each marginal, and assigns the minimum of these to the corresponding coordinate in the joint probability tensor. The motivation is that, the large chunks of probability masses are not split into smaller chunks, making as small contribution as possible to the total entropy. The algorithm satisfies at least one marginal constraint at each step, and $m$ of them in the last step. Thus it terminates in at most $nm-m+1$ steps.

4 Greedy Algorithm Gives Local Optimum

In this section, we present our main theorem and show that the greedy algorithm always finds a local optimum. We consider $n$ variables each with $n$ states. The extension of the analysis to $m$ variables each with $n$ states is trivial. Let us first formalize the entropy minimization problem:

Definition 1 (Minimum Entropy Coupling).

Let $U_{i},i\in[n]$ be discrete random variables with $n$ states, with marginal distributions $\mathbf{p_{i}}\in[0,1]^{n}$ . The minimum entropy coupling problem is to find the joint distribution with minimum entropy that is consistent with the given marginals:

[TABLE]

We can equivalently write down this optimization problem by representing the joint probability value for each configuration as a different variable. This representation has $n^{n}$ variables and $n^{2}$ constraints ( $n$ * marginals and $n$ points for each marginal). Let $x(i_{1},i_{2},\ldots,i_{n})$ be a variable for every $n$ -tuple $(i_{1},i_{2},\ldots,i_{n})\in[n]^{n}$ . Notice that the index for * $j$ th dimension, i.e., $i_{j}$ , captures the realization of variable $U_{j}$ . Then the optimization problem can be written as follows:

[TABLE]

In (4), we dropped the constraint $\sum_{j,i_{j}}x(i_{1},i_{2}\ldots,i_{n})=1$ . Total sum is equivalent to first marginalizing out dimensions $1$ to $n-1$ , and then marginalizing out dimension $n$ . If marginalizing out the first $n-1$ dimensions gives $\mathbf{p_{n}}$ , which is already captured as a separate equality constraint, summing across this dimension gives 1 since $\mathbf{p_{n}}$ sums to 1.

In this section, we show the following theorem:

Theorem 1.

Algorithm 1 finds a local optimum point of the optimization problem in (4).

4.1 KKT Conditions

First, we characterize the points that satisfy the KKT conditions. We have the following lemma:

Lemma 2.

Consider the optimization problem in (4). Let $x^{*}(i_{1},i_{2},\ldots,i_{n}),i_{j}\in[n],j\in[n]$ be a point that satisfies the KKT conditions. Then there are $n$ vectors $\mathbf{u_{k}},k\in[n]$ each of length $n$ such that either $x^{*}(i_{1},i_{2},\ldots,i_{n})=0$ , or

[TABLE]

Proof.

Consider the following general optimization problem:

[TABLE]

Lagrangian becomes

[TABLE]

which gives the KKT conditions

[TABLE]

This implies, for fixed $i$ , either $f_{i}(x^{*})=0$ or $\lambda_{i}^{*}=0$ . Matching the constraints in (4) to the functions in (4), we identify $f(i_{1},i_{2},\ldots,i_{n})$ and $h_{l,j}$ as follows:

[TABLE]

The Lagrangian of (4) can be written as follows:

[TABLE]

for the dual parameters $\lambda(i_{1},i_{2},\ldots,i_{n})$ and $\nu_{l,j}$ . The gradient being zero gives us the following:

[TABLE]

The conditions above imply the following for the optimal point $x^{*}$ : Either $x^{*}(i_{1},i_{2},\ldots,i_{n})=0$ or if $x^{*}(i_{1},i_{2},\ldots,i_{n})\neq 0$ it satisfies

[TABLE]

Thus, for $n$ vectors $u_{k}\coloneqq\nu_{k,.},k\in[n]$ of length $n$ , we have $\log{x^{*}(i_{1},i_{2},\ldots,i_{n})}+1=\sum_{k\in[n]}u_{k}(i_{k}).$ ∎

By Lemma 2, the optimal point satisfies the following: Each nonzero joint probability can be written as a product of the corresponding entries of $n$ vectors $\{\mathbf{v_{k}}\}_{k\in[n]}$ of length $n$ . Inspired by the definition of independence, we will term such joint distributions as quasi-independent:

Definition 2.

A joint distribution $p(X_{1},X_{2},\ldots,X_{m})$ for $X_{i}\in[n]$ is called quasi-independent, if there are $m$ vectors $\mathbf{u_{j}},j\in[m]$ such that either $p(i_{1},i_{2},\ldots,i_{m})=0$ or $p(i_{1},i_{2},\ldots,i_{m})=\prod_{j\in[m]}\mathbf{u_{j}}(i_{j}),\forall i_{j}\in[n],j\in[m]$ .

4.2 Characterization of Greedy Algorithm Output

Consider Algorithm 1. It selects the minimum of maximum probability values across each marginal at each step, subtracts this probability mass from the corresponding coordinates in each marginal and iterates. Next, we show that one can always construct $\mathbf{u_{k}}$ vectors that satisfy $\log{x(i_{1},i_{2},\ldots,i_{n})}+1=\sum_{k\in[n]}\mathbf{u_{k}}(i_{k})$ , where $x(i_{1},i_{2},\ldots,i_{n})$ is the probability mass assigned to point $(i_{1},i_{2},\ldots,i_{n})$ by the algorithm.

Let the algorithm select a probability mass for the point $S_{j}=(i_{1}^{j},i_{2}^{j},\ldots,i_{n}^{j})$ at iteration $j$ . $x(S_{j})>0$ . Let $a_{j}\coloneqq\log{x(S_{j})}+1$ after this assignment. Define the column vector $\mathbf{u}\coloneqq[\mathbf{u_{1}}^{T},\ldots,\mathbf{u_{n}}^{T}]^{T}$ . $\{\mathbf{u_{i}}\}_{i\in[n]}$ are length- $n$ * vectors to be decided. We will show that, given the assignments made by the algorithm, one can always construct a $\mathbf{u}$ such that (3) holds.*

Observe that each iteration of the algorithm corresponds to a linear equation in $\mathbf{u}$ . Note that $\mathbf{u}$ has length $n^{2}$ and at iteration $j$ , $\mathbf{u}$ should satisfy the constraint $\mathds{1}_{S_{j}}^{T}\mathbf{u}=a_{j}$ , where $\mathds{1}_{S_{j}}^{T}$ is the indicator vector that is 1 in the columns from $S_{j}$ and zero otherwise: If $S_{j}=(i_{1}^{j},i_{2}^{j},\ldots,i_{n}^{j})$ , then $\mathds{1}_{S_{j}}(k)=1,\forall k\in\xi_{j}$ , where $\xi_{j}=\{i_{t}^{j}+(t-1)n\}_{t\in[n]}$ . We know that the algorithm terminates in at most $n(n-1)+1$ steps. Thus, we have $m<n(n-1)+1$ linear equations and $n^{2}$ variables. This corresponds to a system of linear equations $\mathbf{Gu}=\mathbf{a}$ , where $\mathbf{G}(j,:)=\mathds{1}_{S_{j}}^{T}$ and $\mathbf{a}=[a_{j}]_{j\in[m]}$ is a column vector.

We have the following key observation: At each iteration step, the algorithm satisfies at least one of the marginal constraints, since it chooses the minimum of maximum probabilities. Thus, if at iteration $j$ the algorithm select the set of the coordinates $(i_{1}^{j},i_{2}^{j},\ldots,i_{n}^{j})$ , then for some $k\in[n]$ algorithm never selects the coordinate $i_{k}^{j}$ again, since the corresponding marginal constraint is already satisfied. In terms of the matrix $\mathbf{G}$ , this translates to the following statement: Every row $j$ of $\mathbf{G}$ contains a column $k\in\xi_{j}$ where $\mathbf{G}(l,k)=0,\forall l>j$ . Thus, every row of $\mathbf{G}$ has a column where that row contains the last 1 in that column. We have the following lemma:

Lemma 3.

Let $\mathbf{G}$ be a ${0,1}$ matrix where no row is identically zero. If for every row $j$ , of all the columns with value 1, there exists a column $k$ such that $\mathbf{G}(l,k)=0,\forall l>j$ , then the rows of $\mathbf{G}$ are linearly independent.

Proof.

Assume otherwise. Then there exists a set of rows $S$ and coefficients $\alpha_{j}>0$ such that $\sum_{j\in S}\alpha_{j}\mathbf{G}(j,:)=0$ . Let $l=\min\{i:i\in S\}$ . By definition, $l^{th}$ row of $\mathbf{G}$ has a column $k$ with $\mathbf{G}(t,k)=0,\forall t>l$ . Thus, this column cannot be made 0 using a linear combination of rows with a larger index, which contradicts with $\sum_{j\in S}\alpha_{j}\mathbf{G}(j,:)=0$ . ∎

By Lemma (3), the rows of $\mathbf{G}$ are linearly independent. This is also true for the augmented matrix of the system $\mathbf{Gu}=\mathbf{a}$ . Hence, the assignments are consistent and there is at least one solution to the linear system $\mathbf{Gu}=\mathbf{a}$ .

Proof of Theorem 1.

Consider the joint distribution output by the greedy algorithm. From the above discussion, the assignments to the joint distribution by the greedy entropy minimization algorithm can always be used to create $n$ vectors, such that the points where the joint is non-zero can be written as the product of the corresponding coordinates of these $n$ vectors. Thus, the greedy algorithm outputs a point which is quasi-independent, and satisfies the KKT conditions of the minimum entropy coupling problem. Hence, this is a stationary point. Since entropy is a concave function, there are no saddle points. Thus, greedy algorithm outputs a local optimum. ∎

5 Approximation Guarantee

In this section, we analyze a variant of the greedy algorithm, Algorithm 2, which is easier to develop an approximation guarantee for.

Different from Algorithm 1, Algorithm 2 looks at each value of every given marginal exactly once during Phase I. This allows us to relate the entropy contribution of Phase I to a lower bound to the optimum entropy.

Consider two random variables $X_{1},X_{2}$ . We use $\mu_{1},\mu_{2}$ to represent the marginal distributions of $X_{1}$ and $X_{2}$ after sorting their probabilities in decreasing order. We can extend the entropy function to operate on vectors which do not necessarily sum to 1. To make the distinction from entropy, we use $h(.)$ for this operator222 $h(.)$ is often used for the differential entropy operator. Since we do not use differential entropy in this paper, we believe this is not a source of confusion..

Theorem 2.

Let $X_{1},X_{2}$ be two discrete random variables with $n$ states and $\mu_{1}=[p_{1}(i)]_{i\in[n]}$ , $\mu_{2}=[p_{2}(i)]_{i\in[n]}$ be their marginal distribution vectors sorted in decreasing order. Let $p_{m}(i)=\min\{p_{1}(i),p_{2}(i)\}$ . Let $U$ be the joint distribution output by the greedy algorithm, and $H^{*}(X_{1},X_{2})$ the minimum joint entropy of all joints that respect the marginals. Then

[TABLE]

where $l_{j}=[p_{j}(i)-p_{m}(i)]_{i\in[n]}$ for $j\in\{1,2\}$ , and $T=0.5\sum_{i\in[n]}\lvert p_{1}(i)-p_{2}(i)\rvert$ is the total variation distance between the sorted marginals of $X_{1}$ and $X_{2}$ .

Proof.

Define $p_{m}(i)=\min\{p_{1}(i),p_{2}(i)\}$ . In Phase I, algorithm chooses $p_{m}(i)$ for $i\in[n]$ . Consider

[TABLE]

$H_{a}$ is the entropy of the distribution which is obtained by splitting $p_{1}(i)$ into $p_{m}(i)$ and $p_{1}(i)-p_{m}(i)$ . Since each probability value is divided into at most 2 probability values,

[TABLE]

Similarly, we can write

[TABLE]

Then in Phase I, algorithm creates an entropy contribution $H_{Ph1}=h(p_{m}(1),p_{m}(2),\ldots,p_{m}(n))=-\sum_{i\in[n]}p_{m}(i)\log(p_{m}(i))$ . Based on the definitions of $l_{1},l_{2}$

[TABLE]

Let $\alpha\in\{0,1\}$ . Combining with (9) and (5), we get

[TABLE]

To bound the contribution of the second phase, we use an "independence" bound. The following lemma is useful:

Lemma 4.

Consider the vectors $\mathbf{p}=[p_{i}]_{i\in[n]},\mathbf{q}=[q_{i}]_{i\in[n]}$ where $p_{i},q_{i}\geq 0$ and $\sum_{i}p_{i}=\sum_{i}q_{i}=T$ . Let $h(\mathbf{p})=-\sum_{i}p_{i}\log(p_{i})$ . Let $\mathbf{R}(i,j)=r_{i,j}$ for $i\in[n],j\in[n]$ be a matrix with row sum equal to $\mathbf{p}$ and column sum equal to $\mathbf{q}$ , i.e., $\sum_{j\in[n]}r_{i,j}=p_{i}$ and $\sum_{i\in[n]}r_{i,j}=q_{j},\forall i,j\in[n]$ . Then $h(\mathbf{R})\leq T\log(T)+h(\mathbf{p})+h(\mathbf{q})$ .

Moreover, when $\mathbf{R}$ is the outer product of $\mathbf{p}/\sqrt{T}$ and $\mathbf{q}/\sqrt{T}$ , the equality holds.

Proof.

Define the random variables $U$ and $V$ as the variables with marginal distributions $\mathbf{p}/T$ and $\mathbf{q}/T$ , respectively. Let $\mathbf{S}(i,j)=[s_{i,j}]_{i,j\in[n]}$ be the joint distribution matrix for $U,V$ that respects the marginals $\mathbf{p}/T$ and $\mathbf{q}/T$ . Since $H(U,V)\leq H(U)+H(V)$ , we have

[TABLE]

Define $\mathbf{R}(i,j)=r_{i,j}$ where $r_{i,j}=Ts_{i,j}$ . Notice that row sum of $\mathbf{R}$ is $\mathbf{p}$ and column sum of $\mathbf{R}$ is $\mathbf{q}$ . Then we have,

[TABLE]

Suppose $\mathbf{R}(i,j)=\frac{p_{i}q_{j}}{T}$ . Then we have,

[TABLE]

∎

Following Lemma 4, the maximum contribution of the second phase to the entropy is obtained when we place the scaled outer product of the remaining probability values on the joint probability matrix. The remaining probabilities after phase 1 are $l_{1}$ and $l_{2}$ for $X_{1}$ and $X_{2}$ . The remaining probability mass is the total variation distance, i.e., $\sum_{i}l_{1}(i)=\sum_{i}l_{2}(i)=T$ . Thus, in Phase II, $l_{1}$ and $l_{2}$ contributes the entropy of $H_{Ph2}\leq T\log{T}+h(l_{1})+h(l_{2})$ . Finally, we can write

[TABLE]

(12) is obtained by selecting $\alpha=1$ if $h(l_{1})>h(l_{2})$ and $\alpha=0$ if $h(l_{1})<h(l_{2})$ , and through the bound $H^{*}(X_{1},X_{2})\geq\max(H(X_{1}),H(X_{2}))\geq\alpha H(X_{1})+(1-\alpha)H(X_{2})$ . ∎

Consider the bound given in Theorem 2. $1-T\log(1/T)$ is a constant less than 1. However, the term $\min\{h(l_{1}),h(l_{2})\}$ can scale with $\log(n)$ depending on the difference between the sorted marginals. In Section 5.1, we give an example where $\min\{h(l_{1}),h(l_{2})\}=\mathcal{O}(\log(n))$ . Interestingly, for the same example we can show that the greedy algorithm output is at most 1 bit away from the global optimum. Thus, it may be possible to identify a tighter bound.

We can extend the analysis to the case of $m$ variables instead of only 2. We then have the following theorem:

Theorem 3.

Let $\{X_{i}\}_{i\in[m]}$ be $m$ random variables each with $n$ states and $\mu_{i}=[p_{i}(j)]_{j\in[n]},\forall i\in[m]$ be their marginal distribution vectors sorted in decreasing order. Let $p_{min}(j)=\min\{p_{i}(j),i\in[m]\}$ . Let $U$ be the joint distribution output by Algorithm 2 and $H^{*}(X_{1},\ldots,X_{m})$ the global optimum. Then

[TABLE]

where $l_{i}=[p_{i}(j)-p_{min}(j)]_{j\in[n]}$ for $i\in[m]$ , and $T=\sum_{i\in[n]}(p_{1}(i)-p_{min}(i))$ .

Proof.

Define $p_{min}(i)=\min_{j}\{p_{j}(i),j\in[m]\}$ . In Phase 1, the algorithm chooses $p_{min}(i)$ for $i\in[n]$ . Consider for all $j\in[m]$

[TABLE]

$H_{a_{j}}$ is the entropy of the distribution which is obtained by splitting $p_{j}(i)$ into $p_{min}(i)$ and $p_{j}(i)-p_{min}(i)$ . Since each probability value is divided into at most 2 probability values,

[TABLE]

In Phase I, algorithm creates an entropy contribution $H_{Ph1}=h(p_{min}(1),p_{min}(2),\ldots,p_{min}(n))=-\sum_{i\in[n]}p_{min}(i)\log(p_{min}(i))$ . Define $l_{j}=[p_{j}(1)-p_{min}(1),p_{j}(2)-p_{min}(2),\ldots,p_{j}(n)-p_{min}(n)]$ for all $j\in[m]$ . Then we have

[TABLE]

Let $\alpha_{j}\in[0,1]$ and $\sum_{j}\alpha_{j}=1$ . Combining with (15), we get

[TABLE]

To bound the contribution of the second phase, we use an "independence" bound similar to the one in the proof of Theorem 2. We need the following lemma:

Lemma 5.

Consider the vectors $\mathbf{p_{i}}=[p_{i}(j)]_{j\in[n]},{i\in[m]}$ where $p_{i}(j)\geq 0,\forall i\in[m],j\in[n]$ and $\sum_{j}p_{i}(j)=T,\forall i\in[m]$ . Let $h(\mathbf{p_{i}})=-\sum_{j}p_{i}(j)\log(p_{i}(j))$ . Let $\mathbf{R}=[r_{i_{1},i_{2},\ldots,i_{m}}]_{i_{j}\in[n]}$ be a tensor that satisfies the following: $\sum_{i_{k}\in[n],\forall k\neq l,i_{l}=t}{r_{i_{1},i_{2},\ldots,i_{m}}}=p_{l}(t),\forall l\in[m],t\in[n]$ . Then $h(\mathbf{R})\leq\sum_{i\in[m]}h(\mathbf{p_{i}})+(m-1)T\log(T)$ .

Moreover, when $\mathbf{R}$ is the outer product of $\frac{\mathbf{p_{i}}}{T^{\frac{m-1}{m}}}$ ,for all $i\in[m]$ , the equality holds.

Proof.

Define the random variables $U_{i}$ as the variables with marginal distributions $\mathbf{p_{i}}/T$ for all $i\in[n]$ . Let $\mathbf{S}(i_{1},i_{2},\ldots,i_{m})=[s_{i_{1},i_{2},\ldots,i_{m}}]_{i_{j}\in[n],\forall j\in[m]}$ be the joint distribution tensor for $U_{i}$ that respects the marginals $\mathbf{p_{i}}/T$ and $\mathbf{q_{i}}/T$ . Since $H(U_{1},U_{2},\ldots,U_{m})\leq\sum_{i}H(U_{i})$ , we have

[TABLE]

Define $\mathbf{R}(i_{1},i_{2},\ldots,i_{m})=r_{i_{1},i_{2},\ldots,i_{m}}$ where $r_{i_{1},i_{2},\ldots,i_{m}}=Ts_{i_{1},i_{2},\ldots,i_{m}}$ . Notice that with this scaling, marginalizing out every dimension in $\mathbf{R}$ except for dimension $i$ gives $\mathbf{p_{i}}$ vector. Then we have,

[TABLE]

Suppose $\mathbf{R}(i_{1},i_{2},\ldots,i_{m})=\frac{\prod_{j}p_{j}(i_{j})}{T^{m-1}}$ . Then we have,

[TABLE]

∎

Following Lemma 5, the maximum contribution of the second phase to the entropy is obtained when we place the scaled outer product of the remaining probability values on the joint probability matrix. The remaining probabilities after Phase 1 are $l_{i}$ for $X_{i}$ for all $i\in[m]$ . The remaining probability mass is $\sum_{i}l_{j}(i)=T,\forall j\in[m]$ . Thus, in Phase II, $l_{j},j\in[m]$ contributes the entropy of $H_{Ph2}\leq\sum_{j}h(l_{j})+(m-1)T\log{T}$ . Finally, we can write

[TABLE]

(18) is obtained by selecting $\alpha_{j}=1$ for $j=\arg\max_{k}\{h(l_{k})\}$ , and through the bound

[TABLE]

∎

5.1 A special family of distributions

Let $X_{1}$ be uniformly distributed random variable over $n$ states, i.e., $\mu_{1}(i)=1/n,\forall i\in[n]$ . Let $X_{2}$ have the distribution $\mu_{2}$ with the following: $\mu_{2}(i)=\frac{\alpha}{n},\forall i\in[n/2]$ and $\mu_{2}(i)=\frac{2-\alpha}{n},\forall i\in\{n/2+1,n/2+2,\ldots,n\}$ , where $1<\alpha<2$ . One can check that $\mu_{2}$ sums to 1 with this parameterization. We can calculate the entropies of $X_{1}$ and $X_{2}$ which yields $H(X_{1})=\log(n),H(X_{2})=\log(n)-\frac{\alpha}{2}\log(\alpha)-\frac{2-\alpha}{2}\log(2-\alpha)$ . Running Algorithm 2 on $X_{1}$ and $X_{2}$ , we have the following:

[TABLE]

where in (20) we used the reparameterization $\alpha=\epsilon+1$ for $0<\epsilon<1$ . Since $H(U)\geq H^{*}(X_{1},X_{2})$ , algorithm outputs a joint distribution with entropy at most 1 bit away from the optimum. However, we have $h(l_{1})=h(l_{2})=\frac{\alpha-1}{2}\log\frac{n}{\alpha-1}$ . Thus, $\min\{h(l_{1}),h(l_{2})\}=\frac{\alpha-1}{2}\log\frac{n}{\alpha-1}$ yielding a gap of at least $\frac{\alpha-1}{2}\log(n)$ . In the light of this example, we believe that a tighter guarantee should be provable for the given algorithm.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Krzysztof Chalupka, Tobias Bischoff, Pietro Perona, and Frederick Eberhardt. Unsupervised discovery of el nino using causal feature learning on microlevel climate data. In Proc. of UAI’16 , 2016.
2[2] Ferdinando Cicalese, Luisa Gargano, and Ugo Vaccaro. How to find a joint probability distribution of minimum entropy (almost), given the marginals. In ar Xiv pre-print , 2017.
3[3] Jesús De Loera and Edward D. Kim. Combinatorics and geometry of transportation polytopes: An update. In ar Xiv pre-print , 2013.
4[4] Frederick Eberhardt. Causation and Intervention (Ph.D. Thesis) , 2007.
5[5] Jalal Etesami and Negar Kiyavash. Discovering influence structure. In IEEE ISIT , 2016.
6[6] Clive W. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society , pages 424–438, 1969.
7[7] Moritz Grosse-Wentrup, Dominik Janzing, Markus Siegel, and Bernhard Schölkopf. Identification of causal relations in neuroimaging data with latent confounders: An instrumental variable approach. Neuro Image (Elsevier) , 125:825–833, 2016.
8[8] Patrik O Hoyer, Dominik Janzing, Joris Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Proc. of NIPS 2008 , 2008.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Entropic Causality and

Abstract

1 Introduction

2 Related Work

3 Background

3.1 Notation

3.2 Causal Model

3.3 The Entropic Causal Inference Framework

Lemma 1**.**

Proof.

3.4 Greedy Minimum Entropy Coupling Algorithm

4 Greedy Algorithm Gives Local Optimum

Definition 1** (Minimum Entropy Coupling).**

Theorem 1**.**

4.1 KKT Conditions

Lemma 2**.**

Proof.

Definition 2**.**

4.2 Characterization of Greedy Algorithm Output

Lemma 3**.**

Proof.

Proof of Theorem 1.

5 Approximation Guarantee

Theorem 2**.**

Proof.

Lemma 4**.**

Proof.

Theorem 3**.**

Proof.

Lemma 5**.**

Proof.

5.1 A special family of distributions

Lemma 1.

Definition 1 (Minimum Entropy Coupling).

Theorem 1.

Lemma 2.

Definition 2.

Lemma 3.

Theorem 2.

Lemma 4.

Theorem 3.

Lemma 5.