Capturing and Interpreting Unique Information

Praveen Venkatesh; Keerthana Gurushankar; Gabriel Schamberg

arXiv:2302.11873·cs.IT·February 24, 2023

Capturing and Interpreting Unique Information

Praveen Venkatesh, Keerthana Gurushankar, Gabriel Schamberg

PDF

Open Access

TL;DR

This paper explores the operational meaning of unique information in partial information decompositions, proposing a new PID definition with clear interpretation and analyzing its properties and connections to existing frameworks.

Contribution

It introduces a new PID definition that captures unique information with an intuitive interpretation and links it to existing PID frameworks through a Lagrangian formulation.

Findings

01

Unique information bounds decision risk.

02

New PID captures information uniquely held by variables.

03

Connections between different PID definitions are established.

Abstract

Partial information decompositions (PIDs), which quantify information interactions between three or more variables in terms of uniqueness, redundancy and synergy, are gaining traction in many application domains. However, our understanding of the operational interpretations of PIDs is still incomplete for many popular PID definitions. In this paper, we discuss the operational interpretations of unique information through the lens of two well-known PID definitions. We reexamine an interpretation from statistical decision theory showing how unique information upper bounds the risk in a decision problem. We then explore a new connection between the two PIDs, which allows us to develop an informal but appealing interpretation, and generalize the PID definitions using a common Lagrangian formulation. Finally, we provide a new PID definition that is able to capture the information that is…

Equations141

(P_{A ∣ B} \circ P_{B ∣ C}) (a ∣ c) : = \int_{B} P_{A ∣ B} (a ∣ b) \cdot P_{B ∣ C} (b ∣ c) d b .

(P_{A ∣ B} \circ P_{B ∣ C}) (a ∣ c) : = \int_{B} P_{A ∣ B} (a ∣ b) \cdot P_{B ∣ C} (b ∣ c) d b .

\displaystyle I\bigl{(}M;(X,Y)\bigr{)}

\displaystyle I\bigl{(}M;(X,Y)\bigr{)}

= U I + R I (M : X; Y) + S I (M : X; Y),

\displaystyle I\bigl{(}M;X\bigr{)}

\displaystyle I\bigl{(}M;Y\bigr{)}

\delta(M:X\setminus Y)\coloneqq\;\;\inf_{\mathclap{\vphantom{X^{X^{X}}}P_{X^{\prime}|Y}\,\in\,\mathcal{C}(\mathsf{X}|\mathsf{Y})}}\;\;\mathbb{E}_{P_{M}}\bigl{[}D(P_{X|M}\,\|\,P_{X^{\prime}|Y}\circ P_{Y|M})\bigr{]}.

\delta(M:X\setminus Y)\coloneqq\;\;\inf_{\mathclap{\vphantom{X^{X^{X}}}P_{X^{\prime}|Y}\,\in\,\mathcal{C}(\mathsf{X}|\mathsf{Y})}}\;\;\mathbb{E}_{P_{M}}\bigl{[}D(P_{X|M}\,\|\,P_{X^{\prime}|Y}\circ P_{Y|M})\bigr{]}.

R I^{δ} (M : X; Y) : = min {

R I^{δ} (M : X; Y) : = min {

I (M; Y) - δ (M : Y ∖ X)} .

U I (M : X ∖ Y) : = Q \in Δ_{P} min I_{Q} (M; X ∣ Y),

U I (M : X ∖ Y) : = Q \in Δ_{P} min I_{Q} (M; X ∣ Y),

P_{Y^{'} ∣ X} \circ P_{X ∣ M} = P_{Y ∣ M} .

P_{Y^{'} ∣ X} \circ P_{X ∣ M} = P_{Y ∣ M} .

U I_{X} = 0 \Leftrightarrow Y ≽_{M} X and U I_{Y} = 0 \Leftrightarrow X ≽_{M} Y

U I_{X} = 0 \Leftrightarrow Y ≽_{M} X and U I_{Y} = 0 \Leftrightarrow X ≽_{M} Y

δ^{LeCam} (M : X ∖ Y)

δ^{LeCam} (M : X ∖ Y)

\displaystyle\coloneqq\inf_{\begin{subarray}{c}P_{X^{\prime}|Y}\\ \in\,\mathcal{C}(X|Y)\end{subarray}}\sup_{m\in\mathsf{M}}\,\bigl{\lVert}P_{X^{\prime}|Y}\circ P_{Y|M=m}-P_{X|M=m}\bigr{\rVert}_{TV}

\mathcal{R}_{m}(P_{X|M},\widehat{M}_{X},\mathcal{L})\coloneqq\mathbb{E}_{X\sim P_{X|M=m}}\bigl{[}\mathcal{L}(\widehat{M}_{X}(X),m)\bigr{]}

\mathcal{R}_{m}(P_{X|M},\widehat{M}_{X},\mathcal{L})\coloneqq\mathbb{E}_{X\sim P_{X|M=m}}\bigl{[}\mathcal{L}(\widehat{M}_{X}(X),m)\bigr{]}

R_{m} (P_{X ∣ M}, M_{X}, L)

R_{m} (P_{X ∣ M}, M_{X}, L)

\leq R + δ^{LeCam} (M : Y ∖ X) .

\bar{\mathcal{R}}(P_{X|M},\widehat{M}_{X},\mathcal{L})\coloneqq\mathbb{E}_{M,X}\bigl{[}\mathcal{L}(\widehat{M}_{X}(X),M)\bigr{]}

\bar{\mathcal{R}}(P_{X|M},\widehat{M}_{X},\mathcal{L})\coloneqq\mathbb{E}_{M,X}\bigl{[}\mathcal{L}(\widehat{M}_{X}(X),M)\bigr{]}

\overset{ˉ}{R} (P_{X ∣ M}, M_{X}, L)

\overset{ˉ}{R} (P_{X ∣ M}, M_{X}, L)

\leq R + g (δ (M : Y ∖ X)),

U I^{δ} (M : X ∖ Y)

U I^{δ} (M : X ∖ Y)

= I (M; X) - R I^{δ} (M : X; Y)

= max {δ (M : X ∖ Y),

= m a x {δ (M : Y ∖ X) + I (M; X) - I (M; Y)}

\geq δ (M : X ∖ Y),

U I^{δ} (M : Y ∖ X)

U I^{δ} (M : Y ∖ X)

U I^{δ} (M : X ∖ Y)

Cyan (M : X ∖ Y)

Cyan (M : X ∖ Y)

: = I - I (M; Y) + δ (M : Y ∖ X) .

R I (M : X; Y)

R I (M : X; Y)

= (a) I (M; X) - I_{Q^{*}} (M; X ∣ Y)

= (b) I_{Q^{*}} (M; X) - I_{Q^{*}} (M; X ∣ Y)

= : I_{Q^{*}} (M; X; Y),

I_{Q *} (M; X; Y)

I_{Q *} (M; X; Y)

\displaystyle=\mathbb{E}_{m,x,y\,\sim\,Q^{*}_{MXY}}\biggl{[}\log\frac{Q^{*}(m,x)Q^{*}(m,y)Q^{*}(x,y)}{Q^{*}(m)Q^{*}(x)Q^{*}(y)Q^{*}(m,x,y)}\biggr{]}

δ (M : X ∖ Y) = P_{X^{'} ∣ M Y} in f E_{M} [D_{K L} (P_{X ∣ M} ∥ P_{X^{'} ∣ M})]

δ (M : X ∖ Y) = P_{X^{'} ∣ M Y} in f E_{M} [D_{K L} (P_{X ∣ M} ∥ P_{X^{'} ∣ M})]

s.t. I (M; X^{'} ∣ Y) = 0.

U I (M : X ∖ Y) = P_{X^{'} ∣ M Y} in f I (M; X^{'} ∣ Y)

U I (M : X ∖ Y) = P_{X^{'} ∣ M Y} in f I (M; X^{'} ∣ Y)

s.t. E_{M} [D_{K L} (P_{X ∣ M} ∥ P_{X^{'} ∣ M})] = 0,

δ^{λ} (M : X ∖ Y) : = P_{X^{'} ∣ M Y} in f

δ^{λ} (M : X ∖ Y) : = P_{X^{'} ∣ M Y} in f

+ λ I (M; X^{'} ∣ Y) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making

Full text

Capturing and Interpreting Unique Information

Praveen Venkatesh

Allen Institute

and University of Washington

Seattle, WA, USA

[email protected]

Keerthana Gurushankar

Department of Computer Science,

Carnegie Mellon University

Pittsburgh, PA, USA

[email protected]

Gabriel Schamberg

Department of Surgery,

University of Auckland

New Zealand

[email protected]

Abstract

Partial information decompositions (PIDs), which quantify information interactions between three or more variables in terms of uniqueness, redundancy and synergy, are gaining traction in many application domains. However, our understanding of the operational interpretations of PIDs is still incomplete for many popular PID definitions. In this paper, we discuss the operational interpretations of unique information through the lens of two well-known PID definitions. We reexamine an interpretation from statistical decision theory showing how unique information upper bounds the risk in a decision problem. We then explore a new connection between the two PIDs, which allows us to develop an informal but appealing interpretation, and generalize the PID definitions using a common Lagrangian formulation. Finally, we provide a new PID definition that is able to capture the information that is unique. We also show that it has a straightforward interpretation and examine its properties.

I Introduction

Partial information decompositions (PIDs) have become a popular method for understanding the information interactions between multiple random variables. A bivariate PID seeks to decompose the information that two variables $X$ and $Y$ convey about a message $M$ , into parts that are unique to $X$ , unique to $Y$ , redundant to $X$ and $Y$ , and synergistic [1, 2, 3].

As a simple example, consider a message $M=[M_{1},M_{2},M_{3},M_{4}]$ , and two variables $X=[M_{1},M_{3},M_{4}\oplus Z]$ and $Y=[M_{2},M_{3},Z]$ , where $M_{i},Z\sim$ i.i.d. Ber $(1/2)$ and $\oplus$ represents an XOR operation between bits. Here, $X$ has one bit of unique information about $M$ , i.e., $M_{1}$ , which is not present in $Y$ . Similarly, $Y$ has one bit of unique information about $M$ , i.e., $M_{2}$ , which is not present in $X$ . There is one bit of redundant information, i.e., $M_{3}$ , which can be extracted from either $X$ or $Y$ taken alone. Finally, there is one bit of synergistic information, i.e., $M_{4}$ : this information cannot be extracted from either $X$ or $Y$ individually, but can be recovered when both are taken together.

PIDs have found applications in various fields, from neuroscience (where one may want to examine the interaction between stimuli, neural activity and behavioral response) [4, 5] to financial markets [6]. Recent works have also used PIDs to explain how information complexity decreases through the layers of a deep neural network [7], as well as to develop new measures of fairness in machine learning [8].

Despite increasingly widespread adoption, there is still no consensus on how PIDs should be defined, or on how to operationally interpret partial information quantities (e.g., see [9, 10]). One popular approach for operational interpretations has relied on the concept of Blackwell sufficiency from statistical decision theory. Blackwell sufficiency is a formal way to determine whether $X$ contains all of the information that $Y$ has about $M$ . Thus, it becomes a natural basis for discussing how two variables carry information about a message. For example, Kolchinsky [10] uses it to operationalize measures of redundancy and “union” information.

Here, we restrict our attention to interpretations of unique information. Bertschinger et al. [3] used Blackwell sufficiency to motivate a definition of unique information. But their interpretation only addressed whether or not the unique information was zero or non-zero, and did not provide an interpretation for the quantity of unique information. More recently, Banerjee et al. [2] and Rauh et al. [11] interpreted the quantification of unique information in terms of a secret key rate using a context from information-theoretic security. However, such an interpretation is difficult to translate to other contexts like neuroscience, where there may not be an analog for an eavesdropper.

This paper discusses two PID definitions based on Blackwell sufficiency [2, 3], and provides an operational interpretation of the quantity of unique information in each case. Extending classical results on so-called “deficiency” measures [12, 13], and clarifying results in [2], we show that the unique information about $M$ present in $X$ w.r.t. $Y$ upper bounds the difference in risk attained in a decision problem, when one uses $X$ rather than $Y$ to make decisions pertaining to $M$ (Sections III-A, III-B, and III-D).

We then identify a previously unrecognized connection between the aforementioned PIDs, which shows that the two definitions swap the objective and constraint in their respective optimizations (Section III-E). This discovery allows us to clarify how these definitions are related to Blackwell sufficiency, and provide an informal but appealing interpretation for them (Section III-F). Finally, we develop a novel generalization of the two PIDs, through a common Lagrangian (Section III-G). In the process, we also explicitly raise an issue pertaining to symmetrization of redundancy, and show how it complicates the interpretation of unique information (Sections III-B, III-C).

Lastly, in Section IV, we propose a new PID definition that captures the part of $M$ that is unique in the form of a random variable. We hinted at this PID in our previous work [14], without defining it or discussing its properties. Here, we define the PID formally through redundancy symmetrization, show that it forms a valid non-negative decomposition and that it obeys intuitive bounds. We also show that this PID definition is Blackwellian [15] when $M$ , $X$ and $Y$ are jointly Gaussian.

II Background

II-A Notation

•

Let $M$ , $X$ and $Y$ be three random variables with sample spaces $\mathsf{M}$ , $\mathsf{X}$ and $\mathsf{Y}$ respectively, and joint density $P_{MXY}$ .

•

Let $\mathcal{C}(\mathsf{A}\,|\,\mathsf{B})$ denote the set of all channels from $\mathsf{A}$ to $\mathsf{B}$ , so for example, $P_{X|M}\in\mathcal{C}(\mathsf{X}\,|\,\mathsf{M})$ .

•

Let $\circ$ denote composition of channels, i.e. $\forall\;a\in\mathsf{A},c\in\mathsf{C}$ ,

[TABLE]

•

To keep the exposition simple, we ignore any measure-theoretic nuances. All conditional distributions and information measures are assumed to be well-defined.

II-B Defining PIDs

There are many notions of partial information decompositions: we focus here on the bivariate case, which decomposes the information that two variables $X$ and $Y$ have about a message $M$ . Such a PID is typically defined by a set of four functions of the joint distribution $P_{MXY}$ —denoted $UI(M:X\setminus Y)$ , $UI(M:Y\setminus X$ ), $RI(M:X;Y)$ and $SI(M:X;Y)$ (or $UI_{X}$ , $UI_{Y}$ , $RI$ and $SI$ respectively for brevity)—which satisfy the following basic equations:

[TABLE]

Equation (1) implies that the total mutual information about $M$ conveyed by $X$ and $Y$ is the sum of four partial information components: one unique to $X$ , one unique to $Y$ , another redundant to both $X$ and $Y$ , and the last which is synergistic, respectively. Equations (2) and (3) enforce that the individual mutual information of $X$ or $Y$ with $M$ is the sum of the redundant information and the corresponding unique information.111Typically, it is also assumed that the redundant and synergistic components are symmetric in $X$ and $Y$ . These equations impose three constraints on the four partial information components, such that defining any one component suffices to specify the other three.

In this paper we discuss the operational interpretations of two existing PID definitions due to [2] and [3] in Section III, and then introduce a new PID definition in Section IV. We state here the first two definitions as defined originally, and later we present modified forms which are more interpretable.

Definition 1 ( $\delta$ -PID [2]).

Let the (weighted output) deficiency222Deficiency* was introduced by Le Cam to quantify a departure from Blackwell sufficiency. of $Y$ with respect to $X$ about $M$ be defined as333The reason for this notation is that the deficiency of $Y$ w.r.t. $X$ translates to the unique information present in $X$ and not in $Y$ .*

[TABLE]

Then, the deficiency-based redundant information about $M$ present in $X$ and $Y$ is given by

[TABLE]

Using equations (1)–(3), $RI^{\delta}_{X}$ fully determines the $\delta$ -PID, i.e. $UI^{\delta}_{X}$ , $UI^{\delta}_{Y}$ , and $SI^{\delta}$ .

Definition 2 ( $\sim$ -PID444Also called the BROJA-PID in the literature after the authors of [3]. [3, 16]).

The unique information about $M$ present in $X$ and not in $Y$ is given by

[TABLE]

where $\Delta_{P}\coloneqq\{Q_{MXY}:Q_{MX}=P_{MX},\;Q_{MY}=P_{MY}\}$ and $I_{Q}(\cdot\,|\,\cdot)$ is the conditional mutual information over the joint distribution $Q_{MXY}$ .

As with the $\delta$ -PID, equations (1)–(3) fully determine the remaining components of the $\sim$ -PID.

II-C Blackwell sufficiency and Blackwellian PIDs

Blackwell sufficiency provides a partial order between random variables based on how informative they are about a message $M$ . This notion was used by [3] to provide an operational motivation for the $\sim$ -PID, and also underlies the basis of the $\delta$ -PID [2].

Definition 3 (Blackwell sufficiency: $\succcurlyeq_{M}$ ).

We say that a channel $P_{X|M}$ is Blackwell sufficient w.r.t. another channel $P_{Y|M}$ (denoted $X\succcurlyeq_{M}Y$ ) if $\exists\;P_{Y^{\prime}|X}\in\mathcal{C}(\mathsf{Y}\,|\,\mathsf{X})$ such that

[TABLE]

Intuitively, $X\succcurlyeq_{M}Y$ means that we can generate a new random variable $Y^{\prime}$ from $X$ (using the stochastic transformation $P_{Y^{\prime}|X}$ ) so that the effective channel from $M$ to $Y^{\prime}$ is equivalent to the original channel from $M$ to $Y$ .555Blackwell sufficiency is identical to the concept of stochastic degradedness of broadcast channels [15]. It was shown by Blackwell [17] that if $X$ is Blackwell sufficient for $M$ w.r.t. $Y$ , then it is always preferable to observe $X$ rather than $Y$ , for making decisions about $M$ . This operational interpretation of Blackwell sufficiency was extended to PIDs by [3]:

Definition 4 (Blackwellian PID).

A bivariate PID on $P_{MXY}$ is said to be Blackwellian if

[TABLE]

This means that (for a Blackwellian PID definition) the unique information in one variable is zero only if it is always beneficial to observe the other variable to make decisions about $M$ . Conversely, if $X$ is not Blackwell sufficient for $M$ w.r.t. $Y$ , then $Y$ must have some unique information about $M$ that $X$ cannot access.

However, it is important to note that a Blackwellian PID is only operationally motivated to the extent of whether or not the unique information is zero. It does not lend an operational interpretation as to the volume of unique information when it is non-zero.

III Interpreting the $\delta$ - and $\sim$ -PIDs

III-A Deficiency upper bounds the difference in risk

The $\delta$ -PID derives its operational interpretation directly from that of deficiency [18, 12], upon which it is based. The deficiency of $Y$ w.r.t. $X$ , originally defined by Le Cam [18], measures how far from Blackwell sufficient $Y$ is, w.r.t. $X$ .

Le Cam’s original notion of deficiency was defined using the total variation distance, and as a worst case over realizations of $M$ . That was a frequentist context, where $M$ was a statistical parameter and not a random variable. Following Raginsky [19], the Le Cam deficiency of $Y$ w.r.t. $X$ about $M$ is:

[TABLE]

The Le Cam deficiency can be interpreted as upper bounding the difference in risk (for any bounded loss function) when using $X$ rather than $Y$ to make decisions based on $M$ . We can state this formally, using the setup of a decision problem:

Definition 5 (Decision problem).

Suppose we need to perform actions based on the value of $M$ , which we cannot observe directly (e.g., we may want to estimate the value of $M$ ). We have access to either $X\sim P_{X|M}$ or $Y\sim P_{Y|M}$ , which can give us information about $M$ . The actions we take after observing either $X$ or $Y$ —call these $\widehat{M}_{X}(x)$ and $\widehat{M}_{Y}(y)$ respectively—incur a bounded loss that depends on the chosen action and the value of $M$ . Let $\mathcal{L}(\widehat{M}(\cdot),M)$ ( $\lVert\mathcal{L}\rVert_{\infty}\leq 1$ ) be the loss function, where $\widehat{M}(\cdot)$ may be either $\widehat{M}_{X}(x)$ or $\widehat{M}_{Y}(y)$ , depending on whether we choose to observe $X$ or $Y$ . How do we decide whether to choose $X$ or $Y$ when we do not know $\mathcal{L}$ ?

Blackwell [17] showed that if $X\succcurlyeq_{M}Y$ , we can always attain a lower loss (on average) by choosing $X$ . What happens when Blackwell sufficiency does not hold? Define the risk as the expected loss over either $X$ or $Y$ :

[TABLE]

If Blackwell sufficiency does not hold, then the worst-case risk (over $M$ ) when you choose $X$ is at most that when you choose $Y$ , plus the Le Cam deficiency of $X$ [12, 13]. In other words, for any $m$ and for any $\widehat{M}_{Y}$ , there exists an $\widehat{M}_{X}$ such that666Recall that the deficiency in $X$ is denoted $\delta(M:Y\setminus X)$ , because it corresponds to the unique information in $Y$ .

[TABLE]

Raginsky [19] showed how alternative measures like the KL-divergence may be used in place of the total variation distance, while preserving the aforementioned risk-based operational interpretation. In that work, Raginsky preserved the frequentist setting, taking the worst case divergence between $P_{X^{\prime}|M=m}$ and $P_{X|M=m}$ , over all realizations of $M$ . However, for partial information decompositions, $M$ is a random variable and thus it makes more sense to consider the expected divergence over different values of $M$ . This is what Banerjee et al. [2] did, in proposing the PID stated in Definition 1. They show that the risk-based operational interpretation extends to the new deficiency definition $\delta(M:X\setminus Y)$ [2, Prop. 8], but do not extend it to the corresponding unique information. We first state the theorem for deficiency, and show the extension in the following subsection.

Theorem 1.

Let the average risk be given by

[TABLE]

Then, for any $\widehat{M}_{Y}$ , there exists an $\widehat{M}_{X}$ such that

[TABLE]

where $g(\cdot)$ is a monotonically increasing function.

A proof of the above theorem is presented in Appendix A.

III-B Interpreting $UI^{\delta}$ after redundancy-symmetrization

Despite the existence of a clear operational interpretation for the deficiency as defined in Definition 1, the PID that arises out of this deficiency still needs an interpretation. In particular, we need to address what happens when we symmetrize the redundancy in Equation (5). This symmetrization step is required because $I(M;X)-\delta(M:X\setminus Y)$ is not always symmetric in $X$ and $Y$ . Interestingly, this issue does not arise in the case of the $\sim$ -PID, as we discuss in Section III-C.

First, we note that the operational interpretation for unique information described by Theorem 1 is still valid, although the bound may be somewhat loose:

[TABLE]

Thus, the unique information can act as an upper bound for the difference in risk, in place of deficiency.

However, one of the two unique informations, $UI^{\delta}_{X}$ or $UI^{\delta}_{Y}$ , is guaranteed to be loose in this way. We can quanitfy the extent of looseness as follows: suppose that $I(M;X)-\delta(M:X\setminus Y)>I(M;Y)-\delta(M:Y\setminus X)$ . Then, $RI^{\delta}(M:X;Y)=I(M;Y)-\delta(M:Y\setminus X)$ , and thus

[TABLE]

In other words, the excess quantity added to $UI^{\delta}(M:X\setminus Y)$ , over and above the deficiency is

[TABLE]

For lack of a better name, we call this the “cyan region”, due to how it is depicted in Figure 1. It is completely unclear what the interpretation of $\text{Cyan}(M:X\setminus Y)$ ought to be, and why this information should be considered unique to $X$ (see Figure 1).

Essentially, we pay the cost of a loose bound in $UI(M:{X}\setminus{Y})$ , and the extent of loosening does not have a clear justification of itself, except that it helps symmetrize the redundancy. This gives rise to the desire for a definition that does not require the explicit symmetrization performed in Equation (5).

III-C The $\sim$ -PID redundancy is intrinsically symmetric

In a stroke of serendipity, the redundancy under the $\sim$ -PID of Definition 2 is naturally symmetric in $X$ and $Y$ [3]. Let $Q^{*}$ be the joint distribution that achieves the optimum in Equation (6). Then,

[TABLE]

where (a) invokes Definition 2, (b) uses the constraint that $Q^{*}\in\Delta_{P}$ so that $Q^{*}(m,x)=P(m,x)$ , and $I_{Q^{*}}(M;X;Y)$ is the multivariate mutual information (the negative of which is also sometimes called the interaction information) on the distribution $Q^{*}$ , which can be expressed as shown below (e.g., using the standard formulae from [20, Ch. 2]):

[TABLE]

Thus, $\widetilde{RI}$ becomes equal to the multivariate mutual information on $Q^{*}$ , which is symmetric in $x$ and $y$ by definition.

Since the $\sim$ -PID has a naturally symmetric redundancy, we might want to examine whether it shares the risk-based operational interpretation of the $\delta$ -PID. We examine this, as well as alternative interpretations, in the following sections.

III-D $\widetilde{UI}$ * upper bounds the difference in risk*

The unique information of the $\sim$ -PID, $\widetilde{UI}_{X}$ , also acts as an upper bound for the difference in risk when choosing $Y$ rather than $X$ in the decision problem from Definition 5. This follows directly from a result of Bertschinger et al. [3], which states that $\widetilde{UI}_{X}$ upper bounds the unique information of any other PID definition that satisfies what they call “Assumption $(*)$ ”. According to this assumption, a definition of unique information should depend only on $P_{M}$ , $P_{X|M}$ and $P_{Y|M}$ , and not on the whole joint distribution $P_{MXY}$ . Since the $\delta$ -PID satisfies Assumption $(*)$ , we have that $\widetilde{UI}_{X}\geq UI^{\delta}_{X}$ , which implies that Theorem 1 extends to the $\sim$ -PID as well, although the upper bound may be loose.

III-E A connection between the $\sim$ -PID and the $\delta$ -PID

We now present a previously unidentified connection between these two PIDs, and use this connection to develop an intuitive interpretation for both PIDs.

First, observe that the $\delta$ -PID can be thought of as optimizing $P_{X^{\prime}|MY}$ instead of $P_{X^{\prime}|Y}$ , so long as we include the constraint that $M$ — $Y$ — $X^{\prime}$ forms a Markov chain. This constraint can also be written as $I(M;X^{\prime}\,|\,Y)=0$ . Thus, abbreviating $P_{X^{\prime}|Y}\circ P_{Y|M}$ as $P_{X^{\prime}|M}$ , we can write the deficiency as:

[TABLE]

Next, we note that the definition of $\sim$ -PID can also be rewritten into a similar form. The optimization variable $Q$ in Definition 2 obeys the constraints that $Q_{MX}=P_{MX}$ and $Q_{MY}=P_{MY}$ . Suppose we change notation by introducing a new random variable $X^{\prime}$ using the stochastic transformation $P_{X^{\prime}|MY}$ , but which also obeys $P_{X^{\prime}M}=P_{XM}$ —or equivalently, $P_{X^{\prime}|M}=P_{X|M}$ . Then, the distribution $P_{MX^{\prime}Y}$ plays exactly the same role as $Q_{MXY}$ , and obeys precisely the same constraints. Thus, the $\sim$ -PID definition can also be written as:

[TABLE]

where the constraint $P_{X^{\prime}|M}=P_{X|M}$ has been expressed in terms of zero expected KL-divergence between the channels.

This reveals the remarkable similarity between the $\delta$ - and $\sim$ -PIDs as written in Equations (24) and (25). The two PIDs are essentially optimizing over the same quantities, but in effect, interchange objective and constraint.

III-F Clarifying the connection to Blackwell sufficiency, and a new informal interpretation

Using the newfound connection between the $\delta$ - and $\sim$ -PIDs, we can clarify their connection to Blackwell sufficiency, and provide an informal interpretation.

First, Blackwell sufficiency can be re-understood as follows. $Y\succcurlyeq_{M}X$ if two requirements are met:

(i) there must exist a random variable $X^{\prime}$ that is derived from $Y$ through the stochastic transformation $P_{X^{\prime}|Y}$ , i.e., $M$ — $Y$ — $X^{\prime}$ must be a Markov chain; and

(ii) $X^{\prime}$ must act as a “copy” of $X$ w.r.t. $M$ , in the sense that $P_{X^{\prime}|M}=P_{X|M}$ .777This is similar to the “simulatable” notion presented in [2, Defn. 38].

When ${Y\,\;/\,\mathclap{\succcurlyeq_{M}}\;\;\;\,X}$ , the $\delta$ -PID and the $\sim$ -PID quantify departures from Blackwell sufficiency in two different ways (also see Figure 2):

(i) the $\delta$ -PID enforces the Markov chain and measures how far we are from a copy (refer Eq. 24);

(ii) the $\sim$ -PID enforces the copy and measures how far we are from having a Markov chain (refer Eq. 25).

This unified explanation of the $\delta$ - and $\sim$ -PIDs has not been identified in the literature previously, to our knowledge.

We can also use this picture to offer a new informal interpretation. If Alice and Bob opt for $X$ and $Y$ respectively in the decision problem of Definition 5, the deficiency $\delta_{X}$ measures the closest that Bob can come to emulating Alice (on average, for the worst loss $\mathcal{L}$ for Bob). On the other hand, $\widetilde{UI}_{X}$ measures the minimum number of bits Bob needs to borrow from $M$ in order to emulate Alice perfectly.

III-G A novel generalization of the $\delta$ - and $\sim$ -PIDs

The connection identified above also allows us to generalize both definitions using a single Lagrangian form:

[TABLE]

As $\lambda\to\infty$ in the Equation (26), we get the $\delta$ -PID, and as $\lambda\to 0$ , we get the $\sim$ -PID. This new $\delta^{\lambda}$ -PID has to be written in terms of a deficiency and then symmetrized as in Definition 1, since its redundancy will not be symmetric in general.

IV Capturing the Unique Information

In this section, we propose a new PID definition that is able to capture the unique information in the form of a random variable. The quantity of unique information also has a simple operational interpretation in terms of mutual information.

Definition 6 ( $I$ -PID).

Let the information deficiency of $Y$ with respect to $X$ about $M$ be given by

[TABLE]

Here, $T$ is a random variable produced through the stochastic transformation $P_{T|M}$ , and satisfies the Markov chain $T$ — $M$ — $(X,Y)$ . Then, the redundant information may be defined as

[TABLE]

This definition is appealing, since it captures the basic intuition that if $X$ has unique information about $M$ with respect to $Y$ , that means that $X$ has information about some “part” of $M$ which $Y$ does not have access to. In practice, this could mean either that $X$ is able to access entire “dimensions” of $M$ that $Y$ cannot, or it could mean that $X$ has access to some of the same dimensions of $M$ as $Y$ , but with lower noise, or it could be a combination of these factors. In this definition, the stochastic transformation $P_{T|M}$ plays the role of extracting these “parts” of $M$ , which $X$ has access to but $Y$ does not. The random variable $T$ corresponding to the optimal $P_{T|M}$ tells us the “parts” (or subspaces) of $M$ in which $X$ has unique information w.r.t. $Y$ .

The operational interpretation for this definition is simply this: the unique information that $X$ has about $M$ with respect to $Y$ is the maximum information about $M$ which you can extract from $X$ , which you cannot simultaneously get from $Y$ . That is, for any (possibly stochastic) function $f$ that depends only on $M$ , we will always have

[TABLE]

However, due to the need for symmetrization, this definition does suffer from the cyan region problem described in Section III-B. This is one area where we still need to work on understanding its interpretation.

In what follows, we prove some basic properties about the $I$ -PID, and show that it is Blackwellian for Gaussian $P_{MXY}$ .

Theorem 2 (Non-negativity and bounds on the $I$ -PID).

The $I$ -PID atoms can be shown to be non-negative:

[TABLE]

The $I$ -PID also satisfies the natural bounds:

[TABLE]

Theorem 3 (The $I$ -PID is Blackwellian for Gaussian $P_{MXY}$ ).

If $P_{MXY}$ is jointly Gaussian, then the $I$ -PID unique information satisfies:

[TABLE]

Proofs of these theorems are presented in Appendix B. In particular, Theorem 3 implies that prior results for Gaussians [15] are also applicable to the $I$ -PID. We conjecture that Theorem 3 can be generalized, i.e., the $I$ -PID is Blackwellian in general, but leave an investigation of this to future work.

Appendix A Proof of Theorem 1

Proof.

Consider the difference in average risks:

[TABLE]

Now, the last two terms of this expression can be bounded using the bound on $\mathcal{L}$ and the total variation distance:

[TABLE]

where in (a) we have used the bound on $\mathcal{L}$ and the definition of the total variation norm, in (b) we have used Pinsker’s inequality [21, Lemma 2.5], in (c) we have used Jensen’s inequality [20, Thm. 2.6.2], and in (d), we have set $g(z)\coloneqq\sqrt{z/2}$ .

It only remains to be shown that the first two terms of the expression in Equation (32) can be upper bounded by zero. Examining the first two terms, for any $\hat{M}_{Y}(y)$ , we can derive a stochastic action rule, $\hat{M}_{X}(x)$ that will attain the same risk: we can first draw $\tilde{y}\sim P_{Y^{\prime}|X}$ and then select the action $\hat{M}_{Y}(\tilde{y})$ . Thus,

[TABLE]

which completes the proof. ∎

Appendix B Proofs of Theorems 2 and 3

Proof of Theorem 2.

First, observe that

[TABLE]

Furthermore,

[TABLE]

where the last inequality follows by the data processing inequality and the Markov chain $T$ — $M$ — $(X,Y)$ . Thus,

[TABLE]

This implies

[TABLE]

Furthermore,

[TABLE]

where in the very last inequality follows from the fact that $T\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(X,Y)\,|\,M$ and the data processing inequality [20, Ch. 2]. This may not be obvious, but it follows the same proof as the data processing inequality:

[TABLE]

From this it follows that

[TABLE]

where (a) follows from the fact that $I\bigl{(}T;X\,|\,Y,M\bigr{)}=0$ since $T\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(X,Y)\,|\,M$ , while (b) uses $I\bigl{(}M;X\,|\,Y,T\bigr{)}\geq 0$ . This justifies Equation (50), which implies

[TABLE]

If $UI^{I}(M:X\setminus Y)=\delta^{I}(M:X\setminus Y)$ , then $SI^{I}(M:X;Y)=I(M;X\,|\,Y)-\delta^{I}(M:X\setminus Y)\geq 0$ , and $SI\leq I(M;X\,|\,Y)$ . This shows that all terms in the $I$ -PID are non-negative and bounded. ∎

Proof of Theorem 3.

We need to show that when $P_{MXY}$ is jointly Gaussian,

[TABLE]

$(\Leftarrow)$ Observe that the $I$ -PID satisfies Assumption $(*)$ from Bertschinger et al. [3], i.e., $UI_{X}$ is a function only of $P_{M}$ , $P_{X|M}$ and $P_{Y|M}$ . Thus, by [3, Lemma 3], $UI^{I}_{X}\leq\widetilde{UI}_{X}$ . Since the $\sim$ -PID is Blackwellian, $Y\succcurlyeq_{M}X\;\Leftrightarrow\;\widetilde{UI}_{X}=0\Rightarrow UI^{I}_{X}=0$ .

This part of the proof holds irrespective of the distribution of $P_{MXY}$ .

$(\Rightarrow)$ Now, suppose $P_{MXY}$ is Gaussian. Then it suffices to show that whenever ${Y\,\;/\,\mathclap{\succcurlyeq_{M}}\;\;\;\,X}$ , $\exists\;P_{T|M}$ such that $I(T;X)-I(T;Y)>0$ , to ensure that $UI^{I}_{X}>0$ .

Following the notation of [15], let $\Sigma_{MXY}$ be represent the joint covariance matrix (which fully specifies information measures on the joint distribution), let $\Sigma_{X|M}$ represent the conditional covariance matrix of $X$ given $M$ and let $\Sigma_{X,Y}$ represent the cross-covariance of $X$ and $Y$ . Let $\Lambda_{X}\coloneqq\Sigma_{X,M}^{\mathsf{T}}\Sigma_{X|M}^{-1}\Sigma_{X,M}$ and $\Lambda_{Y}\coloneqq\Sigma_{Y,M}^{\mathsf{T}}\Sigma_{Y|M}^{-1}\Sigma_{Y,M}$ . Then, [15, Theorem 2], states

[TABLE]

where for positive semidefinite matrices $A$ and $B$ , $A\succcurlyeq B$ denotes that $A-B$ is positive semidefinite.

Consider $P_{T|M}$ to be a normal distribution, given by $\mathcal{N}(H_{T}M,\Sigma_{T|M})$ . Further, we can assume without loss of generality that $\Sigma_{M}=I$ . Then, $\Sigma_{T,X}=H_{T}\Sigma_{M}\Sigma_{M,X}=H_{T}\Sigma_{X,M}^{\mathsf{T}}$ . The mutual information between $T$ and $X$ is given by:

[TABLE]

Then,

[TABLE]

If ${Y\,\;/\,\mathclap{\succcurlyeq_{M}}\;\;\;\,X}$ , then $\Lambda_{Y}\;\;\mathclap{\succcurlyeq}\mathclap{/}\;\;\,\Lambda_{X}$ , i.e., $\exists\;c\in\mathbb{R}$ s.t.

[TABLE]

Letting $\Sigma_{T}=I$ and $H_{T}=c$ , we have that

[TABLE]

This implies

[TABLE]

Recognizing that $UI^{\delta}(M:X\setminus Y)\geq\delta(M:X\setminus Y)$ (see Equation (15)), this completes the proof. ∎

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” ar Xiv preprint ar Xiv:1004.2515 , 2010.
2[2] P. K. Banerjee, E. Olbrich, J. Jost, and J. Rauh, “Unique informations and deficiencies,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton) . IEEE, 2018, pp. 32–38.
3[3] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifying unique information,” Entropy , vol. 16, no. 4, pp. 2161–2183, 2014.
4[4] G. Pica, E. Piasini, H. Safaai, C. Runyan, C. Harvey, M. Diamond, C. Kayser, T. Fellin, and S. Panzeri, “Quantifying how much sensory information in a neural code is relevant for behavior,” in Advances in Neural Information Processing Systems , 2017, pp. 3686–3696.
5[5] N. M. Timme and C. Lapish, “A tutorial for information theory in neuroscience,” eneuro , vol. 5, no. 3, 2018.
6[6] T. Scagliarini, L. Faes, D. Marinazzo, S. Stramaglia, and R. N. Mantegna, “Synergistic information transfer in the global system of financial markets,” Entropy , vol. 22, no. 9, p. 1000, 2020.
7[7] D. A. Ehrlich, A. C. Schneider, M. Wibral, V. Priesemann, and A. Makkeh, “Partial information decomposition reveals the structure of neural representations,” ar Xiv preprint ar Xiv:2209.10438 , 2022.
8[8] S. Dutta, P. Venkatesh, P. Mardziel, A. Datta, and P. Grover, “An information-theoretic quantification of discrimination with exempt features,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 34, no. 04, 2020, pp. 3825–3833.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Capturing and Interpreting Unique Information

Abstract

I Introduction

II Background

II-A Notation

II-B Defining PIDs

Definition 1** (δ\deltaδ-PID [2]).**

Definition 2** (∼\sim∼-PID444Also called the BROJA-PID in the literature after the authors of [3]. [3, 16]).**

II-C Blackwell sufficiency and Blackwellian PIDs

Definition 3** (Blackwell sufficiency: ≽M\succcurlyeq_{M}≽M​).**

Definition 4** (Blackwellian PID).**

III Interpreting the δ\deltaδ- and ∼\sim∼-PIDs

III-A Deficiency upper bounds the difference in risk

Definition 5** (Decision problem).**

Theorem 1**.**

III-B Interpreting UIδUI^{\delta}UIδ after redundancy-symmetrization

III-C The ∼\sim∼-PID redundancy is intrinsically symmetric

III-D UI~\widetilde{UI}UI* upper bounds the difference in risk*

III-E A connection between the ∼\sim∼-PID and the δ\deltaδ-PID

III-F Clarifying the connection to Blackwell sufficiency, and a new informal interpretation

III-G A novel generalization of the δ\deltaδ- and ∼\sim∼-PIDs

IV Capturing the Unique Information

Definition 6** (III-PID).**

Theorem 2** (Non-negativity and bounds on the III-PID).**

Theorem 3** (The III-PID is Blackwellian for Gaussian PMXYP_{MXY}PMXY​).**

Appendix A Proof of Theorem 1

Proof.

Appendix B Proofs of Theorems 2 and 3

Proof of Theorem 2.

Proof of Theorem 3.

Definition 1 ( $\delta$ -PID [2]).

Definition 2 ( $\sim$ -PID444Also called the BROJA-PID in the literature after the authors of [3]. [3, 16]).

Definition 3 (Blackwell sufficiency: $\succcurlyeq_{M}$ ).

Definition 4 (Blackwellian PID).

III Interpreting the $\delta$ - and $\sim$ -PIDs

Definition 5 (Decision problem).

Theorem 1.

III-B Interpreting $UI^{\delta}$ after redundancy-symmetrization

III-C The $\sim$ -PID redundancy is intrinsically symmetric

III-D $\widetilde{UI}$ * upper bounds the difference in risk*

III-E A connection between the $\sim$ -PID and the $\delta$ -PID

III-G A novel generalization of the $\delta$ - and $\sim$ -PIDs

Definition 6 ( $I$ -PID).

Theorem 2 (Non-negativity and bounds on the $I$ -PID).

Theorem 3 (The $I$ -PID is Blackwellian for Gaussian $P_{MXY}$ ).