Capturing and Interpreting Unique Information
Praveen Venkatesh, Keerthana Gurushankar, Gabriel Schamberg

TL;DR
This paper explores the operational meaning of unique information in partial information decompositions, proposing a new PID definition with clear interpretation and analyzing its properties and connections to existing frameworks.
Contribution
It introduces a new PID definition that captures unique information with an intuitive interpretation and links it to existing PID frameworks through a Lagrangian formulation.
Findings
Unique information bounds decision risk.
New PID captures information uniquely held by variables.
Connections between different PID definitions are established.
Abstract
Partial information decompositions (PIDs), which quantify information interactions between three or more variables in terms of uniqueness, redundancy and synergy, are gaining traction in many application domains. However, our understanding of the operational interpretations of PIDs is still incomplete for many popular PID definitions. In this paper, we discuss the operational interpretations of unique information through the lens of two well-known PID definitions. We reexamine an interpretation from statistical decision theory showing how unique information upper bounds the risk in a decision problem. We then explore a new connection between the two PIDs, which allows us to develop an informal but appealing interpretation, and generalize the PID definitions using a common Lagrangian formulation. Finally, we provide a new PID definition that is able to capture the information that is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
Capturing and Interpreting Unique Information
Praveen Venkatesh
Allen Institute
and University of Washington
Seattle, WA, USA
Keerthana Gurushankar
Department of Computer Science,
Carnegie Mellon University
Pittsburgh, PA, USA
Gabriel Schamberg
Department of Surgery,
University of Auckland
New Zealand
Abstract
Partial information decompositions (PIDs), which quantify information interactions between three or more variables in terms of uniqueness, redundancy and synergy, are gaining traction in many application domains. However, our understanding of the operational interpretations of PIDs is still incomplete for many popular PID definitions. In this paper, we discuss the operational interpretations of unique information through the lens of two well-known PID definitions. We reexamine an interpretation from statistical decision theory showing how unique information upper bounds the risk in a decision problem. We then explore a new connection between the two PIDs, which allows us to develop an informal but appealing interpretation, and generalize the PID definitions using a common Lagrangian formulation. Finally, we provide a new PID definition that is able to capture the information that is unique. We also show that it has a straightforward interpretation and examine its properties.
I Introduction
Partial information decompositions (PIDs) have become a popular method for understanding the information interactions between multiple random variables. A bivariate PID seeks to decompose the information that two variables and convey about a message , into parts that are unique to , unique to , redundant to and , and synergistic [1, 2, 3].
As a simple example, consider a message , and two variables and , where i.i.d. Ber and represents an XOR operation between bits. Here, has one bit of unique information about , i.e., , which is not present in . Similarly, has one bit of unique information about , i.e., , which is not present in . There is one bit of redundant information, i.e., , which can be extracted from either or taken alone. Finally, there is one bit of synergistic information, i.e., : this information cannot be extracted from either or individually, but can be recovered when both are taken together.
PIDs have found applications in various fields, from neuroscience (where one may want to examine the interaction between stimuli, neural activity and behavioral response) [4, 5] to financial markets [6]. Recent works have also used PIDs to explain how information complexity decreases through the layers of a deep neural network [7], as well as to develop new measures of fairness in machine learning [8].
Despite increasingly widespread adoption, there is still no consensus on how PIDs should be defined, or on how to operationally interpret partial information quantities (e.g., see [9, 10]). One popular approach for operational interpretations has relied on the concept of Blackwell sufficiency from statistical decision theory. Blackwell sufficiency is a formal way to determine whether contains all of the information that has about . Thus, it becomes a natural basis for discussing how two variables carry information about a message. For example, Kolchinsky [10] uses it to operationalize measures of redundancy and “union” information.
Here, we restrict our attention to interpretations of unique information. Bertschinger et al. [3] used Blackwell sufficiency to motivate a definition of unique information. But their interpretation only addressed whether or not the unique information was zero or non-zero, and did not provide an interpretation for the quantity of unique information. More recently, Banerjee et al. [2] and Rauh et al. [11] interpreted the quantification of unique information in terms of a secret key rate using a context from information-theoretic security. However, such an interpretation is difficult to translate to other contexts like neuroscience, where there may not be an analog for an eavesdropper.
This paper discusses two PID definitions based on Blackwell sufficiency [2, 3], and provides an operational interpretation of the quantity of unique information in each case. Extending classical results on so-called “deficiency” measures [12, 13], and clarifying results in [2], we show that the unique information about present in w.r.t. upper bounds the difference in risk attained in a decision problem, when one uses rather than to make decisions pertaining to (Sections III-A, III-B, and III-D).
We then identify a previously unrecognized connection between the aforementioned PIDs, which shows that the two definitions swap the objective and constraint in their respective optimizations (Section III-E). This discovery allows us to clarify how these definitions are related to Blackwell sufficiency, and provide an informal but appealing interpretation for them (Section III-F). Finally, we develop a novel generalization of the two PIDs, through a common Lagrangian (Section III-G). In the process, we also explicitly raise an issue pertaining to symmetrization of redundancy, and show how it complicates the interpretation of unique information (Sections III-B, III-C).
Lastly, in Section IV, we propose a new PID definition that captures the part of that is unique in the form of a random variable. We hinted at this PID in our previous work [14], without defining it or discussing its properties. Here, we define the PID formally through redundancy symmetrization, show that it forms a valid non-negative decomposition and that it obeys intuitive bounds. We also show that this PID definition is Blackwellian [15] when , and are jointly Gaussian.
II Background
II-A Notation
- •
Let , and be three random variables with sample spaces , and respectively, and joint density .
- •
Let denote the set of all channels from to , so for example, .
- •
Let denote composition of channels, i.e. ,
[TABLE]
- •
To keep the exposition simple, we ignore any measure-theoretic nuances. All conditional distributions and information measures are assumed to be well-defined.
II-B Defining PIDs
There are many notions of partial information decompositions: we focus here on the bivariate case, which decomposes the information that two variables and have about a message . Such a PID is typically defined by a set of four functions of the joint distribution —denoted , ), and (or , , and respectively for brevity)—which satisfy the following basic equations:
[TABLE]
Equation (1) implies that the total mutual information about conveyed by and is the sum of four partial information components: one unique to , one unique to , another redundant to both and , and the last which is synergistic, respectively. Equations (2) and (3) enforce that the individual mutual information of or with is the sum of the redundant information and the corresponding unique information.111Typically, it is also assumed that the redundant and synergistic components are symmetric in and . These equations impose three constraints on the four partial information components, such that defining any one component suffices to specify the other three.
In this paper we discuss the operational interpretations of two existing PID definitions due to [2] and [3] in Section III, and then introduce a new PID definition in Section IV. We state here the first two definitions as defined originally, and later we present modified forms which are more interpretable.
Definition 1** (-PID [2]).**
Let the (weighted output) deficiency222Deficiency* was introduced by Le Cam to quantify a departure from Blackwell sufficiency. of with respect to about be defined as333The reason for this notation is that the deficiency of w.r.t. translates to the unique information present in and not in .*
[TABLE]
Then, the deficiency-based redundant information about present in and is given by
[TABLE]
Using equations (1)–(3), fully determines the -PID, i.e. , , and .
Definition 2** (-PID444Also called the BROJA-PID in the literature after the authors of [3]. [3, 16]).**
The unique information about present in and not in is given by
[TABLE]
where and is the conditional mutual information over the joint distribution .
As with the -PID, equations (1)–(3) fully determine the remaining components of the -PID.
II-C Blackwell sufficiency and Blackwellian PIDs
Blackwell sufficiency provides a partial order between random variables based on how informative they are about a message . This notion was used by [3] to provide an operational motivation for the -PID, and also underlies the basis of the -PID [2].
Definition 3** (Blackwell sufficiency: ).**
We say that a channel is Blackwell sufficient w.r.t. another channel (denoted ) if such that
[TABLE]
Intuitively, means that we can generate a new random variable from (using the stochastic transformation ) so that the effective channel from to is equivalent to the original channel from to .555Blackwell sufficiency is identical to the concept of stochastic degradedness of broadcast channels [15]. It was shown by Blackwell [17] that if is Blackwell sufficient for w.r.t. , then it is always preferable to observe rather than , for making decisions about . This operational interpretation of Blackwell sufficiency was extended to PIDs by [3]:
Definition 4** (Blackwellian PID).**
A bivariate PID on is said to be Blackwellian if
[TABLE]
This means that (for a Blackwellian PID definition) the unique information in one variable is zero only if it is always beneficial to observe the other variable to make decisions about . Conversely, if is not Blackwell sufficient for w.r.t. , then must have some unique information about that cannot access.
However, it is important to note that a Blackwellian PID is only operationally motivated to the extent of whether or not the unique information is zero. It does not lend an operational interpretation as to the volume of unique information when it is non-zero.
III Interpreting the - and -PIDs
III-A Deficiency upper bounds the difference in risk
The -PID derives its operational interpretation directly from that of deficiency [18, 12], upon which it is based. The deficiency of w.r.t. , originally defined by Le Cam [18], measures how far from Blackwell sufficient is, w.r.t. .
Le Cam’s original notion of deficiency was defined using the total variation distance, and as a worst case over realizations of . That was a frequentist context, where was a statistical parameter and not a random variable. Following Raginsky [19], the Le Cam deficiency of w.r.t. about is:
[TABLE]
The Le Cam deficiency can be interpreted as upper bounding the difference in risk (for any bounded loss function) when using rather than to make decisions based on . We can state this formally, using the setup of a decision problem:
Definition 5** (Decision problem).**
Suppose we need to perform actions based on the value of , which we cannot observe directly (e.g., we may want to estimate the value of ). We have access to either or , which can give us information about . The actions we take after observing either or —call these and respectively—incur a bounded loss that depends on the chosen action and the value of . Let () be the loss function, where may be either or , depending on whether we choose to observe or . How do we decide whether to choose or when we do not know ?
Blackwell [17] showed that if , we can always attain a lower loss (on average) by choosing . What happens when Blackwell sufficiency does not hold? Define the risk as the expected loss over either or :
[TABLE]
If Blackwell sufficiency does not hold, then the worst-case risk (over ) when you choose is at most that when you choose , plus the Le Cam deficiency of [12, 13]. In other words, for any and for any , there exists an such that666Recall that the deficiency in is denoted , because it corresponds to the unique information in .
[TABLE]
Raginsky [19] showed how alternative measures like the KL-divergence may be used in place of the total variation distance, while preserving the aforementioned risk-based operational interpretation. In that work, Raginsky preserved the frequentist setting, taking the worst case divergence between and , over all realizations of . However, for partial information decompositions, is a random variable and thus it makes more sense to consider the expected divergence over different values of . This is what Banerjee et al. [2] did, in proposing the PID stated in Definition 1. They show that the risk-based operational interpretation extends to the new deficiency definition [2, Prop. 8], but do not extend it to the corresponding unique information. We first state the theorem for deficiency, and show the extension in the following subsection.
Theorem 1**.**
Let the average risk be given by
[TABLE]
Then, for any , there exists an such that
[TABLE]
where is a monotonically increasing function.
A proof of the above theorem is presented in Appendix A.
III-B Interpreting after redundancy-symmetrization
Despite the existence of a clear operational interpretation for the deficiency as defined in Definition 1, the PID that arises out of this deficiency still needs an interpretation. In particular, we need to address what happens when we symmetrize the redundancy in Equation (5). This symmetrization step is required because is not always symmetric in and . Interestingly, this issue does not arise in the case of the -PID, as we discuss in Section III-C.
First, we note that the operational interpretation for unique information described by Theorem 1 is still valid, although the bound may be somewhat loose:
[TABLE]
Thus, the unique information can act as an upper bound for the difference in risk, in place of deficiency.
However, one of the two unique informations, or , is guaranteed to be loose in this way. We can quanitfy the extent of looseness as follows: suppose that . Then, , and thus
[TABLE]
In other words, the excess quantity added to , over and above the deficiency is
[TABLE]
For lack of a better name, we call this the “cyan region”, due to how it is depicted in Figure 1. It is completely unclear what the interpretation of ought to be, and why this information should be considered unique to (see Figure 1).
Essentially, we pay the cost of a loose bound in , and the extent of loosening does not have a clear justification of itself, except that it helps symmetrize the redundancy. This gives rise to the desire for a definition that does not require the explicit symmetrization performed in Equation (5).
III-C The -PID redundancy is intrinsically symmetric
In a stroke of serendipity, the redundancy under the -PID of Definition 2 is naturally symmetric in and [3]. Let be the joint distribution that achieves the optimum in Equation (6). Then,
[TABLE]
where (a) invokes Definition 2, (b) uses the constraint that so that , and is the multivariate mutual information (the negative of which is also sometimes called the interaction information) on the distribution , which can be expressed as shown below (e.g., using the standard formulae from [20, Ch. 2]):
[TABLE]
Thus, becomes equal to the multivariate mutual information on , which is symmetric in and by definition.
Since the -PID has a naturally symmetric redundancy, we might want to examine whether it shares the risk-based operational interpretation of the -PID. We examine this, as well as alternative interpretations, in the following sections.
III-D * upper bounds the difference in risk*
The unique information of the -PID, , also acts as an upper bound for the difference in risk when choosing rather than in the decision problem from Definition 5. This follows directly from a result of Bertschinger et al. [3], which states that upper bounds the unique information of any other PID definition that satisfies what they call “Assumption ”. According to this assumption, a definition of unique information should depend only on , and , and not on the whole joint distribution . Since the -PID satisfies Assumption , we have that , which implies that Theorem 1 extends to the -PID as well, although the upper bound may be loose.
III-E A connection between the -PID and the -PID
We now present a previously unidentified connection between these two PIDs, and use this connection to develop an intuitive interpretation for both PIDs.
First, observe that the -PID can be thought of as optimizing instead of , so long as we include the constraint that —— forms a Markov chain. This constraint can also be written as . Thus, abbreviating as , we can write the deficiency as:
[TABLE]
Next, we note that the definition of -PID can also be rewritten into a similar form. The optimization variable in Definition 2 obeys the constraints that and . Suppose we change notation by introducing a new random variable using the stochastic transformation , but which also obeys —or equivalently, . Then, the distribution plays exactly the same role as , and obeys precisely the same constraints. Thus, the -PID definition can also be written as:
[TABLE]
where the constraint has been expressed in terms of zero expected KL-divergence between the channels.
This reveals the remarkable similarity between the - and -PIDs as written in Equations (24) and (25). The two PIDs are essentially optimizing over the same quantities, but in effect, interchange objective and constraint.
III-F Clarifying the connection to Blackwell sufficiency, and a new informal interpretation
Using the newfound connection between the - and -PIDs, we can clarify their connection to Blackwell sufficiency, and provide an informal interpretation.
First, Blackwell sufficiency can be re-understood as follows. if two requirements are met:
(i) there must exist a random variable that is derived from through the stochastic transformation , i.e., —— must be a Markov chain; and
(ii) must act as a “copy” of w.r.t. , in the sense that .777This is similar to the “simulatable” notion presented in [2, Defn. 38].
When , the -PID and the -PID quantify departures from Blackwell sufficiency in two different ways (also see Figure 2):
(i) the -PID enforces the Markov chain and measures how far we are from a copy (refer Eq. 24);
(ii) the -PID enforces the copy and measures how far we are from having a Markov chain (refer Eq. 25).
This unified explanation of the - and -PIDs has not been identified in the literature previously, to our knowledge.
We can also use this picture to offer a new informal interpretation. If Alice and Bob opt for and respectively in the decision problem of Definition 5, the deficiency measures the closest that Bob can come to emulating Alice (on average, for the worst loss for Bob). On the other hand, measures the minimum number of bits Bob needs to borrow from in order to emulate Alice perfectly.
III-G A novel generalization of the - and -PIDs
The connection identified above also allows us to generalize both definitions using a single Lagrangian form:
[TABLE]
As in the Equation (26), we get the -PID, and as , we get the -PID. This new -PID has to be written in terms of a deficiency and then symmetrized as in Definition 1, since its redundancy will not be symmetric in general.
IV Capturing the Unique Information
In this section, we propose a new PID definition that is able to capture the unique information in the form of a random variable. The quantity of unique information also has a simple operational interpretation in terms of mutual information.
Definition 6** (-PID).**
Let the information deficiency of with respect to about be given by
[TABLE]
Here, is a random variable produced through the stochastic transformation , and satisfies the Markov chain ——. Then, the redundant information may be defined as
[TABLE]
This definition is appealing, since it captures the basic intuition that if has unique information about with respect to , that means that has information about some “part” of which does not have access to. In practice, this could mean either that is able to access entire “dimensions” of that cannot, or it could mean that has access to some of the same dimensions of as , but with lower noise, or it could be a combination of these factors. In this definition, the stochastic transformation plays the role of extracting these “parts” of , which has access to but does not. The random variable corresponding to the optimal tells us the “parts” (or subspaces) of in which has unique information w.r.t. .
The operational interpretation for this definition is simply this: the unique information that has about with respect to is the maximum information about which you can extract from , which you cannot simultaneously get from . That is, for any (possibly stochastic) function that depends only on , we will always have
[TABLE]
However, due to the need for symmetrization, this definition does suffer from the cyan region problem described in Section III-B. This is one area where we still need to work on understanding its interpretation.
In what follows, we prove some basic properties about the -PID, and show that it is Blackwellian for Gaussian .
Theorem 2** (Non-negativity and bounds on the -PID).**
The -PID atoms can be shown to be non-negative:
[TABLE]
The -PID also satisfies the natural bounds:
[TABLE]
Theorem 3** (The -PID is Blackwellian for Gaussian ).**
If is jointly Gaussian, then the -PID unique information satisfies:
[TABLE]
Proofs of these theorems are presented in Appendix B. In particular, Theorem 3 implies that prior results for Gaussians [15] are also applicable to the -PID. We conjecture that Theorem 3 can be generalized, i.e., the -PID is Blackwellian in general, but leave an investigation of this to future work.
Appendix A Proof of Theorem 1
Proof.
Consider the difference in average risks:
[TABLE]
Now, the last two terms of this expression can be bounded using the bound on and the total variation distance:
[TABLE]
where in (a) we have used the bound on and the definition of the total variation norm, in (b) we have used Pinsker’s inequality [21, Lemma 2.5], in (c) we have used Jensen’s inequality [20, Thm. 2.6.2], and in (d), we have set .
It only remains to be shown that the first two terms of the expression in Equation (32) can be upper bounded by zero. Examining the first two terms, for any , we can derive a stochastic action rule, that will attain the same risk: we can first draw and then select the action . Thus,
[TABLE]
which completes the proof. ∎
Appendix B Proofs of Theorems 2 and 3
Proof of Theorem 2.
First, observe that
[TABLE]
Furthermore,
[TABLE]
where the last inequality follows by the data processing inequality and the Markov chain ——. Thus,
[TABLE]
This implies
[TABLE]
Furthermore,
[TABLE]
where in the very last inequality follows from the fact that T\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(X,Y)\,|\,M and the data processing inequality [20, Ch. 2]. This may not be obvious, but it follows the same proof as the data processing inequality:
[TABLE]
From this it follows that
[TABLE]
where (a) follows from the fact that I\bigl{(}T;X\,|\,Y,M\bigr{)}=0 since T\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(X,Y)\,|\,M, while (b) uses I\bigl{(}M;X\,|\,Y,T\bigr{)}\geq 0. This justifies Equation (50), which implies
[TABLE]
If , then , and . This shows that all terms in the -PID are non-negative and bounded. ∎
Proof of Theorem 3.
We need to show that when is jointly Gaussian,
[TABLE]
Observe that the -PID satisfies Assumption from Bertschinger et al. [3], i.e., is a function only of , and . Thus, by [3, Lemma 3], . Since the -PID is Blackwellian, .
This part of the proof holds irrespective of the distribution of .
Now, suppose is Gaussian. Then it suffices to show that whenever , such that , to ensure that .
Following the notation of [15], let be represent the joint covariance matrix (which fully specifies information measures on the joint distribution), let represent the conditional covariance matrix of given and let represent the cross-covariance of and . Let and . Then, [15, Theorem 2], states
[TABLE]
where for positive semidefinite matrices and , denotes that is positive semidefinite.
Consider to be a normal distribution, given by . Further, we can assume without loss of generality that . Then, . The mutual information between and is given by:
[TABLE]
Then,
[TABLE]
If , then , i.e., s.t.
[TABLE]
Letting and , we have that
[TABLE]
This implies
[TABLE]
Recognizing that (see Equation (15)), this completes the proof. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” ar Xiv preprint ar Xiv:1004.2515 , 2010.
- 2[2] P. K. Banerjee, E. Olbrich, J. Jost, and J. Rauh, “Unique informations and deficiencies,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton) . IEEE, 2018, pp. 32–38.
- 3[3] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifying unique information,” Entropy , vol. 16, no. 4, pp. 2161–2183, 2014.
- 4[4] G. Pica, E. Piasini, H. Safaai, C. Runyan, C. Harvey, M. Diamond, C. Kayser, T. Fellin, and S. Panzeri, “Quantifying how much sensory information in a neural code is relevant for behavior,” in Advances in Neural Information Processing Systems , 2017, pp. 3686–3696.
- 5[5] N. M. Timme and C. Lapish, “A tutorial for information theory in neuroscience,” eneuro , vol. 5, no. 3, 2018.
- 6[6] T. Scagliarini, L. Faes, D. Marinazzo, S. Stramaglia, and R. N. Mantegna, “Synergistic information transfer in the global system of financial markets,” Entropy , vol. 22, no. 9, p. 1000, 2020.
- 7[7] D. A. Ehrlich, A. C. Schneider, M. Wibral, V. Priesemann, and A. Makkeh, “Partial information decomposition reveals the structure of neural representations,” ar Xiv preprint ar Xiv:2209.10438 , 2022.
- 8[8] S. Dutta, P. Venkatesh, P. Mardziel, A. Datta, and P. Grover, “An information-theoretic quantification of discrimination with exempt features,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 34, no. 04, 2020, pp. 3825–3833.
