Decomposing information into copying versus transformation

Artemy Kolchinsky; Bernat Corominas-Murtra

arXiv:1903.10693·cs.IT·November 22, 2022

Decomposing information into copying versus transformation

Artemy Kolchinsky, Bernat Corominas-Murtra

PDF

TL;DR

This paper introduces a novel information-theoretic decomposition distinguishing between copying and transformation in information transfer, with implications for understanding biological replication and other systems.

Contribution

It derives a formal decomposition of mutual information into copying and transformation components, generalizes it to various channels, and links copy information to physical work in copying processes.

Findings

01

Decomposition applies to channels with identical source and destination messages.

02

Copy information correlates with minimal work in physical copying processes.

03

Model analysis of amino acid substitution rates demonstrates practical relevance.

Abstract

In many real-world systems, information can be transmitted in two qualitatively different ways: by copying or by transformation. Copying occurs when messages are transmitted without modification, e.g., when an offspring receives an unaltered copy of a gene from its parent. Transformation occurs when messages are modified systematically during transmission, e.g., when mutational biases occur during genetic replication. Standard information-theoretic measures do not distinguish these two modes of information transfer, although they may reflect different mechanisms and have different functional consequences. Starting from a few simple axioms, we derive a decomposition of mutual information into the information transmitted by copying versus the information transmitted by transformation. We begin with a decomposition that applies when the source and destination of the channel have the same…

Equations165

D_{KL} (s ∥ q) := x \sum s (x) lo g \frac{s ( x )}{q ( x )} .

D_{KL} (s ∥ q) := x \sum s (x) lo g \frac{s ( x )}{q ( x )} .

d (a, b) : = a lo g \frac{a}{b} + (1 - a) lo g \frac{1 - a}{1 - b} .

d (a, b) : = a lo g \frac{a}{b} + (1 - a) lo g \frac{1 - a}{1 - b} .

I_{p} (Y : X) := x \sum s (x) y \sum p (y ∣ x) lo g \frac{p ( y ∣ x )}{p ( y )},

I_{p} (Y : X) := x \sum s (x) y \sum p (y ∣ x) lo g \frac{p ( y ∣ x )}{p ( y )},

p (y) := x \sum s (x) p (y ∣ x) .

p (y) := x \sum s (x) p (y ∣ x) .

I (Y : X) = x \sum s (x) I (Y : X = x),

I (Y : X) = x \sum s (x) I (Y : X = x),

I (Y : X = x)

I (Y : X = x)

F^{trans} (p_{Y ∣ x}, p_{Y}, x) := D_{KL} (p_{Y ∣ x} ∥ p_{Y}) - F (p_{Y ∣ x}, p_{Y}, x) .

F^{trans} (p_{Y ∣ x}, p_{Y}, x) := D_{KL} (p_{Y ∣ x} ∥ p_{Y}) - F (p_{Y ∣ x}, p_{Y}, x) .

D_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) = {d (p_{Y ∣ x} (x), p_{Y} (x)) 0 if p_{Y ∣ x} (x) > p_{Y} (x) otherwise,

D_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) = {d (p_{Y ∣ x} (x), p_{Y} (x)) 0 if p_{Y ∣ x} (x) > p_{Y} (x) otherwise,

D_{x}^{trans} (p_{Y ∣ x} ∥ p_{Y}) = D_{KL} (p_{Y ∣ x} ∥ p_{Y}) - D_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) .

D_{x}^{trans} (p_{Y ∣ x} ∥ p_{Y}) = D_{KL} (p_{Y ∣ x} ∥ p_{Y}) - D_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) .

I_{p} (Y : X = x) = D_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) + D_{x}^{trans} (p_{Y ∣ x} ∥ p_{Y}) .

I_{p} (Y : X = x) = D_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) + D_{x}^{trans} (p_{Y ∣ x} ∥ p_{Y}) .

I_{p} (Y : X) = I_{p}^{copy} (X \shortrightarrow Y) + I_{p}^{trans} (X \shortrightarrow Y),

I_{p} (Y : X) = I_{p}^{copy} (X \shortrightarrow Y) + I_{p}^{trans} (X \shortrightarrow Y),

I_{p}^{copy} (X \shortrightarrow Y)

I_{p}^{copy} (X \shortrightarrow Y)

I_{p}^{trans} (X \shortrightarrow Y)

H (Y) = I^{copy} (X \shortrightarrow Y) + I^{trans} (X \shortrightarrow Y) + H (Y ∣ X) .

H (Y) = I^{copy} (X \shortrightarrow Y) + I^{trans} (X \shortrightarrow Y) + H (Y ∣ X) .

η_{p} (x) := \frac{D _{x}^{copy} ( p _{Y ∣ x} ∥ p _{Y} )}{D _{KL} ( p _{Y ∣ x} ∥ p _{Y} )} \in [0, 1],

η_{p} (x) := \frac{D _{x}^{copy} ( p _{Y ∣ x} ∥ p _{Y} )}{D _{KL} ( p _{Y ∣ x} ∥ p _{Y} )} \in [0, 1],

η_{p} := \frac{I ^{copy} ( X \shortrightarrow Y )}{I ( Y : X )} \in [0, 1] .

η_{p} := \frac{I ^{copy} ( X \shortrightarrow Y )}{I ( Y : X )} \in [0, 1] .

G_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) := r_{Y} min D_{KL} (r_{Y} ∥ p_{Y})

G_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) := r_{Y} min D_{KL} (r_{Y} ∥ p_{Y})

s.t. E_{r_{Y}} [ℓ (x, Y)] \leq E_{p_{Y ∣ x}} [ℓ (x, Y)] .

w (y) = \frac{1}{Z ( λ )} p_{Y} (y) e^{- λ ℓ (x, y)}

w (y) = \frac{1}{Z ( λ )} p_{Y} (y) e^{- λ ℓ (x, y)}

G_{x}^{copy} = - λ E_{p_{Y ∣ x}} [ℓ (x, Y)] - lo g Z (λ) .

G_{x}^{copy} = - λ E_{p_{Y ∣ x}} [ℓ (x, Y)] - lo g Z (λ) .

G_{x}^{trans} (p_{Y ∣ x} ∥ p_{Y}) = D_{KL} (p_{Y ∣ x} ∥ p_{Y}) - G_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) .

G_{x}^{trans} (p_{Y ∣ x} ∥ p_{Y}) = D_{KL} (p_{Y ∣ x} ∥ p_{Y}) - G_{x}^{copy} (p_{Y ∣ x} ∥ p_{Y}) .

r_{Y} min D_{KL} (r_{Y} ∥ p_{Y}) s.t. E_{r_{Y}} [(Y - x)^{2}] \leq E_{p_{Y ∣ x}} [(Y - x)^{2}] .

r_{Y} min D_{KL} (r_{Y} ∥ p_{Y}) s.t. E_{r_{Y}} [(Y - x)^{2}] \leq E_{p_{Y ∣ x}} [(Y - x)^{2}] .

r_{Y ∣ X} min D_{KL} (r_{Y ∣ X} ∥ r_{Y}) s.t. E_{r} [ℓ (X, Y)] \leq α,

r_{Y ∣ X} min D_{KL} (r_{Y ∣ X} ∥ r_{Y}) s.t. E_{r} [ℓ (X, Y)] \leq α,

W \geq k T D_{KL} (p ∥ π) .

W \geq k T D_{KL} (p ∥ π) .

W (x) \geq k T D_{KL} (p_{Y ∣ x} ∥ π_{Y}) .

W (x) \geq k T D_{KL} (p_{Y ∣ x} ∥ π_{Y}) .

W_{min}^{exact} (x)

W_{min}^{exact} (x)

= k T D_{x}^{copy} (p_{Y ∣ x} ∥ π_{Y}),

W (x) - W_{min}^{exact} (x) \geq k T D_{x}^{trans} (p_{Y ∣ x} ∥ π_{Y}) .

W (x) - W_{min}^{exact} (x) \geq k T D_{x}^{trans} (p_{Y ∣ x} ∥ π_{Y}) .

⟨ W ⟩

⟨ W ⟩

= k T [I_{p} (Y : X) + D_{KL} (p_{Y} ∥ π_{Y})] .

⟨ W_{min}^{exact} ⟩

⟨ W_{min}^{exact} ⟩

= k T [I_{p}^{copy} (X \shortrightarrow Y) + D_{KL} (p_{Y} ∥ π_{Y})] .

⟨ W ⟩ - ⟨ W_{min}^{exact} ⟩ \geq k T I_{p}^{trans} (X \shortrightarrow Y) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

††thanks: Author for correspondence: [email protected]

Decomposing information into copying versus transformation

Artemy Kolchinsky1 and Bernat Corominas-Murtra2

1 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA

2 Institute of Science and Technology Austria, Am Campus 1, A-3400, Klosterneuburg, Austria

Abstract

In many real-world systems, information can be transmitted in two qualitatively different ways: by copying or by transformation. Copying occurs when messages are transmitted without modification, e.g., when an offspring receives an unaltered copy of a gene from its parent. Transformation occurs when messages are modified systematically during transmission, e.g., when mutational biases occur during genetic replication. Standard information-theoretic measures do not distinguish these two modes of information transfer, although they may reflect different mechanisms and have different functional consequences. Starting from a few simple axioms, we derive a decomposition of mutual information into the information transmitted by copying versus the information transmitted by transformation. We begin with a decomposition that applies when the source and destination of the channel have the same set of messages and a notion of message identity exists. We then generalize our decomposition to other kinds of channels, which can involve different source and destination sets and broader notions of similarity. In addition, we show that copy information can be interpreted as the minimal work needed by a physical copying process, which is relevant for understanding the physics of replication. We use the proposed decomposition to explore a model of amino acid substitution rates. Our results apply to any system in which the fidelity of copying, rather than simple predictability, is of critical relevance.

I Introduction

Shannon’s information theory provides a powerful set of tools for quantifying and analyzing information transmission. A particular measure of interest is mutual information, which is the most common way of quantifying the amount of information transmitted from a source to a destination. Mutual information has fundamental interpretations and operationalizations in a variety of domains, ranging from telecommunications Shannon (1948); Cover and Thomas (2006), gambling and investment Kelly (1956); Barron and Cover (1988); Cover and Ordentlich (1996), biological evolution Donaldson-Matasci et al. (2010), statistical physics Sagawa and Ueda (2008); Parrondo et al. (2015), and many others. Nonetheless, it has long been observed Pierce (1980); Corominas-Murtra et al. (2014) that mutual information does not distinguish between a situation in which the destination receives a copy of the source message versus one in which the destination receives some systematically transformed version of the source message (where “systematic” refers to transformations that do not arise purely from noise).

As an example of where this distinction matters, consider the transmission of genetic information during biological reproduction. When this process is modeled as a communication channel from parent to offspring, the amount of transmitted genetic information is often quantified by mutual information Bergstrom and Rosvall (2011); Penner et al. (2011); Simonetti et al. (2013); Butte and Kohane (1999); Ramani and Marcotte (2003). During replication, however, genetic information is not only copied but can also undergo systematic transformations in the form of nonrandom mutational biases. For instance, in the DNA of most organisms, $\texttt{A}\leftrightarrow\texttt{G}$ and $\texttt{C}\leftrightarrow\texttt{T}$ mutations occur more frequently than $\texttt{A}\leftrightarrow\texttt{C}$ , $\texttt{A}\leftrightarrow\texttt{T}$ , $\texttt{G}\leftrightarrow\texttt{C}$ , and $\texttt{G}\leftrightarrow\texttt{T}$ mutations Li et al. (1984); Yang (1994); Graur and Li (2000). That means that some information about parent nucleotides is preserved even when those nucleotides undergo mutations. Mutual information does not distinguish which part of genetic information is transmitted by exact copying and which part is transmitted by mutational biases. However, these two modes of information transmission are driven by different mechanisms and have dramatically different evolutionary and functional implications, given that mutations are more likely to lead to deleterious consequences.

The goal of this paper is to find a general decomposition of the information transmitted by a channel into contributions from copying versus from transformation. In Fig. 1, we provide a schematic that visually illustrates the problem. Essentially, we seek a decomposition of transmitted information into copy and transformation that distinguishes the example provided in (Fig. 1a), where the copy is perfect, from the one provided in (Fig. 1b), where the message has been systematically scrambled, from the one provided in (Fig. 1c), where the channel is completely noisy. Of course, we want also such a decomposition to apply in less extreme situations, where part of the information is copied and part is transformed.

The distinction between copying and transformation is important in many other domains beyond the case of biological reproduction outlined above. For example, in many models of animal communication and language evolution, agents exchange signals across noisy channels and then use these signals to try to agree on common referents in the external world Seyfarth et al. (1980); Hurford (1989); Nowak and Krakauer (1999); Cangelosi and Parisi (2002); Komarova and Niyogi (2004); Niyogi (2006); Steels (2003); Corominas-Murtra et al. (2014). In such models, successful communication occurs when information is transmitted by copying; if signals are systematically transformed — e.g., by scrambling — the agents will not be mutually intelligible, even though mutual information between them may be high. As another example, the distinction between copying and transformation may be relevant in the study of information flow during biological development, where recent work has investigated the ability of regulatory networks to decode development signals, such as positional information, from gene expression patterns Petkova et al. (2019). In this scenario, information is copied when developmental signals are decoded correctly, and transformed when they are systematically decoded in an incorrect manner. Yet other examples are provided by Markov chain models, which are commonly used to study computation and other dynamical processes in physics Van Kampen (1992), biology De Jong (2002) or sociology Sorensen (1978), among other fields. In fact, a Markov chain can be seen as a communication channel in which the system state transmits information from the past into the future. In this context, copying occurs when the system maintains its state constant over time (remains in fixed points) and transformation occurs when the state undergoes systematic changes (e.g., performs some kind of non-trivial computations).

Interestingly, while the distinction between copy and transformation information seems natural, it has not been previously considered in the information-theoretic literature. This may be partly due to the different roles that information theory has historically played: on one hand, a field of applied mathematics concerned with the engineering problem of optimizing information transmission (its original purpose); on the other, a set of quantitative tools for describing and analyzing intrinsic properties of real-world systems. Because of its origins in engineering, much of information theory — including Shannon’s channel-coding theorem, which established mutual information as a fundamental measure of transmitted information Shannon (1959); Ash (1990); Cover and Thomas (2006) — is formulated under the assumption of an external agent who can appropriately encode and decode information for transmission across a given communication channel, in this way accounting for any transformations performed by the channel. However, in many real-world systems, there is no additional external agent who codes for the channel Hopfield (1994); Corominas-Murtra et al. (2014), and one is interested in quantifying the ability of a channel to copy information without any additional encoding or decoding. This latter problem is the main subject of this paper.

A final word is required to motivate our information-theoretic approach. It is standard to characterize the ability of a channel to copy messages via the “probability of error” Cover and Thomas (2006), which we indicate as $\epsilon$ . In particular, $\epsilon$ is the probability that the destination receives a different message than the one that was sent by the source, while $1-\epsilon$ is the probability that the destination receives the same message as was sent by the source. However, for our purposes, this approach is insufficient. First of all, while $1-\epsilon$ quantifies the propensity of a channel to copy information, $\epsilon$ does not quantify the propensity to transmit information by transformation, since $\epsilon$ increases both in the presence of transformation and in the presence of noise (in other words, $\epsilon$ is high both in a channel like Fig. 1b and a channel like Fig. 1c). Among other things, this means that $1-\epsilon$ and $\epsilon$ cannot be used to compute a channel’s “copying efficiency” (i.e., which portion of the total information transmitted across a channel is copied). Second, and more fundamentally, $\epsilon$ and $1-\epsilon$ are not information-theoretic quantities, in the sense that they do not measure an amount of information. For instance, $1-\epsilon$ is bounded between 0 and 1 for all channels, whether considering a simple binary channel or a high-speed fiber-optic line. In the language of physics, one might say that $\epsilon$ is an intensive property, rather than an extensive one that scales with the size of the channel. We instead seek measures which quantify the amount of copied and transformed information, and which can grow as the capacity of the channel under consideration increases.

In this paper, we present a decomposition of information that distinguishes copied from transformed information. We derive our decomposition by proposing four natural axioms that copy and transformation information should satisfy, and then identifying the unique measure that satisfies these axioms. Our resulting measure is easy to compute and can be used to decompose either the total mutual information flowing across a channel, or the specific mutual information corresponding to a given source message, or an even more general measure of acquired information called Bayesian surprise.

The paper is laid out as follows. We present our approach in the next section. In Section III, we show that while our basic decomposition is defined for discrete-state channels where the source and destination share the same set of possible messages (so that the notion of “exact copy” is simple to define), our measures can be generalized to channels with different source and destination messages, to continuous-valued channels, and to other definitions of copying. We also discuss how our approach relates to rate-distortion in information theory Cover and Thomas (2006). In Section IV, we show that our measure can be used to quantify the thermodynamic efficiency of physical copying processes, a central topic in biological physics. In Section V, we demonstrate our measures on a real world dataset of amino acid substitution rates.

II Copy and Transformation Information

II.1 Preliminaries

We briefly present some basic concepts from information theory that will be useful for our further developments.

We use the random variables $X$ and $Y$ to indicate the source and destination, respectively, of a communication channel (as defined in detail below). We assume that the source $X$ and destination $Y$ both take outcomes from the same countable set $\mathcal{A}$ . We use $\SS$ to indicate the set of all probability distributions whose support is equal to or a subset of $\mathcal{A}$ . We use notation like $p_{Y},q_{Y},\dots\in\SS$ to indicate marginal distributions over $Y$ , and $p_{Y|x},q_{Y|x},\dots\in\SS$ to indicate conditional distributions over $Y$ , given the event $X=x$ . Where clear from context, we will simply write $p(y),q(y),\dots$ and $p(y|x),q(y|x),\dots$ , and drop the subscripts.

For some distribution $p$ over random variable $X$ , we write the Shannon entropy as $H(p(X)):=-\sum_{x}p(x)\log p(x)$ , or simply $H(X)$ . For any two distributions $s$ and $q$ over the same set of outcomes, the Kullback-Leibler (KL) divergence is defined as

[TABLE]

KL is non-negative and equal to 0 if and only if $s(x)=q(x)$ for all $x$ . It is infinite when the support of $s$ is not a subset of the support of $q$ . In this paper we will also make use of the KL between Bernoulli distributions — that is, distributions over two states of the type $(a,1-a)$ — which is sometimes called “binary KL”. We will use the notation $\mathsf{d}(a,b)$ to indicate the binary KL,

[TABLE]

We will in general assume that $\log$ s are in base 2 (so information is measured in bits), unless otherwise noted.

In information theory, a communication channel specifies the conditional probability distribution of receiving different messages at a destination given messages transmitted by a source. Let $p_{Y|X}(y|x)$ indicate such a conditional probability distribution. The amount of intrinsic noise in the channel, given some probability distribution of source messages $s_{X}(x)$ , is the conditional Shannon entropy $H(Y|X):=-\sum_{x}s(x)\sum_{y}p(y|x)\log p(y|x)$ . The amount of information transferred across a communication channel is quantified using the mutual information (MI) between the source and the destination Cover and Thomas (2006),

[TABLE]

where $p(y)$ is the marginal probability of receiving message $y$ at the destination, defined as

[TABLE]

When writing $I_{p}(Y\!:\!X)$ , we will omit the subscript $p$ indicating the channel where it is clear from context. MI is a fundamental measure of information transmission, and can be operationalized in numerous ways Cover and Thomas (2006). It is non-negative, and large when (on average) the uncertainty about the message at the destination decreases by a large amount, given the source message. MI can also be written as a weighted sum of so-called specific MI111The reader should be aware that the term “specific MI” has been used to refer to two different measures in the literature DeWeese and Meister (1999). The version of specific MI used here, as specified by Eq. 6, is also sometimes called “specific surprise”. terms DeWeese and Meister (1999); Butts (2003); Wibral et al. (2015), one for each outcome of $X$ ,

[TABLE]

where the specific MI for outcome $x$ is given by

[TABLE]

Each $I(Y\!:\!X\!\!=\!x)$ indicates the contribution to MI arising from the particular source message $x$ . We will sometimes use the term total mutual information (total MI) to refer to Eq. 3, so as to distinguish it from specific MI.

Specific MI also has an important Bayesian interpretation. Consider an agent who begins with a set of prior beliefs about $Y$ , as specified by the prior distribution $p_{Y}(y)$ . The agent then updates their beliefs conditioned on the event $X=x$ , resulting in the posterior distribution $p_{Y|x}$ . The KL divergence between the posterior and the prior, $D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ (Eq. 6), is called Bayesian surprise Itti and Baldi (2006), and quantifies the amount of information acquired by the agent. It reaches its minimum value of zero, indicating that no information is acquired, if and only if the prior and posterior distributions match exactly. Bayesian surprise plays a fundamental role in Bayesian theory, including in the design of optimal experiments Lindley (1956); Stone (1959); Bernardo (1979a); Chaloner and Verdinelli (1995) and the selection of “non-informative priors” Bernardo (1979b); Berger et al. (2009). Specific MI is a special case of Bayesian surprise, when the prior $p_{Y}$ is the marginal distribution at the destination, as determined by a choice of source distribution $s_{X}$ and channel $p_{Y|X}$ according to Eq. 4. In general, however, Bayesian surprise may be defined for any desired prior $p_{Y}$ and posterior distribution $p_{Y|x}$ , without necessarily making reference to a source distribution $s_{X}$ and communication channel $p_{Y|X}$ .

Because Bayesian surprise is a general measure that includes specific MI as a special case, we will formulate our analysis of copy and transformation information in terms of Bayesian surprise, $D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ . Note that while the notation $p_{Y|x}$ implies conditioning on the event $X=x$ , formally $p_{Y|x}$ can be any distribution whatsoever. Thus, we do not technically require that there exist some full joint or conditional probability distribution over $X$ and $Y$ . Throughout the paper we will refer to the distributions $p_{Y|x}$ and $p_{Y}$ as the “posterior” and “prior”.

Proofs and derivations are contained in the appendices.

II.2 Axioms for copy information

We propose that any measure of copy information should satisfy a set of four axioms. Our setup is motivated in the following way. First, our decomposition should apply at the level of individual source message, i.e., we wish to be able to decompose each specific mutual information term (or more generally, Bayesian surprise) into a non-negative (specific) copy information term and a non-negative (specific) transformation information term. Second, we postulate that if there are two channels with the same marginal distribution at the destination, then the channel with the larger $p_{Y|X}(x|x)$ (probability of destination getting message $x$ when the source transmits message $x$ ) should have larger copy information for source message $x$ (this is, so to speak, our “central axiom”). This postulate can also be interpreted in a Bayesian way. Imagine two Bayesian agents with the same prior distribution over beliefs, $p_{Y}$ , who update their beliefs conditioned on the event $X=x$ . We postulate that the agent with the larger posterior probability on $Y=x$ should have greater copy information.

Formally, we assume that each copy information term is a real-valued function of the posterior distribution, the prior distribution, and the source message $x$ , written generically as $F(p_{Y|x},p_{Y},x)$ . Given any measure of copy information $F$ , the transformation information associated with message $x$ is then the remainder of $D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ beyond $F$ ,

[TABLE]

We now propose a set of axioms that any measure of copy information $F$ should satisfy.

First, we postulate that copy information should be bounded between 0 and the Bayesian surprise, $D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ . Given Eq. 7, this guarantees that both $F$ and $F^{\mathsf{trans}}$ are non-negative.

Axiom 1.

$F(p_{Y|x},p_{Y},x)\geq 0$ .

Axiom 2.

$F(p_{Y|x},p_{Y},x)\leq D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ .

Then, we postulate that copy information for source message $x$ should increase monotonically as the posterior probability of $x$ increases, assuming the prior distribution is held fixed (this is the “central axiom” mentioned above).

Axiom 3.

If $p_{Y|x}(x)\leq q_{Y|x}(x)$ , then $F(p_{Y|x},p_{Y},x)\leq F(q_{Y|x},p_{Y},x)$ .

In Appendix B, we show that any measure of copy information that satisfies the above three axioms must obey $F(p_{Y|x},p_{Y},x)=0$ whenever $p_{Y|x}(x)\leq p_{Y}(x)$ . We also show that one particular measure of copy information, which is called $D^{\mathsf{copy}}_{x}$ and is discussed in the next section, is the largest measure that satisfies the above three axioms. However, the three axioms do not uniquely determine what happens when $p_{Y|x}(x)>p_{Y}(x)$ . This means that $D^{\mathsf{copy}}_{x}$ is not unique, and in fact there are some trivial measures (such as $F(p_{Y|x},p_{Y},x)=0$ for all $p_{Y|x}$ , $p_{Y}$ , and $x$ ) that also satisfy the above axioms. Such trivial cases are excluded by our final axiom, which states that for all prior distributions and all posterior probabilities $p_{Y|x}(x)>p_{Y}(x)$ , there are posterior distributions that contain only copy information. As we’ll see below, $D^{\mathsf{copy}}_{x}$ is the unique satisfying measure once this axiom is added.

Axiom 4.

For any $p_{Y}$ and $c\in[p_{Y}(x),1]$ , there exists a posterior distribution $p_{Y|x}$ such that $p_{Y|x}(x)=c$ and $F(p_{Y|x},p_{Y},x)=D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ .

II.3 The measure $D^{\mathsf{copy}}_{x}$

We now present $D^{\mathsf{copy}}_{x}$ , the unique measure that satisfies the four copy information axioms proposed in the last section. Given a prior distribution $p_{Y}$ , posterior distribution $p_{Y|x}$ , and source message $x$ , this measure is defined as

[TABLE]

where we have used the notation of Eq. 2. We now state the main result of our paper:

Theorem 1.

$D^{\mathsf{copy}}_{x}$ * is the unique measure which satisfies Axioms 2, 3, 4 and 1.*

In the Appendix A we demonstrate that $D^{\mathsf{copy}}_{x}$ satisfies all the axioms, and in the Appendix B we prove that it is the only measure that satisfies them. We further show that if one drops Axiom 4, then $D^{\mathsf{copy}}_{x}$ is the largest possible measure that can satisfy the remaining axioms.

Given the definition of $F^{\mathsf{trans}}$ in Eq. 7, $D^{\mathsf{copy}}_{x}$ also defines a non-negative measure of transformation information, which we call $D^{\mathsf{trans}}_{x}$ ,

[TABLE]

II.4 Decomposing mutual information

We now show that $D^{\mathsf{copy}}_{x}$ and $D^{\mathsf{trans}}_{x}$ allow for a decomposition of mutual information (MI) into MI due to copying and MI due to transformation. Recall that MI can be written as an expectation over specific MI terms, as shown in Eq. 6. Each specific MI term can be seen as a Bayesian surprise, where the prior distribution is the marginal distribution at the destination (see Eq. 4), and the posterior distribution is the conditional distribution of destination messages given a particular source message. Thus, our definitions of $D^{\mathsf{copy}}_{x}$ and $D^{\mathsf{trans}}_{x}$ provide a non-negative decomposition of each specific MI term,

[TABLE]

In consequence, they also provide a non-negative decomposition of the total MI into two non-negative terms: the total copy information and the total transformation information,

[TABLE]

where $I^{\mathsf{copy}}_{p}(X\mkern-5.0mu\shortrightarrow\!Y)$ and $I^{\mathsf{trans}}_{p}(X\mkern-5.0mu\shortrightarrow\!Y)$ are given by

[TABLE]

(When writing $I^{\mathsf{copy}}$ and $I^{\mathsf{trans}}$ , we will often omit the subscript $p$ where the channel is clear from context.) By a simple manipulation, we can also decompose the marginal entropy of the destination $H(Y)$ into three non-negative components:

[TABLE]

Thus, given a channel from $X$ to $Y$ , the uncertainty in $Y$ can be written as the sum of the copy information from $X$ , the transformed information from $X$ , and the intrinsic noise in that channel from $X$ to $Y$ .

For illustration purposes, we plot the behavior of $I^{\mathsf{copy}}$ and $I^{\mathsf{trans}}$ in the classical binary symmetric channel (BSC) in Fig. 2 (see caption for details). More detailed analysis of copy and transformation information in the BSC is discussed in Appendix E.

It is worthwhile to point several important differences between our proposed measures and MI.

First, in the definitions of $I^{\mathsf{copy}}(X\mkern-5.0mu\shortrightarrow\!Y)$ and $I^{\mathsf{trans}}(X\mkern-5.0mu\shortrightarrow\!Y)$ , the notation $X\mkern-5.0mu\shortrightarrow\!Y$ indicates that $X$ is the source and $Y$ is the destination. This is necessary because, unlike MI, $I^{\mathsf{copy}}$ and $I^{\mathsf{trans}}$ are in general non-symmetric, so it is possible that $I^{\mathsf{copy}}(X\mkern-5.0mu\shortrightarrow\!Y)\neq I^{\mathsf{copy}}(Y\mkern-5.0mu\shortrightarrow\!X)$ , and similarly for $I^{\mathsf{trans}}$ . We also note that the above form of $I^{\mathsf{copy}}$ and $I^{\mathsf{trans}}$ , where they are written as sums over individual source message, is sometimes referred to as trace-like form in the literature, and is a commonly desired characteristic of information-theoretic functionals Hanel and Thurner (2011); Thurner et al. (2017).

Second, $I^{\mathsf{copy}}$ and $I^{\mathsf{trans}}$ do not obey the data processing inequality Cover and Thomas (2006), and can either decrease or increase as the destination undergoes further operations. In this respect, they are different from MI (the sum of $I^{\mathsf{copy}}$ and $I^{\mathsf{trans}}$ ). As an example, consider the case where channel $p_{Y|X}$ first transforms source message $X$ into an encrypted message $Y$ , and then another channel $p_{X^{\prime}|Y}$ decrypts $Y$ back into a copy of $X$ (so $X^{\prime}=X$ ). In this example, $I^{\mathsf{copy}}(X\mkern-5.0mu\shortrightarrow\!X^{\prime})>I^{\mathsf{copy}}(X\mkern-5.0mu\shortrightarrow\!Y)$ even though the Markov condition $X-Y-X^{\prime}$ holds.

Finally, unlike MI, $I^{\mathsf{copy}}$ and $I^{\mathsf{trans}}$ are generally non-additive when multiple independent channels are concatenated. As an example, imagine that the source messages are bit strings of length $n$ , which are transmitted through a product of $n$ independent channels, $p(y|x)=\prod_{i}p_{i}(y_{i}|x_{i})$ . If the source bits are independent, $s(x)=\prod_{i}s_{i}(x_{i})$ , it is straightforward to show that the MI between $X$ and $Y$ has the additive form $I(Y\!:\!X)=\sum_{i}I(Y_{i}:X_{i})$ . However, $I^{\mathsf{copy}}$ will generally not have this additive form, because copy information is defined in terms of the probability of exactly copying the entire source message (e.g., the entire $n$ -bit long string). Imagine that in the above example, one of the bit-wise channels carries out a bit flip, $p_{i}(x_{i}|y_{i})=1-\delta(x_{i},y_{i})$ . In that case, the probability of receiving an exact copy of the source message at the destination is zero, and therefore $I^{\mathsf{copy}}$ is also zero regardless of the nature of the other bit-wise channels $p_{j}$ for $j\neq i$ . If desired, it is possible to derive an additive version of $I^{\mathsf{copy}}$ by generalizing our measure with an appropriate “loss function”, as discussed in more detail in Section III and Appendix C.3.

II.5 Copying efficiency

Our approach provides a way to quantify which portion of the information transmitted across a channel is due to copying rather than transformation, which we refer to as “copying efficiency”. Copying efficiency is defined at the level of individual source messages as

[TABLE]

where the bounds come directly from Axioms 1 and 2. It can also be defined at the level of a channel as whole as

[TABLE]

The bounds follow simply given the above results.

For Eq. 13 and Eq. 14 to be useful efficiency measures, there should exist channels which are either “completely inefficient” (have efficiency 0) or “maximally efficient” (achieve efficiency 1). For the case of Eq. 13, the bounds can be saturated because of Axiom 4, which guarantees that for any source message $x$ , prior $p_{Y}$ , and desired posterior probability $p_{Y|x}(x)\geq p_{Y}(x)$ , there exists a posterior $p_{Y|x}$ such that the Bayesian surprise $D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ is composed entirely of copy information (for example, see Eq. 30).

One can show that the bounds in Eq. 14 can also be saturated. First, it can be verified that completely inefficient channels exist, since any channel which has $p_{Y|x}(x)\leq p_{Y}(x)$ for all $x\in\mathcal{A}$ will have $I^{\mathsf{copy}}(X\mkern-5.0mu\shortrightarrow\!Y)=0$ (note that such channels exist at all levels of mutual information). We also show that maximally efficient channels exist, using the following result which is proved in Appendix D.

Proposition 1.

*For any source distribution $s_{X}$ with $H(X)<\infty$ , there exist channels $p_{Y|x}$ for all levels of mutual information $I_{p}(Y\!:\!X)\in[0,H(X)]$ such that $I^{\mathsf{copy}}_{p}(X\mkern-5.0mu\shortrightarrow\!Y)=I_{p}(Y\!:\!X)$ . *

1 shows that it is possible to achieve all values of total copy information, which is defined at the level of a channel. Note that this proposition does not follow immediately from Axiom 4, which is a statement about copy information at the level of a prior $p_{Y}$ and posterior $p_{Y|x}$ , where no particular relationship between $p_{Y}$ and $p_{Y|x}$ is assumed.

III Generalization and relation to rate-distortion

We now show that $D^{\mathsf{copy}}_{x}$ can be written as a particular element among a broad family of copy information measures, which generalize the formal definition of what is meant by “copying”.

As we showed above, $D^{\mathsf{copy}}_{x}$ is the unique measure that satisfies the four axioms proposed in Section II.2. In particular, it satisfies Axiom 3, which states that given the same prior $p_{Y}$ , copy information should be larger for $q_{Y|x}$ than $p_{Y|x}$ whenever $q_{Y|x}(x)\geq p_{Y|x}(x)$ . It also satisfies Axiom 4, which states that there exist posterior distributions that have only copy information for all possible $p_{Y|x}(x)\in[p_{Y}(x),1]$ .

These axioms are based on one particular definition of copying, which states that copying occurs when the source and destination messages match perfectly. In fact, this can be generalized to other definitions of copying and transformation by using a loss function $\ell(x,y)$ , which quantifies the dissimilarity between source message $x$ and destination message $y$ . For a given loss function, $\ell(x,y)=0$ indicates that $x$ and $y$ should be considered a perfect copy of each other, while $\ell(x,y)>0$ indicates that $x$ and $y$ should be considered as somewhat different. Importantly, $\ell(x,y)$ can quantify similarity in a graded manner, so that $\ell(x,y^{\prime})>\ell(x,y)$ indicates that $y$ is closer to being a copy of $x$ than $y^{\prime}$ (even though neither $y$ nor $y^{\prime}$ may be a perfect copy of $x$ ).

Given an externally-specified loss function $\ell(x,y)$ , one can define Axiom 3 and Axiom 4 in a generalized manner. The generalized version of Axiom 3 states that posterior distribution $q_{Y|x}$ should have higher copy information than $p_{Y|x}$ whenever its expected loss is lower:

Axiom 3∗.

If $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]\geq\mathbb{E}_{q_{Y|x}}[\ell(x,Y)]$ , then $F(p_{Y|x},p_{Y},x)\leq F(q_{Y|x},p_{Y},x)$ .

The generalized version of Axiom 4 states that at all values of the expected loss which are lower than the expected loss achieved by $p_{Y}$ , there are channels which transmit information only by copying.

Axiom 4∗.

For any $p_{Y}$ and $c\in[\min_{y}\ell(x,y),\mathbb{E}_{p_{Y}}[\ell(x,Y)]]$ , there exists a posterior distribution $p_{Y|x}$ such that $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]=c$ and $F(p_{Y|x},p_{Y},x)=D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ .

Note that in defining Axiom 4**∗**, we used that $\min_{y}\ell(x,y)$ is the lowest expected loss that can be achieved by any posterior distribution.

Each particular loss function induces its own measure of copy information. In fact, as we show in Appendix C.1, there is a unique measure of copy information which satisfies Axiom 1 and Axiom 2, as defined in Section II.2, plus the generalized axioms Axiom 3**∗** and Axiom 4**∗**, as defined here in terms of the loss function $\ell(x,y)$ . This generalized measure of copy information has the following form:

[TABLE]

Recall that the KL divergence $D_{\mathsf{KL}}(r_{Y}\|p_{Y})$ reflects the amount of information acquired by an agent in going from prior distribution $p_{Y}$ to posterior distribution $r_{Y}$ . Thus, $G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ quantifies the minimum information that must be acquired by an agent in order to match the copying performance of the actual posterior $p_{Y|x}$ , as measured by the expected loss.

Eq. 15 is an instance of a “minimum cross-entropy” problem, which is closely related to the “maximum entropy” principle Kullback (1959); Kapur and Kesavan (1992); Shore and Johnson (1981). The distribution that optimizes Eq. 15 can be written in a simple form (Rubinstein and Kroese, 2016, pp.299-300),

[TABLE]

where $\lambda\geq 0$ is a Lagrange multiplier chosen so that the constraint in Eq. 15 is satisfied, and $Z(\lambda)=\sum_{y}p_{Y}(y)e^{-\lambda\ell(x,y)}$ is a normalization constant. Note that whenever $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]\geq\mathbb{E}_{p_{Y}}[\ell(x,Y)]$ , $\lambda=0$ and $w_{Y}=p_{Y}$ Rubinstein and Kroese (2016). Otherwise, $\lambda>0$ and the constraint in Eq. 15 will be tight up to equality. In practice, Eq. 15 can be solved by sweeping across the 1-dimensional space of possible $\lambda\geq 0$ values (it can also be solved by standard convex optimization techniques). Once $\lambda$ is determined, the value of copy information is given by

[TABLE]

It can be verified that $D^{\mathsf{copy}}_{x}$ , the measure derived above, corresponds to the special case $\ell(x,y):=1-\delta(x,y)$ , which is called “0-1 loss” in statistics Friedman et al. (2001) and “Hamming distortion” in information theory Cover and Thomas (2006) (see Appendix C.2).

The generalized measure $G^{\mathsf{copy}}_{x}$ has many similarities with $D^{\mathsf{copy}}_{x}$ . Like $D^{\mathsf{copy}}_{x}$ , it naturally leads to a non-negative measure of generalized transformation information,

[TABLE]

$G^{\mathsf{copy}}_{x}$ can also be used to decompose total mutual information into (generalized) total copy and transformation information, akin to Eq. 10 and Eq. 11. Finally, one can use $G^{\mathsf{copy}}_{x}$ to define a generalized measure of copying efficiency, following the approach described in Section II.5.

While we believe $D^{\mathsf{copy}}_{x}$ , as defined via the 0-1 loss function, is a simple and reasonable choice in a variety of applications, in some cases it may also be useful to consider other loss functions. One important example is when the source and destination have different sets of outcomes. Recall that $D^{\mathsf{copy}}_{x}$ assumes that the source and destination share the same set of possible outcomes, $\mathcal{A}$ . When this assumption does not hold, generalized measures of copy and transformation information can still be defined, as long as an appropriate loss function $\ell:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ is provided (where $\mathcal{X}$ and $\mathcal{Y}$ indicate the outcomes of the source and destination, respectively).

Another important use case occurs when the loss function specifies continuously-varying degrees of functional similarity between source and destination messages. For example, imagine that $p_{Y|X}$ is an image compression algorithm which maps raw images $X$ to compressed outputs $Y$ . Research in computer vision has developed sophisticated loss functions for image compression which correlate strongly with human perceptual judgments Wang and Bovik (2006). By defining copy information in terms of such a loss function, one could measure how much perceptual information is copied by a particular image compression algorithm.

Our generalized approach can also be used to define copy and transformation information for random variables with continuous-valued outcomes. The 0-1 loss function, as used in $D^{\mathsf{copy}}_{x}$ , is not very meaningful for continuous-valued outcomes, since it depends on a measure-0 property of $p_{Y|x}$ . A more natural measure of copy information is produced by the squared-error loss function $\ell(x,y):=(x-y)^{2}$ , giving

[TABLE]

This particularly optimization problem has been investigated in the maximum entropy literature, and has been shown to be particularly tractable when $p_{Y}$ belongs to an exponential family Altun and Smola (2006); Dudík and Schapire (2006); Koyejo and Ghosh (2013).

Finally, it is also possible to generalize this approach to vector-valued loss functions $\ell:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}^{n}$ , which allow one to specify dissimilarity in a multi-dimensional way. We discuss the relevant axioms and resulting copy information measure for vector-valued loss functions in Appendix C.3. We also demonstrate that vector-valued loss functions can be used to define measures of copy and transformation information that are additive for independent channels, in the sense discussed in Section II.4.

After what we discussed so far, it is natural to briefly review the similarities between our generalized approach and rate-distortion theory Cover and Thomas (2006). In rate-distortion theory, one is given a distribution over source messages $s_{X}$ and a “distortion function” $\ell(x,y)$ which specifies the loss incurred when source message $x$ is encoded with destination message $y$ . The problem is to find the channel $r_{Y|X}$ which minimizes mutual information without exceeding some constraint on the expected distortion,

[TABLE]

where $\alpha$ is an externally-determined parameter. The prototypical application of rate-distortion is compression, i.e., to find a compression channel $r_{Y|X}$ that has both low mutual information and low expected distortion. As can be seen by comparing Eq. 15 and Eq. 18, the optimization problem considered in our definition of generalized copy information and the optimization found in rate-distortion are quite similar: they both involve minimizing a KL divergence subject to an expected loss constraint. Nonetheless, there are some important differences. First and foremost, the goals of the two approaches are different. In our approach, the aim is to decompose the information transmitted by a fixed externally-specified channel into copy and transformation. In rate-distortion, there is no externally-specified channel and the aim is instead to find an optimal channel de novo. Second, our approach is motivated by a set of axioms which postulate how a measure of copy information should behave, rather than from channel-coding considerations which are used to derive the optimization problem in rate-distortion Cover and Thomas (2006). Lastly, copy information is defined in a point-wise manner for each source message $x$ , rather than for an entire set of source messages at once, as it is rate-distortion.

We finish by noting that one can also define Eq. 15 in a channel-wise manner (by minimizing $D_{\mathsf{KL}}(r_{Y|X}\|r_{Y})$ , as in Eq. 18) rather than a pointwise manner (minimize $D_{\mathsf{KL}}(r_{Y|X=x}\|p_{Y})$ , as in Eq. 15). Under that formulation, one could no longer decompose specific MI into non-negative copy and information terms, though total MI could still be decomposed in that way. Interestingly, this alternative formulation would become equivalent to the so-called minimum information principle, a previous proposal for quantifying how much information about source messages is carried by different properties of destination messages Globerson et al. (2009).

IV Thermodynamic costs of copying

Given the close connection between information theory and statistical physics, many information-theoretic quantities can be interpreted in thermodynamic terms Parrondo et al. (2015). As we show here, this includes our proposed measure of copy information, $D^{\mathsf{copy}}_{x}$ . Specifically, we will show that $D^{\mathsf{copy}}_{x}$ reflects the minimal amount of thermodynamic work necessary to copy a physical entity such as a polymer molecule. This latter example emphasizes the difference between information transfer by copying versus by transformation in a fundamental, biologically-inspired physical setup.

Consider a physical system coupled to a heat bath at temperature $T$ , and which is initially in an equilibrium distribution $\pi(i)\propto e^{-E(i)/(kT)}$ with respect to some Hamiltonian $E$ ( $k$ is Boltzmann’s constant). Now imagine that the system is driven to some non-equilibrium distribution $p$ by a physical process, and that by the end of the process the Hamiltonian is again equal to $E$ . The minimal amount of work required by any such process is related to the KL divergence between $p$ and $\pi$ Esposito and Van den Broeck (2010),

[TABLE]

The limit is achieved by thermodynamically-reversible processes. (In this subsection, in accordance with the convention in physics, we assume that all logarithms are in base $e$ , so information is measured in nats.)

Recent work has analyzed the fundamental thermodynamic constraints on copying in a physical system, for example for an information-carrying polymer like DNA Ouldridge and Rein ten Wolde (2017); Poulton et al. (2018). Here we will generally follow the model described in Ouldridge and Rein ten Wolde (2017), while using our notation and omitting some details that are irrelevant for our purposes (such as the microstate/macrostate distinction). In this model, the source $X$ represents the state of the original system (e.g., the polymer to be copied), and the destination $Y$ represents the state of the replicate (e.g., the polymer produced by the copying mechanism). We make several assumptions. First, the source $X$ is not modified during the copying process. Second, $X$ and $Y$ have the same Hamiltonian before and after the copying process. Finally, we follow Ouldridge and Rein ten Wolde (2017) in assuming that $Y$ is a persistent copy of $X$ , meaning that before and after the copying process, $Y$ is physically separated from $X$ and there is no interaction energy between them. This does not preclude $X$ and $Y$ from coming into contact and interacting energetically during intermediate stages of the copying process (for instance by template binding). The assumption of persistent copying means that there are no unaccounted energetic costs involved in preparing the copying system and transporting the produced replicate (e.g., moving the replicate $Y$ to a daughter cell).

Assume that $Y$ starts in the equilibrium distribution, indicated as $\pi_{Y}$ (note that by our persistent copy assumption, the equilibrium distribution cannot depend on the state of $X$ ). Let $p_{Y|x}(x)$ indicate the conditional distribution of replicates after the end of the copying process, where $x$ is the state of the original system $X$ . Following Eq. 19, the minimal work required to bring $Y$ out of equilibrium and produce replicates according to $p_{Y|x}(x)$ is given by

[TABLE]

Note that Eq. 20 specifies the minimal work required to create the overall distribution $p_{Y|x}$ . However, in many real-world scenarios, likely including DNA copying, the primary goal is to create exact copies of the original state, not transformed versions it (such as nonrandom mutations). That means that for a given source state $x$ , the quality of the replication process can be quantified by the probability of making an exact copy, $p_{Y|x}(x)$ . We can now ask: what is the minimal work required by a physical replication process whose probability of making exact copies is at least as large as $p_{Y|x}(x)$ ? To make the comparison fair, we require that the process begin and end with the same equilibrium distribution, $\pi_{Y}$ . The answer is given by the minimum of the RHS of Eq. 20 under a constraint on the exact-copy yield, which is exactly proportional to $D^{\mathsf{copy}}_{x}$ :

[TABLE]

where Eq. 22 follows from Appendix C.2. The additional work that is expended by the replication process is then lower bounded by a quantity proportional $D^{\mathsf{trans}}_{x}$ ,

[TABLE]

This shows formally the intuitive idea that transformation information contributes to thermodynamic costs but not to the accuracy of correct copying.

In most cases, a replication system is designed for copying not just one source state $x$ , but an entire ensemble of source states (for example, the DNA replication system can copy a huge ensemble of source DNA sequences, not just one). Assume that $X$ is distributed according to some $s_{X}(x)$ . Across this ensemble of source states, the minimal amount of expected thermodynamic work required to produce replicates according to conditional distribution $p_{Y|X}$ is given by

[TABLE]

Since KL is non-negative, the minimum expected work is lowest when the equilibrium distribution $\pi_{Y}$ matches the marginal distribution of replicates, $p_{Y}(y)=\sum_{x}s(x)p(y|x)$ . Using similar arguments as above, we can ask about the minimum expected work required to produce replicates, assuming each source state $x$ achieves an exact-copy yield of at least $p_{Y|x}(x)$ . This turns out to be the expectation of Eq. 22,

[TABLE]

The additional expected work that is needed by the replication process, above and beyond an optimal process that achieves the same exact-copy yield, is lower bounded by the transformation information,

[TABLE]

When the equilibrium distribution $\pi_{Y}$ matches the marginal distribution $p_{Y}$ , $\langle W^{\text{exact}}_{\text{min}}\rangle$ is exactly equal $kTI^{\mathsf{copy}}$ . Furthermore, in this special case the thermodynamic efficiency of exact copying, defined as the ratio of minimal work to actual work, becomes equal to the information-theoretic copying efficiency of $p$ , as defined in Eq. 14:

[TABLE]

As can be seen, standard information-theoretic measures, such as Eq. 20, bound the minimal thermodynamic costs of transferring information from one physical system to another, whether that transfer happens by copying or by transformation. However, as we have argued above, the difference between copying and transformation is essential in many biological scenarios, as well as other domains. In such cases, $D^{\mathsf{copy}}_{x}$ arises naturally as the minimal thermodynamic work required to replicate information by copying.

Concerning the example of DNA copying that we discussed throughout this section, our results should be interpreted with some care. We have generally imagined that the source system represents the state of an entire polymer, e.g., the state of an entire DNA molecule, and that the probability of exact copying refers to the probability that the entire sequence is reproduced without any errors. Alternatively, one can use the same framework to consider probability of copying a single monomer in a long polymer (assuming that the thermodynamics of polymerization can be disregarded), as might be represented for instance by a single-nucleotide DNA substitution matrix Yang (1994), as analyzed in the last section. Generally speaking, $D^{\mathsf{copy}}_{x}$ computed at the level of single monomers will be different from $D^{\mathsf{copy}}_{x}$ computed at the level of entire polymers, since the probability of exact copying means different things in these two formulations.

V Copy and transformation in amino acid substitution matrices

In the previous section, we saw how $D^{\mathsf{copy}}_{x}$ and $I^{\mathsf{copy}}$ arise naturally when studying the fundamental limits on the thermodynamics of copying, which includes the special case of replicating information-bearing polymers. Here we demonstrate how these measures can be used to characterize the information-transmission properties of a real-world biological replication system, as formalized by a communication channel $p_{Y|X}$ from parent to offspring Yang (1994); Le and Gascuel (2008). In this context, we show how $I^{\mathsf{copy}}$ can be used to quantify precisely how much information is transmitted by copying, without mutations. At the same time, we will use $I^{\mathsf{trans}}$ to quantify how much information is transmitted by transformation, that is by systematic nonrandom mutations that carry information but do not preserve the identity of the original message Li et al. (1984); Yang (1994); Graur and Li (2000). We also quantify the effect of purely-random mutations, which correspond to the conditional entropy of the channel, $H({Y|X})$ .

We demonstrate these measures on empirical data of point accepted mutations (PAM) of amino acids. PAM data represents the rates of substitutions between different amino acids during the course of biological evolution, and has various applications, including evolutionary modeling, phylogenetic reconstructions, and protein alignment Le and Gascuel (2008). We emphasize that amino acid PAM matrices do not reflect the direct physical transfer of information from protein to protein, but rather the effects of underlying processes of DNA-based replication and selection, followed by translation.

Formally, an amino acid PAM matrix $Q$ is a continuous-time rate matrix. $Q_{yx}$ represents the instantaneous rates of substitutions from amino acid $x$ to amino acid $y$ , where both $x$ and $y$ belong to $\mathcal{A}=\{1,\dots,20\}$ , representing the 20 standard amino acids. We performed our analysis on a particular PAM matrix $Q$ which was published by Le and Gascuel Le and Gascuel (2008) (this matrix was provided by the pyvolve Python package Spielman and Wilke (2015)). We calculated a discrete-time conditional probability distribution $p_{Y|X}$ from this matrix by computing the matrix exponential $p_{Y|X}=\exp(\tau Q)$ . Thus, $p(y|x)$ represents the probability that amino acid $x$ is replaced by amino acid $y$ over time scale $\tau$ . For simplicity we used timescale $\tau=1$ . We used the stationary distribution of $Q$ as the source distribution $s_{X}$ , which correlates closely with empirically-observed amino acid frequencies (Le and Gascuel, 2008, Fig. 1). Using the decomposition presented in Eq. 11, we arrived at the following values for the communication channel described by the conditional probabilities $p_{Y|X}$ :

[TABLE]

where

[TABLE]

We also computed the intrinsic noise for this channel (see Eq. 12),

[TABLE]

Finally, we computed the specific copy and transformation information, $D^{\mathsf{copy}}_{x}$ and $D^{\mathsf{trans}}_{x}$ , for different amino acids. The results are shown in Fig. 3. We remind the reader that the sum of $D^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ and $D^{\mathsf{trans}}_{x}(p_{Y|x}\|p_{Y})$ for each amino acid $x$ — that is, the total height of the stacked bar plots in the figure — is equal to the specific MI $I(Y\!:\!X\!\!=\!x)$ for that $x$ , as explained in the decomposition of Eq. 9.

While we do not dive deeply in the biological significant of these results, we highlight several interesting findings. First, for this PAM matrix and timescale ( $\tau=1$ ), a considerable fraction of the information ( $\approx 1/4$ ) is transmitted not by copying but by non-random mutations. Generally, such non-random mutations represent underlying physical, genetic, and biological constraints that allow some pairs of amino acids to substitute each other more readily than other pairs.

Second, we observe considerable variation in the amount of specific MI, copy information, and transformation between different amino acids, as well as different ratios of copy information to transformation information. In general, amino acids with more copy information are conserved unchanged over evolutionary timescales. At the same time, it is known that conserved amino acids tend to be “outliers” in terms of their physiochemical properties (such as hydrophobicity, volume, polarity, etc.), since mutations to such outliers are likely to alter protein function in deleterious ways Graur (1985); Yang et al. (1998). To analyze this quantitatively, we used Miyata’s measure of distance between amino acids, which is based on differences in volume and polarity Miyata et al. (1979). For each amino acid, we quantified its degree of “outlierness” in terms of its mean Miyata distance to all 19 other amino acids. The Spearman rank correlation between this outlierness measure and copy information (as shown in Fig. 3) was 0.57 ( $p=0.009$ ). On the other hand, the rank correlation between outlierness and transformation information was 0.22 ( $p=0.352$ ). Similar results were observed for other chemically-motivated measures of amino acid distance, such as Grantham’s distance Grantham (1974) and Sneath’s index Sneath (1966). This demonstrates that amino acids with unique chemical characteristics tend to have more copy information, but not more transformation information.

VI Discussion

Although mutual information is a very common and successful measure of transmitted information, it is insensitive to the distinction between information that is transmitted by copying versus information that is transmitted by transformation. Nonetheless, as we have argued, this distinction is of fundamental importance in many real-world systems.

In this paper we propose a rigorous and practical way to decompose specific mutual information, and more generally Bayesian surprise, into two non-negative terms corresponding to copy and transformation, $I=I^{\mathsf{copy}}+I^{\mathsf{trans}}$ . We derive our decomposition using an axiomatic framework: we propose a set of four axioms that any measure of copy information should obey, and then identify the unique measure that satisfies those axioms. At the same time, we show that our measure of copy information is one of a family of functionals, each of which corresponds to a different way of quantifying error in transmission. We also demonstrate that our measures have a natural interpretation in thermodynamic terms, which suggests novel approaches for understanding the thermodynamic efficiency of biological replication processes, in particular DNA and RNA duplication. Finally, we demonstrate our results on real-world biological data, exploring copy and transformation information of amino acid substitution rates. We find significant variation among the amount of information transmitted by copying vs. transformation among different amino acids.

Several directions for future work present themselves.

First, there is a large range of practical and theoretical application of our measures, from analysis of biological and neural information transmission to the study of the thermodynamics of self-replication, a fundamental and challenging problem in biophysics Corominas-Murtra (2019).

Second, we suspect our measures of copy and transformation information have further connections to existing formal treatments in information theory, in particular rate-distortion theory Cover and Thomas (2006), whose connections we started to explore here. We also believe that our decomposition may be generalizable beyond Bayesian surprise and mutual information to include other information-theoretic measures, including conditional mutual information and multi-information. Decomposing conditional mutual information is of particular interest, since it will permit a decomposition of the commonly-used transfer entropy Schreiber (2000) measure into copy and transformation components, thus separating two different modes of dynamical information flow between systems.

Finally, we point out that our proposed decomposition has some high-level similarities to other recent proposals for information-theoretic decomposition, such as the “partial information decomposition” of multivariate information into redundant and synergistic components Williams and Beer (2010), integrated information decompositions Kahle et al. (2009); Oizumi et al. (2016), and decompositions of mutual information into “semantic” (valuable) and “non-semantic” (non-valuable) information Kolchinsky and Wolpert (2018). We also mention another recent proposal for an alternative information-theoretic notion of “copying” Mediano et al. (2019), in which copying is said to occur in a multivariate system when information that is present in one variable spreads to other variables (regardless of any transformations that information may undergo). Further research should explore if and how the decomposition proposed in this paper relates to these other approaches.

Acknowledgments

AK was supported by Grant No. FQXi-RFP-1622 from the FQXi foundation, and Grant No. CHE-1648973 from the U.S. National Science Foundation. AK would like to thank the Santa Fe Institute for supporting this research. The authors thank Jordi Fortuny, Rudolf Hanel, Joshua Garland, and Blai Vidiella for helpful discussions, as well as the anonymous reviewers for their insightful suggestions.

Appendix A $D^{\mathsf{copy}}_{x}$ satisfies the four axioms

$D^{\mathsf{copy}}_{x}$ satisfies Axiom 1 by non-negativity of KL.

It satisfies Axiom 2 when $p_{Y|x}(x)>p_{Y}(x)$ because $\mathsf{d}(p_{Y|x}(x),p_{Y}(x))\leq D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ by the data processing inequality for KL divergence (Csiszar and Körner, 2011, Lemma 3.11). Otherwise, when $p_{Y|x}(x)\leq p_{Y}(x)$ , $D^{\mathsf{copy}}_{x}$ vanishes and thus satisfies Axiom 2 trivially.

It satisfies Axiom 3 when $p_{Y|x}(x)\leq p_{Y}(x)$ because in that case $D^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})=0\leq D^{\mathsf{copy}}_{x}(q_{Y|x}\|p_{Y})$ . If $p_{Y|x}(x)\leq p_{Y}(x)$ , then note that the derivative of $\mathsf{d}(a,b)$ with respect to $a$ is $\frac{d}{da}\mathsf{d}(a,b)=\log\frac{a}{b}-\log\frac{1-a}{1-b}$ , which is strictly positive when $a>b$ . Thus, $D^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})\leq D^{\mathsf{copy}}_{x}(q_{Y|x}\|p_{Y})$ .

Finally, we show that $D^{\mathsf{copy}}_{x}$ satisfies Axiom 4. For any prior distribution $p_{Y}$ , define the following posterior distribution $p_{Y|x}^{\alpha}(y)$ :

[TABLE]

where $\alpha$ is a parameter that can vary from $p_{Y}(x)$ to 1. It is easy to verify that for all $\alpha$ ,

[TABLE]

and that $D^{\mathsf{copy}}_{x}(p_{Y|x}^{\alpha}\|p_{Y})$ ranges in a continuous manner from 0 (for $\alpha=p_{Y}(x)$ ) to $-\log p_{Y}(x)$ (for $\alpha=1$ ).

Appendix B Proof of Theorem 1

Before proceeding, we first prove two useful lemmas.

Lemma B.1.

Given Axiom 3, $F(p_{Y|x},p_{Y},x)=F(q_{Y|x},p_{Y},x)$ if $p_{Y|x}(x)=q_{Y|x}(x)$ .

Proof.

Follows from applying Axiom 3 in both directions. ∎

Lemma B.2.

Given Axioms 3, 2 and 1, if $p_{Y|x}(x)\leq p_{Y}(x)$ , then $F(p_{Y|x},p_{Y},x)=0$ .

Proof.

If $p_{Y|x}(x)\leq p_{Y}(x)$ , then $F(p_{Y|x},p_{Y},x)\leq F(p_{Y},p_{Y},x)$ by Axiom 3. By Axiom 2, $F(p_{Y},p_{Y},x)\leq D_{\mathsf{KL}}(p_{Y}\|p_{Y})=0$ . Combining gives $F(p_{Y|x},p_{Y},x)\leq 0$ , while $F(p_{Y|x},p_{Y},x)\geq 0$ by Axiom 1. ∎

We then show that $D^{\mathsf{copy}}_{x}$ is the largest possible measure that satisfies Axioms 1, 2 and 3.

Proposition B.1.

Any $F$ which satisfies Axioms 1, 2 and 3 must obey $F(p_{Y|x},p_{Y},x)\leq D^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ .

Proof.

Given Lemma B.2, without loss of generality we restrict our attention to the case where $p_{Y|x}(x)>p_{Y}(x)$ . Define the posterior $p_{Y|x}^{\alpha}$ as in Eq. 30, while taking $\alpha=p_{Y|x}(x)$ . Then, by Lemma B.1,

[TABLE]

At the same time,

[TABLE]

where the first inequality follows from Axiom 2, and the second equality from Eq. 31. ∎

We are now ready to prove the main result from Section II.2.

Proof of 1.

Consider some $p_{Y|x},p_{Y},x$ , and assume $p_{Y|x}(x)>p_{Y}(x)$ (without loss of generality by Lemma B.2). By Axiom 4, there must exist a posterior $q_{Y|x}$ such that $q_{Y|x}(x)=p_{Y|x}(x)$ and

[TABLE]

Note that by the data processing inequality for KL divergence, $D_{\mathsf{KL}}(q_{Y|x}\|p_{Y})\geq D^{\mathsf{copy}}_{x}(q_{Y|x}\|p_{Y})$ .

Then, by Lemma B.1, $F(p_{Y|x},p_{Y},x)=F(q_{Y|x},p_{Y},x)$ since $p_{Y|x}(x)=q_{Y|x}(x)$ . Similarly, it can be verified that $D^{\mathsf{copy}}_{x}(q_{Y|x}\|p_{Y})=D^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ . Combining the above results shows that $F(p_{Y|x},p_{Y},x)\geq D^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ . The theorem follows by combining with B.1.

∎

Appendix C Axiomatic derivation and solution of Eq. 15

C.1 Axiomatic derivation

We first demonstrate that the generalized copy information defined in Eq. 15, $G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ , is the unique measure that satisfies Axioms 1 and 2 and our modified Axioms 3**∗** and 4**∗**. Our derivation has the same structure as the one in Appendix B, and we proceed more quickly.

First, we verify that $G^{\mathsf{copy}}_{x}$ satisfies the four axioms. It satisfies Axiom 1 by non-negativity of KL. It satisfies Axiom 2 because $p_{Y|x}$ falls within the feasibility set of Eq. 15, therefore the minimum $G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ has to be less than or equal to $D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})$ . It satisfies Axiom 3**∗** because $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]\geq\mathbb{E}_{q_{Y|x}}[\ell(x,Y)]$ means that the feasibility set of Eq. 15 for $q_{Y|x}$ is a subset of the feasibility set for $p_{Y|x}$ , so the minimum $G^{\mathsf{copy}}_{x}(q_{Y|x}\|p_{Y})$ has to be greater than or equal to the minimum $G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ . To show that it satisfies Axiom 4**∗**, note that the distribution $w_{Y}$ which optimizes Eq. 15 will achieve $\mathbb{E}_{w_{Y}}[\ell(x,Y)]=\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]$ whenever $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]\leq\mathbb{E}_{p_{Y}}[\ell(x,Y)]$ (Rubinstein and Kroese, 2016, pp.299-300). Note also that $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]$ can vary from $\min_{y}\ell(x,y)$ (for $p_{Y|x}(y|x)=\delta(y,\operatorname*{arg\,min}_{y^{\prime}}\ell(x,y^{\prime})$ ) to $\mathbb{E}_{p_{Y}}[\ell(x,Y)]$ (for $p_{Y|x}=p_{Y}$ ).

We now demonstrate that $G^{\mathsf{copy}}_{x}$ is the unique measure that satisfies the four axioms. We begin by showing that $F(p_{Y|x},p_{Y},x)\leq G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ for any $F$ . Given a choice of $p_{Y|x}$ , $p_{Y}$ , and $x$ , let $w_{Y}$ be the solution to Eq. 15, so

[TABLE]

Given the definition of $G^{\mathsf{copy}}_{x}$ , $\mathbb{E}_{w_{Y}}[\ell(x,Y)]\leq\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]$ . Then, by Axiom 3**∗**, Axiom 2, and Eq. 33,

[TABLE]

We finish by showing that $F(p_{Y|x},p_{Y},x)\geq G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ for any $F$ . First consider the case $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]\geq\mathbb{E}_{p_{Y}}[\ell(x,Y)]$ . Then, $G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})=0$ by construction, and therefore $F(p_{Y|x},p_{Y},x)\geq G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ by Axiom 1.

When $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]<\mathbb{E}_{p_{Y}}[\ell(x,Y)]$ , by Axiom 4**∗** there must exist a posterior $q_{Y|x}$ such that $\mathbb{E}_{q_{Y|x}}[\ell(x,Y)]=\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]$ and

[TABLE]

Then, by definition of $G^{\mathsf{copy}}_{x}$ ,

[TABLE]

Finally, by Axiom 3**∗**,

[TABLE]

Combining Eq. 36, Eq. 34, and then Eq. 35 shows that $F(p_{Y|x},p_{Y},x)\geq G^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ .

Thus, $G^{\mathsf{copy}}_{x}$ is the unique measure that satisfies Axioms 1 and 2 and our generalized Axioms 3**∗** and 4**∗**.

C.2 $D^{\mathsf{copy}}_{x}$ as the solution to Eq. 15 for the 0-1 loss function

Consider the optimization problem:

[TABLE]

When $p_{Y}(x)\geq p_{Y|x}(x)$ , then the solution $r_{Y}=p_{Y}$ satisfies the constraint and achieves $D_{\mathsf{KL}}(p_{Y}\|p_{Y})=0$ , the minimum possible. When $p_{Y}(x)<p_{Y|x}(x)$ , we use the chain rule for KL divergence Cover and Thomas (2006) to write

[TABLE]

The second term is minimized by setting $r_{Y}(y)\propto p_{Y}(y)$ for $y\neq x$ , so that $r_{Y}(y|Y\neq x)=p_{Y}(y|Y\neq x)$ and $D_{\mathsf{KL}}(r_{Y}(Y|Y\neq x)\|p_{Y}(Y|Y\neq x))=0$ . Thus, in the case that $p_{Y}(x)<p_{Y|x}(x)$ , we have reduced the optimization problem of Eq. 37 to the equivalent problem

[TABLE]

Note that the derivative $\mathsf{d}(a,b)$ with respect to $a$ is $\frac{d}{da}\mathsf{d}(a,b)=\log\frac{a}{b}-\log\frac{1-a}{1-b}$ , which is strictly positive when $a>b$ . Given the assumption that $p_{Y|x}(x)>p_{Y}(x)$ , Eq. 38 is minimized by $a=p_{Y|x}(x)$ . Thus, $\mathsf{d}(p_{Y|x}(x),p_{Y}(x))$ is the solution to Eq. 37 when $p_{Y}(x)<p_{Y|x}(x)$ .

Combining these two results shows that $D^{\mathsf{copy}}_{x}(p_{Y|x}\|p_{Y})$ , as defined in Eq. 8, is the solution to Eq. 37.

C.3 Vector-valued loss functions

One can also generalize the approach described in Section III to vector-valued loss functions, $\ell:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}^{n}$ , where we use $\mathcal{X}$ and $\mathcal{Y}$ to indicate the sets of outcomes of $X$ and $Y$ respectively (recall that these can be different, in the context of our generalized copy and transformation information measures). As we’ll see below, one application of vector-valued loss functions is to define measures of copy and transformation information that are additive when independent channels are concatenated.

We first discuss which axioms might be expected to hold for generalized copy information measures with vector-valued loss functions. Axiom 1 and Axiom 2 do not make reference to the loss function, and remain unmodified. Axiom 3**∗** is still meaningful, as long as the inequality $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]\geq\mathbb{E}_{q_{Y|x}}[\ell(x,Y)]$ is taken in an element-wise fashion. Axiom 4**∗** should be dropped for vector-valued functions, for reasons explained below.

Using the derivation found in Section C.1, it can be shown that the largest measure which satisfies Axiom 1, Axiom 2, and Axiom 3**∗** for a vector-valued loss function is given by

[TABLE]

where $\ell_{i}$ indicates the $i^{\mathrm{th}}$ component of the loss function $\ell$ . Eq. 39 is a minimum cross-entropy problem with $n$ different constraints. The general solution to this problem will have the following form Rubinstein and Kroese (2016):

[TABLE]

where $\lambda_{i}\geq 0$ is the Lagrange multiplier for constraint $i$ and $Z(\lambda_{1},\dots,\lambda_{n})$ is a normalization constant. The Lagrange multipliers can be found by using standard convex optimization techniques. Note that all $\lambda_{i}=0$ if $\mathbb{E}_{p_{Y|x}}[\ell_{i}(x,Y)]\geq\mathbb{E}_{p_{Y}}[\ell_{i}(x,Y)]$ for all $i$ , in which case $w_{Y}=p_{Y}$ . Even if $\mathbb{E}_{p_{Y|x}}[\ell(x,Y)]<\mathbb{E}_{p_{Y}}[\ell(x,Y)]$ , however, it may be impossible to make all of the constraints simultaneously tight up to equality. In other words, it will not always be the case that $\mathbb{E}_{w_{Y}}[\ell_{i}(x,Y)]=\mathbb{E}_{p_{Y|x}}[\ell_{i}(x,Y)]$ for all $i=1..n$ , and some (but not necessarily all) of the multipliers $\lambda_{i}$ will be equal to 0. For this reason, Axiom 4**∗** is not generally achievable for copy information defined with vector-valued loss functions, and we drop it from our requirements. This means $G^{\mathsf{copy}}_{x}$ , as defined in Eq. 39, is not the unique measure which satisfies the remaining three axioms (Axiom 1, Axiom 2, and Axiom 3**∗**). For example, they are also satisfied by the trivial measure $F(p_{Y|x},p_{Y},x)=0$ for all $p_{Y|x}$ , $p_{Y}$ , and $x$ .

Vector-valued loss functions can be used to derive an additive measure of copy information. Imagine that source and destination messages consists of sequences of $n$ symbols. If the source symbols are chosen independently, $s(x)=\prod_{i=1}^{n}s_{i}(x_{i})$ , and transmitted across $n$ independent channels, $p(y|x)=\prod_{i=1}^{n}p_{i}(y_{i}|x_{i})$ , then one can verify that the destination marginal distribution will also have a product form,

[TABLE]

In that case, one may desire a measure of copy information that is additive across the $n$ transmissions (see also discussion in Section II.4). This can be achieved by choosing an $n$ -dimensional loss function, $\ell(x,y)=\langle\ell_{1}(x_{1},y_{1}),\ell_{2}(x_{2},y_{2}),\dots,\ell_{n}(x_{n},y_{n})\rangle$ . It can be seen from Eq. 41 and Eq. 40 that the optimal distribution will have a product form, $w(y)=\prod_{i=1}^{n}w_{i}(y_{i})$ . By Eq. 39, it can also be checked that the resulting copy information will have an additive form,

[TABLE]

where $G^{\mathsf{copy}}_{x}(p_{Y_{i}|x_{i}}\|p_{Y_{i}})$ is the generalized copy information defined for loss function $\ell_{i}(x_{i},y_{i})$ . Note that in this case $D_{\mathsf{KL}}(p_{Y|x}\|p_{Y})=\sum_{i}D_{\mathsf{KL}}(p_{Y_{i}|x_{i}}\|p_{Y})$ . Therefore, by Eqs. 17 and 42, the generalized transformation information $G^{\mathsf{trans}}_{x}$ will also be additive.

Appendix D Proof of Prop. 1

Before proving 1, we prove several intermediate results. We start by deriving some useful properties of the roots of the quadratic polynomial $ax^{2}-(a+s)x+sc$ . In particular, we consider the two roots

[TABLE]

where $a\in\mathbb{R}\setminus\{0\}$ , $s\in(0,1]$ , $c\in(0,1]$ .

Lemma D.1.

$f_{+}(a,s,c)<0$ * when $a<0$ and $f_{+}(a,s,c)\geq 1$ when $a>0$ .*

Proof.

When $a<0$ , $f_{+}(a,s,c)\leq f_{-}(a,s,c)$ . Vieta’s formula states that

[TABLE]

This implies $f_{+}(a,s,c)<0$ . When $a>0$ , we lower bound the determinant,

[TABLE]

This implies

[TABLE]

∎

Lemma D.2.

$\lim_{a\rightarrow 0}f_{-}(a,s,c)=c.$ **

Proof.

By L’Hôpital’s rule,

[TABLE]

∎

Lemma D.3.

$f_{-}(a,s,c)$ * is continuous and monotonically decreasing in $a$ . It is strictly monotonically decreasing in $a$ when $f_{-}(a,s,c)<1$ .*

Proof.

First consider the the case when $c=1$ ,

[TABLE]

which is continuous and monotonically decreasing in $a$ , and strictly so when $f_{-}(a,s,c)<1$ (so $a>s$ ).

When $c<1$ , define the square root of the determinant

[TABLE]

Inequality $(a)$ is strict because Eq. 45 is strict when $c<1$ . Then, consider the derivative,

[TABLE]

where in Eq. 46 we multiplied by the (positive) term ${2a^{2}{\eta}}$ , in Eq. 47 we plugged in the definition of $\eta$ and simplified, and in Eq. 48 we divided by the (strictly positive) term ${\eta}s$ . The inequality in the last line uses the fact that $4a^{2}c\frac{1-c}{{\eta}^{2}}>0$ given that $a\neq 0$ and $0<c<1$ , and that $\sqrt{1-x}<1$ for $x>0$ . ∎

We now prove the following.

Theorem D.1.

Let $c(x)\in[0,1]$ indicate a set of values for all $x\in\mathcal{A}$ . Then, for any source distribution $s_{X}$ with full support, there is a channel $p_{Y|X}$ that satisfies

[TABLE]

where $p_{Y}$ is the marginal $p_{Y}(y)=\sum_{x}s(x)p(y|x)$ . The channel $p_{Y|X}$ is unique if $c(x)>0$ for all $x$ . Moreover, $I_{p}(Y\!:\!X)=I^{\mathsf{copy}}_{p}(X\mkern-5.0mu\shortrightarrow\!Y)$ if and only if $\sum_{x}c(x)\geq 1$ .

Proof.

We will show that there exists a marginal $p_{Y}$ that satisfies the consistency conditions of Eq. 49.

We first eliminate a few edge cases. The solution is trivial for $|\mathcal{A}|=1$ , so we assume that $|\mathcal{A}|\geq 1$ . If $c(x)=0$ for all $x$ , then for any two states $x,x^{\prime}\in\mathcal{A}$ , the following is a solution: $p_{Y}(x)=s(x^{\prime})/(s(x)+s(x^{\prime}))$ , $p_{Y}(x^{\prime})=s(x)/(s(x)+s(x^{\prime}))$ , $p_{Y}(x^{\prime\prime})=0$ for all $x^{\prime\prime}\in\mathcal{A}\setminus\{x,x^{\prime}\}$ . If $c(x)=0$ for some but not all $x$ , then the problem can be solved for the reduced outcome space $\mathcal{S}=\{x\in\mathcal{A}:c(x)>0\}$ , using the procedure below. It can then be extended to all outcomes by keeping $p_{Y}(x)$ fixed for $x\in\mathcal{S}$ and setting $p_{Y}(x)=0$ for all $x\in\mathcal{A}\setminus\mathcal{S}$ . Therefore, without loss of generality, below we assume $c(x)>0$ for all $x$ .

We now plug Eq. 49 into $p_{Y}(y)=\sum_{x}s(x)p(y|x)$ ,

[TABLE]

Define $a:=1-\sum_{x^{\prime}}s(x^{\prime})\frac{1-c(x^{\prime})}{1-p_{Y}(x^{\prime})}$ and rearrange Eq. 50 to give

[TABLE]

Multiplying both sides by $1-p_{Y}(x)$ and simplifying gives

[TABLE]

Dividing by $s(x)$ , then summing over $x$ and rearranging gives

[TABLE]

Note that the sum inside the brackets on the left hand side is strictly positive. Thus,we have

[TABLE]

Note also that $a=0$ if $\sum_{x}c(x)=1$ , in which case $p_{Y}(x)=c(x)$ is the unique solution to Eq. 51 for all $x$ . Below, we disregard this special case, and assume that $\sum_{x}c(x)\neq 1$ and $a\neq 0$ .

We now solve Eq. 51 for $p_{Y}(x)$ . First, note that $p_{Y}(x)=\sum_{x^{\prime}}s(x^{\prime})p(x|x^{\prime})\geq s(x)c(x)>0$ for all $x$ , since we assume that $s(x)>0$ and $c(x)>0$ for all $x$ . Given that $|\mathcal{A}|>1$ , this also means that $p_{Y}(x)<1$ for all $x$ (if this were not the case, then it would have to be that $p_{Y}(x)=0$ for all except one $x$ ). We then solve the quadratic equation,

[TABLE]

where we include the superscript $a$ in $p_{Y}^{a}$ to make the dependence on $a$ explicit. We chose the negative solution of the quadratic equation because, by Lemma D.1, it is the only one compatible with the requirement that $0<p_{Y}^{a}(x)<1$ .

We wish to find the value of $a$ satisfies $\sum_{x}p_{Y}^{a}(x)=1$ , which is defined implicitly via

[TABLE]

Note that each $p_{Y}^{a}(x)$ is continuous and strictly monotonically decreasing in $a$ (Lemma D.3), and therefore so is the right hand side of Eq. 55. Moreover, $a$ must lie between $-1$ and $1$ . To see why, evaluate the right hand side of Eq. 55 for $a=-1$ ,

[TABLE]

Then, evaluate it for $a=1$ ,

[TABLE]

Thus, there is a unique $a\in[-1,1]$ that satisfies Eq. 55, resulting in a unique $p_{Y}^{a}$ and corresponding $p_{Y|X}$ in Eq. 49.

Now, by definition of $I^{\mathsf{copy}}$ and the channel $p_{Y|X}$ in Eq. 49, $I_{p}(Y\!:\!X)=I^{\mathsf{copy}}_{p}(X\mkern-5.0mu\shortrightarrow\!Y)$ if $c(x)\geq p_{Y}(x)$ for all $x$ . By Lemma D.2 and Lemma D.3, the right hand side of Eq. 54 is greater than $c(x)$ if and only if $a\geq 0$ . By Eq. 53, $a\geq 0$ if and only if $\sum_{x}c(x)\geq 1$ .

∎

In practice, the value of $a$ which satisfies Eq. 55 in the proof of D.1 can be found by a numerical root finding algorithm, or by trying values from $-1$ to $1$ in small intervals and selecting the first value that makes the LHS of Eq. 55 less than or equal to $1$ . The marginal $p_{Y}$ and channel $p_{Y|X}$ can then be computed in closed form using Eqs. 49 and 54.

We are now ready to prove 1.

Proposition 1.

For any source distribution $s_{X}$ with $H(X)<\infty$ , there exist channels $p$ for all levels of mutual information $I_{p}(Y\!:\!X)\in[0,H(X)]$ such that $I^{\mathsf{copy}}_{p}(X\mkern-5.0mu\shortrightarrow\!Y)=I_{p}(Y\!:\!X)$ .

Proof.

Consider the proof of D.1. Note that for each $x\in\mathcal{A}$ and any $\gamma\in[0,1]$ , Eq. 51 is satisfied by taking $p_{Y}(x)=s(x)$ and $c(x)=\gamma+s(x)-\gamma s(x)$ .

Let $p_{Y|X}^{\gamma}$ represent the channel corresponding to each $\gamma$ , as defined in Eq. 49. It is easy to check that $I^{\mathsf{copy}}_{p^{\gamma}}(X\mkern-5.0mu\shortrightarrow\!Y)=I_{p^{\gamma}}(Y\!:\!X)$ , with $I^{\mathsf{copy}}_{p^{\gamma}}(X\mkern-5.0mu\shortrightarrow\!Y)=0$ for $\gamma=0$ and $I^{\mathsf{copy}}_{p^{\gamma}}(X\mkern-5.0mu\shortrightarrow\!Y)=H(s_{X})$ for $\gamma=1$ . Note that $c(x)$ is increases monotonically in $\gamma$ for all $x$ , from $c(x)=s(x)$ for $\gamma=0$ to $c(x)=1$ for $\gamma=1$ . This means that for all $\gamma$ ,

[TABLE]

Thus, the sums that define $I^{\mathsf{copy}}_{p^{\gamma}}(X\mkern-5.0mu\shortrightarrow\!Y)$ for each $\gamma$ converge uniformly, so $I^{\mathsf{copy}}_{p^{\gamma}}(X\mkern-5.0mu\shortrightarrow\!Y)$ is continuous in $\gamma$ . The proposition follows from the intermediate value theorem. ∎

Appendix E The binary symmetric channel

The BSC is a channel over a two-state space ( $\mathcal{A}=\{0,1\}$ ) parameterized by a “probability of error” $\epsilon\in[0,1]$ . The BSC can be represented in matrix form as

[TABLE]

When $\epsilon=0$ , the BSC is a noiseless channel which copies the source without error. In this extreme case, MI is large, and we expect it to consist entirely of copy information. On the other hand, when $\epsilon=1$ , the BSC is a noiseless “inverted” channel, where messages are perfectly switched between the source and the destination. In this case, MI is again large, but we now expect it to consist entirely of transformation information. Finally, $\epsilon=1/2$ defines a completely noisy channel, for which mutual information (and thus copy and transformation information) must be 0.

For simplicity, we assume a uniform source distribution, $s_{X}(0)=s_{X}(1)=1/2$ , which by symmetry implies a uniform marginal probability $p_{Y}(0)=p_{Y}(1)=1/2$ at the destination for any $\epsilon$ . For the BSC with this source distribution, Eq. 8 states that for both $x=0$ and $x=1$ , $D^{\mathsf{copy}}_{x}(p_{Y|x}^{\epsilon}\|p_{Y})=I_{p^{\epsilon}}(Y\!:\!X\!\!=\!x)$ and $D^{\mathsf{trans}}_{x}(p_{Y|x}^{\epsilon}\|p_{Y})=0$ when $\epsilon\leq 1/2$ , and $D^{\mathsf{copy}}_{x}(p_{Y|x}^{\epsilon}\|p_{Y})=0$ and $D^{\mathsf{trans}}_{x}(p_{Y|x}^{\epsilon}\|p_{Y})=I_{p^{\epsilon}}(Y\!:\!X\!\!=\!x)$ otherwise. Using the definition of the (total) copy and transformation components of total MI, Eqs. 10 and 11, it then follows that

[TABLE]

This confirms intuitions about the BSC discussed in the beginning of this section. The behavior of MI, $I^{\mathsf{copy}}(X\mkern-5.0mu\shortrightarrow\!Y)$ and $I^{\mathsf{trans}}(X\mkern-5.0mu\shortrightarrow\!Y)$ for the BSC with a uniform source distribution is shown visually in Fig. 2 of the main text.

Bibliography72

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Shannon (1948) Claude Elwood Shannon, “A mathematical theory of communication,” The Bell System Technical Journal 27 , 379–423 (1948) . · doi ↗
2Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas, Elements of information theory (John Wiley & Sons, 2006).
3Kelly (1956) J L Kelly, “A New Interpretation of Information Rate,” The Bell System Technical Journal (1956).
4Barron and Cover (1988) Andrew R. Barron and T. M. Cover, “A Bound on the Financial Value of Information,” IEEE Transactions on Information Theory 34 , 1096–1101 (1988).
5Cover and Ordentlich (1996) T. M. Cover and E. Ordentlich, “Universal portfolios with side information,” IEEE Transactions on Information Theory 42 , 348–363 (1996) . · doi ↗
6Donaldson-Matasci et al. (2010) Matina C. Donaldson-Matasci, Carl T. Bergstrom, and Michael Lachmann, “The fitness value of information,” Oikos 119 , 219–230 (2010) . · doi ↗
7Sagawa and Ueda (2008) Takahiro Sagawa and Masahito Ueda, “Second law of thermodynamics with discrete quantum feedback control,” Physical review letters 100 , 080403 (2008).
8Parrondo et al. (2015) Juan MR Parrondo, Jordan M. Horowitz, and Takahiro Sagawa, “Thermodynamics of information,” Nature Physics 11 , 131–139 (2015).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Decomposing information into copying versus transformation

Abstract

I Introduction

II Copy and Transformation Information

II.1 Preliminaries

II.2 Axioms for copy information

Axiom 1**.**

Axiom 2**.**

Axiom 3**.**

Axiom 4**.**

II.3 The measure DxcopyD^{\mathsf{copy}}_{x}Dxcopy​

Theorem 1**.**

II.4 Decomposing mutual information

II.5 Copying efficiency

Proposition 1**.**

III Generalization and relation to rate-distortion

Axiom 3∗****.

Axiom 4∗****.

IV Thermodynamic costs of copying

V Copy and transformation in amino acid substitution matrices

VI Discussion

Acknowledgments

Appendix A DxcopyD^{\mathsf{copy}}_{x}Dxcopy​ satisfies the four axioms

Appendix B Proof of Theorem 1

Lemma B.1**.**

Proof.

Lemma B.2**.**

Proof.

Proposition B.1**.**

Proof.

Proof of 1.

Appendix C Axiomatic derivation and solution of Eq. 15

C.1 Axiomatic derivation

C.2 DxcopyD^{\mathsf{copy}}_{x}Dxcopy​ as the solution to Eq. 15 for the 0-1 loss function

C.3 Vector-valued loss functions

Appendix D Proof of Prop. 1

Lemma D.1**.**

Proof.

Lemma D.2**.**

Proof.

Lemma D.3**.**

Proof.

Theorem D.1**.**

Proof.

Proposition 1.

Proof.

Appendix E The binary symmetric channel

Axiom 1.

Axiom 2.

Axiom 3.

Axiom 4.

II.3 The measure $D^{\mathsf{copy}}_{x}$

Theorem 1.

Proposition 1.

Axiom 3∗.

Axiom 4∗.

Appendix A $D^{\mathsf{copy}}_{x}$ satisfies the four axioms

Lemma B.1.

Lemma B.2.

Proposition B.1.

C.2 $D^{\mathsf{copy}}_{x}$ as the solution to Eq. 15 for the 0-1 loss function

Lemma D.1.

Lemma D.2.

Lemma D.3.

Theorem D.1.