This paper develops a rigorous mathematical framework for defining and analyzing transfer entropy in continuous time stochastic processes, extending the discrete-time concept and illustrating it with Poisson processes.
Contribution
It introduces a novel continuous-time transfer entropy definition using advanced measure theory and provides conditions linking it to discrete-time TE, with applications to Poisson processes.
Findings
01
Continuous-time TE is defined via Radon-Nikodym derivatives.
02
Necessary and sufficient conditions relate discrete and continuous TE.
03
Stationarity implies a constant transfer entropy rate.
Abstract
Transfer entropy (TE) was introduced by Schreiber in 2000 as a measurement of the predictive capacity of one stochastic process with respect to another. Originally stated for discrete time processes, we expand the theory in line with recent work of Spinney, Prokopenko, and Lizier to define TE for stochastic processes indexed over a compact interval taking values in a Polish state space. We provide a definition for continuous time TE using the Radon-Nikodym Theorem, random measures, and projective limits of probability spaces. As our main result, we provide necessary and sufficient conditions to obtain this definition as a limit of discrete time TE, as well as illustrate its application via an example involving Poisson point processes. As a derivative of continuous time TE, we also define the transfer entropy rate between two processes and show that (under mild assumptions) their…
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsstochastic dynamics and bifurcation · Neural Networks and Applications · Nonlinear Dynamics and Pattern Formation
Full text
A Development of Continuous-Time Transfer Entropy
Joshua N. Cooper
Christopher D. Edgar
Abstract
Transfer entropy (TE) was introduced by Schreiber in 2000 as a measurement of the predictive capacity of one stochastic process with respect to another. Originally stated for discrete time processes, we expand the theory in line with recent work of Spinney, Prokopenko, and Lizier to define TE for stochastic processes indexed over a compact interval taking values in a Polish state space. We provide a definition for continuous time TE using the Radon-Nikodym Theorem, random measures, and projective limits of probability spaces. As our main result, we provide necessary and sufficient conditions to obtain this definition as a limit of discrete time TE, as well as illustrate its application via an example involving Poisson point processes. As a derivative of continuous time TE, we also define the transfer entropy rate between two processes and show that (under mild assumptions) their stationarity implies a constant rate. We also investigate TE between homogeneous Markov jump processes and discuss some open problems and possible future directions.
1 Introduction
The quantification of causal relationships between time series is a fundamental problem in fields including, for example, neuroscience ([4, 9, 29, 31]), social networking ([11, 28]), finance ([8, 19, 24, 25]), and machine learning ([14, 21]). Among the various means of measuring such relationships, information theoretical approaches are a rapidly developing area in concert with other paradigms such as Pearl semantics and Granger causality. One such approach is to make use of the notion of transfer entropy, which we abbreviate throughout as “TE”. Broadly speaking, transfer entropy is a functional which measures the information transfer between two stochastic processes. Schreiber’s definition of transfer entropy [26] characterizes information transfer as an informational divergence between conditional probability mass functions. The original definition is native to discrete space processes indexed over a countable set, often the natural numbers. One can generalize Schreiber’s definition to handle the case when the random variables comprising the process have state space R via the Radon-Nikodym Theorem as demonstrated in [16]. While this formalism is applicable to some practical scenarios, it suffers from a serious deficiency: it is only applicable to processes defined over discrete time.
A treatment of TE for processes that are either indexed over an uncountable set or do not have R as the state space of their constituent variables has been lacking in the literature. A common workaround to this shortcoming is the approach of time-binning which has been widely used as a means to capture intuitively the notion of information transfer between processes ([5, 10, 20]). These approaches, while sometimes effective and practicable, do not provide a native definition of TE in continuous time; that is, TE between processes indexed over an uncountable set. Recently, Spinney, Prokopenko, and Lizier ([27]) set out a framework to remedy this gap. We formalize this approach and explore the consequences by providing a definition of TE for discrete time processes comprised of random variables with a Polish state space and extend this definition to continuous time processes via projective limits, random measures, and the Radon-Nikodym Theorem. In Section 5, we provide our main result, Theorem 2, which characterizes when our continuous time definition of TE can be obtained as a limit of discrete time TE and apply it to a time-lagged Poisson point process in Section 6.
In some applications, the instantaneous transfer entropy is of particular interest. Using our methodology, we define the transfer entropy rate (TE rate) as the right derivative with respect to time of the expected pathwise transfer entropy (EPT) functional defined in Section 4 and demonstrate some of its basic properties, including a precise version of a result stated without proof in [27] regarding a particularly well-behaved class of stationary processes. In Section 9, we consider time-homogeneous Markov jump processes and provide an analytic form of the EPT via a Girsanov formula. We finish with several open questions and directions for future work, as well as an Appendix which provides some relevant calculations regarding TE in the context of Wiener processes.
2 Discrete Time Transfer Entropy over a Polish Space
Suppose X:={Xn}n≥1 and Y:={Yn}n≥1 are stochastic processes adapted to the filtered probability space (Ω,F,{Fn}n≥1,P). Suppose further that for each n≥1, Xn and Yn are random variables taking values in a Polish state space Σ, i.e., a completely metrizable, separable space; and let X be a σ-algebra of subsets of Σ. Denote by Pn the probability distribution of the random variable Xn (by which sometimes we will mean a conditional probability distribution). For integers k,l,n≥1, we denote the “history vectors” of X and Y by
[TABLE]
and
[TABLE]
Since Σ is Polish, for each k,l,n≥1, there exist functions (in particular, regular conditional probability measures111The existence of regular conditional probability measures is guaranteed on Polish spaces (see Theorem 6.16 of [22])) Pn(k,l)[Xn(Xn−k−1n−1),(Yn−l−1n−1)] and Pn(k)[Xn(Xn−k−1n−1)] mapping Fn×Ω to [0,1] with the following properties:
For each ω∈Ω, both
[TABLE]
and
[TABLE]
are measures on (Σ,X).
2.
∀A∈Fn the mappings
[TABLE]
and
[TABLE]
are Fn− measurable random variables.
3.
For all ω∈Ω and A∈Fn we have both
[TABLE]
and
[TABLE]
If ω∈Ω, the conditional probabilities Pn(k,l)[Xn(Xn−k−1n−1),(Yn−l−1n−1)](⋅,ω) and Pn(k)[Xn(Xn−k−1n−1)](⋅,ω) are only defined in the case that each event {B∈σ((Xn−k−1n−1)):ω∈B} and {B∈σ((Xn−k−1n−1),(Yn−l−1n−1)):ω∈B} is not a P-null set. We will assume neither of these sets are P-null throughout this work whenever dealing with conditional probabilities.
Notation 1**.**
For sake of convenience let
[TABLE]
and
[TABLE]
whenever n,k,l≥1, ω∈Ω, and A∈Fn.
The following definition generalizes Schreiber’s definition of TE for discrete time processes whose random variables have a Polish state space.
Definition 1**.**
Suppose n,k,l≥1 are integers. Suppose further that Σ is a Polish space and that
[TABLE]
*for each ω∈Ω. Define the transfer entropy from Y to X at n with history window lengths k and l, denoted TY→X(k,l)(n), by
*
[TABLE]
and call X the “destination process” and Y the “source process”.
dPn(k)[Xn∣(Xn−k−1n−1)](⋅)dPn(k,l)[Xn∣(Xn−k−1n−1),(Yn−l−1n−1)](⋅)(⋅)* is F×X-measurable as X is adapted to F.*
4.
For all ω∈Ω,
[TABLE]
*is *F−measurable.
Example 1**.**
Suppose X and Y are discrete processes; that is, for each integer n≥1,
both Xn(Ω) and Yn(Ω) are countable. Then
[TABLE]
where the RN-derivatives have become quotients of probability mass functions since the processes is composed of discrete random variables. The above demonstrates that Schreiber’s initial definition of transfer entropy is indeed a special case of our more general definition of TE. Furthermore,
if (Σ,X)=(R,B(R)) and the joint probability measure PXn,(Xn−k−1n−1),(Yn−l−1n−1) is absolutely continuous with respect to Lebesgue measure on R(1+k+l), then there exist RN-derivatives (probability densities)
[TABLE]
which can replace the probability mass functions in Schreiber’s definition. In regards to our definition in this setting, R is indeed Polish, thus assuming (2.3) our definition yields
[TABLE]
where μ(1+k+l) denotes Lebesgue measure on R(1+k+l). This expression is exactly that for TE in this special case (see [16]); thus, our definition recovers the correct expression for TE in the case that (Σ,X)=(R,B(R)) as well.
Note that Definition 1 differs somewhat from the definition of TE in [27], in that we employ two expectations. The idea of using two expectations to represent some of the more common conditional versions of information-theoretical functionals has appeared in other works (see Section 3 of [2] and (14) in [3]).
3 Construction of path measures
We now turn our attention to the main purpose of this work, namely, developing TE in continuous time. We restrict our attention to the case when the uncountable indexing set is an interval. Let T⊂R≥0 be a closed and bounded interval whose elements we refer to as times. Analogous to the setup for discrete time TE, we suppose X:={Xt}t∈T and Y:={Yt}t∈T are stochastic processes adapted to the filtered probability space (Ω,F,{Ft}t∈T,P) such that for each t∈T, Xt and Yt are random variables taking values in the measurable state space (Σ,X) where Σ is a Polish space and X is a σ−algebra of subsets of Σ.
In this section we begin our construction of continuous time TE by introducing conditional measures on the space of sample paths of X. These measures will act as the continuous time analogues of the random conditional probabilities
[TABLE]
and
[TABLE]
in Definition 1. The following seminal result in [23] will be crucial to the formulation of these measures.
Theorem 1**.**
Let A be any index set and D the set of all its finite subsets directed by inclusion. Let (Σt,Xt)t∈A be a family of measurable spaces where Σt is a
topological space and Xt is a σ-field containing all the compact subsets of Σt. Suppose, for α∈D, Σα=×t∈αΣt,Xα=⨂t∈αXt, and Pα:Xα↦[0,1] so that (Σα,Xα,Pα) is a probability space. If for each α∈D, Pα is inner regular relative to the compact subsets of Xα, i.e., for any A∈Xα, Pα=sup{Pα(C):C is a compact subset of A}, and παβ:Σβ↦Σα(β≥α), πα=παA:×t∈AΣt↦Σα for α,β∈D are coordinate projections, then there exists a unique probability measure PA on the space (×t∈AΣt,⨂t∈AXt) such that ∀α∈D,
[TABLE]
if and only if {(Σα,Xα,Pα,παβ)β≥α:α,β∈D} is a projective system with respect to mappings {παβ}; that is,
(1)
παβ−1(Xα)⊂Xβ* so that παβ is *(Xβ,Xα)−measurable.
(2)
for any α≤β≤λ, παβ∘πβλ=πα,λ, παα=idα and
(3)
Pα=Pβπαβ−1,* whenever α≤β.*
Due to Corollary 15.27 of [1], the same result holds without the inner regularity of P{⋅} whenever Σt is a Polish space for each t∈A. Furthermore, the same result holds if D is the set of countably finite subsets of A (see Corollary 4.9.16 of
[12]).
Let A=[t0,T)⊂T.
As shown in the proof of Theorem 1 (see [23]), the projective limit σ− algebra, ⨂t∈AXt, is generated by ⋃α∈Dπα−1(Xα); that is,
[TABLE]
If α,β∈D with α<β, then due to (1) of Theorem 1 we have
[TABLE]
Consequently, (πα−1(Xα))α∈D is a filtration ordered by set inclusion which generates ⨂t∈AXt and from (3.1) we have
[TABLE]
In our case, we assume that Σt=Σ and Xt=X for all t∈T.
Now let s,r>0 be such that (t0−max(s,r),T)⊂T. The numbers s and r are in place to act as the analogues of the positive integers k and l in Definition 1. For each Δt>0 define the comb setDΔt⊂T by
[TABLE]
where W=max(s,r).
Notation 2**.**
Henceforth, we will let τ=⌊ΔtT⌋−⌊Δtt0⌋ and ⟨T,i,Δt⟩=⌊ΔtT⌋Δt−iΔt for Δt>0,i=0,1,…,τ−1.
Given Δt>0 we can use the comb set DΔt to construct two probability measures on the measurable space (Στ,⨂i=0τ−1X). Specifically, for Δt>0 let AmΔt,X={Xm∈Bm}, AmΔt,Y={Ym∈Bm}, Xm,kΔt=σ((X⟨T,m+k+1,Δt⟩⟨T,m+1,Δt⟩)), and Ym,k,lΔt=σ((Y⟨T,m+l+1,Δt⟩⟨T,m+1,Δt⟩))
for each m=0,1,⋯,τ−1.
Then
[TABLE]
for some ω∈Ω where k=⌊Δts⌋
and
αXi,Δt=⋂j=⌊ΔtT⌋−(i+⌊Δts⌋+1)⌊ΔtT⌋−(i+1)AjΔtΔt,X. Similarly,
[TABLE]
for some ω∈Ω where l=⌊Δtr⌋ and αYi,Δt=⋂j=⌊ΔtT⌋−(i+⌊Δtr⌋+1)⌊ΔtT⌋−(i+1)AjΔtΔt,Y.
Given ω∈Ω and Δt>0 define the measures PX∣X,i,Δt(ω),(k) and PX∣X,Y,i,Δt(ω),(k,l) on the space (Σ,X) for each i=0,1,⋯,τ−1 by
[TABLE]
and
[TABLE]
Notation 3**.**
For Δt′,Δt>0 we write Δt′∣Δt whenever there exists a positive integer m such that Δt=mΔt′.
Suppose k=⌊Δts⌋ and l=⌊Δtr⌋. If for each ω∈Ω the systems
[TABLE]
and
[TABLE]
are projective systems with respect to coordinate projections {πDΔtDΔt′}, then as a consequence of Theorem 1, there exist unique probability measures
[TABLE]
and
[TABLE]
on the measurable space (×t∈[t0,T)Σ,⨂t∈[t0,T)X) such that
[TABLE]
and
[TABLE]
where FΔt[t0,T)=πDΔt−1(XDΔt).
Notation 4**.**
Let ΩX[t0,T) denote the set of sample paths of X.
4 Pathwise transfer entropy and expected pathwise transfer entropy
The purpose of this section is to use the measures PX(s)[Xt0T∣Xt0−st0](⋅) and PX∣X,Y(s,r)[Xt0T∣Xt0−st0,Yt0−rT](⋅) to define transfer entropy over an interval of the form [t0,T)⊂T with history window lengths r,s>0.
Definition 2**.**
Suppose T⊂R≥0 is a closed and bounded interval, [t0,T)⊂T; r,s>0; and for each ω∈Ω the measures PX∣X,Y(s,r)[Xt0T∣Xt0−sT,Yt0−rT](ω) and PX(s)[Xt0T∣Xt0−sT](ω) exist. If (t0−max(s,r),T)⊂T, then for any sample path xt0T∈ΩX[t0,T), define the pathwise transfer entropy from Y to X on [t0,T) at xt0T with history window lengths r and s, denoted PTY→X(s,r)∣t0T(ω,xt0T), by
[TABLE]
if PX∣X,Y(s,r)[Xt0T∣Xt0−st0,Yt0−rT](ω)≪PX(s)[Xt0T∣Xt0−st0](ω) for all ω∈Ω and ∞ otherwise.
Observation 2**.**
For each ω∈Ω, PTY→X(s,r)∣t0T(ω,⋅) maps ΩX[t0,T) into the extended real line R∪{∞} and PTY→X(s,r)∣t0T(ω,⋅) is unique PX(s)[Xt0T∣Xt0−st0](ω)-a.s. due to the Radon-Nikodym Theorem.
The following is our definition of transfer entropy over an interval of the form [t0,T)222 One could, in principle, construct a similar definition in the case that the interval is of the form [t0,T], via following the procedure outlined in Section 3 with comb sets of the form DΔt:={T,T−Δt,T−2Δt,…,T−⌊Δtmax(s,r)⌋Δt} rather than DΔt..
Definition 3**.**
Suppose T⊂R≥0 is a closed and bounded interval, [t0,T)⊂T; r,s>0; and for each ω∈Ω the measures PX∣X,Y(s,r)[Xt0T∣Xt0−sT,Yt0−rT](ω) and PX(s)[Xt0T∣Xt0−sT](ω) exist. If (t0−max(s,r),T)⊂T, the expected pathwise transfer entropy (EPT) from Y to X on [t0,T) with history window lengths r and s, denoted EPTY→X(s,r)∣t0T, is defined by
[TABLE]
if PX∣X,Y(s,r)[Xt0T∣Xt0−st0,Yt0−rT](ω)≪PX(s)[Xt0T∣Xt0−st0](ω) for all ω∈Ω
and ∞ otherwise.
For the sake of clarity we emphasize that the expectation in (4.2) is understood as the integral
[TABLE]
where
[TABLE]
and note that this is similar to the expression in (2.4) for discrete time TE in that it is an expectation of a KL-divergence among conditional measures induced by the dynamics of X and Y.
5 Obtaining continuous time TE as a limit of discrete time TE
We now pursue conditions under which the EPT can be represented as a limit of discrete time TE. We first prove two lemmas that will be used in the proof of our main theorem; then we define a type of consistency between processes that makes the expressions in the main result meaningful; then we provide our main result, Theorem 2, and conclude with some of its consequences.
Lemma 1**.**
Suppose N≥1 and {μi}i≥1 and {νi}i≥1 are finite measures on the measurable space (X,Σ) with μi≪νi for i=1,...,N. Let μ=∏i=1Nμi and ν=∏i=1Nνi be product measures on the space (XN,⊗NΣ). Then μ≪ν and
[TABLE]
where xi∈X for i∈[N].
Proof.
Clearly μ≪ν since ∀A∈⊗NΣ we have
[TABLE]
Fix E∈⊗NΣ and for i=1,2,…,N let
[TABLE]
where xi∈X,∀i∈[N]. Then from the Radon-Nikodym chain rule we obtain
[TABLE]
By the uniqueness of the RN-derivative we have
[TABLE]
which completes the proof.
∎
The following lemma establishes convergence of KL-divergences in a manner which will be useful in the proof of our main result.
Lemma 2**.**
Suppose (Ω,F) is a measurable space. Furthermore, suppose that (FΔt)Δt>0 is a sequence of decreasing sub-σ-algebras of F such that F=⋂Δt>0FΔt and that P and M are probability measures on (Ω,F) with P≪M. Let PΔt=P∣FΔt and MΔt=M∣FΔt for each Δt>0. If EP[logdMdP]<∞, then
[TABLE]
as Δt↓0.
Proof.
Since probability measures are σ−finite, all RN-derivatives in (5.1) exist. Suppose Δt>0. Observe that for all A∈FΔt we have that
[TABLE]
implying that
[TABLE]
from the definition of conditional expectation.
Define ζΔt=dMΔtdPΔt for each Δt>0. From (5.2), we get that {ζΔt}Δt>0 is a uniformly integrable backward martingale since ζΔt is clearly M−integrable for any Δt>0 by the Radon-Nikodym Theorem and
[TABLE]
whenever Δt′>Δt due to the tower property of conditional expectation.
We claim that
[TABLE]
To see this, note first that the limit exists a.s and in L1 due to Theorem 6.1 of [13], i.e., there exists some nonnegative ζ∈L1(Ω,F,M) such that
[TABLE]
as Δt↓0.
Fix Δt>0 and suppose A∈FΔt. Then for all 0<Δt′<Δt we have that A∈FΔt′ since (FΔt)Δt>0 is a decreasing collection of σ−algebras. As a consequence of the Radon-Nikodym Theorem, P(A)=EM[χAζΔt′], implying that EM[χAζΔt′] is constant for 0<Δt′<Δt. Consequently,
[TABLE]
Furthermore, since F=⋂Δt>0FΔt we must have that
P(A)=EM[χAζ] for all A∈F, proving (5.3).
Since (0,∞)∋x↦xlogx is convex and ∀Δt>0,
[TABLE]
Conditional Jensen’s inequality and (5.2) imply that
[TABLE]
Taking expectations with respect to M of both sides of (5.5) we get that ∀Δt>0,
[TABLE]
thus
[TABLE]
The Radon-Nikodym Theorem guarantees that dMdP is nonnegative and that dMdPlogdMdP is F− measurable, thus
[TABLE]
as a consequence of the continuous time version of Fatou’s Lemma and (5.4).
Now clearly
[TABLE]
∎
Let FX[t0,T) be the sub-σ−algebra of ⨂t∈[t0,T)X defined by
[TABLE]
and observe that (FΔt[t0,T))Δt>0 is a decreasing collection of σ−algebras due to (3.2). Henceforth, when we write
PX∣X,Y(s,r)[Xt0T∣Xt0−st0,Yt0−rT](⋅) or PX(s)[Xt0T∣Xt0−st0}](⋅), we are referring to the restriction of these measures to the σ−algebra FX[t0,T). Furthermore, recall from (3.8) and (3.9) that for all A∈FΔt[t0,T) and ω∈Ω we have that
[TABLE]
and
[TABLE]
where k=⌊Δts⌋ and l=⌊Δtr⌋. From now on, we will omit writing the projections in (5.8) and (5.9) to avoid cumbersome notation.
Notation 5**.**
For each ω∈Ω,Δt>0, we denote by PΔt(ω) and MΔt(ω) the measures
\mathbb{P}_{X\mid X,Y}^{(s,r)}[X_{t_{0}}^{T}\mid X_{t_{0}-s}^{t_{0}},Y_{t_{0}-r}^{T}]\left(\omega\right)\Bigg{|}_{\mathcal{F}^{[t_{0},T)}_{{\Delta t}}} and \mathbb{P}_{X}^{(s)}[X_{t_{0}}^{T}\mid X_{t_{0}-s}^{t_{0}}]\left(\omega\right)\Bigg{|}_{\mathcal{F}^{[t_{0},T)}_{{\Delta t}}}, respectively. It should be noted that these are measures on the measurable space (Στ,⨂τX).
For Δt>0, let
[TABLE]
for any i=0,1,…,τ−1
where
[TABLE]
[TABLE]
[TABLE]
and
[TABLE]
As a means of succinctly capturing all of the conditions which need hold to use Definitions 1 and 3, we define a type of consistency between two processes dependent on the window lengths r and s and the interval [t0,T). This notion of consistency captures the conditions under which our main result, Theorem 2, is of utility.
Definition 4**.**
Suppose T⊂R≥0 is a closed and bounded interval, [t0,T)⊂T, and s,r>0 are such that (t0−max(s,r),T)⊂T. Suppose further that X:={Xt}t∈T and Y:={Yt}t∈T are stochastic processes adapted to the filtered probability space (Ω,F,{Ft}t∈T,P) such that for each t∈T,Xt and Yt are random variables taking values in the measurable space (Σ,X), where Σ is assumed to be a Polish space and X is a σ−algebra of subsets of Σ.
Y is (s,r)-consistent upon X on [t0,T) iff
∀ω∈Ω* there exist measures PX(s)[Xt0T∣Xt0−st0}](ω) and PX∣X,Y(s,r)[Xt0T∣Xt0−st0,Yt0−rT](ω) on the space (ΩX[t0,T),FX[t0,T)) for which (3.8) and (3.9) hold.*
2.
∃δ1>0* such that for all Δt∈(0,δ1) and i=0,1,…,τ−1*
KL\left(P_{\Delta t}^{(\cdot)}\Bigg{|}\Bigg{|}M_{\Delta t}^{(\cdot)}\right)* is *P−integrable.
3.
PX∣X,Y(s,r)[Xt0T∣Xto−st0,Yt0−rT](ω)≪PX(s)[Xt0T∣Xto−st0](ω)* for each ω∈Ω.*
We call 1.- 3. “consistency conditions”.
We now present our main result.
Theorem 2**.**
Suppose T⊂R≥0 is a closed and bounded interval with [t0,T)⊂T, Σ is a Polish space and s,r>0 satisfy (t0−max(s,r),T)⊂T. Suppose further that X:={Xt}t∈T and Y:={Yt}t∈T are stochastic processes adapted to the filtered probability space (Ω,F,{Ft}t∈T,P) such that for each t∈T,Xt and Yt are random variables taking values in the measurable state space (Σ,X) and
that Y is (s,r)-consistent upon X on [t0,T).
If ∃M,δ2>0 such that ∀Δt∈(0,δ2),
[TABLE]
then
[TABLE]
iff
[TABLE]
where k=⌊Δts⌋ and l=⌊Δtr⌋.
Proof.
(⇒) Suppose EPTY→X(s,r)∣t0T<∞,
let δ=min{δ1,δ2} and for each ω∈Ω let
[TABLE]
and
[TABLE]
If Δt∈(0,δ), then consistency condition 2(c) implies that KL(PΔt(ω)∣∣MΔt(ω)) is P-integrable. Since Σ is σ−finite under both PX∣X,Y,i,Δt(ω),(⌊Δts⌋,⌊Δtr⌋) and PX∣X,i,Δt(ω),(⌊Δts⌋) for each ω∈Ω and any i=0,1,…,τ−1, we have that the measurable space (Στ,⨂i=0τ−1X) is σ−finite under both PΔt(ω) and MΔt(ω) for each ω∈Ω, thus the RN-derivatives in (5.11) exist. Furthermore, we get from Lemma 1 that
[TABLE]
where k=⌊Δts⌋ and l=⌊Δtr⌋.
Now for each Δt>0,i=0,1,⋯,τ−1 and ω∈Ω let
[TABLE]
for each τ-tuple (x0,x1,⋯,xτ−1)∈Στ. Clearly, Fi,Δtω is Στ−measurable and furthermore PΔt(ω)−integrable due to Jensen’s inequality since consistency condition 2(b) implies
[TABLE]
Now we apply Fubini’s Theorem and obtain
[TABLE]
where Si,Δt=∏j=0,j=iτ−1∫Σ1dPX∣X,Y,j,Δt(⋅),(k,l) for i=0,1,…,τ−1.
Moreover,
[TABLE]
Since EPTY→X(s,r)∣t0T<∞, we have
KL\left(P^{\left(\omega\right)}\big{|}\big{|}M^{\left(\omega\right)}\right)<\infty for all ω∈Ω\B for some P-null set B, which from Lemma 2 implies that
[TABLE]
Let
[TABLE]
and observe that g∈L1(Ω,F,P) and
[TABLE]
Moreover, since P(Ω)=1 we have
[TABLE]
Now for each ϵ,Δt>0 and ω∈Ω, define hΔtϵ(ω) by
[TABLE]
and note that hΔtϵ is nonnegative ∀ϵ,Δt>0 due to Gibbs’ inequality and converges in probability to g since ∀η>0
[TABLE]
as Δt↓0. Let ϵ>0 be arbitrary and observe that
[TABLE]
Since g∈L1(Ω,F,P) and KL(PΔt(⋅)∣∣MΔt(⋅))→Pgas Δt↓0, we have
[TABLE]
Now since \mathbb{P}\left(\left\{\Big{|}KL\left(P_{\Delta t}^{(\cdot)}||M_{\Delta t}^{(\cdot)}\right)-g\Big{|}<\epsilon\right\}\right)\rightarrow 1 as Δt↓0, we obtain
[TABLE]
from (5.17) and thus limϵ↓0limΔt↓0∥hΔtϵ−g∥L1=0
since ϵ>0 was arbitrary.
In particular,
[TABLE]
We now show that
[TABLE]
Note that
[TABLE]
where
[TABLE]
for ϵ,Δt>0, and ω∈Ω. Fix ϵ>0 and note that (5.10) implies
[TABLE]
∀Δt∈(0,δ). Due to (5.16), the RHS of
(5.20) converges to [math] as Δt↓0, thus
[TABLE]
so
[TABLE]
Now from (5.18) and (5.22) we have that limΔt↓0EP[KL(PΔt(⋅)∣∣MΔt(⋅))] exists since
thus ∃δ3>0 such that Δt∈(0,δ3)⟹EP[KL(PΔt(⋅)∣∣MΔt(⋅))]>M.
From (5.10),
[TABLE]
∀Δt∈(0,δ2), hence
[TABLE]
∀Δt∈(0,min{δ3,δ2}). This is a contradiction and the proof is complete.
∎
Due to the following corollary, one can conclude the “only if” part of Theorem 2 under a weaker version of (5.10).
Corollary 1**.**
Let T⊂R≥0 be an interval and [t0,T)⊂T and s,r>0 be such that (t0−max(s,r),T)⊂T. Suppose X:={Xt}t∈T and Y:={Yt}t∈T are stochastic processes adapted to the filtered probability space (Ω,F,{Ft}t∈T,P) such that for each t∈T,Xt and Yt are random variables taking values in the measurable state space (Σ,X) and
Y is (s,r)−SPL consistent upon X on [t0,T).
If there exist η∈L1(Ω,F,P) and δ2>0 such that KL(PΔt(⋅)∣∣MΔt(⋅))≤η(⋅),P−a.s.∀Δt∈(0,δ2) and
EPTY→X(s,r)∣t0T<∞,
then
[TABLE]
where k=⌊Δts⌋ and l=⌊Δtr⌋.
Proof.
We need only show that (5.21) in the proof of the forward direction of Theorem 2 is still true. Since η∈L1(Ω,F,P), for ϵ>0 we have that
The following corollary of Theorem 2 is a key result because it will be used in an application to be explored later in Section 6. The conditions in Theorem 2 may be too strong to apply to some common situations. The following weakens these conditions at the cost of the equivalence between the hypotheses and conclusion in Theorem 2.
Corollary 2**.**
Let T⊂R≥0 be a closed and bounded interval, [t0,T)⊂T, and s,r>0 be such that (t0−max(s,r),T)⊂T. Suppose X:={Xt}t∈T and Y:={Yt}t∈T are stochastic processes adapted to the filtered probability space (Ω,F,{Ft}t∈T,P) such that for each t∈T,Xt and Yt are random variables taking values in the measurable state space (Σ,X) and
Y is (s,r)−SPL consistent upon X on [t0,T).
If there exists γ>0 such that
[TABLE]
where
[TABLE]
for Δt,λ>0
and
EPTY→X(s,r)∣t0T<∞,
then
[TABLE]
where k=⌊Δts⌋ and l=⌊Δtr⌋.
Proof.
As in Corollary 1, it suffices to show that (5.21) holds whenever both (5.24) and EPTY→X(s,r)∣t0T<∞ hold. Observe that
[TABLE]
as Δt↓0, since clearly γ∈L1(Ω,F,P). Let τ′=⌊Δt′T⌋−⌊Δt′t0⌋ for Δt′>0 and observe that since EPTY→X(s,r)∣t0T<∞, Lemma 2 implies that
[TABLE]
for all Δt′>0 in a small enough neighborhood of 0; moreover,
[TABLE]
as Δt↓0 since P(BΔt)→0. Now for any ϵ>0,
[TABLE]
as Δt↓0, proving the corollary.
∎
6 Application: lagged Poisson point process
Below, we provide an example of two processes which satisfy (5.10) of Theorem 2 under a certain assumption on r.
In the following example, we consider TE from a time-lagged version of the counting process of a Time-Homogeneous Poisson Point Process (THPPP) to itself, a case through which we demonstrate the applicability of our results.
Suppose [t0,T)⊂T⊂R,X=(Xt)t∈T is the counting process of a THPPP with intensity λ. Suppose further that ϵ>0 and Y=(Yt)t∈TYt=Xt+ϵ,∀t≥−ϵ. If X is the counting process with intensity λ>0 of a THPPP ψ:=(Tn)n≥1, then Y is also a counting process of a THPPP with intensity λ>0, specifically that of the point process ψ′:=(Tn−ϵ)n≥1. Note that the state space of Xt is the natural numbers for any t∈[t0,T); a Polish space with discrete metric. For any ω∈Ω, Δt>0 and i=0,1,…,τ−1 we have
[TABLE]
where Pois(x,n)=n!e−xxn for x>0 and integers n≥0.
Suppose that [t0−max(ϵ,s),T)⊂T and 0<r<ϵ. Then ∃Δt⋆>0 such that 0<jΔt⋆<ϵ, ∀j=1,2,⋯,⌊Δt⋆r⌋. Letting L=⌊Δt⋆r⌋ we get that
[TABLE]
where we define pΔt⋆,i,ω=Pois(λΔt⋆;b⟨T,i,Δt⋆⟩−X⟨T,i+1,Δt⋆⟩(ω)). Let aω,i=X⟨T,i+1,Δt⋆⟩(ω) and cω,i=X⟨T,i+L,Δt⋆⟩+ϵ(ω) and observe that for any i=0,1,…,⌊Δt⋆T⌋−⌊Δt⋆t0⌋−1 we have that
[TABLE]
where ζΔt⋆(b)=(bcω−aω)(ϵ−LΔt⋆Δt⋆)b for 0≤b≤cω−aω, η(x)=xlog(x) for x>0 and xb:=b!(bx) denotes the b-th falling factorial ofx.
We suppose now that ∀ω∈Ω,∃Δtω>0 such that Xt+Δtω(ω)−Xt(ω)≤1 for all t∈[t0,T); that is, there is no more than one event in any interval of length Δtω.
Under this assumption, if ω∈Ω and 0<Δt<min{Δtω,Δt⋆}, then
[TABLE]
where eω,i∈{aω,i,aω,i+1} and dω,i∈{0,1}.
For any i=0,1,…,τ−1, if dω,i=0,
then
[TABLE]
and if dω,i=1, then
[TABLE]
Recall that
[TABLE]
from the proof of Theorem 4 and let Qω,Δt=∑i=0τ−1dω,i.
Then ∀ω∈Ω we have that
[TABLE]
Since whenever 0<r<ϵ,
[TABLE]
the quantity KL(∏i=0τ−1PX∣X,Y,i,Δt(ω),(k,l)∏i=0τ−1PX∣X,i,Δt(ω),(k)) is bounded in a sufficiently small neighborhood of [math]. Note that this limit is independent of the sample path.
For each Δt>0 let AΔt={ω∈Ω:Xt+Δt(ω)−Xt(ω)≤1,∀t∈[t0,T)} and BΔt,γ be as in Corollary 5.2; that is,
[TABLE]
Fix γ>(T−t0)(λ−ϵ−rlog(λ(ϵ−r))). We have now shown that for all Δt>0, there exists 0<Δt<Δt such that AΔt⊂BΔt,γ. Furthermore, since (BΔt,γ)Δt>0 is a decreasing collection of sets,
[TABLE]
Due to standard properties of the Poisson point process we have that P(AΔt)=1−o(Δt); thus P(AΔt)→1 as Δt↓0. Now (6.3) yields that P(BΔt,γ)→1 as Δt↓0, which establishes the existence of processes that satisfy (5.24) for some γ>0.
7 Transfer Entropy Rate
The generalization of information theoretic measures to the framework of information rates is a common paradigm in information theory. In this section we address the topic of instantaneous information transfer between processes using our methodology. We begin by defining transfer entropy rate using the EPT as follows333A similar definition appears in [27].:
Definition 5**.**
For t∈[t0,T), define the transfer entropy rate from Y to X at t, denoted TY→X(s,r)(t), by
Suppose the hypotheses of Theorem 2 hold for processes X and Y. If t∈[t0,T) and ∃δ>0 such that EPTY→X(s,r)∣tt+dt<∞, for all dt∈(t,t+δ), then
[TABLE]
Assuming some smoothness of the EPT, we can recover it at any time given the rate by using the following straightforward result.
Lemma 3**.**
If [t0,T]∋t↦EPTY→X(s,r)∣t0t∈C1([t0,T]) , then
[TABLE]
Proof.
From the fundamental theorem of calculus, we have that
[TABLE]
∎
Note that we have imposed differentiablity in Lemma 3; not just right-hand differentiability.
Lemma 4**.**
Suppose t0 and T are distinct elements of T and r,s>0 satisfy (t0−max(s,r),T)⊂T. If Y is (s,r)-consistent upon X on [t0,T) and EPTY→X(s,r)∣t0⋅ is linear on [t0,T], then for any t∈[t0,T)
[TABLE]
Proof.
It is immediate that TY→X(s,r) is constant since EPTY→X(s,r)∣t0⋅ is linear, hence EPTY→X(s,r)∣t0⋅∈C1([t0,T]). Furthermore, from Lemma 3 we have
[TABLE]
for any t∈[t0,T) and the proof is complete.
∎
8 Application to stationary processes
Definition 6**.**
Stochastic processes X and Y indexed over T are conditionally stationary if ∀ω∈Ω and k≥1, all collections of times {ti}0≤i≤k in T such that ti<ti+1 for each i, and all A∈X,
[TABLE]
for all i∈[k−1] and τ>0.
Definition 7**.**
Suppose k and l are positive integers. Stochastic processes X and Y on T are (k,l)-order conditionally stationary processes if ∀ω∈Ω, all collections of times {ti}0≤i≤max(k,l) of T such that ti<ti+1 for each i, and all A∈X,
[TABLE]
for all i∈[max(k,l)−1] and τ>0.
Observe that if X and Y are conditionally stationary processes, then they are by definition (k,l)-order conditionally stationary for all k,l≥1. Moreover, if X and Y are stationary, then ∀Δt>0 and s,r>0 such that [t0−max(s,r),T)⊂T, we have that X and Y are also (⌊Δts⌋,⌊Δtr⌋)-order conditionally stationary. We exploit this stationarity in the following observation.
Observation 3**.**
If X and Y are stationary processes, then for any Δt>0 and j=0,⋯,τ−1 we have that
[TABLE]
where in the second to last equality we used that
[TABLE]
for any c=0 due to the a.s. uniqueness of the RN-derivative.
We can use Observation 3 to provide an expression for the transfer entropy rate for stationary processes that have (s,r)-consistency on subintervals of [t0,T) of the form [t0,t). It should be noted that a result similar to the statement in part 2 of the following corollary appears as a remark in [27] without proof.
Corollary 3**.**
Suppose T is a closed and bounded interval, [t0,T)⊂T, and r,s>0 satisfy (t0−max(s,r),T)⊂T. Suppose further that X and Y are stationary processes such that
a.
Y* is (s,r)-consistent upon X on [t0,t),∀t∈(t0,T].*
b.
For all ∀t∈(t0,T], ∃M,δ2>0 such that ∀Δt∈(0,δ2),
[TABLE]
where k=⌊Δts⌋ and l=⌊Δtr⌋.
If ∀t∈(t0,T], limΔt↓0Δt1TY→X(k,l),Δt(Δt⌊Δtt1⌋) exists ∀t1∈[t0,t), then
[TABLE]
for all t1∈(t0,t).
2.
TY→X(s,r)(t)=T−t01EPTY→X(s,r)∣t0T.**
Proof.
(Proof of 1.)
Suppose t∈(t0,T] and t1∈(t0,t). Per assumption limΔt↓0TY→X(k,l),Δt(Δt⌊Δtt1⌋)/Δt exists, thus we have that
Since CΔt is bounded, limΔt↓0CΔtTY→X(k,l),Δt(Δt⌊Δtt1⌋)=0. Now using (8.5) we get
[TABLE]
and the result follows from division by t1−t0.
(Proof of 2.)
Suppose t1,t2 are distinct elements of [t0,T]. Without loss of generality, suppose t1>t2=t0. Per assumption X and Y are stationary processes such that Y is (s,r)-consistent upon X on [t0,t1) and [t0,t2). If j′=⌊Δtt1⌋−⌊Δtt2⌋, then from (8.3) we have that
[TABLE]
Per assumption, limΔt↓0(⌊Δtt2⌋−⌊Δtt0⌋)TY→X(k,l),Δt(Δt⌊Δtt2⌋) exists and since both CΔt and KΔt are bounded we have
[TABLE]
and
[TABLE]
Moreover,
[TABLE]
and since t2−t0t1−t0limΔt↓0KΔtTY→X(k,l),Δt(Δt⌊Δtt2⌋)=0, we have
[TABLE]
Thus, EPTY→X(s,r)∣t0t is linear in t−t0 and the result follows immediately from Lemma 4.
∎
Simply put, Corollary 3 states that under stationarity in a rather strict sense, the TE rate is the average value of the expected pathwise transfer entropy.
9 Jump Processes
In this section we consider EPT between jump processes, i.e., processes whose sample paths, with probability one, are step functions. These processes are ubiquitous in the literature concerning the application of TE to neural spike trains, social media sentiment analysis, and similar fields. Examples of such processes are Lévy processes and Poisson processes. Furthermore, we define conditional escape and transition rates similar to those in [27] as follows.
Definition 8**.**
For jump processes X=(Xt)t∈[t0,T) and Y=(Yt)t∈[t0,T) with Σ countable, define for each ω∈Ω,t∈[t0,T);r,s>0, and x′∈Σ the conditional transition rate of X given X and Y of x′ at t, denoted ψ[x′X,Y](t,ω), by
[TABLE]
the conditional transition rate of X given X of x′ at t, denoted ψ[x′X](t,ω), by
[TABLE]
and the conditional escape rates λX∣X(s)(t,ω) and λX∣X,Y(s,r)(t,ω) by
[TABLE]
and
[TABLE]
Remark 2**.**
In the forthcoming, we will sometimes regard the conditional transition rates defined above as measures on the space (Σ,X) for fixed ω∈Ω,t∈T in accordance with standard definitions of transition kernels (see Section 1.2 of [15]).
Notation 6**.**
for t∈[t0,T),ω∈Ω,and s,r>0, let
[TABLE]
We now consider TE between time-homogeneous Markov processes.
Definition 9**.**
Suppose (Ω,F,P) is a probability space, T⊂R≥0 is a bounded and closed interval, Σ is a countable set, and X is a σ−algebra of subsets of Σ containing all singletons of Σ. A stochastic process X=(Xt)t∈T is a time-homogeneous Markov jump process if all of its sample paths are piecewise constant and right-continuous and ∀n≥1, times t0<t1<⋯<tn−1, and sets Ai∈X for all 0≤i≤n,
[TABLE]
for each ω∈Ω and all τ≥0 such that ti−1+τ∈T for all 0≤i≤n.
We now present a Girsanov formula for the pathwise transfer entropy when the destination process is a time-homogeneous Markov jump process and the source process is any jump process.
Theorem 3**.**
Suppose Σ is countable. Suppose further that X and Y are jump stochastic processes on T with [t0,T)⊂T and X is a time-homogeneous Markov process with conditional transition rates given by (9.1) and (9.2) and conditional escape rates given by (9.4) and (9.3). If
∀ω∈Ω, ψ[xt0X,Y](t0,ω)=ψ[xt0X](t0,ω).
2.
The conditional escape rates are bounded and positive.
3.
ψ[⋅X,Y](t,ω)≪ψ[⋅X](t,ω)* for each ω∈Ω and t∈[t0,T).*
Then
[TABLE]
for every ω∈Ω and every sample path xt0T of X.
Proof.
Since X is Markov, there exists an increasing sequence of finite random jump times {τn}n≥0 such that τ0=t0, Xτn is constant on [τn,τn+1), and Xτn−=Xτn. Furthermore, from the Markov assumption, conditionally on {Xτn}n≥0, the variables {τn+1−τn}n≥0 are independent and exponentially distributed.
We first need to show that for arbitrary measures P≪Q on the path space of piecewise constant sample paths of X with transition probabilities pP(⋅,⋅), pQ(⋅,⋅) and escape rates γP,γQ, that for every realization xt0T of the process Xt0T,
[TABLE]
where {τi}i=0NX[t0,T) is the sequence of jump times of the realization xt0T.
A proof of (9.6) is given in Appendix 1, Proposition 2.6 of [17].
Now letting P and Q be the measures in (9.1) and (9.2), respectively, using assumption 1., and noting that
[TABLE]
and
[TABLE]
where pX∣X,Y and pX∣X denote conditional transition probabilities, we get that
[TABLE]
∎
From here, we present the following explicit formula for the TE rate when the source process is a time homogeneous Markov jump process and the destination process is a time homogeneous Poisson process.
Corollary 4**.**
*Suppose X is a time homogeneous Poisson process and Y is a time homogeneous Markov jump process on [t0,T) such that the hypotheses of Theorem 3 hold. If t↦logψ[xtX](t,ω)ψ[xtX,Y](t,ω)∈L1([t0,T),μ) for each ω∈Ω, then ∀t∈[t0,T) the transfer entropy rate, TY→X(s,r)(t), is given by
*
[TABLE]
Proof.
Observe that for each ω∈Ω and sample path xt0T we have
[TABLE]
Since the process (NX[t0,t)(⋅)−∫t0tλX∣X,Y(t′,⋅)dt′)t∈[t0,T) is a martingale, the stochastic process ∫t0tlogψ[xt′X](t′,⋅)ψ[xt′X,Y](t′,⋅)dNX[t0,t′)(⋅)t∈[t0,T) is a martingale such that for each ω∈Ω
[TABLE]
as a consequence of Theorem 9.2.1 of [6].
Now let ψt,ω=ψ[xtX,Y](t,ω), ψˉt,ω=ψ[xtX](t,ω), and f(t,ω)=λX∣X,Y(s,r)(t,ω)(log[ψˉt,ωψt,ω]−1) for each t∈[t0,T) and ω∈Ω. From Theorem 3 and (9.8) we have
[TABLE]
where the last equality follows from Theorem A16.1 in [30].
∎
10 Conclusion
We end with some open problems regarding the present work. First, motivated by [32], we present an alternative definition of EPT in which, we define it as a limit superior of conditional mutual information over sub-partitions of the interval [t0,T).
We begin by defining sub-partitions of an interval of the form [t0,T).
Definition 10**.**
A sub-partition P of an interval [t0,T)⊂R is a set of real numbers t0,t1,…,tn such that
[TABLE]
Definition 11**.**
Suppose T is a closed and bounded interval and let P[t0,T) denote the set of sub-partitions of the interval [t0,T)⊂T and ∣∣P∣∣ denote the mesh of a sub-partition P∈P[t0,T), defined by
[TABLE]
For all P∈P[t0,T); r,s>0, such that (t0−max(r,s),T]⊂T, define the sub-partitioned expected pathwise transfer entropy of the sub-partition P, denoted EPTY→X(s,r),P∣t0T, by
[TABLE]
Definition 12**.**
Suppose T is a closed and bounded interval such that [t0,T)⊂T. For all r,s>0 such that (t0−max(r,s),T]⊂T, define
[TABLE]
Question 1**.**
Is this definition advantageous or even equivalent to Definition 4.2?
In Section 6 we presented an explicit form of KL(PΔt(ω)MΔt(ω)) and demonstrated that it satisfied sufficient conditions of Corollary 2. We propose the following natural question.
Question 2**.**
What other processes satisfy (5.10) or (5.24) other than the deterministically lagged counting process of a time homogeneous Poisson point process?
In the Appendix section, we provide an explicit form for the divergence KL(PX∣X,Y,i,Δt(ω),(k,l)PX∣X,i,Δt(ω),(k)) where Y is a time-lagged version of a Wiener process X. However, there is no explicit form for neither KL(PΔt(ω)MΔt(ω)) nor KL(PX∣X,Y,i,Δt(ω),(k,l)PX∣X,i,Δt(ω),(k)) other than those presented in the present work. There are a myriad of transformations one could perform on a process to yield another, for example, thinning, superimposition, deterministic and random lagging, and convolution. Each of these transformations yields a new process that is not independent of the original process; thus, in general, there ought to be a nonzero TE between the two. Compound Poisson processes (CPP) are of particular relevance to the continuous-time framework presented in this work and are widely used to model neural spike trains, social media sentiment, geological activity, etc.; therefore, a demonstration that either (5.10) or (5.24) hold for pairs of processes derived from variously transformed CPPs may be useful for applications.
One of the main contributions of this work is a definition of the TE rate native to continuous-time processes. However, our methodology does not present any practical means of measuring it.
Question 3**.**
Do there exist practical estimators of the EPT and the TE rate, at least for common process types?
The transfer entropy estimator presented in [18] is of practical utility for discrete-time processes. Can it be generalized to appropriately measure TE using the measure theoretical approach taken in this work? If so, what are its properties? There is a wealth of questions one could propose pertaining to such an estimator, e.g., is this estimator biased or asymptotically biased/unbiased? Is it an efficient estimator and how is its speed performance? Does there exist an appropriate model class under which an MLE for TE exists? How does this estimator compare with binning and partitioning based estimators?
If there is no such estimator that can be used in a general setting, does there exist an estimator when the destination and source process are a particular type of continuous-time stochastic process? Providing estimators for TE rate and EPT between a pair of time inhomogeneous PPPs, compound Poisson processes, or Brownian motions with various effects on each other would likely be helpful in understanding a wide variety of linked, real-world time series.
Bibliography33
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] Aliprantis, C. D. Infinite dimensional analysis: a hitchhiker’s guide . Springer, London, 2006.
2[2] Ankirchner, S. and Imkeller, P. Financial markets with asymmetric information: information drift, additional utility and entropy. Stochastic processes and applications to mathematical finance , 1-21. World Scientific, 2007.
3[3] Atar, R. and Weissman, T. Mutual information, relative entropy, and estimation in the Poisson channel. IEEE Transactions on Information theory, 58(3): 1308 - 1318. IEEE, 2012.
4[4] Battaglia, D., Witt, A., Wolf, F., and Geisel, T. Dynamic effective connectivity of inter-areal brain circuits P Lo S computational biology , 8(3). Public Library of Science, 2012.
5[5] Bhattacharya, J., Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., and Bhattacharya, J. Causality detection based on information-theoretic approaches in time series analysis. Physics Reports , 441(1):1-46. Elsevier, 2007.
6[6] Brémaud, P. Markov Chains. Texts in Applied Mathematics, volume 31. Springer, 1999.
7[7] Çınlar, E. Probability and stochastics , volume 261. Springer Science & Business Media, 2011.
8[8] Debowski, L. A general definition of conditional information and its application to ergodic decomposition Statistics & probability letters , 79(9): 1260 - 1268. Elsevier, 2009.