1Introduction Symbolictransferentropyrateisequaltotransferentropyrateforbivariateﬁnite-alphabetstationaryergodicMarkovprocesses

(1)

(will be inserted by the editor)

Symbolic transfer entropy rate is equal to transfer entropy rate for bivariate ﬁnite-alphabet stationary ergodic Markov processes

Taichi Haruna¹and Kohei Nakajima^2,3

1 Department of Earth & Planetary Sciences, Graduate School of Science, Kobe University, 1-1 Rokkodaicho, Nada, Kobe 657-8501, Japan, e-mail:[email protected]

2 Department of Informatics, University of Zurich, Andreasstrasse 15, 8050 Zurich, Switzerland, e-mail:[email protected]

3 Department of Mechanical and Process Engineering, ETH Zurich, Leonhardstrasse 27, 8092 Zurich, Switzerland

Received: date / Revised version: date

Abstract. Transfer entropy is a measure of the magnitude and the direction of information ﬂow between jointly distributed stochastic processes. In recent years, its permutation analogues are considered in the literature to estimate the transfer entropy by counting the number of occurrences of orderings of values, not the values themselves. It has been suggested that the method of permutation is easy to implement, computationally low cost and robust to noise when applying to real world time series data. In this paper, we initiate a theoretical treatment of the corresponding rates. In particular, we consider the transfer entropy rate and its permutation analogue, the symbolic transfer entropy rate, and show that they are equal for any bivariate ﬁnite-alphabet stationary ergodic Markov process. This result is an illustration of the duality method introduced in [T. Haruna and K. Nakajima, Physica D 240, 1370 (2011)]. We also discuss the relationship among the transfer entropy rate, the time-delayed mutual information rate and their permutation analogues.

1 Introduction

Quantifying networks of information ﬂows is critical to understand functions of complex systems such as living, social and technological systems. Schreiber [1] introduced the notion of transfer entropy to measure the magnitude

and the direction of information flow from one element to another element emitting stationary signals in a given system. It has been used to analyze information flows in real time series data from neuroscience [1–11], and many other fields [12–19].

(2)

The notion ofpermutation entropyintroduced by Bandt and Pompe [20] has been proved that much of information contained in stationary time series can be captured by counting occurrences of orderings of values, not those of values themselves [21–26]. The method of permutation has been applied across many disciplines [27] and suggested that it is easy to implement, computationally low cost and robust to noise when applying to real world time series data. Among the previous works, one relevant theoretical result to this paper is that the entropy rate [28], which is one of the most fundamental quantities of stationary stochastic processes, is equal to the permutation entropy rate for any ﬁnite-alphabet stationary stochastic process [29, 30].

The symbolic transfer entropy [31] is a permutation analogue of the transfer entropy and has been used as an efficient and conceptually simple way of quantifying information flows in real time series data [31–34]. Another permutation analogue of the transfer entropy calledtrans- fer entropy on rank vectors has been introduced to im- prove the performance of the symbolic transfer entropy [35]. So far, most of the work on permutation analogues of the transfer entropy are in application side. This paper concerns the theoretical relationship among respec- tive rates. In particular, we consider the rate of transfer entropy on rank vectors which we call symbolic transfer entropy rate and show that it is equal to thetransfer en- tropy rate [36] for any bivariate finite-alphabet stationary ergodic Markov process. We also discuss the relationship

among the transfer entropy rate, the time-delayed mutual information rate and their permutation analogues.

Our approach is based on the duality between values and orderings introduced by the authors [37]. In [37], the excess entropy [38–45], which is an eﬀective measure of complexity of stationary stochastic processes, and its permutation analogue is shown to be equal for any ﬁnite- alphabet stationary ergodic Markov process. In this paper, we extend this approach to the bivariate case and address the relationship between the transfer entropy rate and the symbolic transfer entropy rate.

This paper is organized as follows. In Section 2, we introduce the transfer entropy rate and the symbolic transfer entropy rate. We also discuss some combinatorial facts used in later sections. In Section 3, we give a proof of the equality between the transfer entropy rate and the symbolic transfer entropy rate which holds for bivariate ﬁnite- alphabet stationary ergodic Markov processes. In Section 4, we discuss the relationship among the transfer entropy rate, the time-delayed mutual information rate and their permutation analogues. Finally, in Section 5, we give con- cluding remarks.

2 Deﬁnitions and Preliminaries

Let An = {1,2,· · ·, n} be a finite alphabet consisting of natural numbers from 1 to n. In the following discussion,X≡ {X1, X2,· · · }andY≡ {Y1, Y2,· · · }are jointly distributed finite-alphabet stationary stochastic processes, or equivalently, (X,Y) is a bivariate finite-alphabet stationary stochastic process{(X1, Y1),(X2, Y2),· · · }, where

(3)

stochastic variables X_i and Y_j take their values in the alphabet An and Am, respectively. We use the notation X₁^L≡(X₁, X₂,· · ·, X_L) for simplicity. We writep(x^L₁¹, y₁^L²) for the joint probability of the occurrence of wordsx^L₁¹ ≡ x1x2· · ·xL₁ ∈ A^L_n¹ and y₁^L² ≡ y1y2· · ·yL₂ ∈ A^L_m² for L₁, L₂≥1.

Originally, the notion of transfer entropy was introduced as a generalization of the entropy rate to bivariate processes [1]. Along this original motivation, here, we do not consider thetransfer entropy but thetransfer entropy rate [36] fromYto Xwhich is deﬁned by

t(X|Y)≡h(X)−h(X|Y), (1) where h(X) ≡ limL→∞H(X₁^L)/L is the entropy rate of X, H(X₁^L)≡ −∑

x^L₁∈A^L_np(x^L₁) log₂p(x^L₁) is the Shannon entropy of the occurrences of words of lengthLinXand h(X|Y) is the conditional entropy rate of Xgiven Yde- ﬁned by

h(X|Y)≡ lim

L→∞H(XL+1|X₁^L, Y₁^L), (2) which always converges. t(X|Y) has the properties that (i) 0 ≤ t(X|Y) ≤ h(X) and (ii) t(X|Y) = 0 if X₁^L is independent ofY₁^L for allL≥1.

In order to introduce the notion of symbolic transfer entropy rate, we deﬁne a total order on the alphabet An

by the usual “less-than-or-equal-to” relationship. Let SL

be the set of all permutations of length L ≥ 1. We consider each permutationπof lengthLas a bijection on the set {1,2,· · ·, L}. Thus, each permutation π∈ SL can be identiﬁed with the sequence π(1)· · ·π(L). The permuta- tion type π∈ SL of a given word x^L₁ ∈A^L_n is deﬁned by

re-ordering x₁,· · ·, x_L in ascending order, namely, x^L₁ is of type π if we have x_π(i) ≤x_π(i+1) and π(i)< π(i+ 1) whenx_π(i)=x_π(i+1) fori= 1,2,· · ·, L−1. For example, π(1)π(2)π(3)π(4)π(5) = 41352 for x⁵₁ = 24213 ∈ A⁵₄ be- causex4x1x3x5x2= 12234. The mapφn:A^L_n → SLsends each wordx^L₁ to its unique permutation typeπ=φ_n(x^L₁).

We will use the notions of rank sequences and rank variables [29]. Therank sequences of length L are words r^L₁ ∈ A^L_L satisfying 1 ≤ r_i ≤ i for i = 1,· · ·, L. The set of all rank sequences of length L is denoted by RL. It is clear that |RL| = L! = |SL|. Each word x^L₁ ∈ A^L_n can be mapped to a rank sequence r^L₁ by deﬁning ri ≡

∑i

j=1δ(xj ≤xi) for i= 1,· · · , L, where δ(P) = 1 if the propositionP is true, otherwiseδ(P) = 0. We denote this map fromA^L_n toRL byϕn. It can be shown that the map ϕ_n:A^L_n → RLis compatible with the mapφ_n :A^L_n → SL

in the following sense: there exists a bijectionι:RL→ SL

such that ι◦ϕn =φn [37]. The rank variables associated with X are deﬁned by Ri ≡ ∑i

j=1δ(Xj ≤ Xi) for i = 1,· · · , L. In general,R≡ {R1, R2,· · · }is a non-stationary stochastic process.

The symbolic transfer entropy rate from Y to X is deﬁned by

t^∗(X|Y)≡h^∗(X)−h^∗(X|Y), (3)

where h^∗(X) ≡ lim_L_→∞H^∗(X₁^L)/L is the permutation entropy rate which is known to exist and is equal toh(X) [29],

H^∗(X₁^L)≡ − ∑

π∈SL

p(π) log₂p(π)

(4)

is the Shannon entropy of the occurrences of permutations of length L in X, p(π) =∑

φn(x^L₁)=πp(x^L₁) and h^∗(X|Y) is given by

h^∗(X|Y)≡ lim

L→∞

(H^∗(X₁^L+1, Y₁^L)−H^∗(X₁^L, Y₁^L)) (4) if the limit in the right hand side exists. Here,H^∗(X₁^L¹, Y₁^L²) is deﬁned by

H^∗(X₁^L¹, Y₁^L²)≡ − ∑

π∈SL1,π⁰∈SL2

p(π, π⁰) log₂p(π, π⁰),

wherep(π, π⁰) =∑

φ_n(x^L₁¹)=π,φ_m(y₁^L²)=π⁰p(x^L₁¹, y^L₁²).

LetRandSbe rank variables associated withXand Y, respectively. By the compatibility betweenφ_k andϕ_k for k = m, n, we have H(R₁^L¹, S₁^L²) = H^∗(X₁^L¹, Y₁^L²).

Thus,h^∗(X|Y) can be written as h^∗(X|Y) = lim

L→∞H(RL+1|R^L₁, S₁^L) ifh^∗(X|Y) exists.

Note that the above deﬁnition of the symbolic transfer entropy rate (3) is not the rate of the original symbolic transfer entropy introduced by Staniek and Lehnertz [31]

but that of the transfer entropy on rank vectors [35] which is an improved version of it.

3 Main Result

In this section, we give a proof of the following theorem:

Theorem 1 For any bivariate ﬁnite-alphabet stationary ergodic Markov process(X,Y), we have the equality

t(X|Y) =t^∗(X|Y). (5)

Before proceeding to the proof of Theorem 1, ﬁrst we present some intermediate results used in the proof.

We introduce the map µ : SL → N^L, where N = {1,2,· · · } is the set of all natural numbers ordered by usual “less-than-or-equal-to” relationship, by the following procedure: first, given a permutation π∈ SL, we de- compose the sequence π(1)· · ·π(L) into maximal ascend- ing subsequences. A subsequence i_j· · ·i_j+k of a sequence i1· · ·iL is called amaximal ascending subsequence if it is ascending, namely, i_j ≤ i_j+1 ≤ · · · ≤ i_j+k, and neither ij−1ij· · ·ij+knorijij+1· · ·ij+k+1is ascending. Second, if π(1)· · ·π(i₁), π(i₁+1)· · ·π(i₂),· · ·, π(i_k₋₁+1)· · ·π(L) is the decomposition of π(1)· · ·π(L) into maximal ascending subsequences, then we define the word x^L₁ ∈ N^L by x_π(1) = · · · = x_π(i₁₎ = 1, x_π(i₁₊₁₎ = · · · = x_π(i₂₎ = 2,· · · , x_π(i_k−1₊₁₎ = · · · = x_π(L) = k. Finally, we define µ(π) =x^L₁. For example, the decomposition of 25341∈ S5

into maximal ascending subsequences is 25,34,1. We ob- tainµ(π) =x1x2x3x4x5= 31221 by puttingx2x5x3x4x1= 11223. By construction, we have φn ◦ µ(π) = π when µ(π)∈A^L_n for anyπ∈ SL.

The mapµcan be seen as the dual to the mapφn (or ϕn) in the following sense:

Theorem 2 (Theorem 9 in [37]) Let us put

Bn,L ≡ {x^L₁ ∈A^L_n|∃π∈ SL such that φ⁻_n¹(π) ={x^L₁}}, Cn,L ≡ {π∈ SL||φ⁻_n¹(π)|= 1},

where φ⁻_n¹(π) ≡ {x^L₁ ∈ A^L_n|φn(x^L₁) = π} is the inverse image ofπ∈ SL by the mapφn. Then,

(5)

(i) φ_n restricted onB_n,L is a map intoC_n,L,µrestricted on Cn,L is a map into Bn,L, and they form a pair of mutually inverse maps.

(ii) x^L₁ ∈Bn,L if and only if

for all1≤i≤n−1there exist1≤j < k≤L such that xj=i+ 1 andxk=i. (6)

The proof of Theorem 2 can be found in [37].

Sinceh(X) =h^∗(X) holds for any ﬁnite-alphabet stationary process, proving (5) is equivalent to showing that the equality

lim

L→∞H(RL+1|R^L₁, S^L₁) = lim

L→∞H(XL+1|X₁^L, Y₁^L) (7) holds for any bivariate ﬁnite-alphabet stationary ergodic Markov process (X,Y). For simplicity, we assume that each (x, y)∈An×Amappears with a positive probability p(x, y)>0. The essentially same proof can be applied to the general case.

Lemma 1 For any ² >0 if we take L suﬃciently large, then

∑

x^L₁ satisﬁes(∗), y^L₁ satisﬁes(∗∗)

p(x^L₁, y₁^L)>1−², (8)

where(∗)is the condition that for anyx∈An there exist 1 ≤i ≤ bL/2c< j ≤L such that x=x_i =x_j and (∗∗) is the condition that for anyy ∈Am there exist 1≤i⁰ ≤ bL/2c< j⁰≤Lsuch that y=y_i0 =y_j0.

Proof. The ergodicity of (X,Y) implies that the rel- ative frequency of any word (x^k₁, y₁^k) converges in probability top(x^k₁, y₁^k). In particular, ifF_(x,y)^N is the stochastic

variable deﬁned by the number of indexes 1≤i≤N such that (Xi, Yi) = (x, y) for (x, y)∈An×Am, then we have for any² >0 andδ >0 there existsN_(x,y),²,δ such that if N > N_(x,y),²,δ then

Pr{|F_(x,y)^N /N−p(x, y)|< δ}>1−².

Now, ﬁx any² >0. Chooseδso that 0< δ < min

(x,y)∈An×Am

{p(x, y)}

and putN0≡max_(x,y)_∈_A_n_×_A_m{N(x,y),²/(2nm),δ}. LetS_(x,y)^N be the set of words (x^N₁, y₁^N) such that there exists 1 ≤ i≤N that satisﬁesx_i=xand y_i=y, andS_N the set of words (x^N₁, y^N₁ ) such that for any (x, y)∈An×Amthere exists 1≤i≤N that satisﬁesx_i=xandy_i=y.

IfN > N0, then we have for any (x, y)∈An×Am

Pr(S_(x,y)^N ) = ∑

(x^N₁,y^N₁)∈S_(x,y)^N

p(x^N₁ , y^N₁)

= Pr{F_(x,y)^N >0}

≥Pr{|F_(x,y)^N /N−p(x, y)|< δ}

>1−²/(2nm),

where the inequality in the third line holds follows because we have p(x, y)> δ by the choice ofδ.

Then, having that

SN ≡ ∩

(x,y)∈A_n×A_m

S_(x,y)^N ,

it follows that

Pr(S_N)>1−nm×²/(2nm) = 1−²/2.

Now, takeLso thatbL/2c> N0. Let U be the set of words (x^L₁, y^L₁) such that (x^b₁^L/2^c, y^b₁^L/2^c)∈S_bL/2c andV

(6)

the set of words (x^L₁, y^L₁) such that (x^L_b_L/2_c₊₁, y_b^L_L/2_c₊₁)∈ S_L_−b_L/2_c. Then, we have

Pr(U)≥Pr(S_bL/2c)>1−²/2 and

Pr(V)≥Pr(S_L_−b_L/2_c)>1−²/2.

Consequently, we obtain

∑

x^L₁ satisﬁes (∗), y^L₁ satisﬁes (∗∗)

p(x^L₁, y₁^L) = Pr(U ∩V)>1−².

¤

We put

Dn,m,L≡ {(x^L₁, y₁^L)|x^L₁ satisﬁes (∗) andy^L₁ satisﬁes (∗∗)}

and

E_n,m,L≡ {(r^L₁, s^L₁)|∃(x^L₁, y₁^L)∈D_n,m,L such that ϕn(x^L₁) =r₁^L, ϕm(y^L₁) =s^L₁}. Then, we havex^L₁ ∈B_n,Landy^L₁ ∈B_m,Lfor any (x^L₁, y₁^L)∈ Dn,m,L. Indeed, if (x^L₁, y₁^L)∈Dn,m,L, thenx^L₁ andy₁^Lsat- isfy (∗) and (∗∗), respectively. For any 1 ≤ i ≤ n−1, there exists 1≤j≤ bL/2csuch thatxj=i+ 1 and there exists bL/2c< k≤Lsuch that xk =i by (∗). Hence,x^L₁ satisﬁes (6). By Theorem 2 (ii), we have x^L₁ ∈B_n,L. By the same way, we havey₁^L∈Bm,L.

Thus, the map

(x^L₁, y₁^L)7→(ϕ_n(x^L₁), ϕ_m(y₁^L))

is a bijection from Dn,m,L to En,m,L due to the duality betweenφk andµfork=m, n. Indeed, it is onto because

E_n,m,L is the image of the map ϕ_n×ϕ_m : A^L_n×A^L_m → RL× RL restricted on Dn,m,L. It is also injective. For if (ϕ_n(x^L₁), ϕ_m(y₁^L)) = (ϕ_n(x^L₁), ϕ_m(y^L₁)), thenφ_n(x^L₁) = ι◦ϕn(x^L₁) =ι◦ϕn(x^L₁) =φn(x^L₁) and similarlyφm(y₁^L) = φm(y^L₁). By Theorem 2 (i),φnandφmare bijections from B_n,L toC_n,Land from B_m,L to C_m,L, respectively. Since x^L₁, x^L₁ ∈Bn,L and y^L₁, y^L₁ ∈Bm,L, it hold thatx^L₁ =x^L₁ andy^L₁ =y^L₁.

In particular, we have

p(x^L₁, y₁^L) =p(r₁^L, s^L₁)

and

p(rL+1|r₁^L, s^L₁) =p(rL+1|x^L₁, y^L₁)

for any (x^L₁, y₁^L)∈Dn,m,L, wherer^L₁ =ϕn(x^L₁) ands^L₁ = ϕm(y^L₁).

Proof of Theorem 1. Given any ² > 0, let us take L large enough so that the inequality (8) holds. We shall evaluate each term in the right hand side of (9) (see be- low). First, the second term in (9) is bounded by²log₂n which can be arbitrary small. This is because

∑

(x^L₁,y^L₁)6∈D_n,m,L

p(x^L₁, y₁^L)≤²

by Lemma 1 and the sum overx_L+1 is at most log₂n.

Second, to show the third term also converges to 0 as L→ ∞, we use the Markov property: if (X,Y) is ergodic Markov, then we can show that

∑

(r₁^L,s^L₁)6∈E_n,m,L

p(r^L₁, s^L₁) = ∑

(x^L₁,y^L₁)6∈D_n,m,L

p(x^L₁, y₁^L)

< Cλ^L

(7)

H(XL+1|X₁^L, Y₁^L)−H(RL+1|R^L₁, S₁^L)

=− ∑

(x^L₁,y^L₁)∈Dn,m,L

p(x^L₁, y₁^L)



∑

xL+1

p(x_L+1|x^L₁, y^L₁) log₂p(x_L+1|x^L₁, y₁^L)−∑

rL+1

p(r_L+1|x^L₁, y₁^L) log₂p(r_L+1|x^L₁, y₁^L)





− ∑

(x^L₁,y^L₁)6∈Dn,m,L

p(x^L₁, y₁^L)∑

x_L+1

p(xL+1|x^L₁, y^L₁) log₂p(xL+1|x^L₁, y₁^L)

+ ∑

(r^L₁,s^L₁)6∈E_n,m,L

p(r^L₁, s^L₁)∑

r_L+1

p(rL+1|r^L₁, s^L₁) log₂p(rL+1|r^L₁, s^L₁) (9)

for someC >0 and 0≤λ <1. Indeed, we have

∑

(x^L₁,y^L₁)6∈Dn,m,L

p(x^L₁, y^L₁)≤ ∑

x^L₁ does not satisfy (∗)

p(x^L₁) + ∑

y₁^Ldoes not satisfy (∗∗)

p(y^L₁).

Since

∑

x^L₁ does not satisfy (∗)

p(x^L₁)≤ ∑

x∈An







∑

x_i6=x, 1≤i≤N

p(x^N₁) + ∑

x_i6=x, N <i≤L

p(x^L_N₊₁)







≤2 ∑

x∈A_n

∑

x_i6=x, 1≤i≤N

p(x^N₁ )

and similarly

∑

y^L₁ does not satisfy (∗∗)

p(y₁^L)≤2 ∑

y∈A_m

∑

y_i6=y, 1≤i≤N

p(y^N₁ ),

whereN =bL/2c, it is suﬃcient to show that the proba- bilities

β_x,X,L≡ ∑

x_i6=x, 1≤i≤N

p(x^N₁)

for allx∈A_n and

β_y,Y,L≡ ∑

y_i6=y, 1≤i≤N

p(y^N₁ )

for ally∈A_mconverge to 0 exponentially fast.

LetP be the transition matrix for the Markov process (X,Y). We denote its (x, y)(x⁰, y⁰)-th element byp(x,y)(x0,y0)

which indicates the transition probability from state (x, y) to (x⁰, y⁰). We denote the stationary distribution associated with (X,Y) byp= (p_(x,y))_(x,y)_∈_A_n_×_A_mwhich uniquely exists because of the ergodicity of the process. The probability of the occurrence of a word (x^L₁, y₁^L) is given by p(x^L₁, y₁^L) =p_(x₁_,y₁₎p_(x₁_,y₁_)(x₂_,y₂₎· · ·p_(x_L−1_,y_L−1_)(x_L_,y_L₎. For anyx∈An, we deﬁne the matrixPxwhose (x⁰, y⁰)(x⁰⁰, y⁰⁰)- th element is deﬁned by

(Px)_(x0,y⁰)(x⁰⁰,y⁰⁰)=











0 ifx⁰ =x

p_(x0,y⁰)(x⁰⁰,y⁰⁰) otherwise.

Then, we can write

β_x,X,L=h(P_x)^N⁻¹u_x,pi,

where the vector ux = (u_(x0,y⁰)) is deﬁned by u_(x0,y⁰) = 0 if x⁰ = x and otherwise u(x0,y0) = 1 and h· · · i is the usual inner product in the n×m-dimensional Euclidean space. Since Px is a non-negative matrix, we can apply the Perron-Frobenius theorem to it. We can show that its Perron-Frobenius eigenvalue (the non-negative eigenvalue whose absolute value is the largest among the eigenvalues) λx is strictly less than 1 by the same manner as in the proof of Lemma 13 in [37]. We can also show that for any

(8)

δ >0 there existsC_δ,x>0 such that for anyk≥1 k(Px)^kuxk ≤Cδ,x(λx+δ)^kkuxk,

wherek · · · kis the Euclidean norm. The proof for this fact is found in, for example, Section 1.2 of [46]. Hence, if we choose δ > 0 suﬃciently small so that λ_x+δ < 1 and put γx ≡(λx+δ)^1/2 and Cx =Cδ,x(λx+δ)⁻²kuxkkpk, then we have β_x,X,L ≤ C_xγ^L_x. By the same manner, we can obtain the similar bound forβy,Y,Lfor ally∈Am.

Since the sum overrL+1 is at most log₂(L+ 1), the absolute value of the third term is bounded by the quan- tityCλ^Llog₂(L+ 1) which goes to 0 asL→ ∞. Note that there is a O(logL) diverging term coming from the sum overrL+1. The assumed ergodic Markov property is used to overcome this divergence by showing the quantity

∑

(r^L₁,s^L₁)6∈E_n,m,L

p(r₁^L, s^L₁) converges to 0 exponentially fast.

Finally, the ﬁrst term is shown to be 0 by the same discussion as in the proof of Lemma 1 in [29] : if (x^L₁, y^L₁)∈ D_n,m,L, then each symbolx ∈A_n appears at least once in the wordx^L₁ (indeed, it appears at least twice). Ifaxis the number of 1≤x≤n occurring in the wordx^L₁, then ax>0 for all 1≤x≤n. Hence, given (x^L₁, y₁^L)∈Dn,m,L, xL+1=xif and only ifrL+1= 1 +∑x

x⁰=1ax⁰. Indeed, we have

rL+1=

L+1∑

i=1

δ(xi≤xL+1) = 1 +

x∑L+1

x0=1

ax⁰. Hence, ifxL+1 =x, then we haverL+1 = 1 +∑x

x0=1ax⁰. For the converse, if r_L+1 = 1 +∑x

x⁰=1a_x0, then we have

∑x_L+1

x⁰=1ax0 =∑x

x⁰=1ax0. Sinceax0 >0 for all 1≤x⁰ ≤n, this happens only whenxL+1 =x.

Thus, given (x^L₁, y^L₁)∈D_n,m,L, the probability distribution

p(r_L+1|x^L₁, y₁^L)

is just a re-indexing ofp(xL+1|x^L₁, y^L₁), which implies that the ﬁrst term is exactly equal to 0. This completes the proof of the theorem.

¤

From the proof, we can also see thatt^∗(X|Y)≤t(X|Y) holds for any bivariate ﬁnite-alphabet stationary ergodic process (X,Y) ifh^∗(X|Y) exists for the process.

4 On the relationship with the time-delayed mutual information rate

Apart from permutation, it is natural to ask whether the equality for the conditional entropy rate

lim

L→∞H(XL+1|X₁^L, Y₁^L) = lim

L→∞

1

LH(X₁^L+1|Y₁^L) (10) holds or not, which is parallel to the equality for the entropy rate limL→∞H(XL+1|X₁^L) = limL→∞ 1

LH(X₁^L+1) which holds for any ﬁnite-alphabet stationary stochastic processX[28]. In this section, we will see that this ques- tion has an intimate relationship with the relationship between the transfer entropy rate and thetime-delayed mu- tual information rate.

In general, (10) does not hold. For example, ifX=Y, then we have limL→∞H(XL+1|X₁^L, Y₁^L) = h(X), while limL→∞ 1

LH(X₁^L+1|Y₁^L) = 0. However, note that the in-

(9)

equality

Llim→∞H(XL+1|X₁^L, Y₁^L)≥ lim

L→∞

1

LH(X₁^L+1|Y₁^L) (11) holds for any bivariate ﬁnite-alphabet stationary stochastic process (X,Y). Indeed, we have

Llim→∞H(XL+1|X₁^L, Y₁^L)

= lim

L→∞

1 L

L+1∑

i=1

H(Xi|X₁ⁱ⁻¹, Y₁ⁱ⁻¹)

≥ lim

L→∞

1 L

L+1∑

i=1

H(X_i|X₁ⁱ⁻¹, Y₁^L)

= lim

L→∞

1

LH(X₁^L+1|Y₁^L),

where the ﬁrst equality is due to the Ces´aro mean theorem (if limL→∞bL =b then limL→∞ 1

L

∑L

i=1bi =b) and the last equality follows from the chain rule for the Shannon entropy. In the following, we give a suﬃcient condition for (10).

Proposition 1 If there existsN > 0 such that if i > N then Xi is independent of Y_i^i+j givenX₁ⁱ⁻¹ andY₁ⁱ⁻¹ for any j≥0, that is,

Pr(Xi =xi, Y_i^i+j=y_i^i+j|X₁ⁱ⁻¹=xⁱ₁⁻¹, Y₁ⁱ⁻¹=yⁱ₁⁻¹)

= Pr(Xi=xi|X₁ⁱ⁻¹=xⁱ₁⁻¹, Y₁ⁱ⁻¹=yⁱ₁⁻¹)

×Pr(Y_i^i+j=y_i^i+j|X₁ⁱ⁻¹=xⁱ₁⁻¹, Y₁ⁱ⁻¹=y₁ⁱ⁻¹)

for any j≥0,x_k ∈A_n (1≤k≤i)andy_l∈A_m(1≤l≤ i+j), then (10) holds, namely, we have the equality

lim

L→∞H(XL+1|X₁^L, Y₁^L) = lim

L→∞

1

LH(X₁^L+1|Y₁^L).

Proof. Let us put ai,L ≡H(Xi+1|X₁ⁱ, Y₁^L). If we ﬁx the index i, then ai,L is a decreasing sequence of L. By the

chain rule for the Shannon entropy, we have H(X₁^L+1|Y₁^L) =

∑L i=0

H(X_i+1|X₁ⁱ, Y₁^L) =

∑L i=0

a_i,L. However, by the assumption, we have a_i,i = a_i,i+1 = ai,i+2=· · · fori > N. Hence, we have

H(X₁^L+1|Y₁^L) =

∑N i=0

ai,L+

∑L i=N+1

ai,i.

Since the former sum is ﬁnite, by the Ces´aro mean theorem, we obtain

lim

L→∞

1

LH(X₁^L+1|Y₁^L) = lim

L→∞

1 L

∑L i=N+1

ai,i

= lim

L→∞aL,L

= lim

L→∞H(X_L+1|X₁^L, Y₁^L).

¤

Note that if the assumption holds, then it holds for N = 1 by stationarity. If (X,Y) is a stationary Markov process, then we can show by direct calculation that the assumption of Proposition 1 is equivalent to the following simpler condition by using the Markov property:

p(x2, y2|x1, y1) =p(x2|x1, y1)p(y2|x1, y1) (12) for anyx1, x2∈An andy1, y2∈Am.

If (10) holds, then we obtain t(X|Y) = lim

L→∞

1

LI(X₁^L+1;Y₁^L), (13) whereI(A;B) is the mutual information between stochastic variables A and B. We call the quantity at the right hand side of (13) time-delayed mutual information rate and denote it byi+1(X;Y). Note that we have

t(X|Y)≤i+1(X;Y)