RパッケージBNSL:連続と離散を区別しない無向森とDAGの構造学習

(1)

R

パッケージ

BNSL:

連続と離散を区別しない無向森と

DAG

の構造学習

Mutual Information Estimation:

Independence Detection and Consistency

鈴木讓

1∗

1

_{大阪大学基礎工学研究科}

1

_{Osaka University}

Abstract: This paper considers to estimate (conditional) mutual information for general random

variable not limited to discrete variables, and proves under a regular its condition consistency and asymptotic correctness of (conditional) independence testing.

1 Introduction

In any field of statistics, the problem of whether two variables are independent or not is an important issue. We often measure how dependent they are by mutual information, and estimate it from data.

Suppose that variables X, Y take finite numbers of values. Then, the mutual information I(X, Y ) is defined by [6] ∑ x ∑ y P (X = x, Y = y) log P (X = x, Y = y) P (X = x)P (Y = y) ,

where P (X = x), P (Y = y), and P (X = x, Y = y) are the probabilities of the events X = x, Y = y, and (X, Y ) = (x, y), respectively. The quanity is often estimated by In= ∑ x ∑ y cXY(x, y) n log cXY(x, y) n cX(x) n · cY(y) n (1)

given the frequencies cX(x), cY(y), and cXY(x, y) in n independent trials of the events.

Suppose that density functions fX, fY, fXY exist

for X, Y, (X, Y ), respectively. Then, the mutual in-formation I(X, Y ) is defined by

∫ −∞ ∫ _∞ −∞ fXY(x, y) log fXY(x, y) fX(x)fY(y) dxdy .

In particular, if variables X, Y are Gaussian, then it becomes

I(X, Y ) =−1

2log{1 − ρ(X, Y )

2_{} ,}

where ρ(X, Y ) is the correlation coeﬃcient of X, Y , which means that the estimation of I(X, Y ) reduces to that of ρ(X, Y ).

∗_{連絡先：大阪大学基礎工学研究科}

〒 560-8531 豊中市待兼山 1-3

E-mail:[email protected]

For non-Gaussian variables with density functions, several authors proposed the estimates of mutual in-formation and proved consistency.

One naive approach is to estimate diﬀerential en-tropies hX, hY, hXY of X, Y, (X, Y ), and to plug-in

them into the formular I(X, Y ) = hX+ hY − hXY [2].

One family of entropy estimators are based on kernel density estimators [13] followed by re-substitution es-timation. An alternate family of entropy estimators is based on k-Nearest Neighbor (k-NN) estimates, be-ginning with the pioneering work of Kozachenko and Leonenko [11]. A. Kraskov [12] proposed a method for estimating mutual information when the variables have a joint density. The estimator starts with es-timating hX, hY, and hXY based on the k-NN

esti-mates, and employs a heuristic to couple the estimates in order to improve the estimator.

On the other hand, Darbellay and Vajda [7] and Silva et.al [16] considered frameworks for histogram-based mutual information estimation of probability distributions equipped with density function. In par-ticular, the reference [16] succeeded to proved strong consistency.

However, we notice that those estimates miss the important property of mutual information: I(X, Y ) = 0 if and only if X, Y are independent, which means that the estimate should be zero when independence of X, Y is detected (independence detection). Al-though the aforementioned estimates of I(X, Y ) are small enough for large n when X, Y are independent, we are not convinced whether they are independent or not because the estimate is positive.

For example, it is not hard to see that In in (1)

converges to I(X, Y ) with probability one as n→ ∞. However, we observe In> 0 with a positive

probabil-ity even when X, Y are independent [17]. In fact, In>

0 unless cX(x)cY(y) = ncXY(x, y) for all x, y, which

is unlikely for any n. The phenomen was pointed out by several authors [1, 10] after the author claimed it first [17].

In particular, we face fatal situations when we

ap-人工知能学会研究会資料 SIG-FPAI-B802-03

(2)

2i 1i 3i 4i - - -2i 1i 3i 4i 2i 1i 3i 4i 2i 1i 3i 4i @_@ 2i 1i 3i 4i - -2i 1i 3i 4i 2i 1i 3i 4i

Figure 1: The Chow-Liu Algorithm executions for

I(1, 2) > I(1, 3) > I(2, 3) > I(1, 4) > I(3, 4) > I(2, 4) > 0 (above) and for I(1, 2) > I(1, 3) > I(2, 3) = I(1, 4) = I(3, 4) = I(2, 4) = 0 (below).

ply those estimates to construction of graphical mod-els. Suppose that we construct a forest by the Chow-Liu algorithm (Figure 1) [5] in which a pair of vari-ables are connected in the ascending order of the mu-tual information I(X, Y ) when no loop is generated and I(X, Y ) > 0. For the procedure, in general, in-dependent variable sets are separated, and a forest rather than a tree is constructed [21]. However, those estimates always choose trees because they do not de-tect any independence among variables.

The inconvenience is due to overfitting of the es-timates.

Another important issue is that those estimates assumed that both of X, Y either take values in finite sets (discrete variables) or have probability density functions (continuous variables). In any dataframe, very often, there exist variables that do not follow the assumption. For example, variable age takes an inte-ger value that ranges from zero to infinity. However, is it appropriate to regard the data as taking a value in

{0, 1, · · · , 149, 150}?, say for n = 20 people ? For

an-other example, variable height takes a real value that ranges from zero to infty but height can be usually measured up to 0.1 cm or 0.01 inch, which means that the probability is not continuous. Is there any density function of the variable height? Essentially, we cannot classify all the variables into discrete and continuous ones but they are only two special cases. In this sense, it is not enough to prepare three estimates between discrete variables, between continous ones, and be-tween discrete and continous ones [8]. A recent work [9] faces the above problem, and only proved a weak property that the estimate is asymptotically unbiased under restrictive conditions.

A few decades ago, the author developed a mutual information estimation procedure for discrete vari-ables that satisfies independence detection and con-sistency [17, 21]. The estimate Jn is a modification

of In in (1), and avoids overfitting in the sense that Jn = 0 if and only if X, Y are independent with

prob-ability one as n→ ∞.

Recently, the author constructed developed a mu-tual information estimation procedure (JIC) for gen-eral variables that satisfies independence detection [19, 20]. The procedure does not distinguish discrete

and continuous data, dealing with any data in the same manner. It is examined in several applications such as independence testing [19] and genome analy-sis [20] in which using the Chow-Liu algorithm, a gene regular network with 1000 genes was constructed [20]. Also, it was published as CRAN package BNSL [22] and downloaded by many R users.

The procedure JIC seeks the maximum among the mutual information estimates of quantized variables such that the frequencies are expressed by a contin-gency table. The algorithm is similar to MIC (max-imal information coeﬃcient, [14]), but MIC and JIC seek the maximization of In and Jn, respectively, so

that MIC gives no theoretical guarantee.

However, for JIC, no consistency has been proved. In this paper, by deriving the error probability P (Jn >

0) when X, Y are independent, we prove under a mild condition, strong consistency of the procedure for gen-eral variables, which means that JIC [19, 20] sat-isfies consistency as well as independence detection, and works for general variables that contain discrete and continuous ones. Although the proof of indepen-dence detection was complicated in the previous re-port [19, 20], however, we simplify it by utilizing the error probability derived in this paper.

This paper is organized as follows: Section 2 ex-plains the estimates for discrete and general variables in 2.1 and 2.2, respectively. Section 3 proves the error probability P (Jn> 0) of JIC when X, Y are

indepen-dent. Section 4 proves the independence detection and consistency in 4.1 and 4.2, respectively. Section 5 extends the procedures and proofs in the previous sections. Section 6 summarizes the paper and states future works.

2 Estimations

2.1 Estimation for Discrete Variables

For the estimate (1), the author proposed an alterna-tive as [17].

Jn:= In− l

2nlog n (2) with l = (α−1)(β−1), where α and β are the numbers of values that X and Y take, respectively. The values of description length of X, Y , and (X, Y ) are, up to constants, −∑ x cX(x) log cX(x) n + α− 1 2 log n , (3) −∑ y cY(y) log cY(y) n + β− 1 2 log n , (4) and −∑ x ∑ y cXY(x, y) log cXY(x, y) n + αβ− 1 2 log n , (5)

(3)

In Jn 0.000 0.005 0.010 0.015 I_n Jn 0.25 0.30 0.35 0.40 0.45 0.50 0.55

Figure 2: The values of In and Jn (sample size n =

200) when I(X, Y ) = 0 (Top) and I(X, Y ) = 0.368 (Bottom). Each box plots contains 500 mutual infor-mation extimates for In and J each, and generates X ∈ {0, 1} and Y ∈ {0, 1} equiprobabily. The

proba-bilities are P (X ̸= Y ) = 0.5 and P (X ̸= Y ) = 0.1 for the I(X, Y ) = 0 and I(X, Y ) = 0.368 cases, respec-tively. 1 2 3 4 5 6 7 8 1 100 73 37 23 10 7 0 0 2 20 20 56 49 41 30 12 2 3 5 12 26 42 49 49 45 22 4 0 0 6 11 25 39 68 101

Figure 3: The space is divided into 4× 8 for n = 1000 in which each column of X and each row of Y contain 125 and 250 samples, respectively. However, the frequencies of (X, Y ) diﬀer.

respectively, so that (2) is obtained from (1) and{(3)+ (4)− (5)}/n. From consistency of MDL [15], we have

Jn ≤ 0 with probability one as n → ∞ when X ⊥⊥ Y .

If we redefine Jn by Jn := max{In− l 2nlog n , 0} (6) then Jn= 0⇐⇒ X ⊥⊥ Y (7)

with probability one as well as

Jn → I(X, Y ) (8)

as n→ ∞. Figure 2 illustrates the diﬀerence between

In and Jn.

2.2 Estimation for General Variables

We consider estimation of mutual information I(X, Y ) without distinguishing whether X and Y are discrete

· · ·· · ·· · · Y X Y X Y X Y X Y X - - -

-Figure 4: The two-dimensional Euclid space of X and

Y is sequentially divided and a nested quantization

sequence is obtained. z1 z2 z3 z4 z5 1 2 2 2 3 6 z1 z2 z3 z4 z5 1 2 2 2 3 6 -Figure 5: Suppose (z1, z2, z3, z4, z5) = (1, 2, 2, 2, 3). Then, we cannot set any border between z2 and z3

and between z3 and z4 because z2 = z3 = z4. If the

procedure tries to divide z2 and z3 (Left), because z2 = z3 = 2, no border between them can be set.

(z1, z2) = (1, 2) is closer than (z4, z5) = (2, 3) from

(z2, z3), so that the border will be set between z1and

z2. On the other hand, if the procedure tries to divide

z3 and z4 (Right), because z3 = z4 = 2, no border

between them can be set. (z4, z5) = (2, 3) is closer than (z1, z2) = (1, 2) from (z3, z4), so that the border will be set between z4 and z5.

· · ·· · ·· · · Jn(1) Jn(2) Jn(3) Jn(mn) Y X Y X Y X Y X

Figure 6: Generate mn quantizations and choose the

maximum value Jn = maxmk=1n J (k)

n of mn mutual

(4)

or continuous. To this end, we prepare a sequence of contingency tables each of which expresses the fre-quencies of quantized values of X, Y , and (X, Y ) such that each of rows X and each of columns Y contain almost equal sample sizes (the diﬀerence is at most one, see Figure 3), respectively. As the sequence pro-ceeds, the contingency table is monotonically refined so that each cell contains fewer samples (Figure 4).

If two samples with the same value of X is di-vided into diﬀerent quantized value of X, the border crossing them will be adjusted as in Figure 5. If two borders are coincide, we regard them as one border. In this sense, if X takes α values, at most α− 1 bor-ders will generated for X. The same rule is applied to Y as well as to X.

We assume that the number mnof the contingency

tables is at most n, and note that from each contin-gency table k = 1, 2, . . . , mn, the estimation J

(k) n of

mutual information is obtained by regarding that X and Y take α(k) and β(k) value, respectively (Figure 6). The final estimate will be

Jn := max{Jn(1), J (2) n , . . . , J

(mn)

n }

Even under the extended conditions, (7) has been proved in [20].

3 Error Probabilities

In this section, we bound the error probability of the mutual information estimate given n examples when

X, Y are independent. The following theorem will be utilized in proving the main theorem in the next section.

Theorem 1 Suppose X and Y are independent, then

P (Jn > 0)≤ ( e log n n )l 証明: Full Paper をご参照ください。

4 Independence Detection and

Consistency

In this section, we prove independence detection and consistency in 4.1 and 4.2, respectively.

For k = 1, 2,· · · , mn, let X(k) and Y(k) be the

variables X and Y in the kth contingency table, α(k) and β(k) _{the numbers of values that X}(k) _{and Y}(k)

take, respectively. Also, for k = 1, 2,· · · , mn, let l(k)_{:= (α}(k)_{− 1)(β}(k)_{− 1).}

4.1 Independence Detection

Although the proof of independence detection appeared in [19, 20], however, we simplify it by utilizing the er-ror probability bound derived in Section 3.

Theorem 2 Let X and Y be any random variables. X ⊥⊥ Y if and only if the event Jn> 0 finitely occurs

almost surely.

Proof. For (⇐), if X ̸⊥⊥ Y , then we have I(X(k)_{, Y}(k)_{) >}

0 for at least one k. From (8), Jn(k) converges to I(X(k), Y(k)) > 0 with probability one as n → ∞. Thus, Jn(k) > 0 occurs infinitely many times with

probability one, which establishes (⇐).

Let K be the maximum k such that l(k) _{≤ 4.}

Then, from (7), for each k, the event (Jn(k) > 0)

finitely occurs almost surely. Since K does not de-pend on n, the event ∪K

k=1(J (k)

n > 0) finitely occurs

almost surely. From Proposition 1, we have

P (∪mn k=K+1(J (k) n > 0)) ≤ mnP (Jn(K+1)> 0)≤ n( e log n n ) l(K+1)/2 ≤ (e log n)5/2 n3/2 , which means ∞ ∑ n=1 P (∪mn k=K+1(J (k) n > 0)) <∞ .

From Borel-Cantelli’s Lemma [3], the event∪mn

k=K+1(J (k) n >

0) finitely occurs almost surely as well, which means that ∪mn

k=1(J (k)

n > 0) finitely occurs almost surely.

This establishes (⇒).

4.2 Consistency

In this subsection, we state the main theorem and prove it.

Theorem 3 If there exists K≥ 1 such that for each r = 1, 2, . . . ,

I(X(K+r), Y(K+r)) = I(X(K), Y(K))

Then, the estimation Jn converges to the mutual

in-formation I(X, Y ).

Since I(k)_{(X, Y ) is upperbounded by I(X, Y ) and}

mono-tonely increasing with k, we have lim

k→∞I

(k)_{(X, Y ) = I(X, Y ) ,}

so that the assumption may be reasonable. Proof. Note for each r = 1, 2, . . .

I(K)(X, Y ) = IK+r(X, Y ) ⇐⇒ P (X(K), Y(K)) P (X(K)_{P (Y}(K)₎= P (X(K+r)_{, Y}(K+r)₎ P (X(K+r)_{P (Y}(K+r)₎ ⇐⇒    X(K+r)⊥⊥ Y(K+r)|{X(K), Y(K)} X(K+r)⊥⊥ Y(K)|X(K) X(K)⊥⊥ Y(K+r)|Y(K) ,

(5)

"! # "! # "! # "! # X(K+r) _X(K) _Y(K) _Y(K+r)

Figure 7: Given X(K)_{and Y}(K)_{, the pair X}(K+r)_and

Y(K+r)_{are independent; given X}(K)_{, the pair X}(K+r)

and Y(K) _{are independent; and given Y}(K)_{, the pair} X(K)and Y(K+r) are independent.

and its graphical model is expressed by Figure 7. Let JA

n, JnB, and JnC be the conditional mutual

information estimations in the form of (9) for

I(X(K+r), Y(K+r)|X(K), Y(K))

I(X(K+r), Y(K)|X(K)) , and

I(X(K), Y(K+r)|Y(K)) ,

and IA

n, InB, and InCbe the associated maximum

likeli-hood estimations. More precisely, JnA= InA−l

A

2nlog n, J_nB= I_nB−l_2nB log n, and J_nC = I_nC−_2nlC log n, where

lA= (α(K+r)− α(K))(βK+r− β(K))

lB = (α(K+r)− α(K))(βK− 1) , and

lC = (α(K)− 1)(βK+r− β(K)) . From inspection, one checks

I_n(K+r)− I_n(K) = I_nA+ I_nB+ I_nC , and lA+ lBn + lCn = (α(K+r)− 1)(β(K+r)− 1) − (α(K)− 1)(β(K)− 1) , so that we have J_n(K+r)− J_n(K)= J_nA+ J_nB+ J_nC . which means P (Jn(K)< Jn(K+r)) ≤ P (JA n > 0) + P (JnB> 0) + P (JnC> 0) ≤ (e log n n ) lA_{+ (}e log n n ) lB_{+ (}e log n n ) lC _.

Similar to the discussion in the proof of Theorem 1, for r = 1, 2, . . . , the event Jn(K) < J

(K+r)

n occurs at

most finite time with probability one as n→ ∞. On the other hand, Jn(k) converges to I(k)(X, Y )

with probability one as n→ ∞, for each k = 1, 2, · · · . This completes the proof.

5 Extensions

In this section, we consider estimating conditional mu-tual information from data rather than mumu-tual infor-mation. We prove Theorems 4-6 that extend Theo-rems 1-3 althought the extensions are almost straight-forward.

Suppose that X, Y , and Z take α, β, and γ values. The maximum likelihood estimator of conditional mu-tual information is given by

I_n′ =∑ x ∑ y cXY Z(x, y, z) n log cXY Z(x, y, z) n · cZ(z) n cXZ(x, z) n · cY Z(y, z) n ,

and the author proposed the estimator [18]

J_n′ := I_n′ − l

′

2nlog n (9) with l′:= (α− 1)(β − 1)γ, where we write X ⊥⊥ Y |Z when X and Y are conditionally independent given

Z, and proved the two properties.

J_n′ = 0⇐⇒ X ⊥⊥ Y |Z (10) with probability one as well as

J_n′ → I(X, Y |Z) (11) as n→ ∞.

Theorem 4 Suppose X and Y are conditionally

in-dependent given Z, then

P (J_n′ > 0)≤

(

e log n n

)l′

Proof. From the proof of Theorem 1, it is suﬃcient to show that P (χ2

l′ ≤ l′log n) is bounded by ( e log n

n ) l′_.

However, the statement has been already proved in Theorem 1, which completes the proof of Theorem 4.

Theorem 5 Let X, Y , and Z be any random

vari-ables. X ⊥⊥ Y |Z if and only if the event Jn′ > 0

finitely occurs almost surely.

Proof. Following the proof of Theorem 2, we check that l(k+1)′≥ 5 =⇒ P (∪mn k=K+1(J (k) n ′ > 0))≤ n(e log n n ) l(K+1)′/2

as well, which completes the proof of Theorem 2.

Theorem 6 If there exists K≥ 1 such that for each r = 1, 2, . . . ,

I(X(K+r), Y(K+r)|Z(K+r)) = I(X(K), Y(K)|Z(K))

Then, the estimation J_n′ converges to the mutual in-formation I(X, Y|Z).

(6)

Proof. The derivation is similar to Theorem 3. Simply replace the three conditional independence statements by    X(K+r)_{⊥⊥ Y}(K+r)_|{X(K)_{, Y}(K)_{, Z}(K+r)_} X(K+r)_{⊥⊥ Y}(K)_|{X(K)_{, Z}(K+r)_} X(K)_{⊥⊥ Y}(K+r)_|{Y(K)_{, Z}(K+r)_} .

This completes the proof.

References

[1] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified max-imum likelihood approach for optimal distribu-tion property estimadistribu-tion. Electronic Colloquium

on Computational Complexity, 186, 2016.

[2] Jan Beirlant, Edward J Dudewicz, Lszl Gyrfi, and Edward C Van der Meulen. Nonparametric entropy estimation: An overview. International

Journal of Mathematical and Statistical Sciences,

6(1):17–39, 1997.

[3] P. Billingsley. Probability & Measure. Wiley, New York, 3rd edition, 1995.

[4] Jonathan M. Borwein and O-yeat Chan. Uniform bounds for the complementary incomplete gamma function. Preprint at http://locutus.cs.dal.ca:8088/archive/00000335,

2009.

[5] C. K. Chow and C. N. Liu. Approximating dis-crete probability distributions with dependence trees. IEEE Trans. on Information Theory, IT-14(3):462–467, 6 1968.

[6] Thomas M. Cover and Joy A. Thomas. Elements

of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing).

Wiley-Interscience, July 2006.

[7] G. A. Darbellay and I. Vajda. Estimation of the information by an adaptive partition of the obser-vation space. IEEE Transactions on Information

Theory, 45(4):13151321, 1999.

[8] D. Edwards, G. C. G. de Abreu, and R. Labouriau. Selecting high-dimensional mixed graphical models using minimal AIC or BIC forests. BMC Bioinformatics, 11(18), 1 2010. [9] Weihao Gao, Sreeram Kannan, Sewoong Oh, and

Pramod Viswanath. Estimating mutual informa-tion for discrete-continuous mixtures. In

Confer-ence on Neural Information Processing Systems (NIPS), 2017.

[10] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of func-tionals of discrete distributions. IEEE

Transac-tions on Information Theory, 61(5):2835–2885,

2015.

[11] L. F. Kozachenko and Nikolai N. Leonenko. Sam-ple estimate of the entropy of a random vec-tor. Problemy Peredachi Informatsii, 23(2):9–16, 1987.

[12] A. Kraskov, H. Stgbauer, and P. Grassberger. Estimating mutual information. Physical review

E, 69(6):066138, 2004.

[13] Liam Paninski and Masanao Yajima. Un-dersmoothed kernel entropy estimators.

IEEE Transactions on Information Theory,

54(9):43844388, 2008.

[14] Y. A. Reshef, D. N.and Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. Detecting novel associations in large data sets.

Science, 334(6062):15181524, 2011.

[15] J. Rissanen. Modeling by shortest data descrip-tion. Automatica, 14:465–471, 1978.

[16] Jorge Silva and Shrikanth S. Narayanan. Non-product data-dependent partitions for mutual in-formation estimation: Strong consistency and applications. IEEE Transactions on Signal

Pro-cessing, 58(7):3497–3511, 2010.

[17] J. Suzuki. A construction of Bayesian networks from databases based on an MDL principle. In

Uncertainty in Artificial Intelligence, pages 266–

273, Washington DC, 1993. Morgan Kaufmann. [18] J. Suzuki. The Bayesian Chow-Liu algorithm.

In The Sixth European Workshop on

Probabilis-tic Graphical Models, pages 315–322, Granada,

Spain, 2012.

[19] J. Suzuki. An estimator of mutual information and its application to independence testing.

En-tropy, 18(109), March 2016.

[20] J. Suzuki. A novel Chow-Liu algorithm and its application to gene diﬀerential analysis.

Interna-tional Journal of Approximate Reasoning, 80:1–

18, January 2017.

[21] J. Suzuki. Forest learning from data and its uni-versal coding. IEEE Trans. Information Theory, 64:7349–7358, November 2018.

[22] J. Suzuki and J. Kawahara.

Package BNSL. https://cran.r-project.org/web//packages/BNSL/BNSL.pdf.

RパッケージBNSL:連続と離散を区別しない無向森とDAGの構造学習

R

パッケージ

BNSL:

連続と離散を区別しない無向森と

DAG

の構造学習

Mutual Information Estimation:

Independence Detection and Consistency

鈴木 讓

大阪大学 基礎工学研究科

Osaka University

1

Introduction

2

Estimations

2.1

Estimation for Discrete Variables

2.2

Estimation for General Variables

3

Error Probabilities

4

Independence Detection and

Consistency

4.1

Independence Detection

4.2

Consistency

5

Extensions

References

鈴木讓

_{大阪大学基礎工学研究科}

_{Osaka University}