応答変数が連続値の際の組み合わせ仮説に対する多重検定補正法

(1)

応答変数が連続値の際の組み合わせ仮説に対する多重検定補正法

Multiple testing correction for combinatorial hypotheses with

continuous response variables

井ノ口敬章

1

_永野竜輝

1

_西郷浩人

2∗

1

_{九州工業大学}

1

_{Kyushu Institute of Technology}

2

_九州大学

2

_{Kyushu University}

Abstract: As the technologies to collect and store data advances, enumeration of statistically

signiﬁcant patterns is becoming a major challenge, but rigorous statistical evaluation of such patterns has recently been initiated. A limitation of the currently available methods is that they only deal with binary response labels, and thresholding is necessary when dealing with continuous response labels. We present an algorithm that can identify statistically signiﬁcant itemsets in terms of Wald test based on continuous response labels. In computational experiments using synthetic data, we demonstrate that the proposed algorithm is equipped with superior statistically power compared to Bonferroni’s method, while correctly controlling Family-Wise Error Rate.

1 Introduction

Beyond a simple model that a result is caused by a single reason, it is often necessary to explore more complex models. For example, in order to gener-ate pluripotent stem cells, four genes were transduced into adult cells [6]. A major challenge is how to select multiple genes without relying on experts’ knowledge. In this sense, information theoretic foundation is yet to be established. There are two reasons that make this problem hard from the perspective of computer science; i) NP-hardness of enumerating patterns in combinatorial space, and ii) multiple hypothesis test-ing problem in evaluattest-ing large number of patterns.

The former problem is not yet solved fundamen-tally, but practically efficient algorithms such as Apri-ori, Eclat and FP-growth have been developed since 1990s. This paper is built on such enumeration algo-rithms, but focuses on the multiple hypothesis test-ing problem. Suppose that entire combinatorial space is traversed by using one of the above algorithm, it would end up with a large number of patterns. A crucial problem is how to evaluate statistical signifi-cance of those patterns. Since the number of patterns grows exponentially, it suffers from serious false pos-itive problem. Bonferroni’s correction is a standard method for correcting such multiple testing problem, which works by multiplying raw p-value by the num-ber of total tests [1]. As a result, family-wise er-ror rate (FWER), which is a probability of observ-ing at least one false positive in a series of tests, is strictly controlled under a pre-specified value. How-ever, Bonferroni’s correction is often criticized to be

∗_{連絡先：九州大学システム情報科学研究院}

〒 819-0395 福岡県福岡市西区元岡 744 E-mail: [email protected]

too conservative, which results in a low statistical power [4]. This problem is essentially raised by count-ing all the patterns into Bonferroni’s correction fac-tor. However, as is shown by Tarone, there are cases in which we can exclude many patterns from calcula-tion of Bonferroni’s factor, and relax the signiﬁcance threshold [7]. In his pioneering work, an example of how to improve Bonferroni factor is illustrated for chi-squared test and Fisher’s exact test. Recently Terada et al. showed that the same idea can be ap-plied to Mann-Whitney’s U test [8], and Papaxanthos et al. showed that the same idea could be applied to Cochran-Mantel-Haenszel Test [3]. Application of the Tarone’s idea to non-item patterns are also sur-veyed in subgraph mining [5] and subinterval min-ing [2]. However, none of the above consider the case in which response variable is continuous. Handling of continuous response label is important in dealing with data from natural phenomenon, since discretization or thresholding could lead to loss of information.

In this paper, we present a multiple testing correc-tion method for frequent patterns with continuous la-bels. Our method ignores patterns by the same mech-anism as Tarone’s method, thus can provide tight bound of FWER. In computer simulation using syn-thetic data, the proposed method is shown to control FWER while retaining high statistical power com-pared to a Bonferroni’s counterpart.

The rest of the paper is organized as follows. In Section 2, we review Wald test, and studies the min-imum p-value computable at each node in a pattern search tree. This is a key technique in pruning the search tree, and in improving Bonferroni’s factor. In Section 3, we describe how to generate simulation data and to set hyper-parameters. Section 4 compares the experimental results of the proposed method with

人工知能学会研究会資料 SIG-FPAI-B506-11

(2)

the Bonferroni’s method. Section 5 closes the paper with discussion and future work.

2 Method

We consider a dataset of n transactionsD = {ti, yi}ni=1, where ti ⊆ {0, 1}d, and continuous class response

yi ⊆ R. The binary vector ti can be equivalently

represented as a set Ti that contain all the indices j such that tij= 1. Existence or absence of an itemset

X in a transaction T can be represented as I(X⊆ T ),

where I(.) is an indicator function that returns 1 if X exists in T and 0 otherwise. For example in the toy dataset in Table 1, an itemset {1, 2, 6} is contained in transactions 1, so I({1, 2, 6} ⊆ T1) = 1. We

as-sume in this paper that y is asymptotically normal and non-redundant.

2.1 Wald test

Table 1: example dataset transaction ID itemsets response

1 1,2,3,6 120

2 1,2,6 110

3 2,6 0

4 1,3,4,5,6 -80

5 2,3,4,5 -150

Let θ be a scalar parameter, and let bθ be an

esti-mate of θ and letse be the estimated standard errorb

of bθ.

We let [θ(X) be a function that returns the

arith-metic mean of the response values of the transactions that contain itemset X;

[ θ(X) = 1 |{Ti|X ⊆ T }| n ∑ i=1 yiI(X⊆ Ti),

where|.| denotes a function that returns the number of transactions that have itemset X, and the denom-inator is a support of an itemset X in the dataset. For example in Table 1, θ({1, 2, 6}) is calculated as\

120+110

2 = 115. In the framework of Wald test to

judge if [θ(X) is equal to θ0, we test if the following

null hypothesis is rejected [9].

H0: [θ(X) = θ0

More precisely, we inspect if the test statistic

W = θ(X)[ − θ0

b

se

falls into rejection region. We can integrate the region outside the critical value W > zα/2 to convert it into p-value pθ0(W > zα/2), where α is a pre-speciﬁed

sig-niﬁcance level. Our goal is to enumerate all itemsets

X that signiﬁcantly deviates [θ(X) from θ0. However,

there is potentially a large number of itemsets, and when one tries to evaluate them simultaneously,

mul-tiple testing problem occurs.

2.2 Multiple testing correction

Suppose that we have generated all the frequent item-set {X1, X2, . . . XD} and calculated corresponding p-values by Wald test. When one compares more than one hypothesis simultaneously, then the probability of observing false positives easily exceeds pre-speciﬁed threshold α. This phenomenon is pronounced as we have more tests. Therefore, in such a situation, Family-Wise Error Rate (FWER) is often employed. It is es-sentially a probability of observing at least one false positive in a series of tests. Suppose we have M tests, and desire to control FWER under α, then Bonfer-roni’s method multiplies raw p-value by the number of tests M [1]. However, in frequent itemset mining,

M grows exponentially fast, and it may end up with

overlook of true itemset that changes the response sig-niﬁcantly.

Tarone’s method is an extension of the Bonfer-roni’s method, but can make M smaller [7]. This method requires one to calculate minimum attain-able p-value for each hypothesis. Suppose that we have computed minimum p-values for all the frequent itemset, and they are sorted in an ascending order;

p∗₁ ≤ p∗₂, . . . , p∗_D. Let k be the number of itemsets that exceed the corrected signiﬁcance level δk, that is, δk = α/k. Then the upperbound of FWER by Tarone’s method is given as

F W ER ≤ M ∑ i=1 P r(pi≤ δk) = ∑ i∈{m|p∗_m≤δk} P r(pi≤ δk) + ∑ i_∈{m|p∗_m>δk} P r(pi≤ δk)

For itemsets with p∗_i < δk, we can insure that the cor-responding p-value pi never achieves signiﬁcant level because of the relationship pi≥ p∗i > δk. In that case,

P r(pi ≤ δk) = 0. By plugging this fact into the above equation, the upperbound of FWER is obtained as

F W ER≤ ∑

i∈{m|p∗_m≤δk}

P r(pi≤ δk) =|{m|p∗m≤ δk}|δk

Note that the size of the correction factor|m| becomes smaller than M while still guaranteeing the upper-bound of the FWER≤ α.

Now the problem is how to compute such mini-mum attainable p-values. In recent studies such as LAMP or FAIS [?], response variable is given as bi-nary. In such a case, one can draw a 2×2 contingency table, and the minimum p-values are computed from the marginal distribution. The next section explains how to compute such minimum p-values for Wald test.

(3)

2.3 Minimum p-value for Wald test

Figure 1: A search tree illustrating the monotonically decreasing property of bθ. Each node consists of an

itemset pattern, and corresponding [θmax. In order to compute sample standard deviation, we do not deﬁne leaf nodes with support = 1.

In our case, p-value of Wald test is computed by integrating normal distribution from a test statistic

W to the end of the tail, so the minimum p-value is

obtained when W takes its maximum. More precisely, such Wmaxis given by

Wmax(X) = \ θmax(X)− θ0 \ semin(X) .

Suppose that the responses of the transactions that contain itemset X are sorted in a descending order, such that y1 ≥ y2. . ., then we deﬁneθmax\(X) as the mean of the top two responses. As a result, θmax(X) has a monotonically decreasing property, since for X⊆

X0, maxiyiI(X0 ⊆ Ti) may become zero. An exam-ple of computing θmax({1}) for toy data in Table 1 is shown in Figure (1).

\

semin(X), however, does not have either increas-ing or decreasincreas-ing property. So we use a ﬁxed surrogate value se∗_min such that se∗_min ≤ se\min(X) holds for any itemset X. Below, we give a sketch how to com-pute such se∗_min. Suppose that response values are sorted in a descending order, that is, y1_{≥ y}2_{. . .}_{≥ y}n_, then compute the standard deviation of the neighbor-ing responses with the closest absolute distance, and set it to se∗_min. Since se∗_min gives a lowerbound of

\

semin(X), we obtain upperbound of Wmax(X) as;

W_max∗ (X) = θmax\(X)− θ0

se∗_min > Wmax(X).

By taking integral over normal distribution, we obtain minimum attainable p-value as

p∗(X)X = pθ0(W

∗

max(X) > zα/2)

< pθ0(Wmax(X) > zα/2) < pθ0(W (X) > zα/2)

In ﬁnding the corrected threshold δ, we employ suport decrease algorithm in the same as LAMP. The overall procedure is summarized below.

1. Set m = 0, λ = n

2. Run frequent pattern mining with minimum sup-port λ, and let mλbe the number of found pat-terns.

3. m← m + mλ

4. Set δ0= p∗_m

5. If δ0mλ≤ α, then set λ = λ − 1, and go to 2. 6. Set δ = α/mλ as the signiﬁcance threshold 7. For obtained m itemsets, recompute p-values,

then reject those below δ.

3 Experiments

In this section, we compare the proposed method with Bonferroni’s correction. We generated itemset datasets based on three parameters; 1) transaction length, 2) maximum type of items, and 3) maximum number of items per transaction. Continuous response vari-ables are generated such that they follow standard normal distributionN (0, 1). For “case” transactions that includes true causal itemsets we added an extra quantity , and for “control” transactions we left them as they are. Signiﬁcance level α is set to 0.05 unless otherwise stated. θ0is set to 0.

In order to make sure that FWER is controlled, we have randomly generated 100 transactions, and changed from 0.5 to 2, and measured FWER for both the proposed method and the Bonferroni’s method. We also measured statistical power for both methods in the same experimental settings.

For comparing the eﬃciency, we have measured the time consumption for both the methods while changing the number of transactions.

In all experiments, we used a workstation with Intel Xeon E5 processor. Under a given parameter set, 10 diﬀerent datasets are generated, and their means and standard deviations are reported.

4 Results

Figure 2 shows FWER for both the proposed method and Bonferroni’s method with respect to the increase in . We can observe that Bonferroni correction strictly control FWER much under pre-specified significance level; 0.05. However, the resulting threshold level could be too strict, and overlook of the true itemset can occur. On the other hand, the proposed method have higher FWER thanks to Tarone’s correction. We can see the effect of improved correction in Figure 3. The statistical power of the proposed method was al-ways higher than Bonferroni’s method over the given range of . Figure 4 and ?? illustrates the efficiency of the proposed method gained by pruning the search tree. It is observed that the difference in efficiency be-comes larger with respect to the increase in the num-ber of transactions (Figure 4).

(4)

0 5e-05 0.0001 0.00015 0.0002 0.00025 0.0003 0.00035 0.0004 0 0.5 1 1.5 2 2.5 F W E R epsilon Proposed Bonferroni

Figure 2: Comparison of the proposed method with Bonferroni’s method in terms of FWER.

0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 0 0.5 1 1.5 2 2.5 p o w e r epsilon Proposed Bonferroni

Figure 3: Comparison of the proposed method with Bonferroni’s method in terms of test power

5 Conclusion

This paper presented the ﬁrst approach to signiﬁcant itemset mining that (i) can handle continuous labels, (ii) corrects multiple testing problem, and (iii) retain high statistical power.

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 25700004.

References

[1] C. E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilit`a. Pubblicazioni del R

Isti-tuto Superiore di Scienze Economiche e Commer-ciali di Firenze, 8:3–62, 1936. 0 5 10 15 20 0 100 200 300 400 500 600 ti m e (s e c ) number of transactions Proposed Bonferroni

Figure 4: Comparison of the proposed method with Bonferroni’s method in terms of running time with respect to the increase in the number of transactions

[2] Felipe Llinares-Lopez, Dominik G. Grimm, Dean A. Bodenham, Udo Gieraths, Mahito Sugiyama, Beth Rowan, and Karsten Borgwardt. Genome-wide detection of intervals of genetic het-erogeneity associated with complex traits.

Bioin-formatics, 31(12):i240, 2015.

[3] Laetitia Papaxanthos, Felipe Llinares-Lopez, Dean Bodenham, and Karsten Borgwardt. Find-ing signiﬁcant combinations of features in the presence of categorical covariates. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural

Infor-mation Processing Systems 29, pages 2279–2287.

Curran Associates, Inc., 2016.

[4] T. V. Perneger. What is wrong with bon-ferroni adjustments. British Medical Journal,

316(7139):1236–1238, 1998.

[5] M. Sugiyama, C.-A. Azencott, D. G. Grimm, Y. Kawahara, and K. M. Borgwardt. Multi-task feature selection on multiple networks via max-imum ﬂows. In Proceedings of the 2014 SIAM

International Conference on Data Mining, pages

199–207, Philadelphia, Pennsylvania, USA, April 2014.

[6] K. Takahashi and S. Yamanaka. Induction of pluripotent stem cells from mouse embryonic and adult ﬁbroblast cultures by deﬁned factors. Cell, 126(4):663–76, 2006.

[7] R. E. Tarone. A modiﬁed bonferroni method for discrete data. Biometrics, 46(2):515–522, 1990. [8] A. Terada, M. Okada-Hatakeyama, K. Tsuda, and

J. Sese. Statistical signiﬁcance of combinatorial regulations. PNAS, 110(32):12996–13001, 2013. [9] L. Wasserman. All of tatistics: a concise course

in statistical inference. Springer, New York, U.S.,

2010.