情報損失指標の非数値データへの適用

全文

(1)情報処理学会研究報告 IPSJ SIG Technical Report. Vol.2017-MPS-113 No.10 Vol.2017-BIO-50 No.10 2017/6/23. 1,a). 2. ILD ILD ILD ILD. 2. Application of an Information Loss Index to Non-numerical Data Hiroko Akiyama1,a). Masaaki Wada2. Abstract: We define a new information loss index ILD as the ratio of the amount of information lost by microaggregation. Since ILD is based on distance between data, it is applicable not only to numerical data, but also to data like ones represented by character strings if we choose suitable distance. In this paper, we apply ILD to non-numerical datasets, and discuss choice of distance. For microaggregation of numerical datasets using the average values as the representatives of groups, ILD coincides with the information loss index based on the sum of squares of the difference between the average and data. In this sense, ILD is a natural extension of the information loss index. Keywords: Information Loss, Microaggregation, Anonymization. 1. ILD Information Loss based on Distance [1] [2]. ILD 3 ILD. 1. Nagano Collage of Technology 2. a). ILSSDM Osaka University h [email protected]. ⓒ 2017 Information Processing Society of Japan. Information. Loss based on Sum of Square Difference from the Mean. 1.

(2) 情報処理学会研究報告 IPSJ SIG Technical Report. Vol.2017-MPS-113 No.10 Vol.2017-BIO-50 No.10 2017/6/23. [3] [4] [5] [6] [7] [8], [9], [10]. ( I(A) > I(A). ILD. ILSSDM ILD. ILSSDM ILD =. 4. 2.. ILD. ( I(A) − I(A) I(A). ILD. 2.1. 2.1 (. ). D = (1, 2, 3, 4). D1 = (1, 2). A = (x1 , . . . , xN ), xi ∈ X(i = 1, . . . , N ) X. 2. x, y. X. •. 2 D. ( = 32 I(D) = 40 I(D). X = Rn. x, y ∈ Rn. 2.2 (. ! " n "$ d(x, y) = # (xi − yi )2 x, y ∈ X ⎧ ⎨1 d(x, y) = ⎩0. X. 1. S( = (a1 , a1 , a2 , a2 ). a1 a2 2. (x ̸= y). S. (x = y) 2. S(. X. X. ( = 32 I(S) = 144 I(S) ILD =. S (( S = (a, a, a, a). A. I(A) =. = 0.2. S1 = (a11 , a12 ) S. ILD =. A. 40−32 40. ) 4. S2 = (a21 , a22 ). A A⊆X. x. ( D. S1 S2. X. x, y. ILD =. S = (a11 , a12 , a21 , a22 ). i=1. •. D2. ( = (1.5, 1.5, 3.5, 3.5) D. d(x, y). •. D2 = (3, 4). D1. a. 144−32 144. = 0.778. ( S(. 144−0 144. (( I(S) =0. =1. 2. y I(A). N $ N $. d(xi , xj )2. i=1 j=1. 1. 2.2 ILD. 3. ILD A = (x11 , . . . , x1n1 , x21 , . . . , x2n2 , . . . , xm1 , . . . , xmnm ). ILD. A1 , . . . , A m Ak = (xk1 , . . . , xknk ) (k = 1, . . . , m) Ak x ˆk. 3.1 Ak. ILD 8. D D=( ). ( = (ˆ ˆ1 , x ˆ2 , . . . , x ˆ2 , . . . , x ˆm , . . . , x ˆm ) A x1 , . . . , x. ⓒ 2017 Information Processing Society of Japan. 2.

(3) 情報処理学会研究報告 IPSJ SIG Technical Report. ( D=(. Vol.2017-MPS-113 No.10 Vol.2017-BIO-50 No.10 2017/6/23. ). ILD. (1) ( = 48 Idiscr (D) = 56 Idiscr (D) ILDdiscr = 0.14285. (2). D. ( D D. 1. ( D D. ( = 576 It (D). ! D. 4. 2. I(D). It (D) = 1440. 56. ILDt = 0.6. 1440. ! I(D). 648. ILD ILD. 48. 0.14285. 576. 0.60000. 640. 0.01234. 3.2 ILD. Rn 1≤p≤∞ 2. dp (x, y) =. ⎧) ⎨( n. i=1. ⎩max. d2 (x, y). (3). 6. ( D D. 1. |xi − yi |p ) p. i=1,...,n. |xi − yi |. p=∞. ILD. 2. 2.1. p=1 2. 2 p=1. 1. I(D). ( = 640 Ig (D) = 648 Ig (D). 56. ILDg = 0.01234. 3. 1≤p<∞. 3.1. 3 4 4. p−. D. 272 168. ! I(D). ILD ILD. 48. 0.14285. 160. 0.41176. 160. 0.04762. 3.1. 1 ⓒ 2017 Information Processing Society of Japan. 3.

(4) 情報処理学会研究報告 IPSJ SIG Technical Report. 4. ILD. Vol.2017-MPS-113 No.10 Vol.2017-BIO-50 No.10 2017/6/23. I(X) =. ILSSDM. $ $. x∈X y∈X. =. $ $. x∈X y∈X. = ILD. $ $. x∈X y∈X. ILSSDM. |x − y|2 |(x − x) + (x − y)|2 (|x − x|2 + |x − y)|2 ). = 2N 2 var(X). (3). 4.1 ILSSDM N X1 , . . . , X m xij (0 ≤ j ≤ ni ). Xi. ni. X ∈ Rn. Xi 2 SSE =. 4.3 xi. ILD. X. i=1 j=1. |xij − xi |2. I(X ′ ) =. SSA =. i=1. SST =. i=1 j=1. m $ m $. ni nj |xi − xj |2. (4). (2), (4). ni |xi − x|2. I(X) − I(X ′ ) = 2. ni m $ $. X′. i=1 j=1. x m $. Xi. xi. (Sum of Squared Errors). ni m $ $. ILSSDM. X1 , . . . , X m. m $ m $. ni nj (var(Xi ) + var(Xj )). i=1 j=1. =2. m $ m $. ni nj var(Xi ). i=1 j=1. |xij − x|2 = SSE + SSA. = 2N. m $. ni var(Xi ). (5). i=1. ILSSDM (3), (5). I(X) − I(X ′ ) I(X) )m ni var(Xi ) = i=1 N var(X). SSE SST − SSA = ILSSDM = SST SST Xi. var(Xi ) ILSSDM =. X. )m. var(X). ni var(Xi ) N var(X). i=1. ILD =. ILD. (1). (6). ILSSDM. 4.4 ILD. N. 2. O(N ). 4.2. R. 3.1 I(X) =. m $ m $ $ $. i=1 j=1 x∈Xi y∈Xj. =. m $ m $ $ $. i=1 j=1 x∈Xi y∈Xj. =. m $ m $ $ $. i=1 j=1 x∈Xi y∈Xj. =. m $ m $ i=1 j=1. ILSSDM ILSSDM. |x − y|. O(N ). n. ILD. 2. |(x − xi ) + (xi − xj ) + (xj − y)|2. 5.. (|x − xi |2 + |xi − xj |2 + |xj − y|2 ). ILD. ni nj (var(Xi ) + |xi − xj |2 + var(Xj )) (2) X. ILD. x. ⓒ 2017 Information Processing Society of Japan. 4.

(5) 情報処理学会研究報告 IPSJ SIG Technical Report. ILD. Vol.2017-MPS-113 No.10 Vol.2017-BIO-50 No.10 2017/6/23. ILD. p−. ILD ILSSDM ILD. [1]. [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. [10]. ILSSDM. Domingo-Ferrer and Vicen c Torra. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, Vol. 11, No. 2, pp. 195–212, 2005. Domingo-Ferrer, Josep and Mart´ınez-Ballesté, Antoni and Mateo-Sanz, Josep Maria and Sebé, Francesc, Efficient multivariate data-oriented microaggregation, The VLDB Journal The International Journal on Very Large Data Bases, Vol. 15, No.4, pp. 355–369, 2006. Anthony WF Edwards and L Luka Cavalli-Sforza. A method for cluster analysis. Biometrics, pp. 362–375, 1965. AD Gordon and JT Henderson. An algorithm for euclidean sum of squares classification. Biometrics, pp. 355–362, 1977. Pierre Hansen, Brigitte Jaumard, and Nenad Mladenovic. Minimum sum of squares clustering in a low dimensional space. Journal of Classification, Vol. 15, No. 1, pp. 37–55, 1998. James MacQueen, et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281–297. Oakland, CA, USA., 1967. Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, Vol. 58, No. 301, pp. 236–244, 1963. Agusti Solanas, Antoni Martinez-Balleste, and J Domingo-Ferrer. V-mdav: a multivariate microaggregation with variable group size. In 17th COMPSTAT Symposium of the IASC, Rome, pp. 917–925, 2006. Oganian, Anna and Domingo-Ferrer, Josep. On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Commission for Europe, Vol. 18, No. 4, pp. 345–353, 2001. Domingo-Ferrer, Josep and Mateo-Sanz, Josep Maria. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and data Engineering, Vol. 14, No. 1, pp. 189–201, 2001.. ⓒ 2017 Information Processing Society of Japan. 5.

(6)