Preliminaries and problem statement - Categorical Data with Missing Values

Categorical Data with Missing Values

5.3 Preliminaries and problem statement

Table 5.2: Table of notation Symbol Description

k number of clusters

x_i mixed object with indexi x^r_i numeric object with indexi x^c_i categorical object with indexi X_j random variable

D_mix mixed data set A_j j^th attribute ofD_mix

A^r_j j^th numeric attribute ofD_mix A^c_j j^th categorical attribute ofD_mix C_l l^th cluster

Z_l center of clusterC_l

x_ij value appears at thei^th element and j^th attribute of D_mix

o^l_ij numeric value appears at thei^th element and j^th attribute of clusterC_l o^l_ij categorical value appears at thei^th element and j^th attribute of clusterC_l O_j set of categorical values appears at thej^th attribute ofD_mix

O_j^l set of categorical values appears at thej^th attribute of clusterC_l z_j^rl value of thej^th numeric attribute in the centerZ_l

z_j^cl value of thej^th categorical attribute in the centerZ_l

Table 5.3: A mixed numeric and categorical data set with missing values Object Attribute

A₁ A₂ A₃ A₄ A₅ A₆

x₁ d b f e 7 14

x₂ b b c e 6 13

x₃ b b c e 2 13

x₄ a b a b 5 13

x₅ c a f d 2 14

x₆ d b f ? ? 14

x₇ a a c e 1 14

x₈ b a ? ? ? 12

x₉ c b a c 5 14

x₁₀ b ? d e 5 7

x₁₁ d a d c 10 13

x₁₂ d b d d 3 12

x₁₃ ? b a e 2 ?

x₁₄ a a d d 2 18

x₁₅ b a f e 2 14

D_mix: C₁={x₁, x₂,x₃, x₅, x₇,x₁₅},C₂ = {x₄, x₆,x₈, x₉, x₁₀, x₁₁, x₁₂, x₁₃, x₁₄}, C₃ ={ x₁,x₂,x₃ }. Then,{C₁,C₂}are clusters ofD_mix, while{C₁,C₃},{C₂,C₃}and {C₁,C₂,C₃} are not.

Definition 14 (Relative frequency of a categorical value) Let there be a cluster C_l and a categorical value o^l_ij appearing in Cl at the j^th categorical attribute, the relative frequency ofo^l_ij inC_l is denoted and defined as:

fC_l(o^l_ij) = #_l(o^l_ij)

n_l (5.2)

where #_l(o^l_ij) is the number of categorical values o^l_ij appearing in the cluster C_l at thej^th attribute.

The relative frequency of o^l_ij in the data set D_mix at the j^th categorical attribute is denoted and defined as:

f(o^l_ij) = #(o^l_ij)

|D_mix| (5.3)

where #(o^l_ij) is the number of categorical values o^l_ij appearing in data set D_mix at thej^th attribute.

Example 7 In the data set shown in Table5.3, assume that clusterC₁ ={x₁,x₂,x₃,x₅, x₇, x₁₅}, then the relative frequency of the categorical value “b” in the attribute A₁ is fC₁(b)= ³₆ = 0.5.

In this chapter, to represent the center of a cluster, we used the mean for numeric attributes and the variation on Aitchison & Aitken’s kernel function [8] to estimate the probability density function of each categorical attribute in the center.

Definition 15 (The mean of numeric attributes in a cluster) Let there be a cluster C_l that contains p numeric attributes A^r = {A^r₁,A^r₂, . . . ,A^r_p} and Z_l is the center of C_l. The mean of each attribute A^r_j (1 ≤ j ≤ p, p < m) in the cluster Cl is defined and denoted as:

z^rl_j = 1 n_l

i=1

o^l_ij (5.4)

Example 8 In the data set shown in Table5.3, assume that clusterC₁ ={x₁,x₂,x₃,x₅, x₇,x₁₅}, then mean value of the attribute A₆ isz₆^r1 = (14+13+13+14+14+14)

6 =13.67.

Recall that a density estimator is an algorithm which takes a d-dimensional data set

is drawn from [114]. Kernel density estimation (KDE) is a method of estimating the probability distribution of a random variable based on a random sample.

Definition 16 (Center of cluster) Let there be a cluster C_l = {x₁, x₂, . . ., x_n_l} where xi =(xi1,xi2,. . .,xim),m =|A|. The center ofClis then defined as:

Z_l={z₁^l, z₂^l, . . . , z_m^l } (5.5) where the j^th attribute z^l_j (1 ≤ j ≤ m) is either z_j^rl or z_j^cl. Specifically, if the j^th attribute ofZl is a numeric attribute, then its representative is calculated by using the Definition 15 (Eq. 5.4). Otherwise, the representative of a categorical attribute of Z_l is the probability distribution onO_j^l estimated by the kernel density estimation method using Eq. (2.4), which is defined as:

z_j^cl = [P_j^l(o^l_1j),P_j^l(o^l_2j), . . . ,P_j^l(o^l_|Ol

j|j)] (5.6)

where the probabilistic value of a categorical valueo^l_ij (1≤l ≤n_l)can be estimated as:

P_j^l(o^l_ij) =





 λ_j_|O¹l

j| + (1−λ_j)fC_l(o^l_ij) ifo^l_ij ∈ O_j^l

0 otherwise

(5.7)

Example 9 In the data set shown in Table 5.3, assume that cluster C₁ = {x₁, x₂, x₃, x5, x7, x15}. Then the centerZ1 of C1 at the categorical attributes A1, A2, A3, A4 and numeric attributes A₅, A₆ are {“d”: 0.2069, “b”: 0.3793, “c”: 0.2069, “a”: 0.2069}, {“b”: 0.5, “a”: 0.5}, {“f”: 0.5, “c”: 0.5}, {“e”: 0.6724, “d”: 0.3275}, 3.5556, 14.1111, respectively.

Previous studies have used several methods to quantify the dissimilarity between a mixed object and its center [60, 66, 80]. Particularly, distance measures such as the Euclidean, Manhattan, Minkowski, and Mahalanobis [43] can be applied for numeric attributes, while the simple matching dissimilarity measure [58–60,99], the Euclidean norm [20] and the information-theoretic based dissimilarity measure [92] can be ap-plied for categorical attributes. In this chapter, we used the squared Euclidean and the information-theoretic based dissimilarity measure to quantify the dissimilarity between

numeric and categorical attributes in mixed objects, respectively. The information-theoretic definition of similarity [81] is applicable for domains that have probabilistic models.

Definition 17 (Dissimilarity measure for categorical attributes) Let there be two cat-egorical values o^l_ij and o^l_i0

j at thej^th attribute. The similarity between them is defined as:

simj(o^l_ij, o^l_i⁰_j) =

2 logf(o^l_ij, o^l

i⁰j) logf(o^l_ij) + logf(o^l

i⁰j) (5.8)

where f(o^l_ij, o^l

i⁰j) = ^#(o

l ij,o^l

i0 j)

|Dmix| with #(o^l_ij, o^l

i⁰j)is the number of mixed objects in data set Dmixthat receives the categorical values belonging to{o^l_ij, o^l_i0

j}at thej^thattribute, while f(o^l_ij)is measured by the Eq. 5.3.

The dissimilarity measure between two categorical values o^l_ij and o^l_i0

j at the j^th at-tribute can be defined as:

dsimj(o^l_ij, o^l_i0

j)) = 1−simj(o^l_ij, o^l_i0

j) = 1− 2 logf(o^l_ij, o^l_i0

j) logf(o^l_ij) + logf(o^l

i⁰j) (5.9) Example 10 In the data set shown in Table 5.3, we omit four incomplete objects from the data set. The dissimilarity of categorical values “d” and “b” in the attribute A₁ is dsim1(“d”,“b”) = 1− _log(^2×log(3 ¹¹⁶⁾

11)+log(₁₁³) =0.5335.

Definition 18 (Dissimilarity between an object and a cluster) Let there be a cluster C_l and a mixed object x_i = (x_i1, x_i2, . . ., x_im). The dissimilarity between x_i and the centerZ_l={z₁^l, z₂^l, . . . , z_m^l }at thej^th attribute is defined as:

disj(x_i,Z_l) =







(xij −z_j^rl)² ifAj is a numeric attribute P

o^l_ij∈O^l_jP_j^l(o^l_ij)dsimj(xij, o^l_ij) ifAj is a categorical attribute

(5.10)

Specifically, the dissimilarity betweenx_iandZ_lat thej^thattribute is measured based on the type of this attribute. For numeric attributes, the squared Euclidean distance is used to quantify the distance between the mean of clusters and the numeric value of mixed objects. For categorical attributes, the proximity is measured by accumulating the probability distribution onO^l and the dissimilarity betweenj^thcomponentx of the

mixed objectx_i and cluster centerZ_l is defined as follows:

dis(xi,Zl) =

j=1

disj(xi,Zl) (5.11)

Example 11 In the data set shown in Table 5.3, assume that cluster C₁ = {x₁, x₂, x₃, x5,x7, x15} and its clusterZ1 is n

{“d”: 0.2069, “b”: 0.3793, “c”: 0.2069, “a”: 0.2069}, {“b”: 0.5, “a”: 0.5}, {“f”: 0.5, “c”: 0.5}, {“e”: 0.6724, “d”: 0.3275}, 3.5556, 14.1111o

. The dissimilarity betweenx1 = {d, b, f, e, 7, 14} andZ1 is dis(x1,Z1) = 0.3818 + 0.5 + 0.3155 + 0.2489 + 11.8642 + 0.0123=13.3227.

In the following, two measures for the imputation step are presented. The first measure is the IS measure (ISM). It is used to evaluate the degree of associations between two sets of categorical values in a data object. The ISM was first introduced by Tan et al.

[107]. It contains the product of two quantities: interest factor and support count that compute the correlations between values of different attributes in an object as defined in Eq.4.2.

Definition 19 (IS measure [24]) Let there be a set T that contains both complete and incomplete categorical objects. Let A⁰ = {A⁰₁, A⁰₂, . . ., A⁰_m0} and A⁰⁰ = {A⁰⁰₁, A⁰⁰₂, . . ., A⁰⁰

m⁰⁰} (A⁰,A⁰⁰ ⊂ A;m⁰, m⁰⁰ < m) be two sets of categorical attributes that contain missing values and non-missing values in T, respectively. For all a⁰ = {a⁰₁, a⁰₂, . . . , a⁰_n}

∈ A⁰₁× A⁰₂× · · · × A⁰

m⁰ anda⁰⁰ ={a⁰⁰₁, a⁰⁰₂, . . . , a⁰⁰_n} ∈ A⁰⁰₁× A⁰⁰₂× · · · × A⁰⁰

m⁰⁰, the IS measure betweena⁰ anda⁰⁰ is defined as:

IS(a⁰, a⁰⁰) = Support(a⁰, a⁰⁰)

pSupport(a⁰)×Support(a⁰⁰) (5.12) Where Support(a⁰, a⁰⁰)= ^#(a_|T⁰^,a_|⁰⁰⁾,#(a⁰, a⁰⁰)is the number of mixed objects that contain botha⁰ and a⁰⁰.

The second measure is the missing-complete similarity measure (MCS). We extend the information-theoretic based similarity measure Eq. (5.8) to make it applicable for mea-suring the proximity of complete and incomplete objects.

Definition 20 (Missing-complete similarity measure (MCS measure)) Let there be a setT that contains both complete and incomplete objects. Let there be two categorical values o^l_ij and o^l

i⁰j appearing in T at the j^th attribute. The similarity between them is defined as:

sim^mis_j (o^l_ij, o^l_i⁰_j) =











2 logfT(o^l_ij,o^l

i0 j) logfT(o^l_ij)+logfT(o^l

j) ifo^l_ij 6=?and o^l_i0

j 6=?

0 otherwise

(5.13)

where f_T(o^l_ij) = ^#^T^(o

l ij)

nT , #_T(o^l_ij) denotes the number of o^l_ij appearing in T and n_T denotes the number of objects inT.

Letx_i = (x_i1,x_i2, . . ., x_im)and x_i⁰ = (x_i⁰₁,x_i⁰₂, . . .,x_i⁰_m)be the complete mixed object and incomplete mixed object, respectively. The MCS betweenx_i and x_i⁰ is then defined as follows :

MCS(xi, x_i⁰) =

j=1

sim^mis_j (x_id, x_i⁰_d) (5.14) where thej^thattributes ofx_i and x_i⁰ are categorical attributes.

Example 12 This example illustrates how to impute missing values in categorical at-tributes using the IS and MCS measures. The subset that contains complete categorical objects extracting from Table 5.3 is shown in Table 5.4. Assume that the incomplete

Table 5.4: The complete categorical data set extracting from Table5.3 Object Attribute

A^c₁ A^c₂ A^c₃ A^c₄

x^c₁ d b f e

x^c₂ b b c e

x^c₃ b b c e

x^c₄ a b a b

x^c₅ c a f d

x^c₇ a a c e

x^c₉ c b a c

x^c₁₁ d a d c

x^c₁₂ d b d d

x^c₁₄ a a d d

x^c₁₅ b a f e

uses the attributeA^c₄ as the class attribute is built for x^c₆ based on the data set in Table 5.4(Fig. 5.3). The objectx^c₆ is then assigned to leaf 7. The set of complete categorical

Attribute 𝐴𝐴₃^𝑐𝑐= “a”

Attribute 𝐴𝐴3𝑐𝑐 = “d”

Yes

Attribute 𝐴𝐴1𝑐𝑐 = “a”

Yes No

b c Attribute 𝐴𝐴1𝑐𝑐 = “d”

Yes

Attribute 𝐴𝐴2𝑐𝑐 = “a”

Yes No

c d

Yes No

Attribute 𝐴𝐴1𝑐𝑐 = “c”

Yes No

d e

Leaf 1 Leaf 2

Leaf 3 Leaf 4

Leaf 5 Leaf 6 Leaf 7

Figure 5.3: Tree for the missing attributeA^c₄ inx^c₆

objects that are correlated withx^c₆ are{x^c₁, x^c₂, x^c₃, x^c₇, x^c₁₅}. The set of categorical values in the complete attributes A^c₁, A^c₂, A^c₃ contains {d, b, f}, {b, b, c}, {a, a, c} and {b, a, f}. The set of categorical values in the incomplete attribute A^c₄ contains {e}. The possible imputed value is only e. Thus it is chosen for imputing the missing value in x^c₆, i.e., x^c₆ =hd, b, f, ei.

Next, assume that the incomplete categorical object x^c₈ = hb, a,?,?i in Table 5.3 is chosen for imputation. There are two missing values in x^c₈ at the attributes A^c₃ and A^c₄. Therefore, two DTs that use the attributeA^c₃ andA^c₄ as the class attributes are built for x^c₈ based on the data set in Table 5.4 (Fig. 5.4), respectively. For the missing value in

Attribute 𝐴𝐴1𝑐𝑐= “d”

Attribute 𝐴𝐴2𝑐𝑐 = “b”

Yes

Attribute 𝐴𝐴2𝑐𝑐 = “b”

Yes No

f d Attribute 𝐴𝐴1𝑐𝑐 = “b”

Yes

Yes No

Attribute 𝐴𝐴1𝑐𝑐 = “a”

Yes No

c f

Leaf 1 Leaf 2

Leaf 4 Leaf 5 Leaf 6 Leaf 8

d _{Leaf 3}

Leaf 7

(a) Tree for the missing attributeA^c₃ inx^c₈

Attribute 𝐴𝐴₁^𝑐𝑐 = “a”

Attribute 𝐴𝐴1𝑐𝑐= “b”

No Yes

e Leaf 1

Attribute 𝐴𝐴2𝑐𝑐 = “a”

Yes No

Attribute 𝐴𝐴1𝑐𝑐 = “d”

Yes No

b Leaf 4 e

Leaf 2 d Leaf 3

Yes

Attribute 𝐴𝐴𝑐𝑐2 = “b”

Yes No

c Leaf 7 e

Leaf 5 d Leaf 6

Attribute 𝐴𝐴2𝑐𝑐 = “a”

Yes No

c Leaf 9 d

Leaf 8

(b) Tree for the missing attributeA^c₄ inx^c₈ Figure 5.4: Trees for the incomplete categorical objectx^c₈

the attributeA^c₃, the objectx^c₈ is assigned to leaf 8 of the tree5.4a. The set of complete categorical objects that are correlated withx^c₈ contains {x^c₅, x^c₁₅}. For the missing value in the attributeA^c₄, the objectx^c₈ is assigned to leaf 1 of the tree 5.4b. The set of

com-plete categorical objects that are correlated with x^c₈ contains {x^c₂, x^c₃, x^c₁₅}. Because x^c₈ falls into multiple leaves, the objects from all these leaves are grouped in one collection.

Thus, the set of correlated objects are {x^c₂, x^c₃, x^c₅, x^c₁₅}. The set of categorical values in the complete attributesA^c₁, A^c₂ contains{b, b},{c, a},{b, a}, while the set of categorical values in the incomplete attribute A^c₃ and A^c₄ contains {c, e}, {f, d} and {f, e}. The IS and MCS measures for each pair of categorical values in the complete attributes and in-complete attributes are: IS({b, b},{c, e}) = ^√_0.5×0.5^0.5 =1, IS({c, a},{f, d}) = ^√_0.25×0.25^0.25 =1, IS({b, a},{f, e}) = ^√_0.25×0.25^0.25 =1, MCS(x^c₂, x^c₈)= MCS(x^c₃, x^c₈)= log 0.8+log 0.8^{2 log 0.8} + log 0.4+log 0.6^{2 log 1}

+ 0 + 0 = 1, MCS(x^c₅, x^c₈) = log 0.2+log 0.8^{2 log 1} + log 0.6+log 0.6^{2 log 0.6} + 0 + 0 = 1, MCS(x^c₁₅, x^c₈) =

2 log 0.8

log 0.8+log 0.8 + log 0.6+log 0.6^{2 log 0.6} + 0 + 0= 2. The affinity degree of possible imputed values is calculated by the average of the IS and MCS measures for each pair of categorical values in the complete attributes and incomplete attributes. Thus, δ({c, e})= (1+1)/2

=1,δ({f, d})= (1+1)/2=1,δ({f, e})= (1+2)/2= 1.5. The actual imputed values is chosen by random sampling according to the affinity degrees. Specifically, the sampling probabilities of{c, e},{f, d},{f, e}are0.2857,0.2857and0.4286, respectively. Thus, the {f, e}has its probability of been chosen as the actual imputed values forx^c₈.

Based on the above definitions, the problem of clustering for mixed numeric and cate-gorical data sets with missing values aims to minimize the following objective function:

F(U,Z) =

l=1 n

i=1

u_i,l×dis(xi,Z_l) (5.15) subject to





 Pk

l=1u_i,l = 1 1≤i≤n

u_i,l ∈ {0,1} 1≤l≤k, 1≤i≤n

(5.16)

where U = [ui,l]n×k is the partition matrix, ui,l takes value 1 if objectxi is in cluster Cl

and0otherwise.

ドキュメント内 SecondSupervisor :Prof.TAKASHIHASHIMOTO Supervisor :Prof.HUYNHVANNAM DINHDUYTAI EffectiveClusteringAlgorithmsforCategoricalandMixedData (ページ 88-96)