• 検索結果がありません。

the Document Classification

ドキュメント内 職業・産業コーディング自動化システム (ページ 84-89)

KAZUKO TAKAHASHI1,

HIROYA TAKAMURA2, and MANABU OKUMURA2

1KEIAI UNIVERSITY, FACULTY OF INTERNATIONAL STUDIES 2 TOKYO INSTITUTE OF TECHNOLOGY, PRECISION AND INTELLIGENCE LABORATORY PACIFIC-ASIA KNOWLEDGE DISCOVERY AND DATA MINING (PAKDD-07)

2

Table of Contents

1. Motivation 2. Proposed Method

a. A Method Using an Accuracy Table b. A Method Applying a Logistic Regression 3. Experiments

4. Conclusions

3

Table of Contents

1. Motivation 2. Proposed Method

a. A Method Using an Accuracy Table b. A Method Applying a Logistic Regression 3. Experiments

4. Conclusions

4

Motivation

Estimating the probability with which the sample belongs to the predicted class ( class membership probability ) is useful in many applications such as document classification.

Human decision making e.g. the NANACO system

displays outputs fromthe automatic systemas candidates of occupational codesto help human

annotators (coders) for the occupation coding in social surveys

withclass membership probabilities

5

The Occupation Coding

Occupation Data Occupational Code - job task(open-ended)

- industry(open-ended) oneof nearly 200 - employment categories - job title

- firm size

6 2&1: Regular employee No Managerial post

Industry Firm size

Candidates of Occupational Codes Occupation Data

Attribute (Education)

Class Membership Probability

to arrange the delivery vehicles load and unload of luggage

A Picture by the NANACO System

8 : From 500 to 999

Employment & Job title

9 : Junior high school

1 : No managerial post 2 : Regular employee

Job task

563 a transportation clerk 685 transportation laborers 556 shipping/sorting clerks 604 automobile drivers 560 postal/communication clerks

The NANACO system is used for the JGSS(Japanese General Social Surveys) and the SSM(Social Stratification and Social Mobility)survey.

資料編(3)

7

Existing Methods

(for Binary Classifier)

Platt’s Method (Sigmoid function) P (f) = 1 / (1 + exp (Af+B) )

Zadrozny’s Binning Method

Isotonic Regression Method ( PAV algorithm )

Expansion by dividing a multiclass classifierinto binary classifiers

0.1 0.3 0.4 0.5 0.7 0.9 0.9

-2 -1.5 -1.3 -0.5 0 0.2 0.5 0.6 0.8 0.9

0 0 0 0 1 0 0 1 0 1

0 0 0 0 0.3 0.3 0.3 0.5 0.5 1 score Status(acc

uracy) Accuracy Accuracy for each bin

8

What is the Problem in Multiclass Classification?

The relationship among the scores the 1stclass’s score > the 2ndclass’s score >

the 3rdclass’s score > … > the nthclass’s score

The 1stclassis determined not by the absolute value of the score, but by the relative position among the scores.

Example 1 Example 2 the 1stclass’s score 1.5 large 0.1 small the 2ndclass’s score 1.4 large -1.5 small the status of the 1stclass incorrect correct

9

For Effective Estimation

Does class membership probabilityfor the 1stclassdepend not only on the 1stclass’s scorebut also on other classes’ scores?

It would be better to use not only the 1st class’sscore, but also other classes’ scores.

10

Table of Contents

1. Motivation 2. Proposed Method

a. A Method Using an Accuracy Table b. A Method Applying a Logistic Regression 3. Experiments

4. Conclusion

11

Proposed Method

Using multiple classification scores

As a method for estimating class membership probabilities

a. (indirectly) Using “an accuracy table”

b. (directly) Applyinga logistic regression

12

Table of Contents

1. Motivation 2. Proposed Method

a. A Method Using an Accuracy Table b. A Method Applying a Logistic Regression 3. Experiments

4. Conclusions

13

A Method Using an Accuracy Table

Axis of the 2ndclass’s scores Axis of the 1st

class’s scores

An Accuracy Table(e.g. 2-dimensions)

BothBinning Method andIsotonic Regression Method are difficultto be extended for multi-dimensions. These methods are not easy to sortall samples according to some criteria.

0.15 0.36 0.29 0.53 0.39 0.28 0.67 0.53 0.48 0.46

14

Process

Using multiple scores

STEP 1 Create cells for an accuracy table.

STEP 2 Smooth accuracies.

STEP 3 Estimate class membership

probability for an evaluation sample.

15

STEP 1 Create Cells for an Accuracy Table

Training data Training data n-fold cross

validationTest data

Scores, Status an Accuracy table { fi },correct/incorrect

Accuracy= # correctly-classified samples / # samples

0.15 0.36 0.29 0.53 0.39 0.28 0.67 0.53 0.48 0.46

16

STEP 2 Smooth Accuracies

Axis of the 2nd class’s scores Axis of the 1st

class’s scores

Target cell

17

Smoothing Methods

Using only a target cell

Laplace’s law(Lap) PLap(f) = (Np(c(f)) + 1) / (N(c(f)) + 2) Lidstone method(Lid) PLid(f) = (Np(c(f)) + δ) / (N(c(f)) + 2δ)

Using not only a target cellbut also surrounding cells moving average method(MA)

PMA(f) =(Np(c(f))/N(c(f)) +Σs∈Nb(c(f))Np(s)/N(s)) / n Median Method(Median)

PMedian(f) = mediansNb(c(f)) {Np(c(f))/N(c(f)) , {Np(s)/N(s)}

moving average with coverage method(MA_cov) PMA_cov(f) = (Np(c(f))/N(c(f))C(c(f))+Σs∈Nb(c(f))(Np(s)/N(s))C(s))

/ (C(c(f)) + ΣsNb(c(f))C(s))

18

STEP 3 Estimate class membership probability for an evaluation

sample

Scores of an evaluation sample

{ fi }

Axis of the 2ndclass’s scores Axis of the 1st

class’s scores

Class membership probability P =0.29

0.15 0.36 0.29 0.53 0.39 0.28 0.67 0.53 0.48 0.46

19

Table of Contents

1. Motivation 2. Proposed Method

a. A Method Using an Accuracy Table b. A Method Applying a Logistic Regression 3. Experiments

4. Conclusion

20

A Method Applying a Logistic Regression

Formula of a Logistic Regression P (f1,・・・ ,fn) = 1 / (1 + exp (ΣAifi+ B) )

{Ai},B : parameter fi : the ithclass’s score

21

Process

Using multiple scores

STEP 1 Estimate parameter with maximum likelihood method.

STEP 2 Estimate class membership

probability for an evaluation sample.

22

STEP 1 Estimate parameter with maximum likelihood method

Training data Training data

n-fold cross validation Test data

Scores, Status { fi },correct/incorrect

{Ai},B : estimated with maximum likelihood

23

STEP 2 Estimate class membership probability for an evaluation

sample.

Scores of an evaluation sample { fi }

Class membership probability

P(f1,・・・ ,fn) = 1 / (1 + exp (ΣAifi+ B) )

24

Table of Contents

1. Motivation 2. Proposed Method

a. A Method Using an Accuracy Table b. A Method Applying a Logistic Regression 3. Experiments

4. Conclusions

25

The Purpose of Experiments

Experiment 1

- Evaluation of various methods including the proposed methods

Experiment 2

- Evaluation of the best method

26

Experimental Setting

Classifier

one-versus-restmethod to extendSVMsto a multiclass classifier

A linear kernel

Soft margin parameter C = 0.6

Features

e.g. the JGSS dataset (on the next slide) - words in responses to “job task”

- words in responses to “industry”

- responses to “employment status” and “job title”

Naïve Bayes classifier

27

DataSet

The JGSS dataset (23,838 samples)

Japanese survey data (open-ended) The number of classes is nearly200

Training data : old data (JGSS-2000, -2001 , -2002) Test data : new data (JGSS-2003)

The 20 Newsgroups dataset (18,828 samples)

English newspaper articles The number of classes is 20 5-fold cross validation

28

Cell Intervals for an Accuracy Table

Cell intervals

0.05 0.1 0.2 0.3 0.5 etc.

The relationship between cell intervalsand the number of cells

Cell Intervals 0.05 0.1 0.2 0.3 0.5

# cells(the 1st

class’s score used ) 60 30 16 12 7

29

Evaluation Metrics

In experiment 1

Negative log-likelihood a Loss function L = Σ (- yilog ( pi) + (1 - yi) log ( 1 - pi) )

yi: status of an evaluation sample ( correct:1 incorrect:0 ) pi: predicted class membership probability of an evaluation

sample

When L is lower, the method is better.

In experiment 2 Reliability diagram

the predicted values vs. the true values

ROC (receiver operating characteristic) curve FPF (false positive fraction) vs. TPF (true positive fraction)

Ability to detect misclassified samples 30

The Proposed Method for Creating Cells

Classifier DataSet

SVMs JGSS dataset

SVMs 20 Newsgroups

dataset

Naïve Bayes classifier 20 Newsgroups

dataset Equal

intervals 2369.3 (# cells=30)

1472.3 (# cells=30)

1679.8 (# cells=16) Equal

samples 2678.3 (# cells=12)

1572.9 (# cells=12)

1671.0 (# cells=12) Negative Log-likelihood in the best case in each method

31

Experiment 1 (1/2)

Cell Inter-vals

Used Scores No

Smoo-thing

Lap Lid MA Med

-ian MA

_cov Logistic regres -sion 0.1 rank1rank1 &

rank2 2309.3

-2368.9 2356.8

2368.9 2355.8

2367.5 2245.8

2372.6

-2364.7 2232.7

2367.6 2246.9

0.2 rank1rank1 &

rank2 2371.3

-2371.0 2252.7

2370.3 2254.7

2369.3 2240.6

2370.0 2341.8

2369.3 2235.0

2367.6 2246.9

0.5 rank1rank1 &

rank2 2381.9 2265.8

2381.8 2265.6

2381.6 2265.7

2395.9 2327.5

2396.4 2298.8

2409.9 2320.6

2367.6 2246.9

Negative Log-likelihood (SVMs the JGSS dataset)

32

Experiment 1 (2/2)

Cell Inter-vals

Used Scores No

Smoo-thing

Lap Lid MA Med

-ian MA

_cov Logistic regres -sion 0.1 rank1rank1 &

rank2 1472.3

-1472.4 1390.2

1472.2 1388.3

1468.1 1362.3

1469.6

-1467.4 1360.3

1482.3 1386.6

0.2 rank1rank1 &

rank2 1472.5

-1472.7 1365.4

1472.5 1366.9

1474.4 1374.9

1473.3 - 1482.7

1377.7 1482.3 1386.6

0.5 rank1rank1 &

rank2 1487.4 1388.1

1487.5 1387.7

1487.4 1387.8

1503.9 1447.2

1497.0 1408.7

1537.9 1479.4

1482.3 1386.6

Negative Log-likelihood (SVMs the 20 Newsgroups dataset)

33

Negative Log-likelihood with SVMs on Both Datasets

A method using an accuracy table

rank1 & rank2< rank1 & rank2 & rank3<< rank1

A method applying a logistic regression

rank1 & rank2 & rank3< rank1 & rank2<< rank1

Using multiple scoreswas much effective in SVMs The method using an accuracy table

(cell intervals = 0.1 and a smoothing method = MA_cov) was the best of all cases.

A methodapplying a logistic regressionwas stable. 34

Negative Log-likelihood with Naïve Bayes classifier on the 20 Newsgroups dataset

# Cells (the 1stclass’s score used)

Used Scores

No smoothing Lap MA Median MA_cov

30 rank1rank1 &

rank2

-1680.6 1439.7

1670.1 1409.8

1668.4

-1675.0 1415.3

16 rank1rank1 &

rank2

1680.2

-1679.8 1428.1

1679.6 1515.5

1675.8

-1696.2 1536.2

7 rank1rank1 &

rank2

1697.2

-1697.2 1474.8

1712.0 1626.3

1713.6 1644.8

1732.8 1664.1

In the case of the method using an accuracy table

35

ドキュメント内 職業・産業コーディング自動化システム (ページ 84-89)

関連したドキュメント