• 検索結果がありません。

フィールドノートの RDF表現と地名抽出

N/A
N/A
Protected

Academic year: 2021

シェア "フィールドノートの RDF表現と地名抽出"

Copied!
32
0
0

読み込み中.... (全文を見る)

全文

(1)

Extraction and Management of

Spatiotemporal Term from Field Notes

and Data Structuring for its Sharing in

Area Studies

Taizo Yamada

Historiographical Institute,

The University of Tokyo, JAPAN

(2)

Contribution

Extraction of place name from field note

using SVM (Support Vector Machine)

Precision: 0.76

Characterize text in field note

Term extraction and categorization using topic

model

(3)

Outline

Background, purpose

Methodology

Place name extraction

text categorization

data structure

(4)

Background

Field note

consists of an observation note, a drawing and an image of a field.

one of an important resource to understand the field.

There are a huge amount of field notes, but a small set of the field

notes only can be used.

Various databases concerning Area Studies such as a catalogue, an

image, a movie, an audio and so on have been constructed and

published.

There are scarcely databases concerning text of field note.

Reason: there are no discussions or investigations for the efficient data

usage or sharing of the field note data.

(5)

Purpose

Establishing a method or constructing a search

System for promoting usage of field note and

for knowledge discovery from field note.

For efficient searching or mining the text data

(6)

Data Structure of Field note

マングローブ

(en: mangrove)

ココヤシ

(en: coconut)

Scene A

(

text

:

① マングローブ。前面の海にはバガン( 魚取り用の櫓) いくつも

ある。

(en: Mangrove. There are many Bagans which are scaffold to

catch a fish in the front of the sea.)

② ココヤシ多い。この下に少し家ある。

(en: There are many coconuts. There are a few houses in the

below.)

③ チョウジの多い斜面。

(en: The slope has many cloves.)

Place:

Bakauhimi;

Date:

Oct. 19. ‘84;)

Scene B

<text>; <place>; <date>;

Scene A

(

topic

:

<

マングローブ

,

,

バガン

, …>,

<

チョウジ

,

斜面

, …>,

…;

place

: Bakauhumi;

date

: Oct. 19. ‘84;)

Scene B

Field note

Determination

of unit

Term extraction

Latent topic

detection

Morphological

analysis

(Mecab + IPAdic)

Using topic model

(7)

Target

Example: Koichi Takaya,

“The Field note collection2

Sumatra” (in Japanese)

1984. 10. 19 ― 1985. 1. 18

Overall Sumatra Island

Characters : 165,757

(8)
(9)
(10)

Text Structure

Scene

(analysis unit)

date

(11)

Term extraction

morphological analysis

mecab+ipadic (morphological analyzer; dictionary)

マングローブ。前面

の海にはバガン( 魚

取り用の櫓) いくつも

ある。

Text (a scene)

マングローブ 名詞,一般

記号,句点

前面

名詞,一般

助詞,連体化

名詞,一般

助詞,格助詞,一般

助詞,係助詞

バガン

名詞,一般

記号,句点

名詞,一般

取り

名詞,接尾

名詞,接尾

助詞,連体化

名詞,一般

記号,句点

いくつ

名詞,代名詞,一般

助詞,係助詞

ある

動詞,自立

記号,句点

EOS

Result of morphological analysis

“名詞”: Noun,

“助詞”: postpositional particle,

“記号”: Symbol,

“動詞”: Verb

Bakauhumi:1

マングローブ:1

前面:1

海:1

バガン:1

魚取り用:1

櫓:1

ココヤシ:1

下:1

家:1

チョウジ:1

斜面:1

Bag-of-Words

Extraction target:

only noun

Extracting term and

counting term freq

(12)

Using topic model

LDA (Latent Dirichlet Allocation)

:

D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003.

• A scene (text) has one or

more latent topic(s).

• Latent topic is calculated

by

term co-occurrence

in a

scene, and

• Latent topic has one or

more terms.

• Outputting

• Relation between a

scene and a topic,

• Relation between a

(13)

Result: detection of Latent topics(1)

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 1 集落:8 島:10 Loc:8 Bengkalis:8 チガヤ:4 多い:94 松:12 きれい:6 池:28 町:30 Sultan:6 ゴム:59 地区:11 中国人:103 木:16 2 乳液:6 多い:8 北進:4 土手:6 Tembilahan:3 オカボ:92 悪い:7 広大:5 水田:25 店:17 pres:5 水田:44 丸太:7 人:103 炭:10 3 女:4 周辺:6 島:3 墓:5 急:3 トウモロコ シ:86 湖:7 ゴム園:4 魚池:16 市場:9 森:5 ゴム園:21 会社:6 自分:82 窯:10 4 新しい:4 ton:5 苗木:3 オランブニヤ:4 松林:3 広い:80 平坦:6 広い:4 魚:13 北:7 簡単:4 Minangkabau: 19 レジン:5 無い:79 炭焼き小屋:8 5 灌木:4 松林:5 Pekanbaru:2 乾季:4 煉瓦:3 コーヒー:52 所々:6 ヨシ原:3 小池:6 Arsad:5 Ungku Tugut:3 suku:16 分かれ:5 家:78 直径:7 6 箱:4 実:4 Rokan:2 井戸:4 草地:3 焼畑:37 崖:5 Sungai Lala:2 稚魚:6 Kapsa:5 村:3 焼畑:16 山:5 ココヤシ:76 幅:6 7 盆地:3 Talawi:3

Tembirahan側

:2 作期:3 Batak:2 斜面:33 高い:5 coir:2 Loc:5 丘:5 班:3 人口:14 umo:4 Rp:74 壁:5 8 Kampar川:2 シラス:3 baris:2 分かれ:3

Makanan

Padang:2 周り:31 急:4 cungkilan:2 helong:5 植:5 Blast ing:2 オカボ:13 Transmigrasi:3 Melayu:68 灌漑水路:4 9 Transmigrasi:2 丘地帯:3 karet:2 川:3 Medan:2 シナモン:30 村:4 gulungan:2 上流:5 金:4

Tanjung

enim:2 集落:12 ft:3 土地:51 長い:4 10 balai adat:2 川沿い:3 ホテル:2 満潮時:3 ドラム缶:2 クミリ:29 村長:4 ばら:2 囲い:5 Dumai:3

Tengku Syarid:2 トウモロコ シ:10 昼夜水:3 多い:49 クーポン券:3 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 1 オカボ畑:5 サゴ:91 オランダ:33 ゴム:142 バガン:19 水田:108 魚:43 家:77 多い:227 左:4 Banjar:49 草:108 牛:16 コショウ:19 ココヤシ:103 2 Buatan:4 工場:59 下:16 広い:90 Tebing Tinggi:6 広い:84 網:28 多い:59 家:118 下り:3 Bugis:37 鍬:98 長い:16 ドリアン:12 木:75 3 松:4 tual:44 ムラユ:14 ゴム園:41

Tanjung

Datuk:5 稲:65 長い:26 右:40 コーヒー:95 川口:3 Sapat:37 田:95 クビキ:8 根元:7 自分:58 4 Amuntai:3 サゴヤシ:35 Raja Kecil:12 タッピング:27

Tanjung

Pinang:5 多い:61 inch:23

マングロー

ブ:29 ココヤシ:91 昼食:3 Tembirahan:29 多い:86 土:8 中心:5 良い:42 5 Loc:3 水:34 間:9 丘:24 核:5 幅:61 エビ:18 集落:24 村:61 湿地:3 稲:27 水田:72 犂先:7 成木:5 泥炭:41 6 マラヤ:3 Rp:33 kota:8 植:12 内皮:4 棚田:60 depa:14 左:23 ゴム:59 長大:3 Java:21 無い:72 草原:6 コーヒー:4 サゴヤシ:36 7 尋:3 濡れサゴ:23 人達:8 悪い:8 原:4 川:43 目:14 ニッパヤシ:15 周り:54 雑魚:3 Parit:19 苗代:72 犂柄:5 持主:4 水路:34 8 小村:3 ton:20 王:8 サゴヤシ:6 Huk Teicu:3 谷地田:35 深い:13 大変:14

ランブータ ン:39

Tungku

(14)

Result: detection of Latent topics(2)

V9

V12

V21

V27

V14

V24

V30

1

池:28

(pond)

ゴム:59

(rubber)

水田:108

(paddy field)

草:108

(grass)

1

中国人:103

(Chinese)

多い:227

(many)

ココヤシ:103

(coconut)

2

水田:25

(paddy field)

水田:44

(paddy field)

広い:84

(large)

鍬:98

(hoe)

2

人:103

(person)

家:118

(house)

木:75

(tree)

3

魚池:16

(fish pond)

ゴム園:21

(rubber plantation)

稲:65

(rice plant)

田:95

(rice field)

3

自分:82

(self)

コーヒー:95

(coffee)

自分:58

(self)

4

魚:13

(fish)

Minangkabau:19

多い:61

(many)

多い:86

(many)

4

無い:79

(none)

ココヤシ:91

(coconut)

良い:42

(good)

5

小池:6

(small pond)

suku:16

幅:61

(width)

水田:72

(paddy field)

5

家:78

(house)

村:61

(villege)

泥炭:41

(peat)

6

稚魚:6

(young fish)

焼畑:16

(swidden)

棚田:60

(terraced

rice-fields)

無い:72

(none)

6

ココヤシ:76

(coconut)

ゴム:59

(rubber)

サゴヤシ:36

(sago palm)

7

Loc:5

人口:14

(population)

川:43

(river)

苗代:72

(rice nursery)

7

Rp:74

周り:54

(surrounding)

水路:34

(water route)

8

helong:5

オカボ:13

(upland rice)

谷地田:35

(paddy field at

valley bottom)

月:68

(moon)

8

Melayu:68

ランブータン:39

(ranmbutan)

下:32

(bottom)

(15)

Data structure (using RDF)

"http://xxx/id/

00000003"

“Oct. 19. '84(3)”

“Jakarta よりKotabumi へ行く。”

“Oct. 19. '84”

“830km: Bakauhumi (*1) ① マングローブ。前面の

海にはバガン (魚取り用の櫓 )いくつもある。② コ

コヤシ多い。この下に少し家ある。③ チョウジの多

い斜面。”

“V24”

“V5”

“V3”

“Bakauhumi”

“マングローブ”

“前面”

“海”

“バガン”

Text

Catalogue

Scene

dc:title

dc:subject

dcterms:tempral

fn:topic

fn:topicClass

fn:term

Date

“Bakauhumi”

Place

dcterms:spatial

3

fn:descId

fn:topicClass

fn:term

fn:topicClass

fn:term

fn:topic

fn:topic

dc:description

(16)

Example of RDF (RDF/XML) (1)

latitude and longitude

Place name

at the scene

URI of the scene

Place name

(17)

Example of RDF (RDF/XML) (2)

images

Latent Topic

(18)

Chunking and Place name extraction

Banda 名詞,一般

aceh

名詞,一般

助詞,格助詞

行く

動詞,自立

Banda acehに行く。

Chunking

example:

morphological analysis

to

go

Noun

Noun

Postpositional Particle

Verb

Banda aceh

名詞,一般

助詞,格助詞

行く

動詞,自立

EOS ( end of sentence)

Noun

Postpositional Particle

Verb

to

go

(19)

Judgment whether place name or not

Using SVM (Support Vector Machine)

SVM is one of Supervised learning and a method

of pattern recognization.

Currently, one of most superior learning model.

In order to use, training data is required.

SVM requires a vector which indicates a

sequential pattern.

(20)

Method 1

Jakarta

より

(from)

Kotabumi

(to)

行く

(go)

名詞

(Noun)

助詞

(postpositional particle)

名詞

(Noun)

助詞

(postpositional particle)

動詞

(Verb)

記号

(Symbol)

固有名詞

(Proper noun)

固有名詞

(Proper noun)

地域

(Area)

地域

(Area)

<LOCATION>

より

<TARGET>

行く

Extracting previous and next N-Grams.

Making vector using the N-Grams

(21)

Method 2

Pattern1:

ここ

here

(より|で)

(from|at)

<target>

Pattern2:

ここ

here

<target>

Pattern3:

<target>

(出発|着|泊|川|湖|…|中

心)

(start|reach|stay|river|lak

e|…|center)

Pattern4:

<target>

(より|から|まで|へ)

(from|until|to)

Pattern5:

<target>

<symbol, (period|

comma)>

Pattern6:

<target>

(の|に|を)

(of)

(町|帰る|東|西|南|…)

(town|return|east|west|sou

th|…)

Pattern7:

<begin>

<target>

(<終点>|<記号,(句点

|読点)>)

(<end>|<symbol,

(period|comma)>)

小村

(22)

Evaluation

Experimental setup

number of target terms : 412

training data: 725 (number of known place name)

Evaluation:

Manually evaluation of results of prediction by SVM

Results: precision

Extraction of place name

Method 1: 0.59

Method 2: 0.50

Including location or organization which can not be located or identified.

example: airport, orange plant, sago palm factory,…

Method 1: 0.76

(23)

Flow of place name extraction

Term extraction

with

morphological

analysis

Judgment

by SVM

Evaluation

Updating of

sequential

pattern

Dictionary

update

input

output

feed back

dictionary

(24)

Tracking of investigation

Force-Directed Graph

Jan. ‘85

(25)

Use of web service

gazetteer

Geonames.org

kind of services:see

right table

search:

place name search,

JSON and RDF available

hierarchy:

Place name hierarchy

Alternate names

Support

(26)

Use of web service

Obtaining Lat/Long.

The search service can rank the results. We use

top of the result. But,

1.

If administrative division of the result is Sumatra, the

result is prior to other.

2.

Place name in Indonesia is prior to other.

3.

If the feature class is “P” ( indicates city, village,…),

the result is prior to other.

(27)

Mapping into googlemap

(28)
(29)
(30)
(31)

Conclusion, Future works

We introduce method of place name extraction

and topic detection for Area Studies.

Target: field note

Topic detection using LDA

Place name extraction using SVM

Future work

Improvement of text analysis for Area Studies.

What is the system that the researcher for Area Studies

wants?

We consider about the answer, and develop system

according to the answer.

(32)

参照

関連したドキュメント

We describe a generalisation of the Fontaine- Wintenberger theory of the “field of norms” functor to local fields with imperfect residue field, generalising work of Abrashkin for

Theorem 3.5 can be applied to determine the Poincar´ e-Liapunov first integral, Reeb inverse integrating factor and Liapunov constants for the case when the polynomial

An important result of [7] gives an algorithm for finding a submodule series of an arbitrary James module whose terms are Specht modules when coefficients are extended to a field

Applying the conditions to the general differential solutions for the flow fields, we perform many tedious and long calculations in order to evaluate the unknown constant coefficients

Local class field theory gives a complete description of all abelian ex- tensions of a p-adic field K by establishing a one-to-one correspondence between the abelian extensions of K

This paper presents an investigation into the mechanics of this specific problem and develops an analytical approach that accounts for the effects of geometrical and material data on

The final-value problem for systems of partial differential equations play an important role in engineering areas, which aims to obtain the previous data of a physical field from

Moreover, by (4.9) one of the last two inequalities must be proper.. We briefly say k-set for a set of cardinality k. Its number of vertices |V | is called the order of H. We say that