• 検索結果がありません。

paper 研究発表 首都大学東京 自然言語処理研究室(小町研)

N/A
N/A
Protected

Academic year: 2018

シェア "paper 研究発表 首都大学東京 自然言語処理研究室(小町研)"

Copied!
23
0
0

読み込み中.... (全文を見る)

全文

(1)

平易なコーパスを用いない

テキスト平易化のための

単言語パラレルコーパスの構築

(2)

English Wikipedia: Alfonso Perez

Alfonso Perez Munoz, usually referred to as Alfonso, is a

former Spanish footballer, in the striker position.

Simple English Wikipedia: Alfonso Perez

Alfonso Perez is a former Spanish football player.

読みやすくなるように文を書き換えるタスク

応用1:自然言語処理のために入力文の複雑さを減らす

応用2:言語学習者など人々の文章読解を助ける

2

(3)

統計的機械翻訳の枠組みでのテキスト平易化

テキスト平易化を同一言語内の翻訳問題と考える

難解なテキストと平易なテキストからなる

パラレルコーパスを用意してトレーニングする

難解なコーパス

平易なコーパス

統計的機械

翻訳モデル

Some computers can

boot up

from flash drives.

start

Some computers can

from flash drives.

(4)

テキスト平易化のための大規模な

言語資源は英語でのみ利用可能

英語

平易なコーパス:Simple English Wikipedia(100万文)

平易なパラレルコーパス:Parallel Wikipedia(50万文対)

         Newsela(5.6万文 5段階の難易度)

言い換え辞書:PPDB(1億フレーズ対)

平易な言い換え辞書:Simple PPDB(450万フレーズ対)

日本語

言い換え辞書:PPDB: Japanese(1,500万フレーズ対)

(5)

平易なコーパスを用いない

テキスト平易化のための単言語パラレルコーパスの構築

  Readability: 80.3  Japan is an island country in East Asia.

  Readability: 38.0  Japan is a stratovolcanic archipelago of 6,852 islands.   Readability: 69.8  The country is divided into 47 prefectures in eight regions.

  Readability: 15.0  Archaeological research indicates that Japan was inhabited as early as the Upper Paleolithic period.   Readability: 86.7  The population of 126 million is the world's tenth largest.

(6)

平易なコーパスを用いない

テキスト平易化のための単言語パラレルコーパスの構築

  Readability: 80.3  Japan is an island country in East Asia.

  Readability: 38.0  Japan is a stratovolcanic archipelago of 6,852 islands.   Readability: 69.8  The country is divided into 47 prefectures in eight regions.

  Readability: 15.0  Archaeological research indicates that Japan was inhabited as early as the Upper Paleolithic period.   Readability: 86.7  The population of 126 million is the world's tenth largest.

(7)

平易なコーパスを用いない

テキスト平易化のための単言語パラレルコーパスの構築

1.  Japan is a stratovolcanic archipelago of 6,852 islands.

2.  Archaeological research indicates that Japan was

inhabited as early as the Upper Paleolithic period.  

難解なコーパス

1.  Japan is an island country in East Asia.

2.  The country is divided into 47 prefectures in eight regions.

3.  The population of 126 million is the world's tenth largest.

平易なコーパス

①  文のリーダビリティを求めて難解な文と平易な文に分割

  Readability: 80.3  Japan is an island country in East Asia.

  Readability: 38.0  Japan is a stratovolcanic archipelago of 6,852 islands.   Readability: 69.8  The country is divided into 47 prefectures in eight regions.

  Readability: 15.0  Archaeological research indicates that Japan was inhabited as early as the Upper Paleolithic period.   Readability: 86.7  The population of 126 million is the world's tenth largest.

(8)

平易なコーパスを用いない

テキスト平易化のための単言語パラレルコーパスの構築

1.  Japan is a stratovolcanic archipelago of 6,852 islands.

2.  Archaeological research indicates that Japan was

inhabited as early as the Upper Paleolithic period.  

難解なコーパス

1.  Japan is an island country in East Asia.

2.  The country is divided into 47 prefectures in eight regions.

3.  The population of 126 million is the world's tenth largest.

平易なコーパス

1

1 2 …

0.25

2 3 …

0.11 0.22 0.10 0.00 0.09

文間類似度行列

①  文のリーダビリティを求めて難解な文と平易な文に分割

②  難解な文と平易な文の間の文間類似度を計算

  Readability: 80.3  Japan is an island country in East Asia.

  Readability: 38.0  Japan is a stratovolcanic archipelago of 6,852 islands.   Readability: 69.8  The country is divided into 47 prefectures in eight regions.

  Readability: 15.0  Archaeological research indicates that Japan was inhabited as early as the Upper Paleolithic period.   Readability: 86.7  The population of 126 million is the world's tenth largest.

(9)

平易なコーパスを用いない

テキスト平易化のための単言語パラレルコーパスの構築

1.  Japan is a stratovolcanic archipelago of 6,852 islands.

2.  Archaeological research indicates that Japan was

inhabited as early as the Upper Paleolithic period.  

難解なコーパス

1.  Japan is an island country in East Asia.

2.  The country is divided into 47 prefectures in eight regions.

3.  The population of 126 million is the world's tenth largest.

平易なコーパス

In 599, an earthquake destroyed buildings throughout Yamato Province in what is now Nara Prefecture.

In 599, an earthquake destroyed buildings in Yamato Province which is now known as Nara Prefecture.

  

パラレルコーパス

1

1 2 …

0.25

2 3 …

0.11 0.22 0.10 0.00 0.09

文間類似度行列

①  文のリーダビリティを求めて難解な文と平易な文に分割

②  難解な文と平易な文の間の文間類似度を計算

③  閾値以上の文対を抽出してパラレルコーパスを構築

  Readability: 80.3  Japan is an island country in East Asia.

  Readability: 38.0  Japan is a stratovolcanic archipelago of 6,852 islands.   Readability: 69.8  The country is divided into 47 prefectures in eight regions.

  Readability: 15.0  Archaeological research indicates that Japan was inhabited as early as the Upper Paleolithic period.   Readability: 86.7  The population of 126 million is the world's tenth largest.

(10)

平易なコーパスを用いない

テキスト平易化のための単言語パラレルコーパスの構築

1.  Japan is a stratovolcanic archipelago of 6,852 islands.

2.  Archaeological research indicates that Japan was

inhabited as early as the Upper Paleolithic period.  

難解なコーパス

1.  Japan is an island country in East Asia.

2.  The country is divided into 47 prefectures in eight regions.

3.  The population of 126 million is the world's tenth largest.

平易なコーパス

In 599, an earthquake destroyed buildings throughout Yamato Province in what is now Nara Prefecture.

In 599, an earthquake destroyed buildings in Yamato Province which is now known as Nara Prefecture.

  

パラレルコーパス

統計的機械

翻訳モデル

1

1 2 …

0.25

2 3 …

0.11 0.22 0.10 0.00 0.09

文間類似度行列

①  文のリーダビリティを求めて難解な文と平易な文に分割

②  難解な文と平易な文の間の文間類似度を計算

③  閾値以上の文対を抽出してパラレルコーパスを構築

④  パラレルコーパスを用いて統計的機械翻訳モデルを学習

  Readability: 80.3  Japan is an island country in East Asia.

  Readability: 38.0  Japan is a stratovolcanic archipelago of 6,852 islands.   Readability: 69.8  The country is divided into 47 prefectures in eight regions.

  Readability: 15.0  Archaeological research indicates that Japan was inhabited as early as the Upper Paleolithic period.   Readability: 86.7  The population of 126 million is the world's tenth largest.

(11)

平易なコーパスを用いない

テキスト平易化のための単言語パラレルコーパスの構築

1.  Japan is a stratovolcanic archipelago of 6,852 islands.

2.  Archaeological research indicates that Japan was

inhabited as early as the Upper Paleolithic period.  

難解なコーパス

1.  Japan is an island country in East Asia.

2.  The country is divided into 47 prefectures in eight regions.

3.  The population of 126 million is the world's tenth largest.

平易なコーパス

In 599, an earthquake destroyed buildings throughout Yamato Province in what is now Nara Prefecture.

In 599, an earthquake destroyed buildings in Yamato Province which is now known as Nara Prefecture.

  

パラレルコーパス

統計的機械

翻訳モデル

Some computers can start from flash drives.

Some computers can boot up from flash drives.

1

1 2 …

0.25

2 3 …

0.11 0.22 0.10 0.00 0.09

文間類似度行列

①  文のリーダビリティを求めて難解な文と平易な文に分割

②  難解な文と平易な文の間の文間類似度を計算

③  閾値以上の文対を抽出してパラレルコーパスを構築

④  パラレルコーパスを用いて統計的機械翻訳モデルを学習

⑤  モデルを用いて入力文から平易な同義文を生成

  Readability: 80.3  Japan is an island country in East Asia.

  Readability: 38.0  Japan is a stratovolcanic archipelago of 6,852 islands.   Readability: 69.8  The country is divided into 47 prefectures in eight regions.

  Readability: 15.0  Archaeological research indicates that Japan was inhabited as early as the Upper Paleolithic period.   Readability: 86.7  The population of 126 million is the world's tenth largest.

(12)

12

先行研究:英語のテキスト平易化コーパス

(English WikipediaとSimple English Wikipediaから構築)

1.

Zhu et al. (2010) 10.8 万文

文をTF-IDFベクトルとして表現

ベクトル間のコサイン類似度が閾値を越える文対を抽出

2.

Coster and Kauchak (2011) 13.7 万文

Zhuらの手法を拡張し、文の出現順序を考慮

3.

Hwang et al. (2015) 28.5 万文

Wiktionaryの見出し語と定義文中の単語の共起を用いて

異なる単語間の類似度を考慮

4.

Kajiwara and Komachi (2016) 49.3 万文

(13)

先行研究:文間類似度の計算を工夫

本研究:平易な文の抽出を工夫

先行研究

本研究

生コーパス

【リーダビリティを計算】

難解なコーパス

平易なコーパス

1. Zhu et al. (2010)

2. Coster and Kauchak (2011)

3. Hwang et al. (2015)

4. Kajiwara and Komachi (2016)

文間類似度の計算

4. Kajiwara and Komachi (2016)

文間類似度の計算

テキスト平易化のための

単言語パラレルコーパス

テキスト平易化のための

単言語パラレルコーパス

コンパラブルコーパス

(14)

リーダビリティに基づく

難解な文と平易な文の分類

90 100 Very Easy

80 89 Easy

70 79 Fairly Easy

60 69 Standard

50 59 Fairly Difficult

30 49 Difficult

0 29 Very Difficult

FRE

=

206.835

1.015

α

84.6

β

α:単語数

β:平均音節数

(15)

単語分散表現のアライメントに基づく

文間類似度を用いた文アライメント

S

asym

(

x

,

y

)

=

1

x

max

j

φ

(

x

i

,

y

j

)

i

=

1

x

S

sym

(

x

,

y

)

=

1

2

(

S

asym

(

x

,

y

)

+

S

asym

(

y

,

x

))

文xyの類似度を、アラインされた単語類似度の平均値で定義

単語類似度φ(x

i

,y

j

)にはCBOWベクトルのCOS類似度を使う

各単語x

i

に対して、最も類似度が高い単語y

j

をアラインする

S

asym

は非対称なスコアなので、両方向の平均値を取る

ノイズ軽減のため、φ < (閾値) の単語対はアラインしない

(16)

200万文対のテキスト平易化コーパス

English Wikipedia:6,283,703文

難解な( 0 <= FRE < 60)コーパス:3,689,227文

平易な(60 <= FRE <=100)コーパス:2,358,921文

その他(FRE < 0, 100 < FRE)は除外: 235,555文

※ 数百単語の長文や箇条書きなど

文アライメント

閾値(単語):φ > 0.5

閾値(文):

S

sym

> 0.5

(17)

テキスト平易化コーパスの例

類似度

難解な文

平易な文

0.99

Climate in this area has mild differences

between highs and lows, and there is

adequate precipitation year round.

Climate in this area has mild differences

between highs and lows, and there is

adequate rainfall year round.

0.88

The new German Empire included 25 states

(three of them, Hanseatic cities) and the

imperial territory of Alsace-Lorraine.

The new German Empire included 25

states, three of them Hanseatic cities.

0.77

In 1996, she received the Primetime Emmy

Award for Outstanding Supporting Actress

in a Comedy Series, an award she was

nominated for on seven occasions.

In 2006 and 2008, she received Emmy

nominations for Outstanding Supporting

Actress in a Drama Series.

0.66

The album reached number two in the UK

Albums Chart and was certified double

platinum by the British Phonographic

Industry (BPI).

The single reached number one in the

UK and has been certified platinum by

the BPI, selling 600,000 copies.

0.55

Bombed as a target of the Oil Campaign of

World War 2, Erfurt suffered only limited

damage and was captured on 12 April

1945, by the US 80th Infantry Division.

(18)

統計的機械翻訳を用いたテキスト平易化

Moses(PBSMTツール)

GIZA++(単語アライメント)

KenLM(言語モデル)

比較手法:Simple English Wikipedia から5-gram

提案手法:リーダビリティ >= 60 の文から5-gram

マルチリファレンスのテストデータ

English Wikipediaから 350文 8リファレンス

リファレンス:人手で平易に書き換えた文

(19)

自動評価尺度

FRE:リーダビリティを計算する自動評価尺度

FRE:

出力のみを用いて評価する

BLEU:機械翻訳の標準的な自動評価尺度

BLEU:

意味や文法の観点で人手評価との相関が高い

BLEU:

出力とリファレンスの2つを比較して評価する

SARI:テキスト平易化のための自動評価尺度

SARI:

難易度も含めてバランス良く人手評価との相関がある

SARI:

入力と出力とリファレンスの3つを比較して評価する

19

SARI

=

1

3

F

add

+

1

3

F

keep

+

1

3

P

del

(20)

平易なコーパスでトレーニングした

先行研究と同等の性能を達成

コーパスサイズ FRE BLEU SARI

Baseline

(入力文を書き換えずに出力)

0 54.5 99.4 25.9

(21)

平易なコーパスでトレーニングした

先行研究と同等の性能を達成

Input

Offenbach s numerous operettas, such as

La belle Hélène

, were extremely popular in both France and the English-

Orpheus in the Underworld

, and

speaking world during the 1850s and 1860s.

Ref 1

Offenbach s numerous operettas, such as

La belle Hélène

, were extremely

very

popular in both France and the

Orpheus in the Underworld

, and

English-speaking world during the 1850 s and 1860 s.

Ref 2

Offenbach s numerous

Underworld

, and

La belle Hélène

many

operettas, such as

, were extremely

Orpheus in the

very

popular in both

France and

in

the English-speaking world during the 1850s and 1860s.

Kajiwara+

2016

Offenbach s numerous

many

operettas, such as

Orpheus in the

Underworld

, and

La belle Hélène

, were extremely

very

popular in both

France and the English-speaking world during the 1850s and 1860s.

Xu+

2016

Offenbach s numerous

many

operettas, such as

Orpheus in the

Underworld

, and

La

The

belle Hélène

, were extremely

very

popular in both

France and the English-speaking world during

in

the 1850s and 1860s.

本研究

Offenbach s numerous

Underworld

, and

La belle Hélène

many

operettas, such as

, were extremely popular in both France

Orpheus in the

(22)

なぜ上手く動くのか?

生コーパスを分割して得た難解なコーパスと平易な

コーパスは、コンパラブルコーパスではないが、

実験結果はこれが問題ではないことを示している

1. 

PBSMTではフレーズ単位の変換対を学習するから

Ø

単語やフレーズの部分的な対応は、同義や

含意の関係にある文対からだけではなく、

類義の関係にある文対からも得ることができる

2. 

言語モデルでのリランキングを行うから

Ø

獲得したフレーズ対に雑音が多くても、

その中に適切なフレーズ対を含むことが

(23)

まとめと今後

生コーパスのみを用いてテキスト平易化のための

単言語パラレルコーパスを構築した

PBSMTを用いたテキスト平易化の実験によって、

平易な大規模コーパスを用いてトレーニングする

場合と同等の成果を得られることがわかった

生コーパスは英語以外の言語でも大規模に利用できる

今後は任意の言語でテキスト平易化が実現できるだろう

(ソースコードをGitHubで公開予定)

参照

関連したドキュメント

3 Numerical simulation for the mteraction analysis between fluid and

Mochizuki, Topics Surrounding the Combinatorial Anabelian Geometry of Hyperbolic Curves III: Tripods and Tempered Fundamental Groups, RIMS Preprint 1763 (November 2012).

[Mag3] , Painlev´ e-type differential equations for the recurrence coefficients of semi- classical orthogonal polynomials, J. Zaslavsky , Asymptotic expansions of ratios of

Kambe, Acoustic signals associated with vor- page texline reconnection in oblique collision of two vortex rings.. Matsuno, Interaction of an algebraic soliton with uneven bottom

関谷 直也 東京大学大学院情報学環総合防災情報研究センター准教授 小宮山 庄一 危機管理室⻑. 岩田 直子

るものの、およそ 1:1 の関係が得られた。冬季には TEOM の値はやや小さくなる傾 向にあった。これは SHARP

手話言語研究センター講話会.

【 大学共 同研究 】 【個人特 別研究 】 【受託 研究】 【学 外共同 研究】 【寄 付研究 】.