• 検索結果がありません。

1008【NLP KE】pdf 最近の更新履歴 Ryo Masumura: Web

N/A
N/A
Protected

Academic year: 2018

シェア "1008【NLP KE】pdf 最近の更新履歴 Ryo Masumura: Web"

Copied!
19
0
0

読み込み中.... (全文を見る)

全文

(1)

Document Expansion

using Relevant Web Documents

for Spoken Document Retrieval

1. Graduate School of Engineering of Tohoku University, JAPAN 2. F aculty of Engineering, Tohoku Institute of Technology, JAPAN

3. F aculty of Science and Technology, Tohoku Bunka Gakuen University, JAPAN

Ryo Masumura

1

, Akinori Ito

1

, Yu Uno

1

,

Masashi Ito

2

, and Shozo Makino

3

(2)

Background

Retrieving recorded speech files using text queries => Spoken document retrieval

We use a speech recognizer for generating the index of the spoken document automatically

Metadata is insufficient in the internet

Manual indexing wastes a lot of cost

Problem of the spoken document retrieval: Text query

User Internet

Spoken document

Soccer search

(3)

Speech

recognizer Recognized Document

Automatic indexing using LVCSR

Spoken document

Index

Recognition errors

Out-Of-Vocabulary (OOV) words

Automatic indexing of a spoken document

Problems of this framework:

We like soccer ball We like soccer a bowl

Word Value Soccer 8

Bowl 6

Like 2

We 1

The index contains many recognition errors

Words that are not included in the recognizer dictionary

OOV words can are not included in the index

Document vector

(4)

Index

Recognition hypothesis

1-best 2-best Multiple hypothesis

Spoken document

Speech

recognizer Recognition

hypothesis

Document expansion approach

Predict keywords that may appear in a target document

Use of multiple recognition hypotheses (Mangu et al. 2000)

Good: Solve the problem of misrecognition

Bad: OOV words are not included in any hypothesis

Add the predicted keywords to the index

We like soccer ball We like soccer a bowl We like soccer ball

We have been focusing on the document expansion approach

(5)

Index

World Wide Web

Recognized Document

Web Document Spoken document

Speech recognizer

Use of World Wide Web

More than 100 billion Web pages in the Internet

Powerful search engines such as Google, Bing or Yahoo

Develop a document expansion method

using relevant Web document

~ Object of this work ~

Web-based document expansion (Ito et al. 2009)

•How to collect Web

documents relevant to the original speech??

•How to compose the expanded index using

retrieved Web documents??

We focus on World Wide Web

(6)

Overview of the system

Recognized Document

Search Engine Japan

Sendai search search World Wide Web

Web Document

Recognized Index

Web Index Expanded

Index

1. Avoid choosing keywords derived from misrecognition 2. Select documents really relevant to the spoken document

Speech recognizer

Spoken document

Automatic transcription is generated using speech recognizer

Keywords are extracted, and each keywords are used as a query for retrieving

(7)

Point 1: Keyword selection

Conventional keyword selection

Word frequency (TF-IDF) based selection (A.Ito et al. 2009)

SR is tend to misrecognize rare words and OOV words

High frequency of a word in an automatic transcription does not ensure the importance of that word

Problem:

Recognized Document

Rank Word Value 1 Writer 56

2 Ball 52

3 Goal 33

We measure an importance of a word, and those words with highest values are chosen as keywords for composing queries

Search Engine Writer

Ball search search Goal search

(8)

Relevance measurement

We propose a novel method of relevance measurement

“soccer”

“soccer”

Recognized document

documents that contain “soccer”

“soccer” “soccer”

Conventional method Proposed method

“bowl” “goal”

“bowl”

Recognized document

“goal”

“soccer”

“soccer”

“soccer”

“bowl”

“bowl” “ball”

“goal”

“shoot”

Similarity between the two word distribution

= Importance of “soccer” in the recognized document Word Frequency

= Importance of “soccer” in the recognized document

If a keyword is related to the topic of the recognized document, the documents that contain a certain keyword has similar word distribution to the recognized document

(9)

Web Document

Proposed keyword selection

We carry out Web search using a word as a query

This score considers external information sources for

measuring relevance between a keyword and the transcription

Cosine similarity between the two document vectors

Recognized Document

“Soccer”

“Ball”

“Goal”

Web Document

Web Document Recognized

Index

documents that contain “soccer”

Document vector

Document vector

Document vector

(10)

Point 2: Composition of Web index

We should give higher weight for words in the Web index if the words are relevant to the topic of the input speech

・・・・

Web Index

We need to consider the relevance of each Web document to the spoken document when combining the retrieved Web documents

λ1 λ2 λ3 λn

D1 D2 D3 ・・・・ Dn Some of retrieved Web documents might be irrelevant to the topic of the spoken document even if they contain the specified keyword Retrieved Web documents

(11)

Relevance-based weighting

We employ the weight for combination of Web documents

The weight λnis a cosine similarity between the automatic transcription and the Web document

λn

0

λn > λth Weight = ε

λn < λth

Weight = λn D1 D2 D3 ・・・・ Dn

Web Index

λ1 ε λ3 λn

Recognized

Index Dn

・・・・

Retrieved Web documents

Furthermore, we introduce a

threshold λth into the weighting so that irrelevant documents have small effect on the Web index

(12)

Evaluation of the proposed method

The rate of misrecognized words in selected keywords 1. Evaluation of keyword selection

2. Evaluation of composition of Web index

Test set: 40 automatic transcriptions of Japanese lecture speech (Word accuracy: 62.45%, OOV rate: 1.85%)

Web search engine: Yahoo! API Experimental conditions:

Cosine similarity between the composed Web index and the ideal index

Cosine similarity between the expanded index and the ideal index

3. Evaluation the effect of the document expansion Experiment:

(13)

Evaluation of the keyword selection

0 5 10 15 20 25 30 35 40 45

0 10 20 30

Number of keywords

Rate of misrecognized word (%)

proposed conventional

Rate of misrecognized keywords in the selected keywords

Proposed relevance measurement of a word is

more effective than the conventional measurement

Improvement

Proposed method selected less misrecognized words as keywords compared with conventional method

(14)

0.3 0.35 0.4 0.45 0.5 0.55 0.6

1 5 10 15 20 25 30

T (Number of keywords)

Similarity

weight = const threshold = 0 threshold = 0.1

Evaluation of composition of Web index

Cosine similarity between the Web index and the ideal index

By introducing relevance-based weighting with threshold, we could improve the similarity and avoid degradation

The similarity decreases when many keywords are used,

which seem to be caused by irrelevant Web document…

The proposed relevance-based weighting is effective for composing Web index

(15)

Effect of document expansion

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r (Combination weight)

Similarity

weight = const threshold = 0 threshold = 0.1 no expantion

Cosine similarity between the expanded index and the ideal index

We obtained 0.051 point of improvement

compared with the conventional recognized index

No expansion denotes the conventional method which only use the recognized index

When relevance-based weighting with threshold was used, we gave the highest similarity

(16)

Retrieval experiment

Evaluation set: CSJ test collection (Akiba et al. 2008)

Target spoken documents: 2702 lecture speeches

Word accuracy was 75.12%, and OOV rate was 0.23%

Conventional method: No expansion (recognized index)

Proposed method: Document expansion (expanded index)

Number of keywords for finding relevant Web documents was set to 30

Number of retrieved Web documents for 1 query was set to 50

2702 lecture speech

39 query

39 retrieval result

We actually conducted a spoken document retrieval

experiment for evaluating the effect of document expansion

(17)

Evaluation metric

Recall

Interpolated precision

0 1.0

Average precision

10

0 '

10) 11 IP(

ptAP 1 11

) P( max )

IP( '

k R R

k R

R R is the recall rate

For the precision and recall have a trade-off relationship, we calculate the 11-point average precision (11pt AP)

We calculate precision rates and recall rates

Rank Is this relevant? 1 (Relevant) 2 ×(Irrelevant)

3

4

5 ×

Retrieval result

P(R) is a precision rate when recall rate is R

(18)

Experimental result

0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r

11pt AP

weight = const threshold = 0 threshold = 0.1 no expansion

manual transcription

Average of the result by the 39 queries

When constant weight was used, we obtained 0.038 point of improvement compared with the no expansion method

The expanded index gave higher precision than just using the recognized index

The result was different from our expectation for the constant weight combination method gave the highest precision…

Ideal result is calculated using the indices of the manual transcriptions

(19)

Conclusions

We proposed a Web-based document expansion

method for spoken document retrieval

We proposed a method for selecting keywords for retrieving Web documents relevant to the spoken document

We proposed a method for generating Web index that was

similar to the ideal index generated form a manual transcription

The proposed Web-based document expansion method gave an index more similar to the correct index than no expansion method

In a spoken document retrieval experiment, the retrieval precision improved when using a Web-based document expansion scheme

参照

関連したドキュメント

The Mathematical Society of Japan (MSJ) inaugurated the Takagi Lectures as prestigious research survey lectures.. The Takagi Lectures are the first se- ries of the MSJ official

The Mathematical Society of Japan (MSJ) inaugurated the Takagi Lectures as prestigious research survey lectures.. The Takagi Lectures are the first series of the MSJ official

I give a proof of the theorem over any separably closed field F using ℓ-adic perverse sheaves.. My proof is different from the one of Mirkovi´c

東京都は他の道府県とは値が離れているように見える。相関係数はこう

Keywords: continuous time random walk, Brownian motion, collision time, skew Young tableaux, tandem queue.. AMS 2000 Subject Classification: Primary:

Kilbas; Conditions of the existence of a classical solution of a Cauchy type problem for the diffusion equation with the Riemann-Liouville partial derivative, Differential Equations,

n , 1) maps the space of all homogeneous elements of degree n of an arbitrary free associative algebra onto its subspace of homogeneous Lie elements of degree n. A second

For a positive definite fundamental tensor all known examples of Osserman algebraic curvature tensors have a typical structure.. They can be produced from a metric tensor and a