1008【NLP KE】pdf 最近の更新履歴 Ryo Masumura: Web

(1)

Document Expansion

using Relevant Web Documents

for Spoken Document Retrieval

1. Graduate School of Engineering of Tohoku University, JAPAN 2. F aculty of Engineering, Tohoku Institute of Technology, JAPAN

3. F aculty of Science and Technology, Tohoku Bunka Gakuen University, JAPAN

Ryo Masumura

¹

, Akinori Ito

¹

, Yu Uno

¹

,

Masashi Ito

²

, and Shozo Makino

³

(2)

Background

Retrieving recorded speech files using text queries => Spoken document retrieval

We use a speech recognizer for generating the index of the spoken document automatically

Metadata is insufficient in the internet

Manual indexing wastes a lot of cost

Problem of the spoken document retrieval: Text query

User _Internet

Spoken document

Soccer search

(3)

Speech

recognizer Recognized Document

Automatic indexing using LVCSR

Spoken document

Index

Recognition errors

Out-Of-Vocabulary (OOV) words

Automatic indexing of a spoken document

Problems of this framework:

We like soccer ball We like soccer a bowl

Word Value Soccer 8

Bowl 6

Like 2

We 1

The index contains many recognition errors

Words that are not included in the recognizer dictionary

OOV words can are not included in the index

Document vector

(4)

Index

Recognition hypothesis

1-best 2-best Multiple hypothesis

Spoken document

Speech

recognizer Recognition

hypothesis

Document expansion approach

Predict keywords that may appear in a target document

Use of multiple recognition hypotheses (Mangu et al. 2000)

Good: Solve the problem of misrecognition

Bad: OOV words are not included in any hypothesis

Add the predicted keywords to the index

We like soccer ball We like soccer a bowl We like soccer ball

We have been focusing on the document expansion approach

(5)

Index

World Wide Web

Recognized Document

Web Document Spoken document

Speech recognizer

Use of World Wide Web

More than 100 billion Web pages in the Internet

Powerful search engines such as Google, Bing or Yahoo

Develop a document expansion method

using relevant Web document

~ Object of this work ~

Web-based document expansion (Ito et al. 2009)

•How to collect Web

documents relevant to the original speech??

•How to compose the expanded index using

retrieved Web documents??

We focus on World Wide Web

(6)

Overview of the system

Search Engine Japan

Sendai search search World Wide Web

Web Document

Recognized Index

Web Index Expanded

Index

1. Avoid choosing keywords derived from misrecognition 2. Select documents really relevant to the spoken document

Speech recognizer

Spoken document

Automatic transcription is generated using speech recognizer

Keywords are extracted, and each keywords are used as a query for retrieving

(7)

Point 1: Keyword selection

Conventional keyword selection

Word frequency (TF-IDF) based selection (A.Ito et al. 2009)

SR is tend to misrecognize rare words and OOV words

High frequency of a word in an automatic transcription does not ensure the importance of that word

Problem:

Rank Word Value 1 Writer 56

2 Ball 52

3 Goal 33

… … …

We measure an importance of a word, and those words with highest values are chosen as keywords for composing queries

Search Engine Writer

Ball search search Goal search

(8)

Relevance measurement

We propose a novel method of relevance measurement

“soccer”

Recognized document

documents that contain “soccer”

“soccer” ^“soccer”

Conventional method Proposed method

“bowl” “goal”

“bowl”

Recognized document

“goal”

“soccer”

“bowl”

“bowl” ^“ball”

“goal”

“shoot”

Similarity between the two word distribution

= Importance of “soccer” in the recognized document Word Frequency

= Importance of “soccer” in the recognized document

If a keyword is related to the topic of the recognized document, the documents that contain a certain keyword has similar word distribution to the recognized document

(9)

Proposed keyword selection

We carry out Web search using a word as a query

This score considers external information sources for

measuring relevance between a keyword and the transcription

Cosine similarity between the two document vectors

“Soccer”

“Ball”

“Goal”

Web Document Recognized

Index

documents that contain _“soccer”

Document vector

(10)

Point 2: Composition of Web index

We should give higher weight for words in the Web index if the words are relevant to the topic of the input speech

・・・・

Web Index

We need to consider the relevance of each Web document to the spoken document when combining the retrieved Web documents

λ₁ λ₂ λ₃ λ_n

D₁ D₂ D₃ ・・・・ D_n Some of retrieved Web documents might be irrelevant to the topic of the spoken document even if they contain the specified keyword Retrieved Web documents

(11)

Relevance-based weighting

We employ the weight for combination of Web documents

The weight _λ_nis a cosine similarity between the automatic transcription and the Web document

λ_n

0

λ_n^>λ_th Weight = _ε

λ_n^<λ_th

Weight = _λ_n D₁ D₂ D₃ ・・・・ D_n

Web Index

λ₁ ε λ₃ λ_n

Recognized

Index ^Dⁿ

・・・・

Retrieved Web documents

Furthermore, we introduce a

threshold _λ_th into the weighting so that irrelevant documents have small effect on the Web index

(12)

Evaluation of the proposed method

The rate of misrecognized words in selected keywords 1. Evaluation of keyword selection

2. Evaluation of composition of Web index

Test set: 40 automatic transcriptions of Japanese lecture speech (Word accuracy: 62.45%, OOV rate: 1.85%)

Web search engine: Yahoo! API Experimental conditions:

Cosine similarity between the composed Web index and the ideal index

Cosine similarity between the expanded index and the ideal index

3. Evaluation the effect of the document expansion Experiment:

(13)

Evaluation of the keyword selection

0 5 10 15 20 25 30 35 40 45

0 10 20 30

Number of keywords

Rate of misrecognized word (%)

proposed conventional

Rate of misrecognized keywords in the selected keywords

Proposed relevance measurement of a word is

more effective than the conventional measurement

Improvement

Proposed method selected less misrecognized words as keywords compared with conventional method

(14)

0.3 0.35 0.4 0.45 0.5 0.55 0.6

1 5 10 15 20 25 30

T (Number of keywords)

Similarity

weight = const threshold = 0 threshold = 0.1

Evaluation of composition of Web index

Cosine similarity between the Web index and the ideal index

By introducing relevance-based weighting with threshold, we could improve the similarity and avoid degradation

The similarity decreases when many keywords are used,

which seem to be caused by irrelevant Web document…

The proposed relevance-based weighting is effective for composing Web index

(15)

Effect of document expansion

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r (Combination weight)

Similarity

weight = const threshold = 0 threshold = 0.1 no expantion

Cosine similarity between the expanded index and the ideal index

We obtained 0.051 point of improvement

compared with the conventional recognized index

No expansion denotes the conventional method which only use the recognized index

When relevance-based weighting with threshold was used, we gave the highest similarity

(16)

Retrieval experiment

Evaluation set: CSJ test collection (Akiba et al. 2008)

Target spoken documents: 2702 lecture speeches

Word accuracy was 75.12%, and OOV rate was 0.23%

Conventional method: No expansion (recognized index)

Proposed method: Document expansion (expanded index)

Number of keywords for finding relevant Web documents was set to 30

Number of retrieved Web documents for 1 query was set to 50

2702 lecture speech

39 query

39 retrieval result

We actually conducted a spoken document retrieval

experiment for evaluating the effect of document expansion

(17)

Evaluation metric

Recall

Interpolated precision

0 _1.0

Average precision



_





10

0 '

10) 11 IP(

ptAP 1 11

) P( max )

IP( '

k R R

k R

R __Ris the recall rate

For the precision and recall have a trade-off relationship, we calculate the 11-point average precision (11pt AP)

We calculate precision rates and recall rates

Rank Is this relevant? 1 ^○(Relevant) 2 ^×(Irrelevant)

3 ^○

4 ^○

5 ^×

… …

Retrieval result

^P(R) is a precision rate when recall rate is R

(18)

Experimental result

0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r

11pt AP

weight = const threshold = 0 threshold = 0.1 no expansion

manual transcription

Average of the result by the 39 queries

When constant weight was used, we obtained 0.038 point of improvement compared with the no expansion method

The expanded index ^gave higher precision than just using the recognized index

The result was different from our expectation for the constant weight combination method gave the highest precision…

Ideal result is calculated using the indices of the manual transcriptions

(19)

Conclusions

 We proposed a Web-based document expansion

method for spoken document retrieval

We proposed a method for selecting keywords for retrieving Web documents relevant to the spoken document

We proposed a method for generating Web index that was

similar to the ideal index generated form a manual transcription

The proposed Web-based document expansion method gave an index more similar to the correct index than no expansion method

In a spoken document retrieval experiment, the retrieval precision improved when using a Web-based document expansion scheme