Document Expansion
using Relevant Web Documents
for Spoken Document Retrieval
1. Graduate School of Engineering of Tohoku University, JAPAN 2. F aculty of Engineering, Tohoku Institute of Technology, JAPAN
3. F aculty of Science and Technology, Tohoku Bunka Gakuen University, JAPAN
Ryo Masumura
1, Akinori Ito
1, Yu Uno
1,
Masashi Ito
2, and Shozo Makino
3Background
Retrieving recorded speech files using text queries => Spoken document retrieval
We use a speech recognizer for generating the index of the spoken document automatically
Metadata is insufficient in the internet
Manual indexing wastes a lot of cost
Problem of the spoken document retrieval: Text query
User Internet
Spoken document
Soccer search
Speech
recognizer Recognized Document
Automatic indexing using LVCSR
Spoken document
Index
Recognition errors
Out-Of-Vocabulary (OOV) words
Automatic indexing of a spoken document
Problems of this framework:
We like soccer ball We like soccer a bowl
Word Value Soccer 8
Bowl 6
Like 2
We 1
The index contains many recognition errors
Words that are not included in the recognizer dictionary
OOV words can are not included in the index
Document vector
Index
Recognition hypothesis
1-best 2-best Multiple hypothesis
Spoken document
Speech
recognizer Recognition
hypothesis
Document expansion approach
Predict keywords that may appear in a target document
Use of multiple recognition hypotheses (Mangu et al. 2000)
Good: Solve the problem of misrecognition
Bad: OOV words are not included in any hypothesis
Add the predicted keywords to the index
We like soccer ball We like soccer a bowl We like soccer ball
We have been focusing on the document expansion approach
Index
World Wide Web
Recognized Document
Web Document Spoken document
Speech recognizer
Use of World Wide Web
More than 100 billion Web pages in the Internet
Powerful search engines such as Google, Bing or Yahoo
Develop a document expansion method
using relevant Web document
~ Object of this work ~
Web-based document expansion (Ito et al. 2009)
•How to collect Web
documents relevant to the original speech??
•How to compose the expanded index using
retrieved Web documents??
We focus on World Wide Web
Overview of the system
Recognized Document
Search Engine Japan
Sendai search search World Wide Web
Web Document
Recognized Index
Web Index Expanded
Index
1. Avoid choosing keywords derived from misrecognition 2. Select documents really relevant to the spoken document
Speech recognizer
Spoken document
Automatic transcription is generated using speech recognizer
Keywords are extracted, and each keywords are used as a query for retrieving
Point 1: Keyword selection
Conventional keyword selection
Word frequency (TF-IDF) based selection (A.Ito et al. 2009)
SR is tend to misrecognize rare words and OOV words
High frequency of a word in an automatic transcription does not ensure the importance of that word
Problem:
Recognized Document
Rank Word Value 1 Writer 56
2 Ball 52
3 Goal 33
… … …
We measure an importance of a word, and those words with highest values are chosen as keywords for composing queries
Search Engine Writer
Ball search search Goal search
Relevance measurement
We propose a novel method of relevance measurement
“soccer”
“soccer”
Recognized document
documents that contain “soccer”
“soccer” “soccer”
Conventional method Proposed method
“bowl” “goal”
“bowl”
Recognized document
“goal”
“soccer”
“soccer”
“soccer”
“bowl”
“bowl” “ball”
“goal”
“shoot”
Similarity between the two word distribution
= Importance of “soccer” in the recognized document Word Frequency
= Importance of “soccer” in the recognized document
If a keyword is related to the topic of the recognized document, the documents that contain a certain keyword has similar word distribution to the recognized document
Web Document
Proposed keyword selection
We carry out Web search using a word as a query
This score considers external information sources for
measuring relevance between a keyword and the transcription
Cosine similarity between the two document vectors
Recognized Document
“Soccer”
“Ball”
“Goal”
Web Document
Web Document Recognized
Index
documents that contain “soccer”
Document vector
Document vector
Document vector
Point 2: Composition of Web index
We should give higher weight for words in the Web index if the words are relevant to the topic of the input speech
・・・・
Web Index
We need to consider the relevance of each Web document to the spoken document when combining the retrieved Web documents
λ1 λ2 λ3 λn
D1 D2 D3 ・・・・ Dn Some of retrieved Web documents might be irrelevant to the topic of the spoken document even if they contain the specified keyword Retrieved Web documents
Relevance-based weighting
We employ the weight for combination of Web documents
The weight λnis a cosine similarity between the automatic transcription and the Web document
λn
0
λn > λth Weight = ε
λn < λth
Weight = λn D1 D2 D3 ・・・・ Dn
Web Index
λ1 ε λ3 λn
Recognized
Index Dn
・・・・
Retrieved Web documents
Furthermore, we introduce a
threshold λth into the weighting so that irrelevant documents have small effect on the Web index
Evaluation of the proposed method
The rate of misrecognized words in selected keywords 1. Evaluation of keyword selection
2. Evaluation of composition of Web index
Test set: 40 automatic transcriptions of Japanese lecture speech (Word accuracy: 62.45%, OOV rate: 1.85%)
Web search engine: Yahoo! API Experimental conditions:
Cosine similarity between the composed Web index and the ideal index
Cosine similarity between the expanded index and the ideal index
3. Evaluation the effect of the document expansion Experiment:
Evaluation of the keyword selection
0 5 10 15 20 25 30 35 40 45
0 10 20 30
Number of keywords
Rate of misrecognized word (%)
proposed conventional
Rate of misrecognized keywords in the selected keywords
Proposed relevance measurement of a word is
more effective than the conventional measurement
Improvement
Proposed method selected less misrecognized words as keywords compared with conventional method
0.3 0.35 0.4 0.45 0.5 0.55 0.6
1 5 10 15 20 25 30
T (Number of keywords)
Similarity
weight = const threshold = 0 threshold = 0.1
Evaluation of composition of Web index
Cosine similarity between the Web index and the ideal index
By introducing relevance-based weighting with threshold, we could improve the similarity and avoid degradation
The similarity decreases when many keywords are used,
which seem to be caused by irrelevant Web document…
The proposed relevance-based weighting is effective for composing Web index
Effect of document expansion
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r (Combination weight)
Similarity
weight = const threshold = 0 threshold = 0.1 no expantion
Cosine similarity between the expanded index and the ideal index
We obtained 0.051 point of improvement
compared with the conventional recognized index
No expansion denotes the conventional method which only use the recognized index
When relevance-based weighting with threshold was used, we gave the highest similarity
Retrieval experiment
Evaluation set: CSJ test collection (Akiba et al. 2008)
Target spoken documents: 2702 lecture speeches
Word accuracy was 75.12%, and OOV rate was 0.23%
Conventional method: No expansion (recognized index)
Proposed method: Document expansion (expanded index)
Number of keywords for finding relevant Web documents was set to 30
Number of retrieved Web documents for 1 query was set to 50
2702 lecture speech
39 query
39 retrieval result
We actually conducted a spoken document retrieval
experiment for evaluating the effect of document expansion
Evaluation metric
Recall
Interpolated precision
0 1.0
Average precision
10
0 '
10) 11 IP(
ptAP 1 11
) P( max )
IP( '
k R R
k R
R R is the recall rate
For the precision and recall have a trade-off relationship, we calculate the 11-point average precision (11pt AP)
We calculate precision rates and recall rates
Rank Is this relevant? 1 ○(Relevant) 2 ×(Irrelevant)
3 ○
4 ○
5 ×
… …
Retrieval result
P(R) is a precision rate when recall rate is R
Experimental result
0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r
11pt AP
weight = const threshold = 0 threshold = 0.1 no expansion
manual transcription
Average of the result by the 39 queries
When constant weight was used, we obtained 0.038 point of improvement compared with the no expansion method
The expanded index gave higher precision than just using the recognized index
The result was different from our expectation for the constant weight combination method gave the highest precision…
Ideal result is calculated using the indices of the manual transcriptions
Conclusions
We proposed a Web-based document expansion
method for spoken document retrieval
We proposed a method for selecting keywords for retrieving Web documents relevant to the spoken document
We proposed a method for generating Web index that was
similar to the ideal index generated form a manual transcription
The proposed Web-based document expansion method gave an index more similar to the correct index than no expansion method
In a spoken document retrieval experiment, the retrieval precision improved when using a Web-based document expansion scheme