• 検索結果がありません。

Publication Web Intelligence and Data Mining Laboratory

N/A
N/A
Protected

Academic year: 2018

シェア "Publication Web Intelligence and Data Mining Laboratory"

Copied!
4
0
0

読み込み中.... (全文を見る)

全文

(1)

Improving the Effectiveness of POI Search by

Associated Information Summarization

Hsiu-Min Chuang National Central University

Taoyuan, Taiwan

Email: showmin1205@gmail.com

Chia-Hui Chang∗ National Central University

Taoyuan, Taiwan Email: chia@csie.ncu.edu.tw

Chung-Ting Cheng National Central University

Taoyuan, Taiwan Email: glant1021@hotmail.com

Abstract—The demand for map services has risen significantly in recent years due to the popularity of mobile devices and wireless networks. Since there are always emerging point-of-interest (POI) in the real world, mining POIs shared by users from the Web has been a challenging problem to enrich ex-isting POI database. However, crawling address-bearing pages and extracting POI relations are only the fundamentals for constructing POI database, the description of POIs, i.e. the services and products that POIs provide are especially essential for POI search. In this paper, we propose the summarization of POI associated information to improve the search effectiveness to avoid mismatches between users’ queries and the relevant POIs by integrating our constructed POI database with Google place API to collect POIs shared by users from the Web. Experimental results showed that the integrated POI search service outperforms other similar servicess, such as Wikimapia and a commercial app, What’s the number.

I. INTRODUCTION

With the rapid growth of mobile applications and services, digital maps and online maps have replaced conventional paper maps and directories. OpenStreetMap was the first collaborative project that created a free, editable digital map of the world in 2004. Google released its map search service in 2005, and is currently the most extensive POI search service. With the ever-increasing popularity of mobile devices, POI search has become a common need in our daily lives.

Many similar map search services have been developed, such as Bing Maps, Google Maps and Wikimapia. However, the number of POIs and its descriptions are still insufficient because these POIs depend on manual editing or crowdsourc-ing. Moreover, the categories of POIs are often limited in range to food, landmarks, hotels, etc. One observation is that many stores can be found on the Web by search engines, but it could not be found on the maps.

In addition to the incomplete search results, another problem for POI search is mismatch between the query and desirable POIs. For example, search for query “日本料理” (Japanese

cuisine) cannot find the POI “阿婆壽司” (Grandma sushi), because the POI name does not match “日本料理.” Thus,

enriching the descriptions of POIs or query expansion is required to solve the mismatch problem. In this paper, the description of POI is denoted as the associated information, i.e., the products/services of the POI, the business hours of the POI, etc. Thus, we improve the recall of POI search by summarizing POI associated information.

With respect to the precision of POI search, the POI ranking is a critical problem. The search result should consist of intel-ligently ranked POIs according to certain criteria (relevance, distance, and rating) and be tuned in scope (e.g., local or global search). Sometimes, the results are based on competing considerations. For example, when users’ queries belong to common categories, products or services, such as restaurant, coffee shops or free WiFi sources, there may be too many matching POIs, such that ranking nearby POIs according to user’s location is appropriate. For a scenario where specific POIs are being searched, no matched POIs exist around user’s scope, such that the search scope needs to be expanded until matching POIs are found or a global scope is reached.

In this paper, we address two critical problems: the insuf-ficient number of POIs and POI ranking based on different scenarios. To enhance the number of POIs, we complement the associated information of POIs and integrate three sources: an offline collection database, an online search module, and Google Place API. Moreover, the ranking framework combines distance with POI’s relevance in measuring the effectiveness of given search queries.

The experimental result showed that our service can achieve a normalized discounted cumulative gain (NDCG) of 0.932, and outperforms other map search services: Wikimapia (0.476), “What’s the Number” (0.474) and Google Maps (0.844). Because of enriching the associated information of POIs in our database, the average precision for top 10 results is improved from 0.76 to 0.8. The result showed that the associated information of POIs is helpful for indexing more relevant POIs.

Based on our proposal, this paper contributes in multiple ways to solve the challenges facing POI search services.

• We integrate multiple sources and rank search results based on certain criteria (relevance or distance) and search scope (local or global) to improve the number of POIs.

• We train a binary classifier model, which learns to rank the results of POI searches. Experimental results showed that our system outperforms other services.

• In terms of enriching the associated information of POIs, we extracted search snippets from search engines, and summa-rized the top three most representative sentences by TF-IDF (term frequency – inverse document frequency) weighting and LDA (latent Dirichlet allocation [4]) smoothing.

(2)

II. RELATEDWORK

Geographic information retrieval (GIR) can be regarded as an extension of IR [2]. The major difference between information retrieval (IR) and GIR is that the goal of IR is to rank relevant documents for a given query, whereas the aim of GIR is to discover the descriptions of geographic entities [9]. By contrast, POI searches focus on locating places on maps. To do so, an index must be generated for POIs after crawling and extracting them from unstructured Web documents. This is the major difference between GIR and state-of-the-art map search services like Google and Yahoo, which acquire POI databases from governments and yellow pages. In addition to geoparse Web documents, some map services also acquire POIs by crowdsourcing.

To enrich map search databases, geoparsing Web documents to recognize POIs has become an important research topic. For example, [12] proposed curating POIs by bootstrapping training data from Web snippets seeded by POIs gathered from social media. By querying Bing search engine with seed titles to retrieve the top ten snippets, they trained a POI recognition model via a CRF (conditional random field) tagger. [1] proposed a location-based search engine that automatically derives spatial context from unstructured Web resources by heuristically crawling Webpages with spatial relations. With this search engine, they further proposed a directory-driven geoparser to extract addresses from across Germany.

In previous studies, we proposed the query-based crawling strategy to collect address-bearing pages for potential POI extraction [7] by submitting queries that contain address keywords such as “street,” “road,” city names and various categories. Then, we extracted postal addresses, POI names by Web named entity recognition (NER) model construction [8]. Once the NER models are built, the coupling of POI name and address is considered as a binary classification problem to predict the correct POI relation.

However, owing to short queries and short POI names, many POIs may be not retrieved if no description is associated with the POI. In order to improve the effectiveness of POI search, enriching and summarizing the associated description of POIs are necessary. In previous studies [5], they proposed an unsupervised learning method to segment the associated information of POIs from semi-structural pages. Here, we address the problem of POI associated information summariza-tion from collected address-bearing pages and Google snippets to identify the most relevant sentences.

In general, there are two approaches to automatic sum-marization: extraction and abstraction. Extractive methods [11] are used to identify important sentences from texts. By contrast, abstractive methods [10] aim to create a summary by semantic representation and natural language generation.

In addition, one of main targets of any search service is to satisfy users’ intents based on certain criteria. Bauer et al. [3] proposed the evaluation method for balancing POI’s relevance and the distance because offline-purchasing needs require the mapping from a product query to a set of locations.

III. SYSTEMARCHITECTURE

To improve the effectiveness of POI search, we propose the system architecture as shown in Figure 1. The front end is an app called PowerPOI1 (疾疾), which provides

a search interface to submit user information and present search results on maps. The backend is a GIR server that integrates and ranks POI search results from our POI database and Google places API. Four modules are involved in the POI search: query-based crawler, POI extraction, associated information summarization, and POI ranking.

First, the query-based crawler is responsible for collecting address-bearing pages for potential POI extraction. We can make use of search engines to find address-bearing pages as proposed in [7]. Then, the POI extraction module is applied to train postal-address and POI-name models for POI extraction. Next, the coupling of POI name and address is considered as a binary classification problem to predict the correct POI relation. To enrich the associated information of POI pairs and rank search results, we focus on the summarization and ranking modules in this paper.

Fig. 1. System architecture

A. POI Associated Information Summarization

Associated information summarization tries to augment the extracted POI with description for enriching the POI pairs. The idea is to support POI search with product and service queries which are common for daily life. For an IR system, more rele-vant information will be helpful to retrieve the desirable POIs. We adopt a strategy that uses the address and its corresponding POI name as landmarks to segment the region of associated information. First, the address-bearing page is converted into the structure of the DOM tree. The HTML elements of the address and the POI name are then denoted as two leaf nodes. Then, we define the lowest common ancestor node of both nodes as a new root, and extract the text block of the sub-tree. In many cases, the extracted associated information in the above manner are insufficient due to limitations in address-bearing pages. Thus, in addition to address-address-bearing pages, Google snippets are also sources of associated information. In order to accurately collect the descriptions of POIs, we utilize the Google search engine by submitting “address” and “POI name” as a compound query to crawl top 10 search snippets. After extracting the associated information of POIs, the POIs can be retrieved easily because they now have more

(3)

relevant keywords associated with them. However, information overload and irrelevant information can also reduce the quality of POI searches. Thus, a critical task is to summarize the associated information. We calculate the probability by the query likelihood model [2]. Given a POIp, the probability of sentenceP(s|p)is proportional to P(p|s). To avoid the zero-probability estimation, whereby if the word representing a POI is missing from the sentence, we linearly combine the TF-IDF model with the topic model by multiplying by parameter λ:

P(p|s) =λ(fp,s |s| ×log

|C|

cp

) + (1−λ)PLDA(p|s) (1)

wherefp,s is the number of times wordpoccurs in sentence

set s, and |s| is the number of words in s. Moreover, cp is

the number of sentences containing POIp, and|C|is the total number of sentences. TF-IDF is a common numerical statistic that reflects how important words are with respect to sentences in the corpus. Because we do not have large amounts of text for language model probability estimation and the diversity of words, vocabulary mismatch and the bias of rarely occurring words are important issues in general information retrieval. Thus, smoothing is used to avoid the zero probability and overcoming data sparsity.

We adopt the LDA model for smoothing with a mixture of topics. We assume that each sentence is associated with a single topic. The joint probability PLDA(p|s) is calculated

from the Dirichlet distribution for the given corpus. The input of LDA is a term×document matrix, and it generates the document-topic distribution θ and topic-word distribution φ. The probability PLDA(p|s) is estimated as the product of

P(p|z, φ)andP(z|θ)with respect to hidden topic z:

PLDA(p|s) =P(P OI|θd, φk) = z

P(P OI|z, φk)P(z|θd) (2)

φk is the topic-word distribution for k topics, and θd is the

sentence-topic distribution fordsentences. Thus, the sentences are computed and ranked according top(s|p), and the top three sentences are selected as the POI associated information.

B. POI Ranking

A POI search on a map is specified by a user’s query and the geographic position where the query is submitted. Therefore, a POI search must consider both content and spatial relevance. The general expectation is that when several POIs satisfy a query, a POI that is closer to user’s position should be ranked higher than one that is farther away. On the contrary, if we search for some unique POI that cannot be found in the user’s vicinity, we expect the system to search within an expanded area until some POI satisfies the information needed. In such cases, we should ignore the spatial factor but focus on the content. To efficiently retrieve the POI database, we used the Solr search platform to index POIs. In order to avoid null search result, we combine three search results from our constructed POI database, Google Maps API and online extraction module which is proposed in [6]. To provide related POIs that satisfy the information needed for search regions

and ranking criteria, we propose a search algorithm to switch between the two modes based on the relevance and the distance between POIs and user location as shown in Algorithm 1.

Algorithm 1Search (q, r, GP S, i)

1: inputs

1: user queryq, user’sGP S, search scoper, count i,δ 2: outputs

2: POI list

3: if i= 0 then

4: EXIT

5: end if

6: MS = Solr∪ Google Place API∪online search

7: C = Ranking (MS,δ)

8: if C = null then

9: Search(q,r×3,GP S,i-1)

10: else

11: C order by the relevance and the distance

12: end if

Initially, the given parameters consist of query q, search region r, user location based on GPS, relevance threshold δ= 0.5, and counteri >0. The GIR system then simultane-ously accesses multiple search (MS) engines (i.e., our Solr IR system, the online extraction module, and Google Place API) for the union of the results. The ranking module then sorts the POIs by a binary classifier to predictive relevance which if the relevance is more than 0.5, indicates relevance, and irrelevance otherwise. If there is no relevant result in the given search region, the procedure will continue to invoke itself by expanding search region r by three times until counter i is zero. Otherwise, the search results are ranked by relevance first. If they are the same, they are ranked by distance. For the counter i, ifi >1, it is local search with a given search region. Ifi= 1, it means that the search region is global when i= 0, the algorithm will terminate.

IV. EXPERIMENTS

We evaluated the performance with/without associated in-formation for two datasets: POI withoutAI contained only 1.2 millions POIs without corresponding associated informa-tion, and POI withAI contained POIs as well as associated information. To evaluate performance, we adopted the number of returned POIs and average precision of top 10 results (AP@10) as measures. Moreover, nine locations were selected containing downtowns, areas near downtowns, and outskirts of cities. Each location contained three search regions (200 m, 500 m, and 1000 m). For query testing, 18 common queries were used, such as “beef noodle,” “clinics,” “gas station,” etc. In the second experiment to test the performance of POI searches, we compared our method with other three well-known POI search services: Wikimapia, “What’s the Num-ber?” and Google Maps. We used general 20 keywords (e.g., including restaurants, clinics, hotels, parking lots, and su-permarkets) and 20 specific POI names (e.g., McDonald’s, Levi’s, and Carrefour) as queries for testing. Moreover, POI

(4)

search was conducted in both urban (e.g., train stations) and rural areas (e.g., universities) within a range of 10 kilometers. To confirm that the top-ranked documents provided the most relevant results, we used NDCG@10 for evaluation.

Fig. 2. Comparison with/without POI associated information.

The experimental results of enriching the associated in-formation of POIs are shown in Figure 2. When the search region was 1 km, the AP@10 of POI withoutAI was approx-imately 1. By contrast, the AP@10 of POI withAI reached approximately 0.8. One reason for this is that the associ-ated information of POIs sometimes contains words with irrelevant or multiple topics that affect the ranking results. Although the AP@10 of POI withAI was slightly lower than POI withoutAI, the return number of POI withAI was twice greater than the number of POI withoutAI. The result showed that the associated information of POIs is beneficial in retrieving more POIs. Moreover, some irrelevant POIs were filtered by classifying the relevance of POIs in the experiment.

Fig. 3. Performance comparison for four POI search services.

As shown in Figure 3, our proposed PowerPOI service (0.932) outperformed Wikimapia (0.476) and “What’s the Number?” (0.474) for both common and specific POIs in urban and rural areas. PowerPOI outperformed Google Maps (0.844) for common POIs but was inferior for specific POIs. For search regions, PowerPOI outperformed Google Maps for rural areas but was inferior for urban areas. That some POIs cannot be found in Google Maps is the motivation underlying our study to complementing the LBS, through online extraction from search engines that might have a higher probability of finding

emerging POIs. Thus, PowerPOI will reach a level comparable to Google Maps by integrating multiple sources.

V. CONCLUSION

In recent years, location-based services have received in-creased attention. However, existing location databases are based on crowdsourcing. The limitations on prevalent location databases can be mitigated by integrating multiple search results, including extracted POIs from the Web. In this study, we focused on POI searches via associated information sum-marization and POI ranking to improve the recall and precision of POI search. In experiments, we evaluated the performance of with/without the associated information and compared with three POI search services: Wikimapia, “What’s the Number?” and Google Maps. Out proposed system (NDCG=0.932) out-performed others in terms of both common keywords and specific POIs, and regardless of regions (urban or rural area). In sum, we expect to reach a comparable level of performance to Google Maps for POI search services. In future work, we will seek to combine search results with social media information to obtain more relevant POIs and user preferences for personalized POIs recommendation. In addition, due to manual annotation and aliases, deduplicating place database can be used to improve the effectiveness of POI search.

REFERENCES

[1] Ahlers, D. and Boll, S. (2007). Location-based Web search.The Geospatial Web. Springer, 55-66.

[2] Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press.Addison Wesley. [3] Bauer S., Radlinski F. and White R. W. (2016). Where

can I buy a boulder? Searching for offline retail locations.

WWW.

[4] Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation.JMLR. 3(4-5): 993-1022.

[5] Chang, C.-H., Huang, C.-Y. and Su, Y.-S. (2012). On Chi-nese postal address and associated information extraction.

JSAI.

[6] Chang, C.-H., Chuang, H.-M., Huang, C.-Y., Su, Y.-S. and Li., S.-Y. (2016). Enhancing POI search on maps via online address extraction and associated information extraction.Applied Intelligence. 44(3). 539-556.

[7] Chuang, H.-M., Chang, C.-H. and Kao, T.-Y. (2014). Ef-fective Web crawling for Chinese addresses and associated information.EC-Web.

[8] Huang, Y.-Y., Chang, C.-H. and Chou, C.-L. (2015). A tool for Web NER model generation using search snippets of known entities.Rocling.

[9] Jones, C. B. and Purves, R. S., (2008). Geographical information retrieval.IJGIS, 22(3), 219–228.

[10] Marcu, D. (1999). The automatic construction of large-scale corpora for summarization research.SIGIR. [11] Nenkova, A. and Bagga, A. (2003). Facilitating email

thread access by extractive summary generation.RANLP. [12] Rae, A., Murdock, V., Popescu, A. and Bouchard, H.

(2012). Mining the Web for points of interest.SIGIR.

Fig. 1. System architecture
Fig. 2. Comparison with/without POI associated information.

参照

関連したドキュメント

q-series, which are also called basic hypergeometric series, plays a very important role in many fields, such as affine root systems, Lie algebras and groups, number theory,

Furuta, Log majorization via an order preserving operator inequality, Linear Algebra Appl.. Furuta, Operator functions on chaotic order involving order preserving operator

The approach based on the strangeness index includes un- determined solution components but requires a number of constant rank conditions, whereas the approach based on

I give a proof of the theorem over any separably closed field F using ℓ-adic perverse sheaves.. My proof is different from the one of Mirkovi´c

Keywords: continuous time random walk, Brownian motion, collision time, skew Young tableaux, tandem queue.. AMS 2000 Subject Classification: Primary:

Thus, we use the results both to prove existence and uniqueness of exponentially asymptotically stable periodic orbits and to determine a part of their basin of attraction.. Let

Section 3 is first devoted to the study of a-priori bounds for positive solutions to problem (D) and then to prove our main theorem by using Leray Schauder degree arguments.. To show

This paper presents an investigation into the mechanics of this specific problem and develops an analytical approach that accounts for the effects of geometrical and material data on