• 検索結果がありません。

Schedule for 2017 Web Information Extraction and Retrieval nutch presentation

N/A
N/A
Protected

Academic year: 2018

シェア "Schedule for 2017 Web Information Extraction and Retrieval nutch presentation"

Copied!
11
0
0

読み込み中.... (全文を見る)

全文

(1)

1

(2)

» A full-fledged web search engine

» Functionalities of Nutch

˃ Internet and Intranet crawling

˃ Parsing different document formats (PDF, HTML, XML, JS, DOC,PPT etc.)

˃ Web interface for querying the index

2

(3)

» 4 main components

˃ Crawler

˃ Web Database (WebDB, LinkDB, segments)

˃ Indexer

˃ Searcher

» Crawler and Searcher are highly decoupled enabling

independent scaling

(4)

Doug Cutting, "Nutch: Open Source Web Search", 22 May 2004, WWW2004, New York

4

(5)

1. Create a new WebDB (admin db -create).

2. Inject root URLs into the WebDB (inject).

3. Generate a fetchlist from the WebDB in a new segment (generate).

4. Fetch content from URLs in the fetchlist (fetch).

5. Update the WebDB with links from fetched pages (updatedb).

6. Repeat steps 3-5 until the required depth is reached.

7. Update segments with scores and links from the WebDB (updatesegs).

8. Index the fetched pages (index).

9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).

10.Merge the indexes into a single index for searching (merge).

(6)

» Using ssh tool to login Hadoop server. (Ex.

Xshell)

˃ address (140.115.51.18), username, password

» Include Hadoop command in your account.

˃ Type Command:

+ Source /opt/hadoop/conf/hadoop-env.sh

˃ Then, you can access Hadoop Distributed file System (HDFS).

6

(7)

Command Description

hadoop fs –mkdir folder_name Create folder on the HDFS hadoop fs –put source_folder_name

dest_folder_name Upload folder include its files to HDFS hadoop fs –get source_folder_name

dest_folder_name Download folder from HDFS. hadoop fs –rmr folder_name Delete folder

hadoop fs –ls (folder_name) Check the files

(8)

» Single crawler

˃ cd apache-nutch-1.7/runtime/local/urls/

˃ vim seed.txt //add urls into seed.txt

˃ cd .. //go back to local folder

˃ bin/nutch crawl urls/seed.txt -dir crawl -depth 3 -topN 5

˃ Your results will be stored into crawl folder under local folder.

» Distributed crawler

˃ cd apache-nutch-1.7/runtime/deploy/urls/

˃ vim seed.txt //add urls into seed.txt

˃ cd .. //go back to deploy folder

˃ hadoop fs –put urls urls

˃ bin/nutch crawl urls -dir crawl -depth 3 -topN 5

˃ Your results will be stored into search folder on HDFS.

8

(9)

» cd apache-nutch-1.7/conf

» vim regex-urlfilter.txt

˃ # accept anything else + +.

˃ with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:

+ +^http://([a-z0-9]*\.)*nutch.apache.org/

» cd apache-nutch-1.7

(10)

» Job tracker: http://140.115.51.18:50030

» DFS health: http://140.115.51.18:50070

» Join our FB: https

://www.facebook.com/groups/4760633491676

78/

10

(11)

» Hadoop tutorial

˃ http://trac.nchc.org.tw/cloud/wiki/waue/2009/0617

» Nutch tutorial

˃ http://wiki.apache.org/nutch/NutchTutorial

˃ http://www.cnblogs.com/sirhuoshan/archive/2013/04/24/3040158.html

˃ http://trac.nchc.org.tw/cloud/wiki/0428Hadoop_Lab7

» Nutch Docs: http://lucene.apache.org/nutch/

» Nutch Wiki: http://wiki.apache.org/nutch/

» Prasad Pingali, CLIA consortium, Nutch Workshop,

参照

関連したドキュメント

Those of us in the social sciences in general, and the human spatial sciences in specific, who choose to use nonlinear dynamics in modeling and interpreting socio-spatial events in

To this aim, we propose to use categories of fractions of a fundamental category with respect to suitably chosen sytems of morphisms and to investigate quotient categories of those

Standard domino tableaux have already been considered by many authors [33], [6], [34], [8], [1], but, to the best of our knowledge, the expression of the

Furthermore, computing the energy efficiency of all servers by the proposed algorithm and Hadoop MapReduce scheduling according to the objective function in our model, we will get

[r]

Polarity, Girard’s test from Linear Logic Hypersequent calculus from Fuzzy Logic DM completion from Substructural Logic. to establish uniform cut-elimination for extensions of

As Riemann and Klein knew and as was proved rigorously by Weyl, there exist many non-constant meromorphic functions on every abstract connected Rie- mann surface and the compact

Although the Sine β and Airy β characterizations in law (in terms of a family of coupled diffusions) look very similar, the analysis of the limiting marginal statistics of the number