Schedule for 2017 Web Information Extraction and Retrieval nutch presentation

(1)

1

(2)

» A full-fledged web search engine

» Functionalities of Nutch

˃ Internet and Intranet crawling

˃ Parsing different document formats (PDF, HTML, XML, JS, DOC,PPT etc.)

˃ Web interface for querying the index

2

(3)

» 4 main components

˃ Crawler

˃ Web Database (WebDB, LinkDB, segments)

˃ Indexer

˃ Searcher

» Crawler and Searcher are highly decoupled enabling

independent scaling

(4)

Doug Cutting, "Nutch: Open Source Web Search", 22 May 2004, WWW2004, New York

4

(5)

1. Create a new WebDB (admin db -create).

2. Inject root URLs into the WebDB (inject).

3. Generate a fetchlist from the WebDB in a new segment (generate).

4. Fetch content from URLs in the fetchlist (fetch).

5. Update the WebDB with links from fetched pages (updatedb).

6. Repeat steps 3-5 until the required depth is reached.

7. Update segments with scores and links from the WebDB (updatesegs).

8. Index the fetched pages (index).

9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).

10.Merge the indexes into a single index for searching (merge).

(6)

» Using ssh tool to login Hadoop server. (Ex.

Xshell)

˃ address (140.115.51.18), username, password

» Include Hadoop command in your account.

˃ Type Command:

+ Source /opt/hadoop/conf/hadoop-env.sh

˃ Then, you can access Hadoop Distributed file System (HDFS).

6

(7)

Command Description

hadoop fs –mkdir folder_name Create folder on the HDFS hadoop fs –put source_folder_name

dest_folder_name Upload folder include its files to HDFS hadoop fs –get source_folder_name

dest_folder_name Download folder from HDFS. hadoop fs –rmr folder_name Delete folder

hadoop fs –ls (folder_name) Check the files

(8)

» Single crawler

˃ cd apache-nutch-1.7/runtime/local/urls/

˃ vim seed.txt //add urls into seed.txt

˃ cd .. //go back to local folder

˃ bin/nutch crawl urls/seed.txt -dir crawl -depth 3 -topN 5

˃ Your results will be stored into crawl folder under local folder.

» Distributed crawler

˃ cd apache-nutch-1.7/runtime/deploy/urls/

˃ vim seed.txt //add urls into seed.txt

˃ cd .. //go back to deploy folder

˃ hadoop fs –put urls urls

˃ bin/nutch crawl urls -dir crawl -depth 3 -topN 5

˃ Your results will be stored into search folder on HDFS.

8

(9)

» cd apache-nutch-1.7/conf

» vim regex-urlfilter.txt

˃ # accept anything else + +.

˃ with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:

+ +^http://([a-z0-9]*\.)*nutch.apache.org/

» cd apache-nutch-1.7

(10)

» Job tracker: http://140.115.51.18:50030

» DFS health: http://140.115.51.18:50070

» Join our FB: https

://www.facebook.com/groups/4760633491676

78/

10

(11)

» Hadoop tutorial

˃ http://trac.nchc.org.tw/cloud/wiki/waue/2009/0617

» Nutch tutorial

˃ http://wiki.apache.org/nutch/NutchTutorial

˃ http://www.cnblogs.com/sirhuoshan/archive/2013/04/24/3040158.html

˃ http://trac.nchc.org.tw/cloud/wiki/0428Hadoop_Lab7