• 検索結果がありません。

Applying Pattern Mining to Web Information Extraction

N/A
N/A
Protected

Academic year: 2018

シェア "Applying Pattern Mining to Web Information Extraction"

Copied!
12
0
0

読み込み中.... (全文を見る)

全文

(1)

Information Extra tion

Chia-Hui Chang,Shao-ChenLui,andYen-ChinWu

Dept.ofComputerS ien eandInformationEngineering

NationalCentralUniversity,Chung-Li,320,Taiwan

hia sie.n u.edu.tw,fanyway,trisdangdb. sie.n u.edu.tw

Abstra t. Informationextra tion(IE)fromsemi-stru turedWebdo -

umentsisa riti alissue forinformationintegrationsystemsontheIn-

ternet. Previous workin wrapper indu tion aim to solve this problem

byapplyingma hinelearning toautomati allygenerateextra tors.For

example,WIEN,Stalker,Softmealy,et .However,thisapproa hstillre-

quireshumaninterventiontoprovidetrainingexamples.Inthispaper,we

proposeanovelideatoIE,byrepeatedpatternminingandmultiplepat-

ternalignment.Thedis overyofrepeatedpatternsarerealizedthrougha

datastru ture allPATtree.Inaddition,in ompletepatternsarefurther

revisedbypatternalignmentto omprehendall patterninstan es.This

newtra ktoIEinvolvesnohumane ortand ontent-dependentheuris-

ti s.Experimentalresultsshowthatthe onstru tedextra tionrules an

a hieves97per entextra tionoverfourteenpopularsear hengines.

Keywords:informationextra tion,semi-stru tureddo uments,wrapper

generation,pattern dis overy,multiple alignment

1 Introdu tion

Information extra tion (IE) is on erned with extra ting from a olle tion of

do umentstheinformationrelevanttoaparti ularextra tiontask.Forinstan e,

the meta-sear h engineMetaCrawlerextra ts the sear h results from multiple

sear hengines;andtheshoppingagentJungleeextra tstheprodu tinformation

from several online stores for omparison. With the growth of the amount of

onlineinformation,theavailabilityofrobust, exibleIEhasbe omeastringent

ne essity.

Contrastto\traditional"informationextra tionwhi hrootsin naturallan-

guagepro essing(NLP)te hniquessu haslinguisti analysis,Internetinforma-

tionextra tionrelyonsynta ti stru turesidenti ationmarkedbyHTMLtags.

Thedi eren eisdue tothenatureof Websu hthat thepage ontentshaveto

be lear at glan e. Thus, \itemized list" and \tabular format" havebeen the

mainpresentationstyleforWebpagesontheInternet.Su hpresentationstyles

together with the multiple re ords ontained in one do uments ontribute the

(2)

rulesmustbetailoredforea hparti ularpage olle tion,Toautomatethe on-

stru tion of extra tors(or wrappers), re ent resear h has identi ed important

wrapper lassesandindu tionalgorithms.Forexample,Kushmeri ket.al.iden-

ti ed afamily of wrapper lassesand the orresponding indu tion algorithms

whi hgeneralizefrom labeledexamplestoextra tionrules[9℄. Moreexpressive

wrapperstru ture areintrodu ed lately.SoftmealybyHsu andDung [6℄usesa

wrapperindu tion algorithmtogenerateextra torsthatareexpressedas nite-

state transdu ers. Meanwhile, Muslea et al. [10℄ proposed \STALKER" that

performshierar hi alinformation extra tiontoredeemSoftmealy'sinability to

use delimiters that do not immediately pre ede and follow the relevant items

withextras ansoverthedo uments(see[11℄ fora ompletesurvey).

In all this work, wrappers are indu ed from training examples su h that

landmarks or delimiters anbe generalized from ommon pre xes or suÆxes.

However,labelingthesetrainingexamplesissometimestime- onsuming.Hen e,

anothertra kofresear hisexploringnewapproa hestofullyautomatewrapper

onstru tion.Forexample,Embley et.al.des ribe aheuristi approa h todis-

overre ord boundariesin Web do uments by identifying andidate separator

tags using ve independent heuristi s and sele ting a onsensus separator tag

basedonaheuristi ombination[3℄.However,oneserious probleminthis one-

tagseparatorapproa hariseswhentheseparatortagisusedelsewhereamonga

re ordotherthantheboundary.

Ontheotherhand,ourworkhereattemptstoeliminatehumanintervention

bypatternmining.Themotivationisfrom theobservationthatusefulinforma-

tionin aWebpageis oftenpla edin astru turehavingaparti ularalignment

and order.Forexample,Web pagesprodu ed byWeb sear henginesgenerally

have regular and repetitive patterns, whi h usually represent meaningful and

useful data re ords. In thenext se tion, we rst give an exampleshowing the

repeatedpatternformed bymultiple alignedre ords.

2 Motivation

OneobservationfromWebpagesisthattheinformationtobeextra tedisoften

pla edinastru turehavingaparti ularalignmentandformsrepetitivepatterns.

Forexample,query-ableorsear h-ableInternetsitessu hasWebsear hengines

oftenprodu eWebpageswithlargeitemizedmat hresultswhi haredisplayedin

aparti ulartemplateformat.Thetemplate anbere ognizedwhenthe ontent

of ea h mat h is ignored or repla ed by some xed-length string. Therefore,

repetitive patterns are formed. For instan e, in the example of Figure 1, the

sequen e\<LI>Text()<I>Text( )</I>"is repeatedfour times,when alltext

stringsbetweentwotagssu has\Congo",\Egypt",\Belize" et .are repla ed

bytoken lassText( ).

This is a simple examplethat demonstratesa repeated pattern formed by

tag tokens in a Web page following a simple translation onvention. In pra -

(3)

<LI>Congo<I>242</I>

<LI>Egypt<I>20</I>

<LI>Belize<I>501</I>

<LI>Spain<I>34</I>

</UL>

Fig.1.Sample HTMLpage

usuallyextra t datafrom relationaldatabaseandprodu edynami Webpages

withaprede nedformatstyle.Therefore,whatweoughttodoiskindofreverse

engineering to dis over the original format style and the ontent we need to

extra t.Meanwhile,wealso ndthatextra tionpatternsofthedesiredinforma-

tion( alled maininformation blo k asde ned in [3℄)ofteno urregularlyand

loselyin aWebpage. Theseobservationsmotivateus tolookforan approa h

todis overrepeatedpatternsandvalidation riteriato lterdesiredrepeatsthat

arespa edregularlyand losely.

Sin eHTMLtagsarethebasi omponentsfordatapresentationandthetext

stringbetweentagsareexa tlywhatweseeinthebrowsers.Hen e,itisintuitive

toregardthetextstringbetweentwotagsasoneunitaswellasea hindividual

tag.ThissimpleversionofHTMLtranslationwillbeusedinthefollowingpaper

where anytext stringbetweentwotagsistranslatedto oneunit alled Text( )

andeveryHTMLtagistranslatedtoatokenHtml(<tag>)a ordingtoitstag

name.

Su htranslation onventionenablestheshow-upofmanyrepeatedpatterns.

Byrepeated patterns,wemeananysubstring that o urs twi e in theen oded

token string. Thus, not only the sequen e \Html(<LI>) Text( ) Html(<I>)

Text()Html(<I>)" onformstothede nitionofrepeatedpatternbutalsothe

subsequen e\Html(<LI>)Text( )Html(<I>),"\Text()Html(<I>)Text()",

\HMLT(<I>)Text( )Html(</I>)," et . To distinguish from these repeats, we

de nemaximal repeatstouniquelyidentifythelongestpattern asfollows.

De nition GivenaninputstringS,wede nemaximalrepeat asasubstring

ofS thato urs in kdistin tpositionsp

1

;p

2

;:::;p

k

in S,su hthat the(p

i -

1)thtokenin Sisdi erentfromthe(p

j

-1)thtokenforatleastonei;j pair,

1i<j k ( alled leftmaximal), andthe(p

x

+j j)thtokenis di erent

fromthe(p

y

+j j)thtokenforatleastonex;y pair,1x<yk ( alled

rightmaximal).

Thede nition ofmaximal repeatsis ne essaryforidentifying thewell-used

and popular term, repeats. Besides, it also aptures all interesting repetitive

stru tures in a lear way and avoids generating overwhelming outputs. In the

nextse tion,wewilldes ribehowtheproblemofIE anbeaddressedbypattern

(4)

Todis overpatternsfromaninputWebpage, rstanen odings hemeisused

to translate the Web page into a string of abstra t representations, referred

to hereas tokens. Ea h token is represented by a binary ode of length l. To

enable pattern dis overy, we utilizes adata stru ture alled aPAT tree [4℄ in

whi hrepeated patternsin agivensequen e anbeeÆ ientlyidenti ed. Using

thisdatastru turetoindexaninputstring,allpossiblerepeats,in ludingtheir

o urren e ounts andtheirpositions in theoriginal inputstring anbe easily

retrieved.Finally,thedis overedmaximalrepeatsareforwardedtothevalidator,

whi h ltersoutundesired patternsandtoprodu esa andidatepattern.

3.1 Translator

Sin eHTMLtagsarethebasi omponentsfordo umentpresentationandthe

tagsthemselves arrya ertainstru tureinformation,itisintuitivetoexamine

thetagtokenstringformedbyHTMLtagsandregardothernon-tagtext ontent

betweentwotagsasonesingletoken alledText( ).Tokensseeninthetranslated

tokenstringin ludetagtokensandtexttokens,denotedasHtml(<tag name>)

andText( ),respe tively.Forexample,Html(</a>)isatagtoken,where</a>

isthetag.Text( )isatexttoken,whi hin ludesa ontiguoustextstringlo ated

betweentwoHTMLtags.

Tagstokens anbe lassi edinmanyways.Theuser an hoosea lassi a-

tiondependingonthedesiredlevelofinformationtobeextra ted.Forexample,

tagsintheBODYse tionofado ument anbedividedintotwodistin tgroups:

blo k-leveltags and text-leveltags. Theformer de nes thestru ture of ado -

ument, and the latter de nes the hara teristi s, su h as format and style, of

the ontentsof the text. Blo k level tags in lude ategories su h as headings,

text ontainers, lists, andother lassi ations,su hastables and forms.Text-

leveltagsarefurtherdividedinto ategoriesin ludinglogi almarkups,physi al

markups,andspe ialmarkupsformarkinguptextsin atextblo k.

Themanydi erenttag lassi ationsallowdi erentHTML translationsto

be generated. With these di erent abstra tion me hanisms, di erent patterns

an beprodu ed. Forexample, skippingalltext-leveltagswill resultin higher

abstra tionfromtheinputWebpagethanalltagsarein luded.Inaddition,dif-

ferentpatterns anbedis overedandextra tedwhendi erenten odings hemes

aretranslated.

For example, when only blo k-level tags are onsidered, the orrespond-

ing translation of Fig. 1 is a token string: \Html(<H1>)Text()Html(</H1>)

Html(<UL>)Html(<LI>)Text()Html(<LI>)Text()Html(<LI>)Text()Html(<LI>)

Text()Html(</UL>)",where ea htokenis en oded asabinary stringsof "0"s

and"1"swithlengthl.Forexample,supposethreebitsen odethetokensinthe

Congo odeasshownin Fig.2.Theen oded binarystringforthetokenstring

oftheCongo odewillbe"100110101000010110010110010110010110001$"

(5)

,QGH[LQJ SRVLWLRQ

VXIIL[  

VXIIL[  

VXIIL[  

VXIIL[  

VXIIL[  

VXIIL[  

VXIIL[  

VXIIL[  

VXIIL[  

VXIIL[ 

VXIIL[ 

VXIIL[ 

VXIIL[ 

+WPO +! 7H[W B +WPO +! +WPO 8/! +WPO /,! 7H[W B

+WPO /,! 7H[W B +WPO /,! 7H[W B +WPO /,! 7H[W B +WPO 8/!

+WPO 8/!

+WPO /,!

+WPO +!

7H[W B









100110101000010110010110010110010110001$

+WPO 8/!

+WPO /,!

+WPO +!







11

4



Ä

1

1 Ä

Ã

Ä

Ä

Ã

1

Ã

3

1

Ä

Ã

8

12

Ã

13

9

Ã

Ä

5

7

20

Ã

Ã

11

0

10

L

Ä

I

6

M

Ã

4

Ä

N

5

Ä

Ã

2

1

D

2

2 J

E

3

F

14

H

K 3

17 O

8

G

Fig.2.ThePATtreefor theCongoCode

3.2 The PAT Tree

Ourapproa hforpatterndis overyusesaPATtreetodis overrepeatedpatterns

in theen odedtokenstring.A PATtreeisaPatri iatree(Pra ti alAlgorithm

to Retrieve Information Coded in Alphanumeri [11℄) onstru tedoverall the

possiblesuÆxstrings.A Patri iatree isaparti ularimplementationof a om-

pressed binary (0,1) digital tree su h that ea h internal node in the tree has

twobran hes:zerogoesto leftand onegoesto right.LikeasuÆxtree [4℄,the

Patri iatreestoresallitsdataat theexternalnodesandkeepsoneinteger,the

bit-index, in ea h internal node asanindi ation of whi h bit is to beused for

bran hing. For a hara terstringwith nindexingpoint(orn suÆx),there will

benexternal nodesin thePATtree andn 1internalnodes.This makesthe

treeO(n)in size.

WhenaPATtree isto index asequen e of hara ters(or tokenshere) not

just 0 or 1, the binary odes for the hara ters an be used. For simpli ity,

ea h hara terisen odedas xed-lengthbinary ode.Spe i ally,givena nite

alphabetofa xedsize,ea h hara terx2isrepresentedbyabinary ode

oflengthl=dlog

2

jje.Forasequen eSofn hara ters,thebinaryinputBwill

havenlbits,butonlythe[il+1℄thbithastobeindexedfori=0;:::;n 1.

ReferringtoFig.2,aPATtreeis onstru tedfromtheen odedbinarystring

of theCongoexample.Thetreeis onstru tedfrom thirteen sequen esof bits,

withea hsequen eofbitsstartingfromea hoftheen odedtokensandextending

totheendofthetokenstring.Ea hsequen eis alleda"semi-in nitestring"or

"sistring"inshort.Ea hleaf,orexternalnode,isrepresentedbyasquarelabeled

byanumberthatindi atesthestartingposition ofasistring.Forexample,leaf

(6)

intheen odedbitstring.Thebitpositionisusedwhenlo atingagivensistring

in PATtree.

Virtually,ea hedgeinthePATtreehasaedgelabel.Forexample,theedge

labels betweennode dand eare \101100",the8thbit to 13th bit forsuÆx9,

7,and5.Edgesthat arevisitedwhentraversingdownwardfrom rootto aleave

formapaththatleadstoasistring orrespondingtotheleave.The on atenated

edgelabelsalongthepathformavirtualpathlabel.Forexample,theedgelabels

"1","10", and"1..." onthepath that leadsfromroot to leave2form apre x

"1101...",whi hisauniquepre xforsistring2.

Asshownin Fig.2,allsuÆxstringswiththesamepre xwill belo atedin

thesamesubtree.Hen e,itallowssurprisinglyeÆ ient,linear-timesolutionsto

omplexstringsear hproblems.Forexample,stringpre xsear hing,proximity

sear hing,rangesear hing,longestrepetitionsear hing,mostfrequentsear hing,

et . [4,5℄Sin e everyinternal nodein aPATtreeindi ates abran h,itimplies

a di erent bit following the ommon pre x between two suÆxes. Hen e, the

on atenationoftheedge-labelsonthepathfrom therootto aninternal node

representsonerepeatedstringintheinputstring.However,noteverypath-label

orrepeatedstringrepresentsamaximalrepeat.Let's allthe(p

k

1)th hara ter

ofthebinarystringp

k

theleft hara ter.Forapath-labelofaninternalnodevto

beamaximalrepeat,atleasttwoleaves(suÆxes)inthev'ssubtreeshouldhave

di erent left hara ters.By re ordingthe o urren e ountsand the referen e

positionsintheleafnodesofaPATtree,we aneasilyknowhowmanytimesa

pattern isrepeated. Hen e,giventhe patternlength,o urren e ount,we an

applypostordertraversalto thePATtreetoenumerateallrepeats.

Theessen eofaPATtreeisabinarysuÆxtree,whi hhasalsobeenapplied

in severalresear h eldforpattern dis overy.Forexample,Kurtzand S hleier-

ma herhaveusedsuÆxtreesinbioinformati sfor ndingrepeatedsubstringin

genomes[8℄.AsforPATtrees,theyhavebeenappliedforindexinginthe eldof

informationretrievalsin ealongtimeago[4℄.Ithasalsobeenused inChinese

keywordextra tion [1℄ forits simplerimplementation than suÆxtrees and its

great power for pattern dis overy. However, in the appli ation of information

extra tion, weare notonly interested in repeats but also repeats that appear

regularly in vi inity. Dis overedmaximal repeats haveto be further validated

or ompared to nd the best one that orresponds to the information to be

extra ted.

3.3 Pattern ValidationCriteria

Intheabovese tion,wedis ussedhowto ndmaximal repeatsin aPATtree.

However, there may be over 60 maximal repeats dis overedin an Web page.

To lassifythese maximal repeats, we introdu e two measures regularity, and

ompa tness as des ribed below. Let the suÆxes of a maximal repeat are

orderedbyitsposition su h thatsuÆxp

1

<p

2

<p

3 :::<p

k

, wherep

i

denotes

(7)

intervalbetweentwoadja ento urren es(p

i+1 p

i

),thatis,thesequen eof

spa ingbetweentwoadja ento urren es(p

2 p

1 ),(p

3 p

2 ),...,(p

k p

k 1 ).

Regularity of the maximal repeat is equalto the standard derivation of

thesequen edividedbythemeanofthesequen e.

Compa tness is a measure of the density of a maximal repeat. It is used to

eliminatemaximalrepeatsthatares atteredfarapartbeyondagivenbound.

Compa tnessisde ned askj j/ P

k

i=2 p

i p

i 1

,where j jis thelengthof

in numberoftokens.

Thevalueofregularityislo atedbetween0and1whilethevalueofdensity

isgreaterthan0.Ideally,theextra tionpatternshouldhaveregularityequalto

zeroand ompa tnessequaltoone.To lterpotentiallygoodpatterns,asimple

approa hwillbetouseathresholdforea hofthesemeasuresabove.Impli itly,

good patterns havesmall regularity and density lose to one. Therefore, only

patternswith regularitylessthantheregularitythresholdanddensitybetween

thedensitythresholdsare onsideredvalidated patterns.

4 Performan e Evaluation

We rst showthe number of validated maximal repeats validated by oursys-

tem using fourteen state-of-the-art sear h engines, ea h with ten Web pages.

There are several ontrol parameters whi h an a e t thenumber of maximal

repeats validated,in ludingen oding s heme, minimumpattern length, o ur-

ren e ount, and threshold values for regularity and ompa tness. Given the

minimumlength3and ount5,thee e tofdi erenten odings hemeisshown

in Table 1. Conform to general expe tation, higher-level en oding s heme of-

ten resultsin lesspatterns. From this table,we an also see how ea h ontrol

parameter lters patterns where the thresholds are de ided by the following

experiments.Thevalueofdensity anbegreaterthanonebe ausemaximalre-

peatsmaybeoverlapped.Forexample,supposeamaximalrepeat o ursten

timesin arow.Insu h ase, will hasregularity0anddensity1.Inaddition,

, ,et . arealso quali edfor regularmaximalrepeats, onlywithdensity

greaterthan1.

Table 1.No.ofPatternsvalidatedwithdi erenten odings heme

En oding MaximalRegularityCompa tness

S heme Repeat <0.5 >1.5 <0.25

All-tag 117 39 22 7.6

NoPhysi al 88 41 25 6.5

NoSpe ial 82 29 18 5.7

Blo k-level 66 32 17 3.9

(8)

















    

Density

# of patterns

5 

5 

5 

Fig.3.#ofpatternssu essfullyvalidated

Fig. 3 shows the e e t of various regularity and density thresholds using

all-tag en oding s heme. Basi ally, low regularity threshold and high density

thresholdredu ethenumberofpatterns,but ouldhavemissedgood patterns.

Therefore, thethresholdsare hosenempiri allyto in lude asmany good pat-

ternsaspossible.

Table 2 shows the performan e of di erent en oding s heme measured in

retrievalrate,a ura y rateandmat hing per entage.Retrievalrateisde ned

as the ratio of the number of desired data re ords enumerated by a maximal

repeatto thenumberofdesireddata re ords ontainedin theinputtext.Like-

wise,a ura yrateisde nedastheratioofthenumberofdesireddatare ords

enumeratedbya maximalrepeatto the numberof o urren e ofthe maximal

repeat.Adatare ordissaidtobeenumeratedbyamaximalrepeatifthemat h-

ing per entageis greaterthan abound determined bythe user.The mat hing

per entageis usedbe ausethepattern may ontainonlyaportionof thedata

re ord.

With the simple en oding s heme of using blo k-level tags, our approa h

ould dis over patterns whi h extra t 86% re ords with mat hing per entage

78%.Nearly halfthetestWebsites are orre tlyextra ted(withmat hingper-

entage greater than90%). Among them, nine of the fourteen Web sites have

retrievalrateanda ura yratebothgreaterthan0.9.However,examiningother

dis overedpatterns,manyarein ompleteduetoex eptions.Inthenextse tion,

wewill further improve theperforman e by o urren epartition and multiple

stringalignment.

5 Constru ting Extra tion Pattern

Generallyspeaking,sear henginesutilizea\whileloop"tooutputtheirresults

in some template. However, they may use \if lauses" inside the while loop

(9)

En odingS hemeRetrievalRateA ura yRateMat hingPer entage

All-tag 0.73 0.82 0.60

NoPhysi al 0.82 0.89 0.68

NoSpe ial 0.84 0.88 0.70

Blo k-level 0.86 0.86 0.78

to sear h engines are shown in bold fa e for Infoseek and MetaCrawler, thus,

breakingtheir\whileloop" patterns.

From thestatisti s above,wesummarize that \maximal repeat"and \reg-

ularity"are thetwoprimary riteriawe lter andidatepatterns. However,we

also found that the extra tionpattern maynot be maximal repeats and regu-

lar.Forexample,theregularityofthepatternforEx iteisgreaterthandefault

regularitythreshold0.5 be auseabanner isinsertedamong thesear hresults,

dividingthetenmat hesintotwoparts.Besides,the\if-e e t"oftenhindersus

from dis overing ompletepatterns.These issuesare whatwewouldaddressin

thefollowing.

5.1 O urren e Partition

Tohandlepatternswithregularitygreaterthanthespe i edthreshold0.5,these

patterns are arefullysegmentedto see if any partition ofthe pattern's o ur-

ren es satis es the requirement for regularity. By de nition, the regularity of

a pattern is omputed through all o urren esof the pattern. Forexample, if

thereareko urren es,thek 1intervals(betweentwoadja entinstan es)are

thestatisti sweuseto omputethestandarddeviationandthemean.However,

in examplessu h asLy os,the sear h resultis divided intothree blo ks. Su h

o urren esin reasetheregularityoverallinstan es.Nonetheless,theregularity

of the o urren esin ea h information blo k is still small. Therefore, theidea

hereis to segmentthe o urren esinto partitionsso that we ananalyzeea h

partitionindividually.

Wedon't reallyhaveto apply lustering algorithmon thismatter, instead,

a simple loop an a omplish the job if the o urren es are ordered by their

position aforehand. Let C

i;j

denotes the set of o urren es p

i

;p

i+1

;:::;p

j and

initialize s =1;j = 1. For instan e p

j+1

, if theregularity of C

s;j+1

is greater

thanthenoutputC

s;j

asapartition andassignj+1tos.

On e the partitions are separated, we anthen ompute the regularity for

ea hindividualpartition.Ifapartitionin ludeso urren esmoretheminimum

ountandhasregularitylessthanthreshold, thepatternaswellastheo ur-

ren esinthispartitionareoutputted.Notethatthethresholdissettoasmall

value mu h less than 0.5 to ontrol the number of outputted patterns. With

thismodi ation,theperforman eisimprovedgreatly.AsshowninTable3,the

retrievalrateisin reasedto93%anda ura yrateto90%.Theonlytradeo is

(10)

Advan edTe hnique RetrievalRateA ura yRateMat hingPer entage

O urren ePartition 0.93 0.93 0.84

MultipleAlignment 0.97 0.94 0.90

5.2 MultipleString Alignment

Forthe tough work regardingin ompletepatterndis overed,thete hniquefor

multiplestringalignmentisborrowedto ndagoodpresentationofthe riti al

ommonfeaturesofmultiple strings.Forexample,suppose \ad "isthedis ov-

eredpatternfortokenstring\ad wbdad xbad xb ad ".Ifwehavethefollowing

multiple alignmentforstrings\ad wbd",\ad xb"and\ad xbd":

ad wb d

ad x b

ad x b d

The extra tion pattern an be generalized as \ad [wjx℄b[dj ℄" to over these

three instan es. Spe i ally, suppose avalidated maximalrepeathask+1o -

urren e, p

1 ,p

2 ,...,p

k +1

in theen oded token string.LetstringP

i

denotethe

stringstartingatp

i

andendingatp

i+1

1.Theproblemisto ndthemultiple

alignmentof the k stringsS =fP

1

;P

2

;:::;P

k

g sothat the generalizedpattern

anbeusedtoextra tallre ordsweneed.

Multiplestring omparisonis anaturalgeneralization ofalignment for two

strings whi h an be solved in O(nm) by dynami programming to obtain

optimaleditdistan e,where nand marestringlengths.As anexampleoftwo

string alignment, onsiderthealignmentoftwostringsa wbdandad xb shown

below:

a wb d

a d x b

In this alignment, hara ter w is mismat hed with x,two dsare opposite hy-

phens(or alledspa e),andallother hara tersmat htheir ounterpartsinthe

opposite string.If wegive ea h mat h avalue of , ea h mismat h avalue of

,andea hspa e avalueof Æ,thetwostring alignmentproblemis tooptimize

theweighteddistan eD(P

1

;P

2

)(nmat h +nmis +nspa eÆ),where

nmat h,nmis,and nspa edenotethenumberofmismat h, mat h, andspa e,

respe tively(nmat h=3,nmis=1,andnspa e=2here).

Extendingdynami programmingtomultiplestringalignmentyiedsaO(n k

)

algorithm. Instead,anapproximationalgorithmisavailable su hthat thes ore

of the multiple alignment is no greater than twi e the s ore of optimalmulti-

plealignment[5℄.Theapproximationalgorithmstartsby omputingthe enter

string S

in kstringsS that minimizes onsensuserror P

Pi2S D(S

;P

i ). On e

the enter string is found, ea h string is then iteratively alignedto the enter

string to onstru t multiple alignment, whi h is in turn used to onstru t the

(11)

algorithm for multiple string alignment is applied to generalize the extra tion

pattern.Supposethegeneralizedextra tionpatternisexpressedas\

1

2

3 :::

n

",

whereea h

i

iseitherasymbolorasubsetof[f g ontainingsymbolsthat

an appear at position i. An additional step is taken to generate pattern of

this form `

j

j+1

j+2 :::

n

1

2 :::

j 1

" for position j with single symbol of the

following spe ial tags su h as <DL>, <DT>, <TR> or <P>, <BR>, <HR>,

be auseextra tion patternsoftenbeginorendupwiththem 1

.

Weadoptthisadditionalstepbe ausethegeneratedextra tionpatternmay

notbethebeginningof are ord.Theexperimental resultsshowthat withthe

help of multiple string alignment and the additional step, the performan e is

improvedto97%retrievalrate,94%a ura yrateand0.90mat hingper entage.

Thehighper entageofretrievalrateisprettyen ouraging.Theninetyper entof

mat hingper entageisa tuallyhigherintermsofthetext ontentretrieved.For

thoseWebsiteswithmat hingper entagegreaterthan85%,thetext ontentsare

allsu essfullyextra ted.Whatbothersisthea ura yrate,sin etheextra tion

pattern generalizedfrom multiple alignmentmay omprehends morethan the

information we need. For example, the generalized rule for Ly os will extra t

information in allthreeblo kswhile onlytheinformation inoneblo kis what

wedesired, ausinglowera ura yrate.

6 Summary and Future Work

Information extra tion from Web pages is a ore te hnology for omparison-

shopping agents[2℄, whi h Doorenboset.al.regardasimprovementin theaxe

oftoleratingunstru turedinformation.The hara teristi sofregularity,unifor-

mity, and verti al separation enable the possibility of learning. In this paper,

wehavepresentedanunsupervisedapproa htosemi-stru turedinformationex-

tra tion.We propose theappli ation of PATtreesfor pattern dis overyin the

en oded tokenstringof Web pages.On e the PAT tree is onstru ted, we an

easily traversethetree to nd allmaximal repeats giventhe expe ted pattern

frequen y and length. The dis overed maximal repeats are further ltered by

three measures: regularity and ompa tness.The ltering riteriaaim to keep

the number of patterns as small as possible while at the same time have all

interestingpatterns.Furthermore,o urren epartitionisappliedtohandlepat-

ternswithregularitygreaterthanthedefaultthreshold.Finally,multiplestring

alignmentisappliedtopatternswithdensitylessthanonetogeneralizeextra -

tionpattern.Thereby,theextra tionmodule ansimplyadaptpatternmat hing

algorithmto extra tallre ords.

Theextra tionrulegeneralizedfrom multiplestringalignmenthasa hieved

97%retrievalrateand91%a ura yrate.Thewholepro essrequiresnohuman

interventionandtrainingexample.Comparingouralgorithmtoothers,ourap-

proa his qui kand expressive.Ittakesonlythreeminutesto extra t140Web

1

(12)

tolerateex eptionsandvarian ein theinput.

Weare urrentlyapplyingthisapproa hagainstmoretestdataformattedin

tabularform,whi hperformatthelevelof80%retrievalrate.Asmorevarian es

o ur in input pages, it be omes even diÆ ult to have good multiple string

alignment.Insu h ases,thes oringofeditdistan ebetweentwostringsandthe

algorithmto onstru tmultiplealignmentbe omemoreimportant.Inaddition,

ltering of the onstru ted patterns an also provide a reasonable number of

patternsforuserto hoose.

A knowledgements

ThisworkissponsoredbyNationalS ien eCoun il,TaiwanundergrantNSC89-

2213-E-008-056.Also,wewouldliketothankLee-FengChien,Ming-JerLeeand

Jung-LiangChenforprovidingtheirPATtree odeforus.

Referen es

1. Chien,L.F.1997. PAT-tree-basedkeywordextra tionforChineseinformationre-

trieval.InPro eedingsofthe20thannualinternationalACMSIGIR onferen eon

Resear h anddevelopmentininformationretrieval.pp.50{58.1997.

2. Doorenbos, R.B., Etzioni, O. and Weld, D. S. A s alable omparison-shopping

agentfortheWorld-WideWeb.InPro eedingsofthe rstinternational onferen e

onAutonomousAgents.pp.39{48,NewYork, NY,1997,ACMPress.

3. Embley, D.; Jiang, Y.; and Ng. Y.-K. 1999. Re ord-boundarydis overyin Web

do uments.InPro eedings of the 1999 ACM SIGMOD InternationalConferen e

onManagementofData(SIGMOD'99).pp.467{478,Philadelphia,Pennsylvania.

4. Gonnet, G.H.;Baeza-yates,R.A.;andSnider,T.1992.NewIndi esforText:Pat

Trees and Pat Arrays. Information Retrieval: Data Stru tures and Algorithms,

Prenti eHall.

5. Gus eld,D.1997.Algorithms onstrings, trees,andsequen es,Cambridge.1997.

6. Hsu, C.-N. and Dung, M.-T. 1998. Generating nite-state transdu ers for semi-

stru tureddataextra tionfromtheWeb.InformationSystems. 23(8):521{538.

7. Knoblo k,A.etal.,ed.,1998.Pro .1998WorkshoponAIandInformationInte-

gration,MenloPark,California.:AAAIPress.

8. Kurtz, S. and S hleierma her, C. 1999. REPuter: fast omputation of maximal

repeatsin ompletegenomes.Bioinformati s15(5):426{427.

9. Kushmeri k,N.;Weld, D.;and Doorenbos, R.1997 Wrapperindu tion forinfor-

mation extra tion. InPro eedings of the 15th International Joint Conferen e on

Arti ialIntelligen e(IJCAI).

10. Muslea, I.;Minton,S.; andKnoblo k,C.1999. Ahierar hi al approa htowrap-

perindu tion.InPro eedings ofthe3rdInternationalConferen eonAutonomous

Agents(Agents'99), Seattle,WA.

11. Muslea,I.1999.Extra tionpatternsforinformationextra tiontasks:asurvey.In

Pro eedingsofAAAI'99:WorkshoponMa hineLearningforInformationExtra -

tion

Fig. 2. The PAT tree for the Congo Code
Fig. 3. # of patterns su
essfully validated

参照

関連したドキュメント

Zheng and Yan 7 put efforts into using forward search in planning graph algorithm to solve WSC problem, and it shows a good result which can find a solution in polynomial time

Once bulk deformation b is chosen (so that there is a torus fiber L whose Floer cohomology is non-vanishing), then we consider the Floer chain complex of L with a generic torus fiber

東京都は他の道府県とは値が離れているように見える。相関係数はこう

Two numerical examples are described to demonstrate the application of the variational finite element analysis to simulate the hydraulic heads and free surface in a porous medium..

Two numerical examples are described to demonstrate the application of the variational finite element analysis to simulate the hydraulic heads and free surface in a porous medium..

We present sufficient conditions for the existence of solutions to Neu- mann and periodic boundary-value problems for some class of quasilinear ordinary differential equations.. We

Antigravity moves Given a configuration of beads on a bead and runner diagram, considered in antigravity for some fixed bead, the following moves alter the antigrav- ity

(3) We present a JavaScript library 2 , that contains all the al- gorithms described in this paper, and a Web platform, AGORA 3 (Automatic Graph Overlap Removal Algorithms), in