Information Extra tion
Chia-Hui Chang,Shao-ChenLui,andYen-ChinWu
Dept.ofComputerS ien eandInformationEngineering
NationalCentralUniversity,Chung-Li,320,Taiwan
hia sie.n u.edu.tw,fanyway,trisdangdb. sie.n u.edu.tw
Abstra t. Informationextra tion(IE)fromsemi-stru turedWebdo -
umentsisa riti alissue forinformationintegrationsystemsontheIn-
ternet. Previous workin wrapper indu tion aim to solve this problem
byapplyingma hinelearning toautomati allygenerateextra tors.For
example,WIEN,Stalker,Softmealy,et .However,thisapproa hstillre-
quireshumaninterventiontoprovidetrainingexamples.Inthispaper,we
proposeanovelideatoIE,byrepeatedpatternminingandmultiplepat-
ternalignment.Thedis overyofrepeatedpatternsarerealizedthrougha
datastru ture allPATtree.Inaddition,in ompletepatternsarefurther
revisedbypatternalignmentto omprehendall patterninstan es.This
newtra ktoIEinvolvesnohumaneortand ontent-dependentheuris-
ti s.Experimentalresultsshowthatthe onstru tedextra tionrules an
a hieves97per entextra tionoverfourteenpopularsear hengines.
Keywords:informationextra tion,semi-stru tureddo uments,wrapper
generation,pattern dis overy,multiple alignment
1 Introdu tion
Information extra tion (IE) is on erned with extra ting from a olle tion of
do umentstheinformationrelevanttoaparti ularextra tiontask.Forinstan e,
the meta-sear h engineMetaCrawlerextra ts the sear h results from multiple
sear hengines;andtheshoppingagentJungleeextra tstheprodu tinformation
from several online stores for omparison. With the growth of the amount of
onlineinformation,theavailabilityofrobust, exibleIEhasbe omeastringent
ne essity.
Contrastto\traditional"informationextra tionwhi hrootsin naturallan-
guagepro essing(NLP)te hniquessu haslinguisti analysis,Internetinforma-
tionextra tionrelyonsynta ti stru turesidenti ationmarkedbyHTMLtags.
Thedieren eisdue tothenatureof Websu hthat thepage ontentshaveto
be lear at glan e. Thus, \itemized list" and \tabular format" havebeen the
mainpresentationstyleforWebpagesontheInternet.Su hpresentationstyles
together with the multiple re ords ontained in one do uments ontribute the
rulesmustbetailoredforea hparti ularpage olle tion,Toautomatethe on-
stru tion of extra tors(or wrappers), re ent resear h has identied important
wrapper lassesandindu tionalgorithms.Forexample,Kushmeri ket.al.iden-
tied afamily of wrapper lassesand the orresponding indu tion algorithms
whi hgeneralizefrom labeledexamplestoextra tionrules[9℄. Moreexpressive
wrapperstru ture areintrodu ed lately.SoftmealybyHsu andDung [6℄usesa
wrapperindu tion algorithmtogenerateextra torsthatareexpressedasnite-
state transdu ers. Meanwhile, Muslea et al. [10℄ proposed \STALKER" that
performshierar hi alinformation extra tiontoredeemSoftmealy'sinability to
use delimiters that do not immediately pre ede and follow the relevant items
withextras ansoverthedo uments(see[11℄ fora ompletesurvey).
In all this work, wrappers are indu ed from training examples su h that
landmarks or delimiters anbe generalized from ommon prexes or suÆxes.
However,labelingthesetrainingexamplesissometimestime- onsuming.Hen e,
anothertra kofresear hisexploringnewapproa hestofullyautomatewrapper
onstru tion.Forexample,Embley et.al.des ribe aheuristi approa h todis-
overre ord boundariesin Web do uments by identifying andidate separator
tags using ve independent heuristi s and sele ting a onsensus separator tag
basedonaheuristi ombination[3℄.However,oneserious probleminthis one-
tagseparatorapproa hariseswhentheseparatortagisusedelsewhereamonga
re ordotherthantheboundary.
Ontheotherhand,ourworkhereattemptstoeliminatehumanintervention
bypatternmining.Themotivationisfrom theobservationthatusefulinforma-
tionin aWebpageis oftenpla edin astru turehavingaparti ularalignment
and order.Forexample,Web pagesprodu ed byWeb sear henginesgenerally
have regular and repetitive patterns, whi h usually represent meaningful and
useful data re ords. In thenext se tion, werst give an exampleshowing the
repeatedpatternformed bymultiple alignedre ords.
2 Motivation
OneobservationfromWebpagesisthattheinformationtobeextra tedisoften
pla edinastru turehavingaparti ularalignmentandformsrepetitivepatterns.
Forexample,query-ableorsear h-ableInternetsitessu hasWebsear hengines
oftenprodu eWebpageswithlargeitemizedmat hresultswhi haredisplayedin
aparti ulartemplateformat.Thetemplate anbere ognizedwhenthe ontent
of ea h mat h is ignored or repla ed by some xed-length string. Therefore,
repetitive patterns are formed. For instan e, in the example of Figure 1, the
sequen e\<LI>Text()<I>Text( )</I>"is repeatedfour times,when alltext
stringsbetweentwotagssu has\Congo",\Egypt",\Belize" et .are repla ed
bytoken lassText( ).
This is a simple examplethat demonstratesa repeated pattern formed by
tag tokens in a Web page following a simple translation onvention. In pra -
<LI>Congo<I>242</I>
<LI>Egypt<I>20</I>
<LI>Belize<I>501</I>
<LI>Spain<I>34</I>
</UL>
Fig.1.Sample HTMLpage
usuallyextra t datafrom relationaldatabaseandprodu edynami Webpages
withapredenedformatstyle.Therefore,whatweoughttodoiskindofreverse
engineering to dis over the original format style and the ontent we need to
extra t.Meanwhile,wealsondthatextra tionpatternsofthedesiredinforma-
tion( alled maininformation blo k asdened in [3℄)ofteno urregularlyand
loselyin aWebpage. Theseobservationsmotivateus tolookforan approa h
todis overrepeatedpatternsandvalidation riteriatolterdesiredrepeatsthat
arespa edregularlyand losely.
Sin eHTMLtagsarethebasi omponentsfordatapresentationandthetext
stringbetweentagsareexa tlywhatweseeinthebrowsers.Hen e,itisintuitive
toregardthetextstringbetweentwotagsasoneunitaswellasea hindividual
tag.ThissimpleversionofHTMLtranslationwillbeusedinthefollowingpaper
where anytext stringbetweentwotagsistranslatedto oneunit alled Text( )
andeveryHTMLtagistranslatedtoatokenHtml(<tag>)a ordingtoitstag
name.
Su htranslation onventionenablestheshow-upofmanyrepeatedpatterns.
Byrepeated patterns,wemeananysubstring that o urs twi e in theen oded
token string. Thus, not only the sequen e \Html(<LI>) Text( ) Html(<I>)
Text()Html(<I>)" onformstothedenitionofrepeatedpatternbutalsothe
subsequen e\Html(<LI>)Text( )Html(<I>),"\Text()Html(<I>)Text()",
\HMLT(<I>)Text( )Html(</I>)," et . To distinguish from these repeats, we
denemaximal repeatstouniquelyidentifythelongestpattern asfollows.
Denition GivenaninputstringS,wedenemaximalrepeatasasubstring
ofS thato urs in kdistin tpositionsp
1
;p
2
;:::;p
k
in S,su hthat the(p
i -
1)thtokenin Sisdierentfromthe(p
j
-1)thtokenforatleastonei;j pair,
1i<j k ( alled leftmaximal), andthe(p
x
+jj)thtokenis dierent
fromthe(p
y
+jj)thtokenforatleastonex;y pair,1x<yk ( alled
rightmaximal).
Thedenition ofmaximal repeatsis ne essaryforidentifying thewell-used
and popular term, repeats. Besides, it also aptures all interesting repetitive
stru tures in a lear way and avoids generating overwhelming outputs. In the
nextse tion,wewilldes ribehowtheproblemofIE anbeaddressedbypattern
Todis overpatternsfromaninputWebpage,rstanen odings hemeisused
to translate the Web page into a string of abstra t representations, referred
to hereas tokens. Ea h token is represented by a binary ode of length l. To
enable pattern dis overy, we utilizes adata stru ture alled aPAT tree [4℄ in
whi hrepeated patternsin agivensequen e anbeeÆ ientlyidentied. Using
thisdatastru turetoindexaninputstring,allpossiblerepeats,in ludingtheir
o urren e ounts andtheirpositions in theoriginal inputstring anbe easily
retrieved.Finally,thedis overedmaximalrepeatsareforwardedtothevalidator,
whi hltersoutundesired patternsandtoprodu esa andidatepattern.
3.1 Translator
Sin eHTMLtagsarethebasi omponentsfordo umentpresentationandthe
tagsthemselves arrya ertainstru tureinformation,itisintuitivetoexamine
thetagtokenstringformedbyHTMLtagsandregardothernon-tagtext ontent
betweentwotagsasonesingletoken alledText( ).Tokensseeninthetranslated
tokenstringin ludetagtokensandtexttokens,denotedasHtml(<tag name>)
andText( ),respe tively.Forexample,Html(</a>)isatagtoken,where</a>
isthetag.Text( )isatexttoken,whi hin ludesa ontiguoustextstringlo ated
betweentwoHTMLtags.
Tagstokens anbe lassiedinmanyways.Theuser an hoosea lassi a-
tiondependingonthedesiredlevelofinformationtobeextra ted.Forexample,
tagsintheBODYse tionofado ument anbedividedintotwodistin tgroups:
blo k-leveltags and text-leveltags. Theformer denes thestru ture of ado -
ument, and the latter denes the hara teristi s, su h as format and style, of
the ontentsof the text. Blo k level tags in lude ategories su h as headings,
text ontainers, lists, andother lassi ations,su hastables and forms.Text-
leveltagsarefurtherdividedinto ategoriesin ludinglogi almarkups,physi al
markups,andspe ialmarkupsformarkinguptextsin atextblo k.
Themanydierenttag lassi ationsallowdierentHTML translationsto
be generated. With these dierent abstra tion me hanisms, dierent patterns
an beprodu ed. Forexample, skippingalltext-leveltagswill resultin higher
abstra tionfromtheinputWebpagethanalltagsarein luded.Inaddition,dif-
ferentpatterns anbedis overedandextra tedwhendierenten odings hemes
aretranslated.
For example, when only blo k-level tags are onsidered, the orrespond-
ing translation of Fig. 1 is a token string: \Html(<H1>)Text()Html(</H1>)
Html(<UL>)Html(<LI>)Text()Html(<LI>)Text()Html(<LI>)Text()Html(<LI>)
Text()Html(</UL>)",where ea htokenis en oded asabinary stringsof "0"s
and"1"swithlengthl.Forexample,supposethreebitsen odethetokensinthe
Congo odeasshownin Fig.2.Theen oded binarystringforthetokenstring
oftheCongo odewillbe"100110101000010110010110010110010110001$"
,QGH[LQJ SRVLWLRQ
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
VXIIL[
+WPO+!7H[WB+WPO+!+WPO8/!+WPO/,!7H[WB
+WPO/,!7H[WB+WPO/,!7H[WB+WPO/,!7H[WB+WPO8/!
+WPO8/!
+WPO/,!
+WPO+!
7H[WB
100110101000010110010110010110010110001$
+WPO8/!
+WPO/,!
+WPO+!
11
4
Ä
1
1 Ä
Ã
Ä
Ä
Ã
1
Ã
3
1
Ä
Ã
8
12
Ã
13
9
Ã
Ä
5
7
20
Ã
Ã
11
0
10
L
Ä
I
6
M
Ã
4
Ä
N
5
Ä
Ã
2
1
D
2
2 J
E
3
F
14
H
K 3
17 O
8
G
Fig.2.ThePATtreefor theCongoCode
3.2 The PAT Tree
Ourapproa hforpatterndis overyusesaPATtreetodis overrepeatedpatterns
in theen odedtokenstring.A PATtreeisaPatri iatree(Pra ti alAlgorithm
to Retrieve Information Coded in Alphanumeri [11℄) onstru tedoverall the
possiblesuÆxstrings.A Patri iatree isaparti ularimplementationof a om-
pressed binary (0,1) digital tree su h that ea h internal node in the tree has
twobran hes:zerogoesto leftand onegoesto right.LikeasuÆxtree [4℄,the
Patri iatreestoresallitsdataat theexternalnodesandkeepsoneinteger,the
bit-index, in ea h internal node asanindi ation of whi h bit is to beused for
bran hing. For a hara terstringwith nindexingpoint(orn suÆx),there will
benexternal nodesin thePATtree andn 1internalnodes.This makesthe
treeO(n)in size.
WhenaPATtree isto index asequen e of hara ters(or tokenshere) not
just 0 or 1, the binary odes for the hara ters an be used. For simpli ity,
ea h hara terisen odedasxed-lengthbinary ode.Spe i ally,givenanite
alphabetofaxedsize,ea h hara terx2isrepresentedbyabinary ode
oflengthl=dlog
2
jje.Forasequen eSofn hara ters,thebinaryinputBwill
havenlbits,butonlythe[il+1℄thbithastobeindexedfori=0;:::;n 1.
ReferringtoFig.2,aPATtreeis onstru tedfromtheen odedbinarystring
of theCongoexample.Thetreeis onstru tedfrom thirteen sequen esof bits,
withea hsequen eofbitsstartingfromea hoftheen odedtokensandextending
totheendofthetokenstring.Ea hsequen eis alleda"semi-innitestring"or
"sistring"inshort.Ea hleaf,orexternalnode,isrepresentedbyasquarelabeled
byanumberthatindi atesthestartingposition ofasistring.Forexample,leaf
intheen odedbitstring.Thebitpositionisusedwhenlo atingagivensistring
in PATtree.
Virtually,ea hedgeinthePATtreehasaedgelabel.Forexample,theedge
labels betweennode dand eare \101100",the8thbit to 13th bit forsuÆx9,
7,and5.Edgesthat arevisitedwhentraversingdownwardfrom rootto aleave
formapaththatleadstoasistring orrespondingtotheleave.The on atenated
edgelabelsalongthepathformavirtualpathlabel.Forexample,theedgelabels
"1","10", and"1..." onthepath that leadsfromroot to leave2form aprex
"1101...",whi hisauniqueprexforsistring2.
Asshownin Fig.2,allsuÆxstringswiththesameprexwill belo atedin
thesamesubtree.Hen e,itallowssurprisinglyeÆ ient,linear-timesolutionsto
omplexstringsear hproblems.Forexample,stringprexsear hing,proximity
sear hing,rangesear hing,longestrepetitionsear hing,mostfrequentsear hing,
et . [4,5℄Sin e everyinternal nodein aPATtreeindi ates abran h,itimplies
a dierent bit following the ommon prex between two suÆxes. Hen e, the
on atenationoftheedge-labelsonthepathfrom therootto aninternal node
representsonerepeatedstringintheinputstring.However,noteverypath-label
orrepeatedstringrepresentsamaximalrepeat.Let's allthe(p
k
1)th hara ter
ofthebinarystringp
k
theleft hara ter.Forapath-labelofaninternalnodevto
beamaximalrepeat,atleasttwoleaves(suÆxes)inthev'ssubtreeshouldhave
dierent left hara ters.By re ordingthe o urren e ountsand the referen e
positionsintheleafnodesofaPATtree,we aneasilyknowhowmanytimesa
pattern isrepeated. Hen e,giventhe patternlength,o urren e ount,we an
applypostordertraversalto thePATtreetoenumerateallrepeats.
Theessen eofaPATtreeisabinarysuÆxtree,whi hhasalsobeenapplied
in severalresear heldforpattern dis overy.Forexample,Kurtzand S hleier-
ma herhaveusedsuÆxtreesinbioinformati sforndingrepeatedsubstringin
genomes[8℄.AsforPATtrees,theyhavebeenappliedforindexingintheeldof
informationretrievalsin ealongtimeago[4℄.Ithasalsobeenused inChinese
keywordextra tion [1℄ forits simplerimplementation than suÆxtrees and its
great power for pattern dis overy. However, in the appli ation of information
extra tion, weare notonly interested in repeats but also repeats that appear
regularly in vi inity. Dis overedmaximal repeats haveto be further validated
or ompared to nd the best one that orresponds to the information to be
extra ted.
3.3 Pattern ValidationCriteria
Intheabovese tion,wedis ussedhowto ndmaximal repeatsin aPATtree.
However, there may be over 60 maximal repeats dis overedin an Web page.
To lassifythese maximal repeats, we introdu e two measures regularity, and
ompa tness as des ribed below. Let the suÆxes of a maximal repeat are
orderedbyitsposition su h thatsuÆxp
1
<p
2
<p
3 :::<p
k
, wherep
i
denotes
intervalbetweentwoadja ento urren es(p
i+1 p
i
),thatis,thesequen eof
spa ingbetweentwoadja ento urren es(p
2 p
1 ),(p
3 p
2 ),...,(p
k p
k 1 ).
Regularity of the maximal repeat is equalto the standard derivation of
thesequen edividedbythemeanofthesequen e.
Compa tness is a measure of the density of a maximal repeat. It is used to
eliminatemaximalrepeatsthatares atteredfarapartbeyondagivenbound.
Compa tnessisdened askjj/ P
k
i=2 p
i p
i 1
,where jjis thelengthof
in numberoftokens.
Thevalueofregularityislo atedbetween0and1whilethevalueofdensity
isgreaterthan0.Ideally,theextra tionpatternshouldhaveregularityequalto
zeroand ompa tnessequaltoone.Tolterpotentiallygoodpatterns,asimple
approa hwillbetouseathresholdforea hofthesemeasuresabove.Impli itly,
good patterns havesmall regularity and density lose to one. Therefore, only
patternswith regularitylessthantheregularitythresholdanddensitybetween
thedensitythresholdsare onsideredvalidated patterns.
4 Performan e Evaluation
We rst showthe number of validated maximal repeats validated by oursys-
tem using fourteen state-of-the-art sear h engines, ea h with ten Web pages.
There are several ontrol parameters whi h an ae t thenumber of maximal
repeats validated,in ludingen oding s heme, minimumpattern length, o ur-
ren e ount, and threshold values for regularity and ompa tness. Given the
minimumlength3and ount5,theee tofdierenten odings hemeisshown
in Table 1. Conform to general expe tation, higher-level en oding s heme of-
ten resultsin lesspatterns. From this table,we an also see how ea h ontrol
parameter lters patterns where the thresholds are de ided by the following
experiments.Thevalueofdensity anbegreaterthanonebe ausemaximalre-
peatsmaybeoverlapped.Forexample,supposeamaximalrepeato ursten
timesin arow.Insu h ase,will hasregularity0anddensity1.Inaddition,
, ,et . arealso qualiedfor regularmaximalrepeats, onlywithdensity
greaterthan1.
Table 1.No.ofPatternsvalidatedwithdierenten odings heme
En oding MaximalRegularityCompa tness
S heme Repeat <0.5 >1.5 <0.25
All-tag 117 39 22 7.6
NoPhysi al 88 41 25 6.5
NoSpe ial 82 29 18 5.7
Blo k-level 66 32 17 3.9
Density
# of patterns
5
5
5
Fig.3.#ofpatternssu essfullyvalidated
Fig. 3 shows the ee t of various regularity and density thresholds using
all-tag en oding s heme. Basi ally, low regularity threshold and high density
thresholdredu ethenumberofpatterns,but ouldhavemissedgood patterns.
Therefore, thethresholdsare hosenempiri allyto in lude asmany good pat-
ternsaspossible.
Table 2 shows the performan e of dierent en oding s heme measured in
retrievalrate,a ura y rateandmat hing per entage.Retrievalrateisdened
as the ratio of the number of desired data re ords enumerated by a maximal
repeatto thenumberofdesireddata re ords ontainedin theinputtext.Like-
wise,a ura yrateisdenedastheratioofthenumberofdesireddatare ords
enumeratedbya maximalrepeatto the numberof o urren e ofthe maximal
repeat.Adatare ordissaidtobeenumeratedbyamaximalrepeatifthemat h-
ing per entageis greaterthan abound determined bythe user.The mat hing
per entageis usedbe ausethepattern may ontainonlyaportionof thedata
re ord.
With the simple en oding s heme of using blo k-level tags, our approa h
ould dis over patterns whi h extra t 86% re ords with mat hing per entage
78%.Nearly halfthetestWebsites are orre tlyextra ted(withmat hingper-
entage greater than90%). Among them, nine of the fourteen Web sites have
retrievalrateanda ura yratebothgreaterthan0.9.However,examiningother
dis overedpatterns,manyarein ompleteduetoex eptions.Inthenextse tion,
wewill further improve theperforman e by o urren epartition and multiple
stringalignment.
5 Constru ting Extra tion Pattern
Generallyspeaking,sear henginesutilizea\whileloop"tooutputtheirresults
in some template. However, they may use \if lauses" inside the while loop
En odingS hemeRetrievalRateA ura yRateMat hingPer entage
All-tag 0.73 0.82 0.60
NoPhysi al 0.82 0.89 0.68
NoSpe ial 0.84 0.88 0.70
Blo k-level 0.86 0.86 0.78
to sear h engines are shown in bold fa e for Infoseek and MetaCrawler, thus,
breakingtheir\whileloop" patterns.
From thestatisti s above,wesummarize that \maximal repeat"and \reg-
ularity"are thetwoprimary riteriawe lter andidatepatterns. However,we
also found that the extra tionpattern maynot be maximal repeats and regu-
lar.Forexample,theregularityofthepatternforEx iteisgreaterthandefault
regularitythreshold0.5 be auseabanner isinsertedamong thesear hresults,
dividingthetenmat hesintotwoparts.Besides,the\if-ee t"oftenhindersus
from dis overing ompletepatterns.These issuesare whatwewouldaddressin
thefollowing.
5.1 O urren e Partition
Tohandlepatternswithregularitygreaterthanthespe iedthreshold0.5,these
patterns are arefullysegmentedto see if any partition ofthe pattern's o ur-
ren es satises the requirement for regularity. By denition, the regularity of
a pattern is omputed through all o urren esof the pattern. Forexample, if
thereareko urren es,thek 1intervals(betweentwoadja entinstan es)are
thestatisti sweuseto omputethestandarddeviationandthemean.However,
in examplessu h asLy os,the sear h resultis divided intothree blo ks. Su h
o urren esin reasetheregularityoverallinstan es.Nonetheless,theregularity
of the o urren esin ea h information blo k is still small. Therefore, theidea
hereis to segmentthe o urren esinto partitionsso that we ananalyzeea h
partitionindividually.
Wedon't reallyhaveto apply lustering algorithmon thismatter, instead,
a simple loop an a omplish the job if the o urren es are ordered by their
position aforehand. Let C
i;j
denotes the set of o urren es p
i
;p
i+1
;:::;p
j and
initialize s =1;j = 1. For instan e p
j+1
, if theregularity of C
s;j+1
is greater
thanthenoutputC
s;j
asapartition andassignj+1tos.
On e the partitions are separated, we anthen ompute the regularity for
ea hindividualpartition.Ifapartitionin ludeso urren esmoretheminimum
ountandhasregularitylessthanthreshold, thepatternaswellastheo ur-
ren esinthispartitionareoutputted.Notethatthethresholdissettoasmall
value mu h less than 0.5 to ontrol the number of outputted patterns. With
thismodi ation,theperforman eisimprovedgreatly.AsshowninTable3,the
retrievalrateisin reasedto93%anda ura yrateto90%.Theonlytradeois
Advan edTe hnique RetrievalRateA ura yRateMat hingPer entage
O urren ePartition 0.93 0.93 0.84
MultipleAlignment 0.97 0.94 0.90
5.2 MultipleString Alignment
Forthe tough work regardingin ompletepatterndis overed,thete hniquefor
multiplestringalignmentisborrowedtondagoodpresentationofthe riti al
ommonfeaturesofmultiple strings.Forexample,suppose \ad "isthedis ov-
eredpatternfortokenstring\ad wbdad xbad xb ad ".Ifwehavethefollowing
multiple alignmentforstrings\ad wbd",\ad xb"and\ad xbd":
ad wb d
ad x b
ad x b d
The extra tion pattern an be generalized as \ad [wjx℄b[dj ℄" to over these
three instan es. Spe i ally, suppose avalidated maximalrepeathask+1o -
urren e, p
1 ,p
2 ,...,p
k +1
in theen oded token string.LetstringP
i
denotethe
stringstartingatp
i
andendingatp
i+1
1.Theproblemistondthemultiple
alignmentof the k stringsS =fP
1
;P
2
;:::;P
k
g sothat the generalizedpattern
anbeusedtoextra tallre ordsweneed.
Multiplestring omparisonis anaturalgeneralization ofalignment for two
strings whi h an be solved in O(nm) by dynami programming to obtain
optimaleditdistan e,where nand marestringlengths.As anexampleoftwo
string alignment, onsiderthealignmentoftwostringsa wbdandad xb shown
below:
a wb d
a d x b
In this alignment, hara ter w is mismat hed with x,two dsare opposite hy-
phens(or alledspa e),andallother hara tersmat htheir ounterpartsinthe
opposite string.If wegive ea h mat h avalue of , ea h mismat h avalue of
,andea hspa e avalueof Æ,thetwostring alignmentproblemis tooptimize
theweighteddistan eD(P
1
;P
2
)(nmat h+nmis +nspa eÆ),where
nmat h,nmis,and nspa edenotethenumberofmismat h, mat h, andspa e,
respe tively(nmat h=3,nmis=1,andnspa e=2here).
Extendingdynami programmingtomultiplestringalignmentyiedsaO(n k
)
algorithm. Instead,anapproximationalgorithmisavailable su hthat thes ore
of the multiple alignment is no greater than twi e the s ore of optimalmulti-
plealignment[5℄.Theapproximationalgorithmstartsby omputingthe enter
string S
in kstringsS that minimizes onsensuserror P
Pi2S D(S
;P
i ). On e
the enter string is found, ea h string is then iteratively alignedto the enter
string to onstru t multiple alignment, whi h is in turn used to onstru t the
algorithm for multiple string alignment is applied to generalize the extra tion
pattern.Supposethegeneralizedextra tionpatternisexpressedas\
1
2
3 :::
n
",
whereea h
i
iseitherasymbolorasubsetof[f g ontainingsymbolsthat
an appear at position i. An additional step is taken to generate pattern of
this form `
j
j+1
j+2 :::
n
1
2 :::
j 1
" for position j with single symbol of the
following spe ial tags su h as <DL>, <DT>, <TR> or <P>, <BR>, <HR>,
be auseextra tion patternsoftenbeginorendupwiththem 1
.
Weadoptthisadditionalstepbe ausethegeneratedextra tionpatternmay
notbethebeginningof are ord.Theexperimental resultsshowthat withthe
help of multiple string alignment and the additional step, the performan e is
improvedto97%retrievalrate,94%a ura yrateand0.90mat hingper entage.
Thehighper entageofretrievalrateisprettyen ouraging.Theninetyper entof
mat hingper entageisa tuallyhigherintermsofthetext ontentretrieved.For
thoseWebsiteswithmat hingper entagegreaterthan85%,thetext ontentsare
allsu essfullyextra ted.Whatbothersisthea ura yrate,sin etheextra tion
pattern generalizedfrom multiple alignmentmay omprehends morethan the
information we need. For example, the generalized rule for Ly os will extra t
information in allthreeblo kswhile onlytheinformation inoneblo kis what
wedesired, ausinglowera ura yrate.
6 Summary and Future Work
Information extra tion from Web pages is a ore te hnology for omparison-
shopping agents[2℄, whi h Doorenboset.al.regardasimprovementin theaxe
oftoleratingunstru turedinformation.The hara teristi sofregularity,unifor-
mity, and verti al separation enable the possibility of learning. In this paper,
wehavepresentedanunsupervisedapproa htosemi-stru turedinformationex-
tra tion.We propose theappli ation of PATtreesfor pattern dis overyin the
en oded tokenstringof Web pages.On e the PAT tree is onstru ted, we an
easily traversethetree to nd allmaximal repeats giventhe expe ted pattern
frequen y and length. The dis overed maximal repeats are further ltered by
three measures: regularity and ompa tness.The ltering riteriaaim to keep
the number of patterns as small as possible while at the same time have all
interestingpatterns.Furthermore,o urren epartitionisappliedtohandlepat-
ternswithregularitygreaterthanthedefaultthreshold.Finally,multiplestring
alignmentisappliedtopatternswithdensitylessthanonetogeneralizeextra -
tionpattern.Thereby,theextra tionmodule ansimplyadaptpatternmat hing
algorithmto extra tallre ords.
Theextra tionrulegeneralizedfrom multiplestringalignmenthasa hieved
97%retrievalrateand91%a ura yrate.Thewholepro essrequiresnohuman
interventionandtrainingexample.Comparingouralgorithmtoothers,ourap-
proa his qui kand expressive.Ittakesonlythreeminutesto extra t140Web
1
tolerateex eptionsandvarian ein theinput.
Weare urrentlyapplyingthisapproa hagainstmoretestdataformattedin
tabularform,whi hperformatthelevelof80%retrievalrate.Asmorevarian es
o ur in input pages, it be omes even diÆ ult to have good multiple string
alignment.Insu h ases,thes oringofeditdistan ebetweentwostringsandthe
algorithmto onstru tmultiplealignmentbe omemoreimportant.Inaddition,
ltering of the onstru ted patterns an also provide a reasonable number of
patternsforuserto hoose.
A knowledgements
ThisworkissponsoredbyNationalS ien eCoun il,TaiwanundergrantNSC89-
2213-E-008-056.Also,wewouldliketothankLee-FengChien,Ming-JerLeeand
Jung-LiangChenforprovidingtheirPATtree odeforus.
Referen es
1. Chien,L.F.1997. PAT-tree-basedkeywordextra tionforChineseinformationre-
trieval.InPro eedingsofthe20thannualinternationalACMSIGIR onferen eon
Resear h anddevelopmentininformationretrieval.pp.50{58.1997.
2. Doorenbos, R.B., Etzioni, O. and Weld, D. S. A s alable omparison-shopping
agentfortheWorld-WideWeb.InPro eedingsoftherstinternational onferen e
onAutonomousAgents.pp.39{48,NewYork, NY,1997,ACMPress.
3. Embley, D.; Jiang, Y.; and Ng. Y.-K. 1999. Re ord-boundarydis overyin Web
do uments.InPro eedings of the 1999 ACM SIGMOD InternationalConferen e
onManagementofData(SIGMOD'99).pp.467{478,Philadelphia,Pennsylvania.
4. Gonnet, G.H.;Baeza-yates,R.A.;andSnider,T.1992.NewIndi esforText:Pat
Trees and Pat Arrays. Information Retrieval: Data Stru tures and Algorithms,
Prenti eHall.
5. Guseld,D.1997.Algorithms onstrings, trees,andsequen es,Cambridge.1997.
6. Hsu, C.-N. and Dung, M.-T. 1998. Generating nite-state transdu ers for semi-
stru tureddataextra tionfromtheWeb.InformationSystems. 23(8):521{538.
7. Knoblo k,A.etal.,ed.,1998.Pro .1998WorkshoponAIandInformationInte-
gration,MenloPark,California.:AAAIPress.
8. Kurtz, S. and S hleierma her, C. 1999. REPuter: fast omputation of maximal
repeatsin ompletegenomes.Bioinformati s15(5):426{427.
9. Kushmeri k,N.;Weld, D.;and Doorenbos, R.1997 Wrapperindu tion forinfor-
mation extra tion. InPro eedings of the 15th International Joint Conferen e on
Arti ialIntelligen e(IJCAI).
10. Muslea, I.;Minton,S.; andKnoblo k,C.1999. Ahierar hi al approa htowrap-
perindu tion.InPro eedings ofthe3rdInternationalConferen eonAutonomous
Agents(Agents'99), Seattle,WA.
11. Muslea,I.1999.Extra tionpatternsforinformationextra tiontasks:asurvey.In
Pro eedingsofAAAI'99:WorkshoponMa hineLearningforInformationExtra -
tion