IE Survey by Sarawagi Schedule for 2017 Web Information Extraction and Retrieval

(1)

Information Extraction

Survey

Sunita Sarawagi

Foundations and Trends in Databases Vol. 1, No. 3 (2007) 261–377

(2)

Part I Introduction

(3)

Outline

 Applications

 Organization of the Survey

 Types of Structure Extracted

 Types of Unstructured Sources

 Input Resources for Extraction

 Methods of Extraction

 Output of Extraction Systems

(4)

Applications

 Enterprise Applications

 Personal Information Management

 Scientific Applications

 Web Oriented Applications

(5)

Enterprise Applications (1/2)

 Data Cleaning

 convert addresses into structured forms (road name, city, and state)

 provide better querying and more robust deduplication

 News Tracking

 structured entities-relations: people names

“is-CEO-of” company names

 specific event: disease outbreaks, terrorist

 multimedia: integrate video and pictures of entities and events

(6)

Enterprise Applications

(2/2)

 Classified Ads

 extraction research on record-oriented data

 Customer Care

 identify product names and product attributes

 link customer emails to a specific transaction

 extract customer moods from phone transcripts

 Extract product attribute and value from ⁶

(7)

Personal Information

Management (PIM)

 organize personal data like

documents, emails, projects and people in a structured inter-linked format

 automatically extract structure from existing unstructured sources

(8)

Scientific Applications

 Bio-informatics has broadened the scope of earlier extractions from

named entities, to biological objects such as proteins and genes.

 A central problem is extracting from paper repositories such as Pubmed, protein names, and their interaction.

(9)

Web Oriented

Applications (1/3)

 Citation Databases

 extract sources, e.g. conference web sites, individual home pages, PDF paper.

 Segment: authors, title, venue, and year field

 value added in terms: forward references and

aggregate statistics (author-level citation counts).

 Opinion Databases

 free text form hidden behind Blogs, newsgroup posts, review sites, etc.

 products are useful to find out each product feature and the prevalent polarity of opinion.

(10)

Web Oriented Applications

(2/3)

 Community Websites

 DBLife and Rexa track researchers,

conferences, talks, projects, titles and events relevant community.

 Comparison Shopping

 crawl products and prices on merchant web sites  comparison shopping

 merchant web is hidden behind forms and scripting languages  crawl and extract

information from form-based web sites

(11)

Web Oriented Applications

(3/3)

 Ad Placement on Webpages

 Structured Web Searches

 keyword (nouns or noun phrases)  entities

 fail on queries that look for relationships between entities

Product description

Positive opinion product (product

advertisements)

(12)

Organization of the Survey

 The type of structure extracted

 The type of unstructured source

 The type of input available for extraction

 The method used for extraction

 The output of extraction

(13)

Types of Structure Extracted

 _Entities

 Relationships

 Adjectives Describing Entities

 Structures such as Lists, Tables, and Ontologies

(14)

Entities

 noun phrases and a few tokens in the unstructured text

 Example named entities: persons, locations, and companies  disease names, protein

names, paper titles, and journal names

 three subtasks: proper names and acronyms of persons, locations, and organizations

(ENAMEX), absolute temporal terms (TIMEX) and monetary and other numeric

expressions (NUMEX).

(15)

Entity Extractions

Examples

Fig. 1.1 Traditionally named entity and relationship extraction from plain text

(16)

Relationships (1/2)

 Relationships are defined over two or more entities related in a pre-defined way. Ex: “is employee of”  person and organization, “is acquired by”  pair companies, “location of outbreak”  disease and location

 subtype of record extraction is event extraction, ex: disease outbreak  “disease name”,

“location of the outbreak”, “number of people affected”, “number of people killed”, and “date of outbreak” (multi-way relationship)

(17)

Relationships (2/2)

 Semantic Role Labeling

 the goal is to identify various semantic argu-ments of the predicate.

 Ex: given a predicate accept in the

sentence “He accepted the manuscript from his dying father with trembling

hands”. The extraction task is to find the role-sets predicate: “acceptor”, “thing

accepted”, and “accepted-from”.

(18)

Adjectives Describing

Entities

 A adjective describes an entity.

 A adjective derives by combining soft clues spread over different

words around the entity. Ex: entity restaurant or music bands  extract parts of a Blog or web-page that

presents a critique of the entities  critique positive or negative

(19)

Structures such as Lists,

Tables, and Ontologies

 Expanded extraction systems to richer structures (tables, lists, and trees) from various types of

documents.

(20)

Types of Unstructured

Sources

 Granularity of Extraction

 Heterogeneity of Unstructured Sources

(21)

Granularity of Extraction

 Record or Sentences

 extract a small text snippets either unstructured records (addresses, citations) or sentences

extracted (NL paragraph)

 each word is a part of field, during extraction need to segment the text at the entity

boundaries.

 Paragraphs and Documents

 consider multiple sentences or a document for meaningful extractions.

 extract from longer information is a main challenge to filtering efficiently.

(22)

Heterogeneity of

Unstructured Sources

 Machine Generated Pages

HTML documents dynamically generated via database backed sites  wrappers

 Partially Structured Domain Specific Sources

well-defined scope (news articles, citations)  possible to develop a decent extraction model given enough labeled data

 Open Ended Sources

one important factor is to exploit the redundancy of the extracted information across many different

sources.

(23)

Input Resources for

Extraction

 Structured Databases

 Labeled Unstructured Text

 Preprocessing Libraries for Unstructured Text

(24)

Structured Databases

 Existing structured databases

(entities and relationships) are a valuable resource to improve

extraction accuracy.

 unstructured data needs to be

integrated with structured databases so the time of extraction a large

database is available.

(25)

Labeled Unstructured

Text

 labeled unstructured text requires tedious labeling effort.

 A labeled unstructured source is significantly

 more valuable than a structured database because it provides

contextual information about an entity.

 an entity appears in the unstructured data is often a very noisy form.

(26)

Preprocessing Libraries for

Unstructured Text

 Natural Language Text

Natural language documents are analyzed by a deep pipeline of preprocessing libraries

 Sentence analyzer and tokenizer

▪ identify the boundaries of sentences in a document

▪ decomposes each sentence into tokens (delimiters like spaces, commas, and dots)

 Part of speech tagger

▪ assigns to each word a grammatical category (noun, verb, adjective, adverb, article, conjunct, and pronoun)

(27)

Preprocessing Libraries for

Unstructured Text (2)

▪ Ex: Brown tag has 179 tags and the Penn treebank tag has 45 tags.

▪ The/DT University/NNP of/IN Helsinki/NNP hosts/VBZ ICML/NNP this/DT year/NN

 _Parser

▪ group words into prominent phrase types, such as noun phrases, prepositional phrases, and verb phrases.

▪ The output parsing is a parse tree that groups words into syntactic phrases.

▪ Parse trees are used for entity extraction because

typically named entities are noun phrases. In relationship extraction they provide valuable linkages between verbs and their arguments.

(28)

Preprocessing Libraries for

Unstructured Text (3)

 Dependency analyzer

▪ identify the words that form arguments of other words in the sentence.

▪ Ex: “Apple is located in Cupertino”, the word “Apple” (S) and “Cupertino” (O) are dependent on the word “located” (argument).

 Formatted Text

 need for understanding the overall structure and layout of the source (pdf, web-page)

before entity extraction.

 extract items in a list-like environment and create hierarchies of rectangular regions comprising logical units of content.

(29)

Methods of Extraction

 _Hand-coded

require human experts (to define rules or regular expressions) or program snippets (for performing the extraction).

 Learning-based

require manually labeled unstructured examples to train machine learning

models of extraction.

(30)

Methods of Extraction

(2)

 _Rule-based

drive by hard predicates, easier to interpret and develop, more useful in closed domains where human involvement is both essential and available.

 Statistical methods

decision based on a weighted sum of

predicate firings, more robust to noise in the unstructured data, more useful in closed

domains where human involvement is both essential and available.

(31)

Output of Extraction Systems

 Identify all mentions of the structured information in the unstructured text.

 Populate a database of structured entities.

(32)

Part II Entity Extraction

– Rule Based Method

(33)

Outline

 Introduction

 Feature of token

 Single entity

 Multiple entities

 Mark entity boundaries

 Organizing collection of rules

 Rule learning algorithm

3 3

(34)

Introduction ( 1/4 )

• Many real-life extraction tasks can be handled through a collection

rules.

• Task is controlled and well-behaved

– Phone number

– Zip codes from e-mails

• It can be faster and more easier amenable to optimizations

3 4

(35)

Introduction ( 2/4 )

• A typical ruled-based system consist of

– A collection of rules

– A set of policies to control the firings of multiple rules

3 5

(36)

Introduction ( 3/4 )

• Basic rule : Contextual pattern -> action

• Contextual pattern consists pf one or more labeled pattern to capture

the different entities.

– NP -> Det Adj Noun { The cute dog }

3 6

(37)

Introduction ( 4/4 )

• Label patterns are usually regular

expression which consists of tokens.

• The action part

– Assigning an entity label to a sequence of tokens

– Inserting the start or the end of an entity tag at a position

– Assigning multiple entity tags

3 7

(38)

Feature of tokens

 The string representing the token.

 Punctuations

 The part of speech of the token

 The list of dictionaries in while the token appear

 Predefined text features

 _isNumber

 IsCapitalize

3 8

(39)

Identify a single entity

• A single full entity consists of three types of patterns

– An option pattern capturing the context before the start of an entity

– A pattern matching the tokens in the entity

– An optional pattern for capturing the context after the end of the entity

3 9

(40)

Identify a single entity

• A pattern for identifying person name of form “ Dr. Yair Weiss ”

– Consisting of a little token as listed in a dictionary of titles ( “Prof”, “Dr”, “Mr”)

– _{A dot}

– Two capitalized words

• ({ DictionaryLookup = Titles } { String = ‘.’ }

{ Orthography type = capitalized

word } { 2 })

₄

0

(41)

Identify a single entity

• A pattern for marking all number

following words “by” and “in” as the Year entity is

– ({ String = “by” | String = “in” })

– ({ Orthography type = Number }) :y -> Year=:y

4 1

(42)

Rules for multiple

entities

• Some rules take the form of regular expressions with multiple slots

• _Example

– the number of bedrooms and rent from an apartment rental ad.

• ({Orthography type = Digit}) : Bedroom

({String = “BR”})({}*)({String

=“$”})

(Orthography type = Number) :

Price

⁴

2

(43)

Mark entity boundaries

• In particular long entities , it is more efficient to define separate

rules to mark the start and end of an entity boundary.

• SGML (Standard Generalized Markup Language)

4 3

(44)

Organizing collection of

rules

• Rule-based system consists of a very large collection of rules.

• A rule identifies a span of text to be called a particular entity.

• The spans demarcated by different rules overlap and lead to

conflicting actions.

• None standardized and custom- tuned.

(45)

Unordered rules

• Treat rules as an unordered collection of disjuncts.

• Each rule fires independently of the other.

• Prefer rules that mark larger spans of text as an entity type.

• Merge the spans of text that overlap.

4 5

(46)

Rules Arranged as an

Ordered Set

• Define a complete priority order on all the rules.

• Prefer to choose high priority rule.

• A complete order over rules is that a later rule can be defined on the actions of earlier rules.

4 6

(47)

Rule learning algorithm

• Rule can be learnt automatically from labeled example of entities in unstructured text.

– Rset = set of rules, initially empty

– While there exists an entity D not covered by any rule in Rset

• Form a new rule

• Add new rule to Rset

– Post process rules to prune away redundant rules

4 7



(48)

Rule learning algorithm

 The main challenge is how to a new rule that has high coverage and has high precision

 They broadly fall under two classes

 _{Bottom up}

 _{Top down}

4 8

(49)

Bottom-up rule

formation

 This rule has minimal coverage, but 100% precision

 It grown from an instance that is not covered by the existing rule set

1. Create of a seed rule from an uncovered instance

2. Generalization of the seed rule

3. Removal of instances that are covered by the new rules

4 9

(50)

Bottom-up rule

formation

Rule Result

The cute dog swam the river

Det <- the Det cute dog swam the river Adj <- cute Det Adj dog swam the river Noun <- dog Det Adj Noun swam the river Verb <- swam Det Adj Noun Verb the river Det <- the Det Adj Noun Verb Det river Noun <- river Det Adj Noun Verb Det Noun NP <- Det Adj Noun NP Verb Det Noun

NP <- Det Noun NP Verb NP

VP <- Verb NP NP VP

S <- NP VP S

5 0

(51)

Top-down rule formation

 The starting rule covers all possible instances, which means it has 100% coverage and poor precision

5 1

(52)

Top-down rule formation

Rule Result

S

S -> NP + VP NP VP

NP -> Det Adj Noun Det Adj Noun VP

Det -> the The Adj Noun VP

Adj -> cute The cute Noun VP

Noun -> dog The cute dog VP

VP -> Verb NP The cute cat Verb NP Verb -> swam The cute cat swam NP NP -> Det Noun The cute cat swam Det

Noun

Det -> the The cute cat swam the Noun

Noun -> river The cute cat swam the

river

₅

2

(53)

Summary

• Rule-based systems provide a convenient

method of defining extraction pattern.

 Rules are typically hand-coded by a domain

expert.

• It is easy for a human being to

interpret, develop and argument the

set of rules.

₅

3

(54)

Part III Entity

Extraction: Statistical

Methods

(55)

Outline

 3.1 token-level models

 3.2 segment-level models

 3.3 grammar-based models

 3.4 Training algorithms

 3.5 Inference algorithms

(56)

Notation

 _X

 unstructured input



▪ _tokens

 _E

 The set of entity types

 _Y

 tag sequence

xn

x ,...,₁

(57)

3.1 Token-level Models

 Splitting an unstructured text into a sequence of tokens

 along a pre-defined set of delimiters

 Assign an entity label to each token

(58)

3.1 Token-level Models

(cont.)

 The set of labels Y

 the set of entity types E

 _other

 e.g. Y ={HouseNo, Street, City, State, Zip, Country, Other}

 Decompose each entity

 Entities typically comprise of multiple tokens

 _Encoding

▪ _BCEO

▪ B=Begin, C=Continue, E=End, O=Other

▪ _BIO

▪ B=Begin, I=Inside, O=Other

(59)

Features

 The basis of classification process



 x := sequence token

 _{y := label}

 i := token position



^x ^y ⁱ



^R

f : , , _

(60)

Token properties

 Word Features

 Orthographic Features

 Dictionary Lookup Features

(61)

Models for Labeling

Tokens

 Using features derived from the token x_i and its neighbors in X

 Support Vector Machine (SVM)

 Capturing the dependency between the labels of adjacent words

 Hidden Markov Models (HMMs)

 Maximum Entropy Markov Models (MEMM)

 Conditional Markov models (CMM)

 Conditional Random Fields (CRFs)

(62)

Conditional Random

Fields

 Models a single joint distribution Pr(y|x)

 the predicted labels y of the tokens of x

 A chain is adequate for capturing label dependencies

 The label y_i is influenced only by the labels that are adjacent to it

(63)

Conditional Random Fields

(cont.)

 scoring function: ψ(yi−1,yi,x, i)

 dependency between the labels of adjacent tokens

 be defined in terms of weighted functions



 The conditional distribution



 _{Z(x) =}



(64)

3.2 Segment-level

Models

 The output is a sequence of segments

 Each segment defining an entity

 Features can now capture joint properties over the tokens

(65)

3.2.1 Feature



 Entity-level Features

 Similarity to an Entity in the Database

 maximum TF-IDF similarity

 Length of the Entity Segment



^y_i ^y_i ^x ^l_j ^u _j



f , _₁, , ,

(66)

3.2.2 Global Segmentation

Models



 W := weight vector for feature vector f

 _{f(x,s) :=}



 Goal of inference

 w‧f (x,s) is maximized

(67)

3.3 Grammar-based

Models

 Require a better interpretation of the structure of the source

 Like in context-free grammars of programming language

 The productions are defined over terminals and non-terminal

 Output of the extraction process is a parse tree

 Each output tree is assigned a score

(68)

Example

 R: S → AuthorsLF | AuthorsFL

 R0: AuthorsLF → NameLF Separator AuthorsLF

 R1: AuthorsFL → NameFL Separator AuthorsFL

 R2: AuthorsFL → NameFL

 R3: AuthorsLF → NameLF

 R4: NameLF Separator → NameLF Punctuation

 R5: NameFL Separator → NameFL Punctuation

 R6: NameLF → LastName First Middle

 R7: NameFL → First Middle LastName

(69)

Example (score)

 Each production of the form: R

→R1R2



 _(l₁_,r₁) := the text spans in x that R1 cover

 _(r₁_{+ 1,r}₂) := the text spans in x that R2 cover

 _w ・ f (R, x, l, r) := terminal nodes

(70)

Example(cont.)

 a string “Peter Haas, George John”

R0→ R4 R3 R4→ R6 x3 R3 → R6 R6→ x1x2 R6→ x4x5

 total score of this tree

w ・ f (R0,R4,R3,x,1,3,5)

+ w ・ f (R4,R6,Punctuation,x,1,2,3) + w ・ f (R3,R6,φ,x,1,2,2)

+ w ・ f (R6,x,1,2) + w ・ f (R6,x,3, 4)

(71)

3.4 Training Algorithms

 Apply to the different feature based models

 score s(y) = w _{・ f (x,y)}

 outputs a y for the s(y) is maximum

 _{f (x,y)}

 feature vector



 the labeled training set

 



x_l y_l



_l^N

D ^ , _₁

(72)

3.4.1 Likelihood Trainer





 _Goal

 Choose a weight vector w that the probability of the correct output is maximized

 _Log



(73)

Likelihood Trainer(cont.)

 Training objective



 be maximized by gradient ascent type of methods

 Gradient ∇L(w)



(74)

Training algorithm

(75)

3.4.2 Max-margin

Training

 An extension of SVM for training structured models

 _Goal

 find a w score w ・ f (x, y) of the correct labeling yl has a margin of at least err(y, yl) more than the score of a labeling y

 err(y, yl)

▪ user-provided error function

▪ Indicates the penalty of outputting y when the correct label is yl

(76)

Max-margin Training

(cont.)

 Training objective



 ξ = gap between the desired

margin and the margin achieved by the current values of w

 The program is convex in w,ξ

 local minima is the global minima

(77)

Training algorithm

(78)

3.5 Inference Algorithms

 Highest scoring (MAP) labeling

 _Find

 Expected feature values:

 _Find



 Feature vector f (x,y) decomposes

 Finer substructures in y

 solving efficiently

 enables to design a dynamic programming algorithm

) , ( max

arg

* w f x y

y  _y 

(79)

General principle

 S := entire output



 _S₃ smallest part of S₁

 all features spanning S₁ and S₂ involve only the subset in S₃

2

1 ^S

S

S  



2 

1 ^S

S 

(80)

Find the MAP score V(S) over

S recursively



 _{When S}₃ is small, the equation will be efficient

(81)

MAP for Sequential Labeling

 feature vector decomposes over labels of adjacent positions



 donate the best score of the sequence

 Running time

 _O(nm²₎

(82)

MAP for Segmentations

 _{y = s}₁_…s_p

 _s_j_{= (t}_j_{, u}_j_{, y}_j₎

▪ _t_j = segment start position

▪ _U_j = segment end position

▪ _y_j = segment label



 Running time

 _O(nLm²₎

(83)

MAP for Parse Trees



 Running time

 _O(n³_m’)

 m’ := the total number of non-terminals and terminals

(84)

Expected Features Values for

Sequential Labeling

 _Computing



 _y_i:y := all label sequences from 1 to i



 base cases: α(0,y) = 1



(85)

Expected Features Values for

Sequential Labeling(cont.)

 Computing the expectation



(86)

Expected Features Values for

Sequential Labeling(cont.)

 Reduce space requirements

 Km + nm => K + nm

 K is the number of features

 Backward recursion



 _Β_i:y := all label sequences from i + 1 to n



(87)

Summary

 Models for entity extraction using statistical models

 The most prominent

 CMM, HMM and CRF

 Segment-level models and the grammar-based models

 Provide all the flexibility of CRFs and more

 Increased overheads of inference and training