Information Extraction
Survey
Sunita Sarawagi
Foundations and Trends in Databases Vol. 1, No. 3 (2007) 261–377
Part I Introduction
Outline
Applications
Organization of the Survey
Types of Structure Extracted
Types of Unstructured Sources
Input Resources for Extraction
Methods of Extraction
Output of Extraction Systems
Applications
Enterprise Applications
Personal Information Management
Scientific Applications
Web Oriented Applications
Enterprise Applications (1/2)
Data Cleaning
convert addresses into structured forms (road name, city, and state)
provide better querying and more robust deduplication
News Tracking
structured entities-relations: people names
“is-CEO-of” company names
specific event: disease outbreaks, terrorist
multimedia: integrate video and pictures of entities and events
Enterprise Applications
(2/2)
Classified Ads
extraction research on record-oriented data
Customer Care
identify product names and product attributes
link customer emails to a specific transaction
extract customer moods from phone transcripts
Extract product attribute and value from 6
Personal Information
Management (PIM)
organize personal data like
documents, emails, projects and people in a structured inter-linked format
automatically extract structure from existing unstructured sources
Scientific Applications
Bio-informatics has broadened the scope of earlier extractions from
named entities, to biological objects such as proteins and genes.
A central problem is extracting from paper repositories such as Pubmed, protein names, and their interaction.
Web Oriented
Applications (1/3)
Citation Databases
extract sources, e.g. conference web sites, individual home pages, PDF paper.
Segment: authors, title, venue, and year field
value added in terms: forward references and
aggregate statistics (author-level citation counts).
Opinion Databases
free text form hidden behind Blogs, newsgroup posts, review sites, etc.
products are useful to find out each product feature and the prevalent polarity of opinion.
Web Oriented Applications
(2/3)
Community Websites
DBLife and Rexa track researchers,
conferences, talks, projects, titles and events relevant community.
Comparison Shopping
crawl products and prices on merchant web sites comparison shopping
merchant web is hidden behind forms and scripting languages crawl and extract
information from form-based web sites
Web Oriented Applications
(3/3)
Ad Placement on Webpages
Structured Web Searches
keyword (nouns or noun phrases) entities
fail on queries that look for relationships between entities
Product description
Positive opinion product (product
advertisements)
Organization of the Survey
The type of structure extracted
The type of unstructured source
The type of input available for extraction
The method used for extraction
The output of extraction
Types of Structure Extracted
Entities
Relationships
Adjectives Describing Entities
Structures such as Lists, Tables, and Ontologies
Entities
noun phrases and a few tokens in the unstructured text
Example named entities: persons, locations, and companies disease names, protein
names, paper titles, and journal names
three subtasks: proper names and acronyms of persons, locations, and organizations
(ENAMEX), absolute temporal terms (TIMEX) and monetary and other numeric
expressions (NUMEX).
Entity Extractions
Examples
Fig. 1.1 Traditionally named entity and relationship extraction from plain text
Relationships (1/2)
Relationships are defined over two or more entities related in a pre-defined way. Ex: “is employee of” person and organization, “is acquired by” pair companies, “location of outbreak” disease and location
subtype of record extraction is event extraction, ex: disease outbreak “disease name”,
“location of the outbreak”, “number of people affected”, “number of people killed”, and “date of outbreak” (multi-way relationship)
Relationships (2/2)
Semantic Role Labeling
the goal is to identify various semantic argu-ments of the predicate.
Ex: given a predicate accept in the
sentence “He accepted the manuscript from his dying father with trembling
hands”. The extraction task is to find the role-sets predicate: “acceptor”, “thing
accepted”, and “accepted-from”.
Adjectives Describing
Entities
A adjective describes an entity.
A adjective derives by combining soft clues spread over different
words around the entity. Ex: entity restaurant or music bands extract parts of a Blog or web-page that
presents a critique of the entities critique positive or negative
Structures such as Lists,
Tables, and Ontologies
Expanded extraction systems to richer structures (tables, lists, and trees) from various types of
documents.
Types of Unstructured
Sources
Granularity of Extraction
Heterogeneity of Unstructured Sources
Granularity of Extraction
Record or Sentences
extract a small text snippets either unstructured records (addresses, citations) or sentences
extracted (NL paragraph)
each word is a part of field, during extraction need to segment the text at the entity
boundaries.
Paragraphs and Documents
consider multiple sentences or a document for meaningful extractions.
extract from longer information is a main challenge to filtering efficiently.
Heterogeneity of
Unstructured Sources
Machine Generated Pages
HTML documents dynamically generated via database backed sites wrappers
Partially Structured Domain Specific Sources
well-defined scope (news articles, citations) possible to develop a decent extraction model given enough labeled data
Open Ended Sources
one important factor is to exploit the redundancy of the extracted information across many different
sources.
Input Resources for
Extraction
Structured Databases
Labeled Unstructured Text
Preprocessing Libraries for Unstructured Text
Structured Databases
Existing structured databases
(entities and relationships) are a valuable resource to improve
extraction accuracy.
unstructured data needs to be
integrated with structured databases so the time of extraction a large
database is available.
Labeled Unstructured
Text
labeled unstructured text requires tedious labeling effort.
A labeled unstructured source is significantly
more valuable than a structured database because it provides
contextual information about an entity.
an entity appears in the unstructured data is often a very noisy form.
Preprocessing Libraries for
Unstructured Text
Natural Language Text
Natural language documents are analyzed by a deep pipeline of preprocessing libraries
Sentence analyzer and tokenizer
▪ identify the boundaries of sentences in a document
▪ decomposes each sentence into tokens (delimiters like spaces, commas, and dots)
Part of speech tagger
▪ assigns to each word a grammatical category (noun, verb, adjective, adverb, article, conjunct, and pronoun)
Preprocessing Libraries for
Unstructured Text (2)
▪ Ex: Brown tag has 179 tags and the Penn treebank tag has 45 tags.
▪ The/DT University/NNP of/IN Helsinki/NNP hosts/VBZ ICML/NNP this/DT year/NN
Parser
▪ group words into prominent phrase types, such as noun phrases, prepositional phrases, and verb phrases.
▪ The output parsing is a parse tree that groups words into syntactic phrases.
▪ Parse trees are used for entity extraction because
typically named entities are noun phrases. In relationship extraction they provide valuable linkages between verbs and their arguments.
Preprocessing Libraries for
Unstructured Text (3)
Dependency analyzer
▪ identify the words that form arguments of other words in the sentence.
▪ Ex: “Apple is located in Cupertino”, the word “Apple” (S) and “Cupertino” (O) are dependent on the word “located” (argument).
Formatted Text
need for understanding the overall structure and layout of the source (pdf, web-page)
before entity extraction.
extract items in a list-like environment and create hierarchies of rectangular regions comprising logical units of content.
Methods of Extraction
Hand-coded
require human experts (to define rules or regular expressions) or program snippets (for performing the extraction).
Learning-based
require manually labeled unstructured examples to train machine learning
models of extraction.
Methods of Extraction
(2)
Rule-based
drive by hard predicates, easier to interpret and develop, more useful in closed domains where human involvement is both essential and available.
Statistical methods
decision based on a weighted sum of
predicate firings, more robust to noise in the unstructured data, more useful in closed
domains where human involvement is both essential and available.
Output of Extraction Systems
Identify all mentions of the structured information in the unstructured text.
Populate a database of structured entities.
Part II Entity Extraction
– Rule Based Method
Outline
Introduction
Feature of token
Single entity
Multiple entities
Mark entity boundaries
Organizing collection of rules
Rule learning algorithm
3 3
Introduction ( 1/4 )
• Many real-life extraction tasks can be handled through a collection
rules.
• Task is controlled and well-behaved
– Phone number
– Zip codes from e-mails
• It can be faster and more easier amenable to optimizations
3 4
Introduction ( 2/4 )
• A typical ruled-based system consist of
– A collection of rules
– A set of policies to control the firings of multiple rules
3 5
Introduction ( 3/4 )
• Basic rule : Contextual pattern -> action
• Contextual pattern consists pf one or more labeled pattern to capture
the different entities.
– NP -> Det Adj Noun { The cute dog }
3 6
Introduction ( 4/4 )
• Label patterns are usually regular
expression which consists of tokens.
• The action part
– Assigning an entity label to a sequence of tokens
– Inserting the start or the end of an entity tag at a position
– Assigning multiple entity tags
3 7
Feature of tokens
The string representing the token.
Punctuations
The part of speech of the token
The list of dictionaries in while the token appear
Predefined text features
isNumber
IsCapitalize
3 8
Identify a single entity
• A single full entity consists of three types of patterns
– An option pattern capturing the context before the start of an entity
– A pattern matching the tokens in the entity
– An optional pattern for capturing the context after the end of the entity
3 9
Identify a single entity
• A pattern for identifying person name of form “ Dr. Yair Weiss ”
– Consisting of a little token as listed in a dictionary of titles ( “Prof”, “Dr”, “Mr”)
– A dot
– Two capitalized words
• ({ DictionaryLookup = Titles } { String = ‘.’ }
{ Orthography type = capitalized
word } { 2 })
4
0
Identify a single entity
• A pattern for marking all number
following words “by” and “in” as the Year entity is
– ({ String = “by” | String = “in” })
– ({ Orthography type = Number }) :y -> Year=:y
4 1
Rules for multiple
entities
• Some rules take the form of regular expressions with multiple slots
• Example
– the number of bedrooms and rent from an apartment rental ad.
• ({Orthography type = Digit}) : Bedroom
({String = “BR”})({}*)({String
=“$”})
(Orthography type = Number) :
Price
4
2
Mark entity boundaries
• In particular long entities , it is more efficient to define separate
rules to mark the start and end of an entity boundary.
• SGML (Standard Generalized Markup Language)
4 3
Organizing collection of
rules
• Rule-based system consists of a very large collection of rules.
• A rule identifies a span of text to be called a particular entity.
• The spans demarcated by different rules overlap and lead to
conflicting actions.
• None standardized and custom- tuned.
Unordered rules
• Treat rules as an unordered collection of disjuncts.
• Each rule fires independently of the other.
• Prefer rules that mark larger spans of text as an entity type.
• Merge the spans of text that overlap.
4 5
Rules Arranged as an
Ordered Set
• Define a complete priority order on all the rules.
• Prefer to choose high priority rule.
• A complete order over rules is that a later rule can be defined on the actions of earlier rules.
4 6
Rule learning algorithm
• Rule can be learnt automatically from labeled example of entities in unstructured text.
– Rset = set of rules, initially empty
– While there exists an entity D not covered by any rule in Rset
• Form a new rule
• Add new rule to Rset
– Post process rules to prune away redundant rules
4 7
Rule learning algorithm
The main challenge is how to a new rule that has high coverage and has high precision
They broadly fall under two classes
Bottom up
Top down
4 8
Bottom-up rule
formation
This rule has minimal coverage, but 100% precision
It grown from an instance that is not covered by the existing rule set
1. Create of a seed rule from an uncovered instance
2. Generalization of the seed rule
3. Removal of instances that are covered by the new rules
4 9
Bottom-up rule
formation
Rule Result
The cute dog swam the river
Det <- the Det cute dog swam the river Adj <- cute Det Adj dog swam the river Noun <- dog Det Adj Noun swam the river Verb <- swam Det Adj Noun Verb the river Det <- the Det Adj Noun Verb Det river Noun <- river Det Adj Noun Verb Det Noun NP <- Det Adj Noun NP Verb Det Noun
NP <- Det Noun NP Verb NP
VP <- Verb NP NP VP
S <- NP VP S
5 0
Top-down rule formation
The starting rule covers all possible instances, which means it has 100% coverage and poor precision
5 1
Top-down rule formation
Rule Result
S
S -> NP + VP NP VP
NP -> Det Adj Noun Det Adj Noun VP
Det -> the The Adj Noun VP
Adj -> cute The cute Noun VP
Noun -> dog The cute dog VP
VP -> Verb NP The cute cat Verb NP Verb -> swam The cute cat swam NP NP -> Det Noun The cute cat swam Det
Noun
Det -> the The cute cat swam the Noun
Noun -> river The cute cat swam the
river
5
2
Summary
• Rule-based systems provide a convenient
method of defining extraction pattern.
Rules are typically hand-coded by a domain
expert.
• It is easy for a human being to
interpret, develop and argument the
set of rules.
5
3
Part III Entity
Extraction: Statistical
Methods
Outline
3.1 token-level models
3.2 segment-level models
3.3 grammar-based models
3.4 Training algorithms
3.5 Inference algorithms
Notation
X
unstructured input
▪ tokens
E
The set of entity types
Y
tag sequence
xn
x ,...,1
3.1 Token-level Models
Splitting an unstructured text into a sequence of tokens
along a pre-defined set of delimiters
Assign an entity label to each token
3.1 Token-level Models
(cont.)
The set of labels Y
the set of entity types E
other
e.g. Y ={HouseNo, Street, City, State, Zip, Country, Other}
Decompose each entity
Entities typically comprise of multiple tokens
Encoding
▪ BCEO
▪ B=Begin, C=Continue, E=End, O=Other
▪ BIO
▪ B=Begin, I=Inside, O=Other
Features
The basis of classification process
x := sequence token
y := label
i := token position
x y i
Rf : , ,
Token properties
Word Features
Orthographic Features
Dictionary Lookup Features
Models for Labeling
Tokens
Using features derived from the token xi and its neighbors in X
Support Vector Machine (SVM)
Capturing the dependency between the labels of adjacent words
Hidden Markov Models (HMMs)
Maximum Entropy Markov Models (MEMM)
Conditional Markov models (CMM)
Conditional Random Fields (CRFs)
Conditional Random
Fields
Models a single joint distribution Pr(y|x)
the predicted labels y of the tokens of x
A chain is adequate for capturing label dependencies
The label yi is influenced only by the labels that are adjacent to it
Conditional Random Fields
(cont.)
scoring function: ψ(yi−1,yi,x, i)
dependency between the labels of adjacent tokens
be defined in terms of weighted functions
The conditional distribution
Z(x) =
3.2 Segment-level
Models
The output is a sequence of segments
Each segment defining an entity
Features can now capture joint properties over the tokens
3.2.1 Feature
Entity-level Features
Similarity to an Entity in the Database
maximum TF-IDF similarity
Length of the Entity Segment
yi yi x lj u j
f , 1, , ,
3.2.2 Global Segmentation
Models
W := weight vector for feature vector f
f(x,s) :=
Goal of inference
w‧f (x,s) is maximized
3.3 Grammar-based
Models
Require a better interpretation of the structure of the source
Like in context-free grammars of programming language
The productions are defined over terminals and non-terminal
Output of the extraction process is a parse tree
Each output tree is assigned a score
Example
R: S → AuthorsLF | AuthorsFL
R0: AuthorsLF → NameLF Separator AuthorsLF
R1: AuthorsFL → NameFL Separator AuthorsFL
R2: AuthorsFL → NameFL
R3: AuthorsLF → NameLF
R4: NameLF Separator → NameLF Punctuation
R5: NameFL Separator → NameFL Punctuation
R6: NameLF → LastName First Middle
R7: NameFL → First Middle LastName
Example (score)
Each production of the form: R
→R1R2
(l1,r1) := the text spans in x that R1 cover
(r1 + 1,r2) := the text spans in x that R2 cover
w ・ f (R, x, l, r) := terminal nodes
Example(cont.)
a string “Peter Haas, George John”
R0→ R4 R3 R4→ R6 x3 R3 → R6 R6→ x1x2 R6→ x4x5
total score of this tree
w ・ f (R0,R4,R3,x,1,3,5)
+ w ・ f (R4,R6,Punctuation,x,1,2,3) + w ・ f (R3,R6,φ,x,1,2,2)
+ w ・ f (R6,x,1,2) + w ・ f (R6,x,3, 4)
3.4 Training Algorithms
Apply to the different feature based models
score s(y) = w ・ f (x,y)
outputs a y for the s(y) is maximum
f (x,y)
feature vector
the labeled training set
xl yl
lND , 1
3.4.1 Likelihood Trainer
Goal
Choose a weight vector w that the probability of the correct output is maximized
Log
Likelihood Trainer(cont.)
Training objective
be maximized by gradient ascent type of methods
Gradient ∇L(w)
Training algorithm
3.4.2 Max-margin
Training
An extension of SVM for training structured models
Goal
find a w score w ・ f (x, y) of the correct labeling yl has a margin of at least err(y, yl) more than the score of a labeling y
err(y, yl)
▪ user-provided error function
▪ Indicates the penalty of outputting y when the correct label is yl
Max-margin Training
(cont.)
Training objective
ξ = gap between the desired
margin and the margin achieved by the current values of w
The program is convex in w,ξ
local minima is the global minima
Training algorithm
3.5 Inference Algorithms
Highest scoring (MAP) labeling
Find
Expected feature values:
Find
Feature vector f (x,y) decomposes
Finer substructures in y
solving efficiently
enables to design a dynamic programming algorithm
) , ( max
arg
* w f x y
y y
General principle
S := entire output
S3 smallest part of S1
all features spanning S1 and S2 involve only the subset in S3
2
1 S
S
S
2
1 S
S
Find the MAP score V(S) over
S recursively
When S3 is small, the equation will be efficient
MAP for Sequential Labeling
feature vector decomposes over labels of adjacent positions
donate the best score of the sequence
Running time
O(nm2)
MAP for Segmentations
y = s1…sp
sj = (tj, uj, yj)
▪ tj = segment start position
▪ Uj = segment end position
▪ yj = segment label
Running time
O(nLm2)
MAP for Parse Trees
Running time
O(n3m’)
m’ := the total number of non-terminals and terminals
Expected Features Values for
Sequential Labeling
Computing
yi:y := all label sequences from 1 to i
base cases: α(0,y) = 1
Expected Features Values for
Sequential Labeling(cont.)
Computing the expectation
Expected Features Values for
Sequential Labeling(cont.)
Reduce space requirements
Km + nm => K + nm
K is the number of features
Backward recursion
Βi:y := all label sequences from i + 1 to n
Summary
Models for entity extraction using statistical models
The most prominent
CMM, HMM and CRF
Segment-level models and the grammar-based models
Provide all the flexibility of CRFs and more
Increased overheads of inference and training