• 検索結果がありません。

IE Survey by Sarawagi Schedule for 2017 Web Information Extraction and Retrieval

N/A
N/A
Protected

Academic year: 2018

シェア "IE Survey by Sarawagi Schedule for 2017 Web Information Extraction and Retrieval"

Copied!
87
0
0

読み込み中.... (全文を見る)

全文

(1)

Information Extraction

Survey

Sunita Sarawagi

Foundations and Trends in Databases Vol. 1, No. 3 (2007) 261–377

(2)

Part I Introduction

(3)

Outline

Applications

Organization of the Survey

Types of Structure Extracted

Types of Unstructured Sources

Input Resources for Extraction

Methods of Extraction

Output of Extraction Systems

(4)

Applications

Enterprise Applications

Personal Information Management

Scientific Applications

Web Oriented Applications

(5)

Enterprise Applications (1/2)

Data Cleaning

convert addresses into structured forms (road name, city, and state)

provide better querying and more robust deduplication

News Tracking

structured entities-relations: people names

“is-CEO-of” company names

specific event: disease outbreaks, terrorist

multimedia: integrate video and pictures of entities and events

(6)

Enterprise Applications

(2/2)

Classified Ads

extraction research on record-oriented data

Customer Care

identify product names and product attributes

link customer emails to a specific transaction

extract customer moods from phone transcripts

Extract product attribute and value from 6

(7)

Personal Information

Management (PIM)

organize personal data like

documents, emails, projects and people in a structured inter-linked format

automatically extract structure from existing unstructured sources

(8)

Scientific Applications

Bio-informatics has broadened the scope of earlier extractions from

named entities, to biological objects such as proteins and genes.

A central problem is extracting from paper repositories such as Pubmed, protein names, and their interaction.

(9)

Web Oriented

Applications (1/3)

Citation Databases

extract sources, e.g. conference web sites, individual home pages, PDF paper.

Segment: authors, title, venue, and year field

value added in terms: forward references and

aggregate statistics (author-level citation counts).

Opinion Databases

free text form hidden behind Blogs, newsgroup posts, review sites, etc.

products are useful to find out each product feature and the prevalent polarity of opinion.

(10)

Web Oriented Applications

(2/3)

Community Websites

DBLife and Rexa track researchers,

conferences, talks, projects, titles and events relevant community.

Comparison Shopping

crawl products and prices on merchant web sites  comparison shopping

merchant web is hidden behind forms and scripting languages  crawl and extract

information from form-based web sites

(11)

Web Oriented Applications

(3/3)

Ad Placement on Webpages

Structured Web Searches

keyword (nouns or noun phrases)  entities

fail on queries that look for relationships between entities

Product description

Positive opinion product (product

advertisements)

(12)

Organization of the Survey

The type of structure extracted

The type of unstructured source

The type of input available for extraction

The method used for extraction

The output of extraction

(13)

Types of Structure Extracted

Entities

Relationships

Adjectives Describing Entities

Structures such as Lists, Tables, and Ontologies

(14)

Entities

noun phrases and a few tokens in the unstructured text

Example named entities: persons, locations, and companies  disease names, protein

names, paper titles, and journal names

three subtasks: proper names and acronyms of persons, locations, and organizations

(ENAMEX), absolute temporal terms (TIMEX) and monetary and other numeric

expressions (NUMEX).

(15)

Entity Extractions

Examples

Fig. 1.1 Traditionally named entity and relationship extraction from plain text

(16)

Relationships (1/2)

Relationships are defined over two or more entities related in a pre-defined way. Ex: “is employee of”  person and organization, “is acquired by”  pair companies, “location of outbreak”  disease and location

subtype of record extraction is event extraction, ex: disease outbreak  “disease name”,

“location of the outbreak”, “number of people affected”, “number of people killed”, and “date of outbreak” (multi-way relationship)

(17)

Relationships (2/2)

Semantic Role Labeling

the goal is to identify various semantic argu-ments of the predicate.

Ex: given a predicate accept in the

sentence “He accepted the manuscript from his dying father with trembling

hands”. The extraction task is to find the role-sets predicate: “acceptor”, “thing

accepted”, and “accepted-from”.

(18)

Adjectives Describing

Entities

A adjective describes an entity.

A adjective derives by combining soft clues spread over different

words around the entity. Ex: entity restaurant or music bands  extract parts of a Blog or web-page that

presents a critique of the entities  critique positive or negative

(19)

Structures such as Lists,

Tables, and Ontologies

Expanded extraction systems to richer structures (tables, lists, and trees) from various types of

documents.

(20)

Types of Unstructured

Sources

Granularity of Extraction

Heterogeneity of Unstructured Sources

(21)

Granularity of Extraction

Record or Sentences

extract a small text snippets either unstructured records (addresses, citations) or sentences

extracted (NL paragraph)

each word is a part of field, during extraction need to segment the text at the entity

boundaries.

Paragraphs and Documents

consider multiple sentences or a document for meaningful extractions.

extract from longer information is a main challenge to filtering efficiently.

(22)

Heterogeneity of

Unstructured Sources

Machine Generated Pages

HTML documents dynamically generated via database backed sites  wrappers

Partially Structured Domain Specific Sources

well-defined scope (news articles, citations)  possible to develop a decent extraction model given enough labeled data

Open Ended Sources

one important factor is to exploit the redundancy of the extracted information across many different

sources.

(23)

Input Resources for

Extraction

Structured Databases

Labeled Unstructured Text

Preprocessing Libraries for Unstructured Text

(24)

Structured Databases

Existing structured databases

(entities and relationships) are a valuable resource to improve

extraction accuracy.

unstructured data needs to be

integrated with structured databases so the time of extraction a large

database is available.

(25)

Labeled Unstructured

Text

labeled unstructured text requires tedious labeling effort.

A labeled unstructured source is significantly

more valuable than a structured database because it provides

contextual information about an entity.

an entity appears in the unstructured data is often a very noisy form.

(26)

Preprocessing Libraries for

Unstructured Text

Natural Language Text

Natural language documents are analyzed by a deep pipeline of preprocessing libraries

Sentence analyzer and tokenizer

identify the boundaries of sentences in a document

decomposes each sentence into tokens (delimiters like spaces, commas, and dots)

Part of speech tagger

assigns to each word a grammatical category (noun, verb, adjective, adverb, article, conjunct, and pronoun)

(27)

Preprocessing Libraries for

Unstructured Text (2)

Ex: Brown tag has 179 tags and the Penn treebank tag has 45 tags.

The/DT University/NNP of/IN Helsinki/NNP hosts/VBZ ICML/NNP this/DT year/NN

Parser

group words into prominent phrase types, such as noun phrases, prepositional phrases, and verb phrases.

The output parsing is a parse tree that groups words into syntactic phrases.

Parse trees are used for entity extraction because

typically named entities are noun phrases. In relationship extraction they provide valuable linkages between verbs and their arguments.

(28)

Preprocessing Libraries for

Unstructured Text (3)

Dependency analyzer

identify the words that form arguments of other words in the sentence.

Ex: “Apple is located in Cupertino”, the word “Apple” (S) and “Cupertino” (O) are dependent on the word “located” (argument).

Formatted Text

need for understanding the overall structure and layout of the source (pdf, web-page)

before entity extraction.

extract items in a list-like environment and create hierarchies of rectangular regions comprising logical units of content.

(29)

Methods of Extraction

Hand-coded

require human experts (to define rules or regular expressions) or program snippets (for performing the extraction).

Learning-based

require manually labeled unstructured examples to train machine learning

models of extraction.

(30)

Methods of Extraction

(2)

Rule-based

drive by hard predicates, easier to interpret and develop, more useful in closed domains where human involvement is both essential and available.

Statistical methods

decision based on a weighted sum of

predicate firings, more robust to noise in the unstructured data, more useful in closed

domains where human involvement is both essential and available.

(31)

Output of Extraction Systems

Identify all mentions of the structured information in the unstructured text.

Populate a database of structured entities.

(32)

Part II Entity Extraction

– Rule Based Method

(33)

Outline

Introduction

Feature of token

Single entity

Multiple entities

Mark entity boundaries

Organizing collection of rules

Rule learning algorithm

3 3

(34)

Introduction ( 1/4 )

Many real-life extraction tasks can be handled through a collection

rules.

Task is controlled and well-behaved

Phone number

Zip codes from e-mails

It can be faster and more easier amenable to optimizations

3 4

(35)

Introduction ( 2/4 )

A typical ruled-based system consist of

A collection of rules

A set of policies to control the firings of multiple rules

3 5

(36)

Introduction ( 3/4 )

Basic rule : Contextual pattern -> action

Contextual pattern consists pf one or more labeled pattern to capture

the different entities.

NP -> Det Adj Noun { The cute dog }

3 6

(37)

Introduction ( 4/4 )

Label patterns are usually regular

expression which consists of tokens.

The action part

Assigning an entity label to a sequence of tokens

Inserting the start or the end of an entity tag at a position

Assigning multiple entity tags

3 7

(38)

Feature of tokens

The string representing the token.

Punctuations

The part of speech of the token

The list of dictionaries in while the token appear

Predefined text features

isNumber

IsCapitalize

3 8

(39)

Identify a single entity

A single full entity consists of three types of patterns

An option pattern capturing the context before the start of an entity

A pattern matching the tokens in the entity

An optional pattern for capturing the context after the end of the entity

3 9

(40)

Identify a single entity

A pattern for identifying person name of form “ Dr. Yair Weiss ”

Consisting of a little token as listed in a dictionary of titles ( “Prof”, “Dr”, “Mr”)

A dot

Two capitalized words

({ DictionaryLookup = Titles } { String = ‘.’ }

{ Orthography type = capitalized

word } { 2 })

4

0

(41)

Identify a single entity

A pattern for marking all number

following words “by” and “in” as the Year entity is

({ String = “by” | String = “in” })

({ Orthography type = Number }) :y -> Year=:y

4 1

(42)

Rules for multiple

entities

Some rules take the form of regular expressions with multiple slots

Example

the number of bedrooms and rent from an apartment rental ad.

({Orthography type = Digit}) : Bedroom

({String = “BR”})({}*)({String

=“$”})

(Orthography type = Number) :

Price

4

2

(43)

Mark entity boundaries

In particular long entities , it is more efficient to define separate

rules to mark the start and end of an entity boundary.

SGML (Standard Generalized Markup Language)

4 3

(44)

Organizing collection of

rules

Rule-based system consists of a very large collection of rules.

A rule identifies a span of text to be called a particular entity.

The spans demarcated by different rules overlap and lead to

conflicting actions.

None standardized and custom- tuned.

(45)

Unordered rules

Treat rules as an unordered collection of disjuncts.

Each rule fires independently of the other.

Prefer rules that mark larger spans of text as an entity type.

Merge the spans of text that overlap.

4 5

(46)

Rules Arranged as an

Ordered Set

Define a complete priority order on all the rules.

Prefer to choose high priority rule.

A complete order over rules is that a later rule can be defined on the actions of earlier rules.

4 6

(47)

Rule learning algorithm

Rule can be learnt automatically from labeled example of entities in unstructured text.

Rset = set of rules, initially empty

While there exists an entity D not covered by any rule in Rset

Form a new rule

Add new rule to Rset

Post process rules to prune away redundant rules

4 7

(48)

Rule learning algorithm

The main challenge is how to a new rule that has high coverage and has high precision

They broadly fall under two classes

Bottom up

Top down

4 8

(49)

Bottom-up rule

formation

This rule has minimal coverage, but 100% precision

It grown from an instance that is not covered by the existing rule set

1. Create of a seed rule from an uncovered instance

2. Generalization of the seed rule

3. Removal of instances that are covered by the new rules

4 9

(50)

Bottom-up rule

formation

Rule Result

The cute dog swam the river

Det <- the Det cute dog swam the river Adj <- cute Det Adj dog swam the river Noun <- dog Det Adj Noun swam the river Verb <- swam Det Adj Noun Verb the river Det <- the Det Adj Noun Verb Det river Noun <- river Det Adj Noun Verb Det Noun NP <- Det Adj Noun NP Verb Det Noun

NP <- Det Noun NP Verb NP

VP <- Verb NP NP VP

S <- NP VP S

5 0

(51)

Top-down rule formation

The starting rule covers all possible instances, which means it has 100% coverage and poor precision

5 1

(52)

Top-down rule formation

Rule Result

S

S -> NP + VP NP VP

NP -> Det Adj Noun Det Adj Noun VP

Det -> the The Adj Noun VP

Adj -> cute The cute Noun VP

Noun -> dog The cute dog VP

VP -> Verb NP The cute cat Verb NP Verb -> swam The cute cat swam NP NP -> Det Noun The cute cat swam Det

Noun

Det -> the The cute cat swam the Noun

Noun -> river The cute cat swam the

river

5

2

(53)

Summary

Rule-based systems provide a convenient

method of defining extraction pattern.

Rules are typically hand-coded by a domain

expert.

It is easy for a human being to

interpret, develop and argument the

set of rules.

5

3

(54)

Part III Entity

Extraction: Statistical

Methods

(55)

Outline

3.1 token-level models

3.2 segment-level models

3.3 grammar-based models

3.4 Training algorithms

3.5 Inference algorithms

(56)

Notation

X

unstructured input

tokens

E

The set of entity types

Y

tag sequence

xn

x ,...,1

(57)

3.1 Token-level Models

Splitting an unstructured text into a sequence of tokens

along a pre-defined set of delimiters

Assign an entity label to each token

(58)

3.1 Token-level Models

(cont.)

The set of labels Y

the set of entity types E

other

e.g. Y ={HouseNo, Street, City, State, Zip, Country, Other}

Decompose each entity

Entities typically comprise of multiple tokens

Encoding

BCEO

B=Begin, C=Continue, E=End, O=Other

BIO

B=Begin, I=Inside, O=Other

(59)

Features

The basis of classification process

x := sequence token

y := label

i := token position

x y i

R

f : , ,

(60)

Token properties

Word Features

Orthographic Features

Dictionary Lookup Features

(61)

Models for Labeling

Tokens

Using features derived from the token xi and its neighbors in X

Support Vector Machine (SVM)

Capturing the dependency between the labels of adjacent words

Hidden Markov Models (HMMs)

Maximum Entropy Markov Models (MEMM)

Conditional Markov models (CMM)

Conditional Random Fields (CRFs)

(62)

Conditional Random

Fields

Models a single joint distribution Pr(y|x)

the predicted labels y of the tokens of x

A chain is adequate for capturing label dependencies

The label yi is influenced only by the labels that are adjacent to it

(63)

Conditional Random Fields

(cont.)

scoring function: ψ(yi−1,yi,x, i)

dependency between the labels of adjacent tokens

be defined in terms of weighted functions

The conditional distribution

Z(x) =

(64)

3.2 Segment-level

Models

The output is a sequence of segments

Each segment defining an entity

Features can now capture joint properties over the tokens

(65)

3.2.1 Feature

Entity-level Features

Similarity to an Entity in the Database

maximum TF-IDF similarity

Length of the Entity Segment

yi yi x lj u j

f , 1, , ,

(66)

3.2.2 Global Segmentation

Models

W := weight vector for feature vector f

f(x,s) :=

Goal of inference

w‧f (x,s) is maximized

(67)

3.3 Grammar-based

Models

Require a better interpretation of the structure of the source

Like in context-free grammars of programming language

The productions are defined over terminals and non-terminal

Output of the extraction process is a parse tree

Each output tree is assigned a score

(68)

Example

R: S → AuthorsLF | AuthorsFL

R0: AuthorsLF → NameLF Separator AuthorsLF

R1: AuthorsFL → NameFL Separator AuthorsFL

R2: AuthorsFL → NameFL

R3: AuthorsLF → NameLF

R4: NameLF Separator → NameLF Punctuation

R5: NameFL Separator → NameFL Punctuation

R6: NameLF → LastName First Middle

R7: NameFL → First Middle LastName

(69)

Example (score)

Each production of the form: R

→R1R2

(l1,r1) := the text spans in x that R1 cover

(r1 + 1,r2) := the text spans in x that R2 cover

w ・ f (R, x, l, r) := terminal nodes

(70)

Example(cont.)

a string “Peter Haas, George John”

R0→ R4 R3 R4→ R6 x3 R3 → R6 R6→ x1x2 R6→ x4x5

total score of this tree

w ・ f (R0,R4,R3,x,1,3,5)

+ w ・ f (R4,R6,Punctuation,x,1,2,3) + w ・ f (R3,R6,φ,x,1,2,2)

+ w ・ f (R6,x,1,2) + w ・ f (R6,x,3, 4)

(71)

3.4 Training Algorithms

Apply to the different feature based models

score s(y) = w ・ f (x,y)

outputs a y for the s(y) is maximum

f (x,y)

feature vector

the labeled training set

 

xl yl

lN

D , 1

(72)

3.4.1 Likelihood Trainer

Goal

Choose a weight vector w that the probability of the correct output is maximized

Log

(73)

Likelihood Trainer(cont.)

Training objective

be maximized by gradient ascent type of methods

Gradient ∇L(w)

(74)

Training algorithm

(75)

3.4.2 Max-margin

Training

An extension of SVM for training structured models

Goal

find a w score w ・ f (x, y) of the correct labeling yl has a margin of at least err(y, yl) more than the score of a labeling y

err(y, yl)

user-provided error function

Indicates the penalty of outputting y when the correct label is yl

(76)

Max-margin Training

(cont.)

Training objective

ξ = gap between the desired

margin and the margin achieved by the current values of w

The program is convex in w,ξ

local minima is the global minima

(77)

Training algorithm

(78)

3.5 Inference Algorithms

Highest scoring (MAP) labeling

Find

Expected feature values:

Find

Feature vector f (x,y) decomposes

Finer substructures in y

solving efficiently

enables to design a dynamic programming algorithm

) , ( max

arg

* w f x y

y y

(79)

General principle

S := entire output

S3 smallest part of S1

all features spanning S1 and S2 involve only the subset in S3

2

1 S

S

S

2

1 S

S 

(80)

Find the MAP score V(S) over

S recursively

When S3 is small, the equation will be efficient

(81)

MAP for Sequential Labeling

feature vector decomposes over labels of adjacent positions

donate the best score of the sequence

Running time

O(nm2)

(82)

MAP for Segmentations

y = s1…sp

sj = (tj, uj, yj)

tj = segment start position

Uj = segment end position

yj = segment label

Running time

O(nLm2)

(83)

MAP for Parse Trees

Running time

O(n3m’)

m’ := the total number of non-terminals and terminals

(84)

Expected Features Values for

Sequential Labeling

Computing

yi:y := all label sequences from 1 to i

base cases: α(0,y) = 1

(85)

Expected Features Values for

Sequential Labeling(cont.)

Computing the expectation

(86)

Expected Features Values for

Sequential Labeling(cont.)

Reduce space requirements

Km + nm => K + nm

K is the number of features

Backward recursion

Βi:y := all label sequences from i + 1 to n

(87)

Summary

Models for entity extraction using statistical models

The most prominent

CMM, HMM and CRF

Segment-level models and the grammar-based models

Provide all the flexibility of CRFs and more

Increased overheads of inference and training

参照

関連したドキュメント

Therefore, we considered the heat conduction effects concentrated around the heat extraction pipe embedded in the bamboo chip pile, and obtained relatively simple analytical

The connection weights of the trained multilayer neural network are investigated in order to analyze feature extracted by the neural network in the learning process. Magnitude of

The purpose of the paper is to present explicit L-A pairs for 1 + 1 Gaudin model, to pro- pose corresponding Hamiltonian description and to find out relationships between the

In recent communications we have shown that the dynamics of economic systems can be derived from information asymmetry with respect to Fisher information and that this form

Jayamsakthi Shanmugam, Dr.M.Ponnavaikko “A Solution to Block Cross Site Scripting Vulnerabilities Based on Service Oriented Architecture”, in Proceedings of 6th IEEE

— Holonomic and semi-holonomic geometries modelled on a homogeneous space G/P are introduced as reductions of the holonomic or semi-holonomic frame bundles respectively satisfying

It is not a bad idea but it means that since a differential field automorphism of L|[x 0 ] is given by a birational transformation c 7→ ϕ(c) of the space of initial conditions, we

Using a clear and straightforward approach, we have obtained and proved inter- esting new binary digit extraction BBP-type formulas for polylogarithm constants.. Some known results