lecture MEA 最近の更新履歴浜田道昭＠早稲田大学産総研日本医科大学

(1)

.

...

Maximizing expected accuracy estimations in

bioinformatics

バイオインフォマティクスにおける期待精度最大化推定

Michiaki Hamada

The University of Tokyo CBRC/AIST

October 7, 2011

(2)

1... Introduction

2... Framework of MEG/MEA estimators MEG/MEA estimators

Gain functions for Y ⊂ Lⁿ Gain functions for Y ⊂ {0, 1}ⁿ

Gain function for Y ⊂ {0, 1}ⁿ× {0, 1}^m

3... A classification of existing algorithms from MEG/MEA-viewpoint Sequence feature prediction

Alignment (excluding RNA alignment) RNA sequence analysis

RNA secondary structure prediction

RNA common secondary structure prediction RNA alignment

Others

4... _Summary

5... Bibliography

(3)

Point estimation problem in (high-D) discrete space

.Problem ..

...

Given data D and a discrete space Y correlated to D, find a point y in Y . Y is called predictive space, which contains all the possible solutions^． .Example (Predictive spaces in bioinformatics)

..

...

1... S(x) : the space of secondary structures of RNA sequence x

2... _{A(x, x}^′) : the space of pairwise alignments between two biological sequences x and x^′

3... T (S) : the space of phylogenetic topologies for a set of leafs S ...

4 G(x) : the space of gene annotations for a genome x In general, the dimension of those spaces are very high.

(4)

Widely-used approach (1): maximizing a score

Define a score model S(y|D) (which gives a score of y ∈ Y ), and then maximize the score.

.Example ..

... ...

1 In pairwise alignment, define a score model S(y|x, x^′) which gives a score of a pairwise alignment y between x and x^′ (specified by a score matrix and gap costs), and then maximize the score.

2... In RNA secondary structure prediction, define the free energy E(y|x) of a secondary structure y of an RNA sequence x (given by Turner’s energy model [Freier et al., 1986]), and then minimize the energy.

(5)

Widely-used approach (2): maximizing a probability

Define a probabilistic model p(y|D) (which gives a probability of y ∈ Y ), and then maximize the probability.

.Example ..

...

1... In pairwise alignment, define the probabilistic model

p(y|x, x^′) = exp (S(y|x, x^′)/T ) /Z (Miyazawa model [Miyazawa, 1995]) which gives a probability of a pairwise alignment y between x and x^′, and then maximize the probability.

...

2 In RNA secondary structure prediction, define the probabilistic model p(y|x) = exp(−E(y|x)/(kT ))/Z (McCaskill model [McCaskill, 1990]) of a secondary structure y of an RNA sequence x, and then maximize the probability.

In many cases, score models naturally lead to probabilistic models.

(6)

Point estimation problem in (high-D) discrete space with

probability distribution

.Problem ..

...

Given data D and a discrete space Y correlated to D, find a point y in Y . .Assumption

..

...

In above problem a (posterior) probability distribution p(y|D) on a predictive space Y is given.

.Definition ((Baysien) maximum likelihood (ML) estimator) ..

...

ˆ

y^{(M L)}= argmax

y∈Y

p(y|D),

Is ML-estimator good?

(7)

Drawback of ML estimators (1)

仮定：ハミング距離の近い予測は”似ている”

p1 = 2 / 6 p₂= 1 / 6

最尤推定 otherwise = 0 ハミング距離１の

ノード間にエッジ

[Carvalho and Lawrence, 2008]

(8)

Drawback of ML estimators (2)

p₂ (1,1,1,...,1)

(0,0,...,0) (1,0,...,0)

(0,1,...,0)

n-dim

p₁

最尤推定遠い

p₂ p₂

（n+1)p₂

p₁= 2 / (n+3) p₂= 1 / (n+3) otherwise = 0

良い予測？？？ 0.002 0.001

0.001 _0.001

0.998

n = 997

Imagine the case of n → ∞.

(9)

Drawback of ML estimators (3)

...

1 The number of possible solutions (i.e. |Y |) is immense. ...

2 There are many mutually similar solutions in Y .

3... The probability of ML-estimation is often extremely low.

e.g. typically less than 10⁻⁴⁰in RNA secondary structure prediction

4... ML estimator on discretespace does not satisfy the following properties (although ML estimator oncontinuousspace holds the properties)

consistency

asymptotic normality asymptotic efficiency

What estimators should we use ?

(10)

Drawback of ML estimators (3)

...

1 The number of possible solutions (i.e. |Y |) is immense. ...

2 There are many mutually similar solutions in Y .

3... The probability of ML-estimation is often extremely low.

e.g. typically less than 10⁻⁴⁰in RNA secondary structure prediction

4... ML estimator on discretespace does not satisfy the following properties (although ML estimator oncontinuousspace holds the properties)

consistency

asymptotic normality asymptotic efficiency

What estimators should we use ?

(11)

Maximum expected gain (MEG) estimators

.Definition (gain function) ..

...

For θ ∈ Y and y ∈ Y , G : Y × Y → R⁺, G(θ, y) is called gain function. .Definition (MEG estimator)

..

...

ˆ

y = argmax

y∈Y ^E^θ|D

[G(θ, y)] = argmax

y∈Y

∑

θ∈Y

G(θ, y)p(θ|D)

.Remark .. ...

When G(θ, y) = δ(θ, y)^，MEG estimator is equivalent to ML-estimator.

(12)

Maximum expected accuracy (MEA) estimators

When the gain function G is designed according to the accuracy measures of the target problem, the MEG estimator is called “maximum expected accuracy (MEA) estimator”

This does not mean the gain function is exactly equal to the accuracy measure.

ML-estimator from a viewpoint of MEA estimator:

The delta function δ(θ, y) is rarely used as accuracy measure because it is too strict for accuracy measure.

In RNA secondary structure prediction, predicted secondary structure is considered as correct only when it is exactly same as the reference structure.

The gain function should be designed more carefully.

(13)

Maximum expected accuracy (MEA) estimators

When the gain function G is designed according to the accuracy measures of the target problem, the MEG estimator is called “maximum expected accuracy (MEA) estimator”

This does not mean the gain function is exactly equal to the accuracy measure.

ML-estimator from a viewpoint of MEA estimator:

The delta function δ(θ, y) is rarely used as accuracy measure because it is too strict for accuracy measure.

In RNA secondary structure prediction, predicted secondary structure is considered as correct only when it is exactly same as the reference structure.

The gain function should be designed more carefully.

(14)

Commonly-used gain functions

Predictive spaces:

A subset of Lⁿ (|L| < ∞) A subset of {0, 1}ⁿ Gain functions:

Gain functions for Y ⊂ Lⁿ label gain function: G^(label)

boundary gain function: G^(boundary)γ

Gain functions for Y ⊂ {0, 1}ⁿ γ-centroid gain function: G^(centroid)γ

MCC / F-score: G^(Acc)

Gain function for Y ⊂ {0, 1}ⁿ× {0, 1}^m: G^(2dim)γ

(15)

Y is a subset of L

ⁿ

(|L| < ∞)

Typically, the data D is a biological sequence with length n (e.g., a DNA, RNA or protein sequence), and L is a set of labels on the sequence. .Example (Gene annotations of a sequence x: G(x))

..

...

For a genome sequence x and L = {exon, intron, intergenic} (a set of labels for components of gene structures), a gene structure y can be represented as y = {yi}^|x|_i=1 ∈ L^|x| where yi∈ L indicates the label of the i-th position in x. G(x) denotes the space of gene structures of a genome sequence x.

XXXEEEIIIIIIEEIIEEEXX 123456789012345678901

X:Intergenic region, E:Exon, I:Intron

(16)

Gain function for Y ⊂ L

ⁿ

: label gain function

For θ, y ∈ Y ⊂ Lⁿ, the following gain function is introduced. G^(label)(θ, y) = ^∑

1≤i≤n

I(θi = yi)

Here, I(condition) is the indicator function that returns 1 only when condition is true.

1... This gain function was originally proposed in Kall et al. [2005]

2... When θ is a correct (reference) sequence and y is a prediction, this is equal to the number of correctly predicted labels.

3... The MEG estimator of this gain function, therefore, maximizes the expected number of correctly predicted labels.

(17)

G

^(label)

for gene prediction

Y = G(x)

L = {exon, intron, intergenic} G^(label)(θ, y) = 19

XXEEEEIIIIIEEEIIEEEXX 123456789012345678901

XXXEEEIIIIIIEEIIEEEXX 123456789012345678901 Reference θ

Prediction y

X:Intergenic region, E:Exon, I:Intron

MEG estimator with this gain function is suitable to nucleotide level accuracy

(18)

Gain function for Y ⊂ L

ⁿ

: boundary gain function

For θ, y ∈ Y ⊂ Lⁿ, the following gain function is introduced. G^(boundary)_γ (θ, y) = ^∑

2≤i≤n

[I ((θi−1, θi) ̸∈ B) I ((yi−1, yi) ̸∈ B) +γ · I ((θ_i−1, θi) ∈ B) I ((y_i−1, yi) ∈ B)^] B is the list of all pairs of labels corresponding to a boundary (e.g., an exon-intron boundary for gene prediction).

...

1 This gain function was originally proposed by Gross et al. [2007] in the context of gene prediction.

...

2 When θ is a correct prediction and y is a prediction, this is equal to a weighted sum of the number of correctly predicted boundariesand non-boundaries.

...

3 The MEG estimator of this gain function is suitable for accurate prediction of boundary of annotation (boundary accuracy).

(19)

G

^(boundary)γ

for gene prediction

Y = G(x)

B is the list of all pairs of labels corresponding to a boundary (e.g., an exon-intron boundary for gene prediction).

G^(boundary)γ (θ, y) = 4γ + 12

XXEEEEIIIIIEEEIIEEEXX 123456789012345678901

XXXEEEIIIIIIEEIIEEEXX 123456789012345678901 Reference _θ

Prediction y

In the MEG estimator with this gain function, the γ is a parameter that adjusts between the sensitivity and PPV of a prediction. Using larger γ leads to more boundaries (that is, more genes) in the prediction.

(20)

Y is a subset of {0, 1}

ⁿ

1... This is a special case of the previous (where L = {0, 1})^．

2... In this case, 1 and 0 in a binary vector y ∈ Y typically mean positive andnegative predictions, respectively.

3... Accuracy measures, such as sensitivity (SEN), PPV, MCC and F-score, are naturally introduced

Each is defined by using the number of true positive, true negative, false positive and false negative predictions (denoted as TP, TN, FP and FN, respectively) [Baldi et al., 2000].

(21)

Secondary structures of RNA sequence x: S(x)

I⁽⁰⁾= {(i, k)|1 ≤ i ≤ |x|, i ≤ j ≤ |x|} (upper triangular matrix) we define a binary variable θik ((i, k) ∈ I⁽⁰⁾) as

θik =

{ 1 x_i and x_j form a base-pair 0 xi and xj do not form a base-pair a secondary structure θ of x can be represented as θ = {θik}_(i,k)∈I(0) ∈ S(x)

1 2 3 4 5 6

7 8

9 10

13 12 11

14







0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00













0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00





 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

＝

(22)

Pairwise alignment between x and x

^′

: A(x, x

^′

)

I⁽⁰⁾= {(i, k)|1 ≤ i ≤ |x|, 1 ≤ k ≤ |x^′|}

we define a binary variable θik ((i, k) ∈ I⁽⁰⁾) as θ_ik =

{ 1 xi aligns with x^′_k 0 xi does not align x^′_k

A pairwise alignment θ between x and x^′ can be represented as θ = {θik}_(i,k)∈I(0) ∈ A(x, x^′).

123456 CCAC-A CAACGA 1234-5

1 2 3 4 5 6 1 2 3 4 5

|||| | _＝







1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0













1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0







(23)

Phylogenetic topologies of leaf sets S = {1, · · · , n}: T (S)

I⁽⁰⁾={X|X ∈ S², |X| < n/2 ∨ (|X| = n/2 ∧ 1 ∈ X)}, S²= {X|X ⊂ S, |X| > 1 ∧ |X| < n − 1}

a set of partition of leafs

Define a binary variable θX for X ∈ I⁽⁰⁾ by θX =

{ 1 if S can be split into X and S \ X by cutting a edge 0 otherwise

A phylogenetic tree θ can be represented as θ = {θX}_X∈I(0) ∈ T (S)

X

θx=1

(24)

A gain function for Y ⊂ {0, 1}

ⁿ

: γ-centroid gain function

For θ, y ∈ Y (⊂ {0, 1}ⁿ), we introduce the gain function G^(centroid)_γ (θ, y) = ^∑

1≤i≤n

[I(θi= 0)I(yi = 0) + γ · I(θi = 1)I(yi = 1)]

where γ ≥ 0 is a weight parameter. ...

1 This gain function was originally proposed in the context of RNA secondary structure prediction [Hamada et al., 2009a].

2... When y is a prediction and θ is a reference sequence, this gain function is equal to a weighted sum of the number of true-positives (TP) and the number of true-negatives (TN).

3... MEG estimator with this gain function is calledγ-centroid estimator

(25)

G

^(centroid)γ

for RNA secondary structure prediction

For two secondary structures y and θ in S(x) y is a prediction and θ is a reference structure,

G^(centroid)γ (θ, y) is equal to the weighted sum of the number of true-positive base pairs and true-negative base pairs.

For example, G^(centroid)γ (θ, y) = 147 + 3γ.

(((...)))...((..)) Reference θ

(((...))).(....).. 123456789012345678

123456789012345678 TP=3 TN=147 FP=1 FN=2 Prediction y

(26)

G

^(centroid)γ

for pairwise alignment

For two pairwise alignments y and θ in A(x, x^′) y is a prediction and θ is a reference structure,

G^(centroid)γ (θ, y) is equal to the weighted sum of the number of true-positive aligned bases and true-negative aligned bases. For example, G^(centroid)γ (θ, y) = 63 + 4γ.

12-3456789 1234-5-678 1-23456789 123--45678 TP=4 TN=63 FP=2 FN=3

xx' xx' Reference θ Prediction y

(27)

A gain function for Y ⊂ {0, 1}

ⁿ

: MCC/F-score

Question: Can we use actual accuracy measures as gain function? Answer: Yes

For θ, y ∈ Y ⊂ {0, 1}ⁿ, we introduce the gain function

G^(Acc)(θ, y) = Acc(θ, y), (1)

where Acc is either MCC or F-score [Baldi et al., 2000], both of which are commonly-used accuracy measures providing a balance between Sensitivity and PPV.

1... It is generally difficult to compute MEG estimator with this gain function.

2... Hamada et al. [2010] have proposed an approximate method to maximize expected MCC/F-score.

(28)

G

^(Acc)

for RNA secondary structure prediction

For two secondary structures y and θ in S(x) y is a prediction and θ is a reference structure, G^(centroid)γ (θ, y) is equal to the value of MCC. For example, M CC(θ, y) = 0.661.

(((...)))...((..)) Reference θ

(((...))).(....).. 123456789012345678

123456789012345678 TP=3 TN=147 FP=1 FN=2 Prediction y

1... (Unlike the γ-centroid estimators,) the MEG estimator of this gain function does not contain any parameter.

(29)

A gain function for Y ⊂ {0, 1}

ⁿ

× {0, 1}

^m

Suppose that each binary vector has two indices, that is, Y ⊂ {0, 1}ⁿ× {0, 1}^m

like S(x) and A(x, x^′)

For θ = {θij} and y = {yij} (θij, yij ∈ {0, 1}) the gain function G^(2dim)_γ (θ, y) = 2γ ·^∑

i,j

I(θ_ij = 1)I(y_ij = 1)+

∑

i

∏

j

I(θij = 0)I(yij = 0) +^∑

j

∏

i

I(θij = 0)I(yij = 0)

Larger γ produces more positive predictions

Interestingly, this gain function wasindependently proposed in Do et al. [2006] (in the context of RNA secondary structure prediction) and Schwartz et al. [2005] (in the context of pairwise alignment).

(30)

G

^(2dim)γ

for RNA secondary structure prediction

Y = S(x)

When θ is a reference secondary structure and y is a predicted secondary structure, G^(2dim)γ (θ, y) is equal to a (weighted) sum of the numbers of correctly predicted positions in the RNA sequence x. In the example (below), G^(2dim)γ (θ, y) = 6γ + 6

p:correctly predicted base-pair position l:correctly predicted loop position x:wrongly predicted position

(((...)))...((..)) Reference θ

(((...))).(....).. 123456789012345678 123456789012345678 Prediction y

ppplllppplxlxxlxxx

(31)

G

^(2dim)γ

for pairwise alignment

Y = A(x, x^′)

θ is a reference alignment and y is a predicted alignment,

G^(2dim)γ (θ, y) is equal to a (weighted) sum of the numbers of correctly and incorrectly predicted columns in the alignment.

G^(2dim)γ (θ, y) = 10γ + 1.

12-3456789 1234-5-678 1-23456789 123--45678 pxp--xxppp p-pxlxxppp xx'

xx' xx' Reference θ Prediction y

MEG estimator with this gain function is suitable to accuracy measure calledalignment metric accuracy (AMA) [Schwartz, 2007].

(32)

Relation between G

^(centroid)γ

and G

^(2dim)γ For θ, y ∈ {0, 1}ⁿ× {0, 1}^m,

G^(2dim)_γ (θ, y) = 2G^(centroid)_γ (θ, y) + A(θ, y) + C(θ) where

A(θ, y) =^∑

i

∑

(j1,j2):j1̸=j2

I(θij1 = 1)I(yij2 = 1)

+^∑

j

∑

(i1,i2):i1̸=i2

I(θi1j= 1)I(yi2j = 1)

When θ is a reference and y is a prediction^，A(θ, y) gives positive gains for false positive and falsenegative predictions.

MEG estimator with G^(2dim)γ has biased to e.g. SEN, PPV ... [Hamada et al., 2009a]

(33)

How to compute MEG/MEA estimators

The following methods are often employed when computing the prediction of MEG estimators.

1... Dynamic Programming (DP)

2... Integer Programming (IP)

3... Stochastic Sampling (SS)

Sampling solutions directly from the posterior spaces e.g. Simulated Annealing and Gibbs Sampling

Generally, (fast) DP < IP < SS (slow) with respect to computational speed.

(34)

Classification of existing bioinformatics algorithms from

MEA viewpoints

1... Sequence feature prediction

2... RNA sequence analysis ...

3 _Alignment

1... RNA secondary structure prediction

2... RNA common secondary structure prediction

3... RNA alignment

4... _Others

(35)

Sequence feature prediction

Reference Software Target problem Gain function Apr Rep Comp Suitable accuracy

measures

Kall et al. [2005] – Sequence feature predictions

G^(label) ✓ DP # of correctly

predictedlabel

Gross et al. [2007] CONTRAST Gene prediction ^G^(boundary)_γ DP # of correctly

predicted boundary

N´an´asi et al. [2010] HERD HIV recombination prediction

G^(boundary)_γ ^a _DP _# _of _correctly predicted boundary

aMore precisely, a generalization of^G^(boundary)_γ was utilized.

(36)

Alignment (excluding RNA alignment)

Reference Software Type Gain function Apr Rep Comp Suitable accu-

racy measures

Miyazawa [1995] – Pairwise ^G^(centroid)1 DP Hamming

distance of (un)aligned- bases Holmes and Durbin [1998] – Pairwise ^G^(centroid)∞

a _DP _SEN/SPS _of

aligned-bases

Schwartz et al. [2005] – Pairwise ^G^(2dim)γ DP Alignment

metric accuracy (AMA)

Do et al. [2005] ProbCons Multiple ^G^(centroid)∞ ✓ ✓ DP SEN/SPS of

aligned-bases

Roshan and Livesay [2006] ProbAlign Multiple ^G^(centroid)∞ ✓ ✓ _DP _SEN/SPS _of

aligned-bases

Shinsuke et al. [2008] PRIME Multiple ^G^(centroid)∞ DP SEN/SPS of

aligned-bases

Schwartz and Pachter [2007] AMAP Multiple ^G^(2dim)γ ^✓ ^✓ SA AMA

Sahraeian and Yoon [2010] PicXAA Multiple ^G^(centroid)∞ ✓ ✓ _DP _SEN/SPS _of

aligned-bases

Frith et al. [2010] LAST Local ^G^(centroid)_γ DP SEN/PPV of

(un)aligned- bases

(37)

γ-centroid vs. maximum score in local alignment

[Frith et al., 2010]

(38)

RNA secondary structure prediction

Reference Software Gain function Apr Rep Comp Suitable accuracy measures

Ding et al. [2005] Sfold ^G^(centroid)1 SS Hamming distance of base-pairs

Do et al. [2006] CONTRAfold ^G^(2dim)_γ DP # of correctly predicted (loop or

base-pairs) positions in RNA sequence

Lu et al. [2009] MaxExpect ^G^(2dim)γ DP # of correctly predicted (loop or

base-pairs) positions in RNA sequence

Hamada et al. [2009a] CentroidFold ^G^(centroid)_γ DP SEN/PPV of base-pairs

Hamada et al. [2010] CentroidFold ^G^(Acc)c DP/SS MCC/F-score of base-pairs

Lorenz and Clote [2011] RNAlocopt ^G^(2dim)γ DP # of correctly predicted (loop or base-pairs) positions in RNA sequence

Sato et al. [2011] IPKnot ^G^(centroid)_γ ✓ IP SEN/PPV of base-pairs

Hamada et al. [2009b] CentroidHomfold ^G^(centroid)_γ ✓ ✓ DP SEN/PPV of base-pairs aIPKnot can predict secondary structures with pseudoknot;^bCentroidHomfold predicts secondary structure of a target RNA sequence by using its homologous sequence information;^cAn approximate method (i.e. maximizingpseudo-expected accuracy) was used.

(39)

Comparison between various estimators

CONTRAfold model SimFold model McCaskill model

γ：：：：Small γ：：：：Large

arrow：Centroid (1-centroid)

○：γ-centroid +：2dim

a point：ML

RNAfold

[Hamada et al., 2009a]

(40)

RNA secondary structure prediction by maximizing

pseudo -expected MCC/Fscore

F-score MCC F-score

MCC

(41)

RNA common secondary structure prediction

Reference Software Gain function Apr Rep Comp Suitable accuracy measures

Knudsen and Hein [2003] Pfold ^G^(2dim)1 DP # of correctly predicted (loop or

base-pairs) positions

Bernhart et al. [2008] RNAalifold^a ^G^(centroid)1 DP Hamming distance

Kiryu et al. [2007] McCaskill-MEA ^G^(2dim)γ ^✓ DP # of correctly predicted positions Seemann et al. [2008] PETfold ^G^(2dim)γ ^✓ DP # of correctly predicted positions Hamada et al. [2011a] CentroidAlifold ^G^(centroid)γ ^✓ DP SEN/PPV of base-pairs

Wei et al. [2011] RNAG ^G^(centroid)γ SS SEN/PPV of base-pairs

aML estimator is employed as the default setting.

(42)

Comparison

CONTRAfold model McCaskill model CONTRAfold model

McCaskill model

arrow：Centroid

△：γ-centroid

○：2-dim

point：ML Secondary structure prediction

Common secondary structure prediction

(43)

RNA alignment

Reference Software Target prob-

lem

Gain function Apr Rep Comp Suitable accuracy measures

Wei et al. [2011] RNAG ^G^(centroid)_γ GS SEN/PPV of

base-pairs Sahraeian and Yoon [2011] PicXAA-R RNA multiple

alignment

G^(centroid)_∞ ✓ ✓ DP SPS of

aligned-bases Hamada et al. [2009c] CentroidAlign RNA multiple

alignment

G^(centroid)_γ ✓ ✓ _DP _{SEN/PPV of} aligned-bases Tabei and Asai [2009] SCARNA-LM RNA local

alignment

G^(centroid)_γ _DP _{SEN/PPV of}

aligned bases

(44)

Others

Reference Software Target problem Gain function Apr Rep Comp Suitable accuracy

measures

Kato et al. [2010] RactIP RNA-RNA interaction prediction

G^(centroid)_γ _IP _SEN/PPV _of

base-pairs / interaction base-pairs Seemann et al. [2011] PETcofold RNA-RNA inter-

action prediction between two multiple alignments

G^(2dim)

γ ^✓ ^DP ^–

Hamada et al. [2011b] – Phylogenetic tree estimation

G^(centroid)_γ ✓ _– Robinson-Foulds (RF) measure

(45)

. . . . . . A classification of existing algorithms from MEG/MEA-viewpoint Others

Kall et al. [2005] – Sequence feature predictions L ^G(label) ^✓ DP

Gross et al. [2007] CONTRAST Gene prediction L G(boundary)

γ ^DP

N´an´asi et al. [2010] HERD HIV recombination prediction L G(boundary)

γ ^DP

Miyazawa [1995] – Pairwise alignment B G(centroid)₁ _DP

Holmes and Durbin [1998] – Pairwise alignment B G(centroid)

∞ DP

Schwartz et al. [2005] – Pairwise alignment B G(2dim)

γ ^DP

Do et al. [2005] ProbCons Multiple alignment B G(centroid)

∞ ^✓ ^✓ ^DP

Roshan and Livesay [2006] ProbAlign Multiple alignment B G(centroid)

∞ ^✓ ^✓ DP

Shinsuke et al. [2008] PRIME Multiple alignment B G(centroid)

∞ DP

Schwartz and Pachter [2007] AMAP Multiple alignment B G(2dim)

γ ^✓ ^✓ ^SA

Sahraeian and Yoon [2010] PicXAA Multiple alignment B G(centroid)

∞ ^✓ ^✓ ^DP

Frith et al. [2010] LAST Genome (local) alignment B G(centroid)

γ ^DP

Ding et al. [2005] Sfold RNA sec. str. pred. B G(centroid)₁ SS

Do et al. [2006] CONTRAfold RNA sec. str. pred. B ^G(2dim)_γ DP

Lu et al. [2009] MaxExpect RNA sec. str. pred. B ^G(2dim)_γ DP

Hamada et al. [2009a] CentroidFold RNA sec. str. pred. B G(centroid)

γ ^DP

Hamada et al. [2010] CentroidFold RNA sec. str. pred. B G(^Acc) DP/SS

Lorenz and Clote [2011] RNAlocopt RNA sec. str. pred. B ^G(2dim)_γ DP

Sato et al. [2011] IPKnot RNA sec. str.

pred. with pseudoknot ^B

G(centroid)

γ ^✓ ^IP

Hamada et al. [2009b] CentroidHomfold RNA sec. str. pred. with ho- mol. seq.

B G(centroid)

γ ^✓ ^✓ ^DP

Knudsen and Hein [2003] Pfold RNA com. sec. str. pred. B G(2dim)₁ DP

Bernhart et al. [2008] RNAalifold RNA com. sec. str. pred. B G(centroid)₁ DP

Kiryu et al. [2007] McCaskill-MEA RNA com. sec. str. pred. B ^G(2dim)_γ ✓ DP

Seemann et al. [2008] PETfold RNA com. sec. str. pred. B ^G(2dim)_γ ✓ DP

Hamada et al. [2011a] CentroidAlifold RNA com. sec. str. pred. B G(centroid)

γ ^✓ ^DP

Wei et al. [2011] RNAG RNA com. sec. str. pred. B G(centroid)

γ ^GS

Sahraeian and Yoon [2011] PicXAA-R RNA multiple alignment B G(centroid)

∞ ^✓ ^✓ ^DP

Hamada et al. [2009c] CentroidAlign RNA multiple alignment B G(centroid)

γ ^✓ ^✓ ^DP

Tabei and Asai [2009] SCARNA-LM RNA local alignment B G(centroid)

γ ^DP

(46)

Summary

...

1 Many estimation problems in bioinformatics are formulated as point estimation problems in a high-dimensional discrete space.

...

2 Maximum score and maximum likelihood estimators do not work well in this situation although they are widely employed in a number of applications.

3... Maximizing expected accuracy (MEA) estimation, in which accuracy measures of the target problem and the entire distribution of solutions are considered, is a more successful approach.

...

4 A number of algorithms used in previous studies can be classified from the viewpoint of MEG/MEA.

(47)

Bibliography I

S. M. Freier, R. Kierzek, J. A. Jaeger, N. Sugimoto, M. H. Caruthers, T. Neilson, and D. H. Turner. Improved free-energy parameters for predictions of RNA duplex stability. Proc. Natl. Acad. Sci. U.S.A., 83:9373–9377, Dec 1986. S. Miyazawa. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng., 8:

999–1009, Oct 1995.

J S McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6-7):1105–1119, May 1990.

L.E. Carvalho and C.E. Lawrence. Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. U.S.A., 105:3209–3214, 2008.

L. Kall, A. Krogh, and E. L. Sonnhammer. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics, 21 Suppl 1:i251–257, 2005.

S.S. Gross, C.B. Do, M. Sirota, and S. Batzoglou. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol., 8:R269, 2007.

P. Baldi, S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16:412–424, May 2000.

M. Hamada, H. Kiryu, K. Sato, T. Mituyama, and K. Asai. Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics, 25:465–473, Feb 2009a.

M. Hamada, K. Sato, and K. Asai. Prediction of RNA secondary structure by maximizing pseudo-expected accuracy. BMC Bioinformatics, 11:586, 2010.

C.B. Do, D.A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22:e90–98, Jul 2006.

Ariel S. Schwartz, Eugene W. Myers, and Lior Pachter. Alignment metric accuracy, 2005.

Ariel S. Schwartz. Posterior Decoding Methods for Optimization and Accuracy Control of Multiple Alignments. Ph.D thesis, University of California, Berkeley, 2007.

(48)

Bibliography II

Michal Nánási, Tomáˇs Vinaˇr, and Broˇna Brejová. The highest expected reward decoding for hmms with application to recombination detection. In Proc. of CPM’10, pages 164–176, 2010.

I. Holmes and R. Durbin. Dynamic programming alignment accuracy. J. Comput. Biol., 5:493–504, 1998.

C.B. Do, M.S. Mahabhashyam, M. Brudno, and S. Batzoglou. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res., 15:330–340, Feb 2005.

U. Roshan and D.R. Livesay. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics, 22:2715–2721, Nov 2006.

Yamada Shinsuke, Gotoh Osamu, and Yamana Hayato. Improvement in speed and accuracy of multiple sequence alignment program prime. IPSJ Transactions on Bioinformatics (TBIO), 1:2–12, nov 2008.

A.S. Schwartz and L. Pachter. Multiple alignment by sequence annealing. Bioinformatics, 23:e24–29, Jan 2007. S. M. Sahraeian and B. J. Yoon. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of

multiple sequences. Nucleic Acids Res., 38:4917–4928, Aug 2010.

M. C. Frith, M. Hamada, and P. Horton. Parameters for accurate genome alignment. BMC Bioinformatics, 11:80, 2010. Y. Ding, C.Y. Chan, and C.E. Lawrence. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.

RNA, 11:1157–1166, Aug 2005.

Z. J. Lu, J. W. Gloor, and D. H. Mathews. Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA, 15:1805–1813, Oct 2009.

W. A. Lorenz and P. Clote. Computing the partition function for kinetically trapped RNA secondary structures. PLoS ONE, 6: e16178, 2011.

K. Sato, Y. Kato, M. Hamada, T. Akutsu, and K. Asai. IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics, 2011. (to appear).

M. Hamada, K. Sato, H. Kiryu, T. Mituyama, and K. Asai. Predictions of RNA secondary structure by combining homologous

(49)

Bibliography III

Bjarne Knudsen and Jotun Hein. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res, 31(13):3423–3428, Jul 2003.

S.H. Bernhart, I.L. Hofacker, S. Will, A.R. Gruber, and P.F. Stadler. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9:474, Nov 2008.

H. Kiryu, T. Kin, and K. Asai. Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics, 23:434–441, 2007.

S.E. Seemann, J. Gorodkin, and R. Backofen. Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res., 36:6355–6362, 2008.

M. Hamada, K. Sato, and K. Asai. Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res., 39:393–402, Jan 2011a.

Donglai Wei, Lauren V. Alpert, and Charles E. Lawrence. Rnag: A new gibbs sampler for predicting rna secondary structure for unaligned sequences. Bioinformatics, 2011. doi: 10.1093/bioinformatics/btr421. URL

http://bioinformatics.oxfordjournals.org/content/early/2011/07/24/bioinformatics.btr421.abstract. S. M. Sahraeian and B. J. Yoon. PicXAA-R: Efficient structural alignment of multiple RNA sequences using a greedy approach.

BMC Bioinformatics, 12 Suppl 1:S38, 2011.

M. Hamada, K. Sato, H. Kiryu, T. Mituyama, and K. Asai. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics, 25:3236–3243, Dec 2009c.

Y. Tabei and K. Asai. A local multiple alignment method for detection of non-coding RNA sequences. Bioinformatics, 25: 1498–1505, Jun 2009.

Y. Kato, K. Sato, M. Hamada, Y. Watanabe, K. Asai, and T. Akutsu. RactIP: fast and accurate prediction of RNA-RNA interaction using integer programming. Bioinformatics, 26:i460–466, Sep 2010.

S. E. Seemann, A. S. Richter, T. Gesell, R. Backofen, and J. Gorodkin. PETcofold: predicting conserved interactions and structures of two multiple alignments of RNA sequences. Bioinformatics, 27:211–219, Jan 2011.

M. Hamada, H. Kiryu, W. Iwasaki, and K. Asai. Generalized centroid estimators in bioinformatics. PLoS ONE, 6:e16450, 2011b.

lecture MEA 最近の更新履歴 浜田 道昭＠早稲田大学 産総研 日本医科大学

Maximizing expected accuracy estimations in

bioinformatics

Point estimation problem in (high-D) discrete space

Widely-used approach (1): maximizing a score

Widely-used approach (2): maximizing a probability

Point estimation problem in (high-D) discrete space with

probability distribution

Is ML-estimator good?

Drawback of ML estimators (1)

Drawback of ML estimators (2)

n = 997

Drawback of ML estimators (3)

What estimators should we use ?

Drawback of ML estimators (3)

What estimators should we use ?

Maximum expected gain (MEG) estimators

Maximum expected accuracy (MEA) estimators

The gain function should be designed more carefully.

Maximum expected accuracy (MEA) estimators

The gain function should be designed more carefully.

Commonly-used gain functions

Y is a subset of L

(|L| < ∞)

Gain function for Y ⊂ L

: label gain function

G

for gene prediction

Gain function for Y ⊂ L

: boundary gain function

G

for gene prediction

Y is a subset of {0, 1}

Secondary structures of RNA sequence x: S(x)

Pairwise alignment between x and x

: A(x, x

)

Phylogenetic topologies of leaf sets S = {1, · · · , n}: T (S)

A gain function for Y ⊂ {0, 1}

: γ-centroid gain function

G

for RNA secondary structure prediction

G

for pairwise alignment

A gain function for Y ⊂ {0, 1}

: MCC/F-score

G

for RNA secondary structure prediction

A gain function for Y ⊂ {0, 1}

× {0, 1}

G

for RNA secondary structure prediction

G

for pairwise alignment

Relation between G

and G

How to compute MEG/MEA estimators

Classification of existing bioinformatics algorithms from

MEA viewpoints

Sequence feature prediction

Alignment (excluding RNA alignment)

γ-centroid vs. maximum score in local alignment

RNA secondary structure prediction

Comparison between various estimators

RNA secondary structure prediction by maximizing

pseudo -expected MCC/Fscore

RNA common secondary structure prediction

Comparison

RNA alignment

Others

Summary

Bibliography I

Bibliography II

Bibliography III

lecture MEA 最近の更新履歴浜田道昭＠早稲田大学産総研日本医科大学