.
...
Maximizing expected accuracy estimations in
bioinformatics
バイオインフォマティクスにおける期待精度最大化推定
Michiaki Hamada
The University of Tokyo CBRC/AIST
October 7, 2011
1... Introduction
2... Framework of MEG/MEA estimators MEG/MEA estimators
Gain functions for Y ⊂ Ln Gain functions for Y ⊂ {0, 1}n
Gain function for Y ⊂ {0, 1}n× {0, 1}m
3... A classification of existing algorithms from MEG/MEA-viewpoint Sequence feature prediction
Alignment (excluding RNA alignment) RNA sequence analysis
RNA secondary structure prediction
RNA common secondary structure prediction RNA alignment
Others
4... Summary
5... Bibliography
Point estimation problem in (high-D) discrete space
.Problem ..
...
Given data D and a discrete space Y correlated to D, find a point y in Y . Y is called predictive space, which contains all the possible solutions. .Example (Predictive spaces in bioinformatics)
..
...
1... S(x) : the space of secondary structures of RNA sequence x
2... A(x, x′) : the space of pairwise alignments between two biological sequences x and x′
3... T (S) : the space of phylogenetic topologies for a set of leafs S ...
4 G(x) : the space of gene annotations for a genome x In general, the dimension of those spaces are very high.
Widely-used approach (1): maximizing a score
Define a score model S(y|D) (which gives a score of y ∈ Y ), and then maximize the score.
.Example ..
... ...
1 In pairwise alignment, define a score model S(y|x, x′) which gives a score of a pairwise alignment y between x and x′ (specified by a score matrix and gap costs), and then maximize the score.
2... In RNA secondary structure prediction, define the free energy E(y|x) of a secondary structure y of an RNA sequence x (given by Turner’s energy model [Freier et al., 1986]), and then minimize the energy.
Widely-used approach (2): maximizing a probability
Define a probabilistic model p(y|D) (which gives a probability of y ∈ Y ), and then maximize the probability.
.Example ..
...
1... In pairwise alignment, define the probabilistic model
p(y|x, x′) = exp (S(y|x, x′)/T ) /Z (Miyazawa model [Miyazawa, 1995]) which gives a probability of a pairwise alignment y between x and x′, and then maximize the probability.
...
2 In RNA secondary structure prediction, define the probabilistic model p(y|x) = exp(−E(y|x)/(kT ))/Z (McCaskill model [McCaskill, 1990]) of a secondary structure y of an RNA sequence x, and then maximize the probability.
In many cases, score models naturally lead to probabilistic models.
Point estimation problem in (high-D) discrete space with
probability distribution
.Problem ..
...
Given data D and a discrete space Y correlated to D, find a point y in Y . .Assumption
..
...
In above problem a (posterior) probability distribution p(y|D) on a predictive space Y is given.
.Definition ((Baysien) maximum likelihood (ML) estimator) ..
...
ˆ
y(M L)= argmax
y∈Y
p(y|D),
Is ML-estimator good?
Drawback of ML estimators (1)
仮定:ハミング距離 の近い予測は”似 ている”
p1 = 2 / 6 p2 = 1 / 6
最尤推定 otherwise = 0 ハミング距離1の
ノード間にエッジ
[Carvalho and Lawrence, 2008]
Drawback of ML estimators (2)
p2 (1,1,1,...,1)
(0,0,...,0) (1,0,...,0)
(0,1,...,0)
n-dim
p1
最尤推定 遠い
p2 p2
(n+1)p2
p1 = 2 / (n+3) p2 = 1 / (n+3) otherwise = 0
良い予測??? 0.002 0.001
0.001 0.001
0.998
n = 997
Imagine the case of n → ∞.
Drawback of ML estimators (3)
...
1 The number of possible solutions (i.e. |Y |) is immense. ...
2 There are many mutually similar solutions in Y .
3... The probability of ML-estimation is often extremely low.
e.g. typically less than 10−40in RNA secondary structure prediction
4... ML estimator on discretespace does not satisfy the following properties (although ML estimator oncontinuousspace holds the properties)
consistency
asymptotic normality asymptotic efficiency
What estimators should we use ?
Drawback of ML estimators (3)
...
1 The number of possible solutions (i.e. |Y |) is immense. ...
2 There are many mutually similar solutions in Y .
3... The probability of ML-estimation is often extremely low.
e.g. typically less than 10−40in RNA secondary structure prediction
4... ML estimator on discretespace does not satisfy the following properties (although ML estimator oncontinuousspace holds the properties)
consistency
asymptotic normality asymptotic efficiency
What estimators should we use ?
Maximum expected gain (MEG) estimators
.Definition (gain function) ..
...
For θ ∈ Y and y ∈ Y , G : Y × Y → R+, G(θ, y) is called gain function. .Definition (MEG estimator)
..
...
ˆ
y = argmax
y∈Y Eθ|D
[G(θ, y)] = argmax
y∈Y
∑
θ∈Y
G(θ, y)p(θ|D)
.Remark .. ...
When G(θ, y) = δ(θ, y),MEG estimator is equivalent to ML-estimator.
Maximum expected accuracy (MEA) estimators
When the gain function G is designed according to the accuracy measures of the target problem, the MEG estimator is called “maximum expected accuracy (MEA) estimator”
This does not mean the gain function is exactly equal to the accuracy measure.
ML-estimator from a viewpoint of MEA estimator:
The delta function δ(θ, y) is rarely used as accuracy measure because it is too strict for accuracy measure.
In RNA secondary structure prediction, predicted secondary structure is considered as correct only when it is exactly same as the reference structure.
The gain function should be designed more carefully.
Maximum expected accuracy (MEA) estimators
When the gain function G is designed according to the accuracy measures of the target problem, the MEG estimator is called “maximum expected accuracy (MEA) estimator”
This does not mean the gain function is exactly equal to the accuracy measure.
ML-estimator from a viewpoint of MEA estimator:
The delta function δ(θ, y) is rarely used as accuracy measure because it is too strict for accuracy measure.
In RNA secondary structure prediction, predicted secondary structure is considered as correct only when it is exactly same as the reference structure.
The gain function should be designed more carefully.
Commonly-used gain functions
Predictive spaces:
A subset of Ln (|L| < ∞) A subset of {0, 1}n Gain functions:
Gain functions for Y ⊂ Ln label gain function: G(label)
boundary gain function: G(boundary)γ
Gain functions for Y ⊂ {0, 1}n γ-centroid gain function: G(centroid)γ
MCC / F-score: G(Acc)
Gain function for Y ⊂ {0, 1}n× {0, 1}m: G(2dim)γ
Y is a subset of L
n(|L| < ∞)
Typically, the data D is a biological sequence with length n (e.g., a DNA, RNA or protein sequence), and L is a set of labels on the sequence. .Example (Gene annotations of a sequence x: G(x))
..
...
For a genome sequence x and L = {exon, intron, intergenic} (a set of labels for components of gene structures), a gene structure y can be represented as y = {yi}|x|i=1 ∈ L|x| where yi∈ L indicates the label of the i-th position in x. G(x) denotes the space of gene structures of a genome sequence x.
XXXEEEIIIIIIEEIIEEEXX 123456789012345678901
X:Intergenic region, E:Exon, I:Intron
Gain function for Y ⊂ L
n: label gain function
For θ, y ∈ Y ⊂ Ln, the following gain function is introduced. G(label)(θ, y) = ∑
1≤i≤n
I(θi = yi)
Here, I(condition) is the indicator function that returns 1 only when condition is true.
1... This gain function was originally proposed in Kall et al. [2005]
2... When θ is a correct (reference) sequence and y is a prediction, this is equal to the number of correctly predicted labels.
3... The MEG estimator of this gain function, therefore, maximizes the expected number of correctly predicted labels.
G
(label)for gene prediction
Y = G(x)
L = {exon, intron, intergenic} G(label)(θ, y) = 19
XXEEEEIIIIIEEEIIEEEXX 123456789012345678901
XXXEEEIIIIIIEEIIEEEXX 123456789012345678901 Reference θ
Prediction y
X:Intergenic region, E:Exon, I:Intron
MEG estimator with this gain function is suitable to nucleotide level accuracy
Gain function for Y ⊂ L
n: boundary gain function
For θ, y ∈ Y ⊂ Ln, the following gain function is introduced. G(boundary)γ (θ, y) = ∑
2≤i≤n
[I ((θi−1, θi) ̸∈ B) I ((yi−1, yi) ̸∈ B) +γ · I ((θi−1, θi) ∈ B) I ((yi−1, yi) ∈ B)] B is the list of all pairs of labels corresponding to a boundary (e.g., an exon-intron boundary for gene prediction).
...
1 This gain function was originally proposed by Gross et al. [2007] in the context of gene prediction.
...
2 When θ is a correct prediction and y is a prediction, this is equal to a weighted sum of the number of correctly predicted boundariesand non-boundaries.
...
3 The MEG estimator of this gain function is suitable for accurate prediction of boundary of annotation (boundary accuracy).
G
(boundary)γfor gene prediction
Y = G(x)
B is the list of all pairs of labels corresponding to a boundary (e.g., an exon-intron boundary for gene prediction).
G(boundary)γ (θ, y) = 4γ + 12
XXEEEEIIIIIEEEIIEEEXX 123456789012345678901
XXXEEEIIIIIIEEIIEEEXX 123456789012345678901 Reference θ
Prediction y
In the MEG estimator with this gain function, the γ is a parameter that adjusts between the sensitivity and PPV of a prediction. Using larger γ leads to more boundaries (that is, more genes) in the prediction.
Y is a subset of {0, 1}
n1... This is a special case of the previous (where L = {0, 1}).
2... In this case, 1 and 0 in a binary vector y ∈ Y typically mean positive andnegative predictions, respectively.
3... Accuracy measures, such as sensitivity (SEN), PPV, MCC and F-score, are naturally introduced
Each is defined by using the number of true positive, true negative, false positive and false negative predictions (denoted as TP, TN, FP and FN, respectively) [Baldi et al., 2000].
Secondary structures of RNA sequence x: S(x)
I(0)= {(i, k)|1 ≤ i ≤ |x|, i ≤ j ≤ |x|} (upper triangular matrix) we define a binary variable θik ((i, k) ∈ I(0)) as
θik =
{ 1 xi and xj form a base-pair 0 xi and xj do not form a base-pair a secondary structure θ of x can be represented as θ = {θik}(i,k)∈I(0) ∈ S(x)
1 2 3 4 5 6
7 8
9 10
13 12 11
14
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
=
Pairwise alignment between x and x
′: A(x, x
′)
I(0)= {(i, k)|1 ≤ i ≤ |x|, 1 ≤ k ≤ |x′|}
we define a binary variable θik ((i, k) ∈ I(0)) as θik =
{ 1 xi aligns with x′k 0 xi does not align x′k
A pairwise alignment θ between x and x′ can be represented as θ = {θik}(i,k)∈I(0) ∈ A(x, x′).
123456 CCAC-A CAACGA 1234-5
1 2 3 4 5 6 1 2 3 4 5
|||| | =
1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
Phylogenetic topologies of leaf sets S = {1, · · · , n}: T (S)
I(0)={X|X ∈ S2, |X| < n/2 ∨ (|X| = n/2 ∧ 1 ∈ X)}, S2= {X|X ⊂ S, |X| > 1 ∧ |X| < n − 1}
a set of partition of leafs
Define a binary variable θX for X ∈ I(0) by θX =
{ 1 if S can be split into X and S \ X by cutting a edge 0 otherwise
A phylogenetic tree θ can be represented as θ = {θX}X∈I(0) ∈ T (S)
X
θx=1
A gain function for Y ⊂ {0, 1}
n: γ-centroid gain function
For θ, y ∈ Y (⊂ {0, 1}n), we introduce the gain function G(centroid)γ (θ, y) = ∑
1≤i≤n
[I(θi= 0)I(yi = 0) + γ · I(θi = 1)I(yi = 1)]
where γ ≥ 0 is a weight parameter. ...
1 This gain function was originally proposed in the context of RNA secondary structure prediction [Hamada et al., 2009a].
2... When y is a prediction and θ is a reference sequence, this gain function is equal to a weighted sum of the number of true-positives (TP) and the number of true-negatives (TN).
3... MEG estimator with this gain function is calledγ-centroid estimator
G
(centroid)γfor RNA secondary structure prediction
For two secondary structures y and θ in S(x) y is a prediction and θ is a reference structure,
G(centroid)γ (θ, y) is equal to the weighted sum of the number of true-positive base pairs and true-negative base pairs.
For example, G(centroid)γ (θ, y) = 147 + 3γ.
(((...)))...((..)) Reference θ
(((...))).(....).. 123456789012345678
123456789012345678 TP=3 TN=147 FP=1 FN=2 Prediction y
G
(centroid)γfor pairwise alignment
For two pairwise alignments y and θ in A(x, x′) y is a prediction and θ is a reference structure,
G(centroid)γ (θ, y) is equal to the weighted sum of the number of true-positive aligned bases and true-negative aligned bases. For example, G(centroid)γ (θ, y) = 63 + 4γ.
12-3456789 1234-5-678 1-23456789 123--45678 TP=4 TN=63 FP=2 FN=3
xx' xx' Reference θ Prediction y
A gain function for Y ⊂ {0, 1}
n: MCC/F-score
Question: Can we use actual accuracy measures as gain function? Answer: Yes
For θ, y ∈ Y ⊂ {0, 1}n, we introduce the gain function
G(Acc)(θ, y) = Acc(θ, y), (1)
where Acc is either MCC or F-score [Baldi et al., 2000], both of which are commonly-used accuracy measures providing a balance between Sensitivity and PPV.
1... It is generally difficult to compute MEG estimator with this gain function.
2... Hamada et al. [2010] have proposed an approximate method to maximize expected MCC/F-score.
G
(Acc)for RNA secondary structure prediction
For two secondary structures y and θ in S(x) y is a prediction and θ is a reference structure, G(centroid)γ (θ, y) is equal to the value of MCC. For example, M CC(θ, y) = 0.661.
(((...)))...((..)) Reference θ
(((...))).(....).. 123456789012345678
123456789012345678 TP=3 TN=147 FP=1 FN=2 Prediction y
1... (Unlike the γ-centroid estimators,) the MEG estimator of this gain function does not contain any parameter.
A gain function for Y ⊂ {0, 1}
n× {0, 1}
mSuppose that each binary vector has two indices, that is, Y ⊂ {0, 1}n× {0, 1}m
like S(x) and A(x, x′)
For θ = {θij} and y = {yij} (θij, yij ∈ {0, 1}) the gain function G(2dim)γ (θ, y) = 2γ ·∑
i,j
I(θij = 1)I(yij = 1)+
∑
i
∏
j
I(θij = 0)I(yij = 0) +∑
j
∏
i
I(θij = 0)I(yij = 0)
Larger γ produces more positive predictions
Interestingly, this gain function wasindependently proposed in Do et al. [2006] (in the context of RNA secondary structure prediction) and Schwartz et al. [2005] (in the context of pairwise alignment).
G
(2dim)γfor RNA secondary structure prediction
Y = S(x)
When θ is a reference secondary structure and y is a predicted secondary structure, G(2dim)γ (θ, y) is equal to a (weighted) sum of the numbers of correctly predicted positions in the RNA sequence x. In the example (below), G(2dim)γ (θ, y) = 6γ + 6
p:correctly predicted base-pair position l:correctly predicted loop position x:wrongly predicted position
(((...)))...((..)) Reference θ
(((...))).(....).. 123456789012345678 123456789012345678 Prediction y
ppplllppplxlxxlxxx
G
(2dim)γfor pairwise alignment
Y = A(x, x′)
θ is a reference alignment and y is a predicted alignment,
G(2dim)γ (θ, y) is equal to a (weighted) sum of the numbers of correctly and incorrectly predicted columns in the alignment.
G(2dim)γ (θ, y) = 10γ + 1.
12-3456789 1234-5-678 1-23456789 123--45678 pxp--xxppp p-pxlxxppp xx'
xx' xx' Reference θ Prediction y
MEG estimator with this gain function is suitable to accuracy measure calledalignment metric accuracy (AMA) [Schwartz, 2007].
Relation between G
(centroid)γand G
(2dim)γ For θ, y ∈ {0, 1}n× {0, 1}m,G(2dim)γ (θ, y) = 2G(centroid)γ (θ, y) + A(θ, y) + C(θ) where
A(θ, y) =∑
i
∑
(j1,j2):j1̸=j2
I(θij1 = 1)I(yij2 = 1)
+∑
j
∑
(i1,i2):i1̸=i2
I(θi1j= 1)I(yi2j = 1)
When θ is a reference and y is a prediction,A(θ, y) gives positive gains for false positive and falsenegative predictions.
MEG estimator with G(2dim)γ has biased to e.g. SEN, PPV ... [Hamada et al., 2009a]
How to compute MEG/MEA estimators
The following methods are often employed when computing the prediction of MEG estimators.
1... Dynamic Programming (DP)
2... Integer Programming (IP)
3... Stochastic Sampling (SS)
Sampling solutions directly from the posterior spaces e.g. Simulated Annealing and Gibbs Sampling
Generally, (fast) DP < IP < SS (slow) with respect to computational speed.
Classification of existing bioinformatics algorithms from
MEA viewpoints
1... Sequence feature prediction
2... RNA sequence analysis ...
3 Alignment
1... RNA secondary structure prediction
2... RNA common secondary structure prediction
3... RNA alignment
4... Others
Sequence feature prediction
Reference Software Target problem Gain function Apr Rep Comp Suitable accuracy
measures
Kall et al. [2005] – Sequence feature predictions
G(label) ✓ DP # of correctly
predictedlabel
Gross et al. [2007] CONTRAST Gene prediction G(boundary)γ DP # of correctly
predicted bound- ary
N´an´asi et al. [2010] HERD HIV recombina- tion prediction
G(boundary)γ a DP # of correctly predicted bound- ary
aMore precisely, a generalization ofG(boundary)γ was utilized.
Alignment (excluding RNA alignment)
Reference Software Type Gain function Apr Rep Comp Suitable accu-
racy measures
Miyazawa [1995] – Pairwise G(centroid)1 DP Hamming
distance of (un)aligned- bases Holmes and Durbin [1998] – Pairwise G(centroid)∞
a DP SEN/SPS of
aligned-bases
Schwartz et al. [2005] – Pairwise G(2dim)γ DP Alignment
metric accu- racy (AMA)
Do et al. [2005] ProbCons Multiple G(centroid)∞ ✓ ✓ DP SEN/SPS of
aligned-bases
Roshan and Livesay [2006] ProbAlign Multiple G(centroid)∞ ✓ ✓ DP SEN/SPS of
aligned-bases
Shinsuke et al. [2008] PRIME Multiple G(centroid)∞ DP SEN/SPS of
aligned-bases
Schwartz and Pachter [2007] AMAP Multiple G(2dim)γ ✓ ✓ SA AMA
Sahraeian and Yoon [2010] PicXAA Multiple G(centroid)∞ ✓ ✓ DP SEN/SPS of
aligned-bases
Frith et al. [2010] LAST Local G(centroid)γ DP SEN/PPV of
(un)aligned- bases
γ-centroid vs. maximum score in local alignment
[Frith et al., 2010]
RNA secondary structure prediction
Reference Software Gain function Apr Rep Comp Suitable accuracy measures
Ding et al. [2005] Sfold G(centroid)1 SS Hamming distance of base-pairs
Do et al. [2006] CONTRAfold G(2dim)γ DP # of correctly predicted (loop or
base-pairs) positions in RNA se- quence
Lu et al. [2009] MaxExpect G(2dim)γ DP # of correctly predicted (loop or
base-pairs) positions in RNA se- quence
Hamada et al. [2009a] CentroidFold G(centroid)γ DP SEN/PPV of base-pairs
Hamada et al. [2010] CentroidFold G(Acc)c DP/SS MCC/F-score of base-pairs
Lorenz and Clote [2011] RNAlocopt G(2dim)γ DP # of correctly predicted (loop or base-pairs) positions in RNA se- quence
Sato et al. [2011] IPKnot G(centroid)γ ✓ IP SEN/PPV of base-pairs
Hamada et al. [2009b] CentroidHomfold G(centroid)γ ✓ ✓ DP SEN/PPV of base-pairs aIPKnot can predict secondary structures with pseudoknot;bCentroidHomfold predicts secondary structure of a target RNA sequence by using its homologous sequence information;cAn approximate method (i.e. maximizingpseudo-expected accuracy) was used.
Comparison between various estimators
CONTRAfold model SimFold model McCaskill model
γ::::Small γ::::Large
arrow:Centroid (1-centroid)
○:γ-centroid +:2dim
a point:ML
RNAfold
[Hamada et al., 2009a]
RNA secondary structure prediction by maximizing
pseudo -expected MCC/Fscore
F-score MCC F-score
MCC
RNA common secondary structure prediction
Reference Software Gain function Apr Rep Comp Suitable accuracy measures
Knudsen and Hein [2003] Pfold G(2dim)1 DP # of correctly predicted (loop or
base-pairs) positions
Bernhart et al. [2008] RNAalifolda G(centroid)1 DP Hamming distance
Kiryu et al. [2007] McCaskill-MEA G(2dim)γ ✓ DP # of correctly predicted positions Seemann et al. [2008] PETfold G(2dim)γ ✓ DP # of correctly predicted positions Hamada et al. [2011a] CentroidAlifold G(centroid)γ ✓ DP SEN/PPV of base-pairs
Wei et al. [2011] RNAG G(centroid)γ SS SEN/PPV of base-pairs
aML estimator is employed as the default setting.
Comparison
CONTRAfold model McCaskill model CONTRAfold model
McCaskill model
arrow:Centroid
△:γ-centroid
○:2-dim
point:ML Secondary structure prediction
Common secondary structure prediction
RNA alignment
Reference Software Target prob-
lem
Gain function Apr Rep Comp Suitable accu- racy measures
Wei et al. [2011] RNAG G(centroid)γ GS SEN/PPV of
base-pairs Sahraeian and Yoon [2011] PicXAA-R RNA multiple
alignment
G(centroid)∞ ✓ ✓ DP SPS of
aligned-bases Hamada et al. [2009c] CentroidAlign RNA multiple
alignment
G(centroid)γ ✓ ✓ DP SEN/PPV of aligned-bases Tabei and Asai [2009] SCARNA-LM RNA local
alignment
G(centroid)γ DP SEN/PPV of
aligned bases
Others
Reference Software Target problem Gain function Apr Rep Comp Suitable accuracy
measures
Kato et al. [2010] RactIP RNA-RNA inter- action prediction
G(centroid)γ IP SEN/PPV of
base-pairs / inter- action base-pairs Seemann et al. [2011] PETcofold RNA-RNA inter-
action prediction between two mul- tiple alignments
G(2dim)
γ ✓ DP –
Hamada et al. [2011b] – Phylogenetic tree estimation
G(centroid)γ ✓ – Robinson-Foulds (RF) measure
. . . . . . A classification of existing algorithms from MEG/MEA-viewpoint Others
Kall et al. [2005] – Sequence feature predictions L G(label) ✓ DP
Gross et al. [2007] CONTRAST Gene prediction L G(boundary)
γ DP
N´an´asi et al. [2010] HERD HIV recombination prediction L G(boundary)
γ DP
Miyazawa [1995] – Pairwise alignment B G(centroid)1 DP
Holmes and Durbin [1998] – Pairwise alignment B G(centroid)
∞ DP
Schwartz et al. [2005] – Pairwise alignment B G(2dim)
γ DP
Do et al. [2005] ProbCons Multiple alignment B G(centroid)
∞ ✓ ✓ DP
Roshan and Livesay [2006] ProbAlign Multiple alignment B G(centroid)
∞ ✓ ✓ DP
Shinsuke et al. [2008] PRIME Multiple alignment B G(centroid)
∞ DP
Schwartz and Pachter [2007] AMAP Multiple alignment B G(2dim)
γ ✓ ✓ SA
Sahraeian and Yoon [2010] PicXAA Multiple alignment B G(centroid)
∞ ✓ ✓ DP
Frith et al. [2010] LAST Genome (local) alignment B G(centroid)
γ DP
Ding et al. [2005] Sfold RNA sec. str. pred. B G(centroid)1 SS
Do et al. [2006] CONTRAfold RNA sec. str. pred. B G(2dim)γ DP
Lu et al. [2009] MaxExpect RNA sec. str. pred. B G(2dim)γ DP
Hamada et al. [2009a] CentroidFold RNA sec. str. pred. B G(centroid)
γ DP
Hamada et al. [2010] CentroidFold RNA sec. str. pred. B G(Acc) DP/SS
Lorenz and Clote [2011] RNAlocopt RNA sec. str. pred. B G(2dim)γ DP
Sato et al. [2011] IPKnot RNA sec. str.
pred. with pseudoknot B
G(centroid)
γ ✓ IP
Hamada et al. [2009b] CentroidHomfold RNA sec. str. pred. with ho- mol. seq.
B G(centroid)
γ ✓ ✓ DP
Knudsen and Hein [2003] Pfold RNA com. sec. str. pred. B G(2dim)1 DP
Bernhart et al. [2008] RNAalifold RNA com. sec. str. pred. B G(centroid)1 DP
Kiryu et al. [2007] McCaskill-MEA RNA com. sec. str. pred. B G(2dim)γ ✓ DP
Seemann et al. [2008] PETfold RNA com. sec. str. pred. B G(2dim)γ ✓ DP
Hamada et al. [2011a] CentroidAlifold RNA com. sec. str. pred. B G(centroid)
γ ✓ DP
Wei et al. [2011] RNAG RNA com. sec. str. pred. B G(centroid)
γ GS
Sahraeian and Yoon [2011] PicXAA-R RNA multiple alignment B G(centroid)
∞ ✓ ✓ DP
Hamada et al. [2009c] CentroidAlign RNA multiple alignment B G(centroid)
γ ✓ ✓ DP
Tabei and Asai [2009] SCARNA-LM RNA local alignment B G(centroid)
γ DP
Summary
...
1 Many estimation problems in bioinformatics are formulated as point estimation problems in a high-dimensional discrete space.
...
2 Maximum score and maximum likelihood estimators do not work well in this situation although they are widely employed in a number of applications.
3... Maximizing expected accuracy (MEA) estimation, in which accuracy measures of the target problem and the entire distribution of solutions are considered, is a more successful approach.
...
4 A number of algorithms used in previous studies can be classified from the viewpoint of MEG/MEA.
Bibliography I
S. M. Freier, R. Kierzek, J. A. Jaeger, N. Sugimoto, M. H. Caruthers, T. Neilson, and D. H. Turner. Improved free-energy parameters for predictions of RNA duplex stability. Proc. Natl. Acad. Sci. U.S.A., 83:9373–9377, Dec 1986. S. Miyazawa. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng., 8:
999–1009, Oct 1995.
J S McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6-7):1105–1119, May 1990.
L.E. Carvalho and C.E. Lawrence. Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. U.S.A., 105:3209–3214, 2008.
L. Kall, A. Krogh, and E. L. Sonnhammer. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics, 21 Suppl 1:i251–257, 2005.
S.S. Gross, C.B. Do, M. Sirota, and S. Batzoglou. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol., 8:R269, 2007.
P. Baldi, S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16:412–424, May 2000.
M. Hamada, H. Kiryu, K. Sato, T. Mituyama, and K. Asai. Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics, 25:465–473, Feb 2009a.
M. Hamada, K. Sato, and K. Asai. Prediction of RNA secondary structure by maximizing pseudo-expected accuracy. BMC Bioinformatics, 11:586, 2010.
C.B. Do, D.A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22:e90–98, Jul 2006.
Ariel S. Schwartz, Eugene W. Myers, and Lior Pachter. Alignment metric accuracy, 2005.
Ariel S. Schwartz. Posterior Decoding Methods for Optimization and Accuracy Control of Multiple Alignments. Ph.D thesis, University of California, Berkeley, 2007.
Bibliography II
Michal N´an´asi, Tom´aˇs Vinaˇr, and Broˇna Brejov´a. The highest expected reward decoding for hmms with application to recombination detection. In Proc. of CPM’10, pages 164–176, 2010.
I. Holmes and R. Durbin. Dynamic programming alignment accuracy. J. Comput. Biol., 5:493–504, 1998.
C.B. Do, M.S. Mahabhashyam, M. Brudno, and S. Batzoglou. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res., 15:330–340, Feb 2005.
U. Roshan and D.R. Livesay. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics, 22:2715–2721, Nov 2006.
Yamada Shinsuke, Gotoh Osamu, and Yamana Hayato. Improvement in speed and accuracy of multiple sequence alignment program prime. IPSJ Transactions on Bioinformatics (TBIO), 1:2–12, nov 2008.
A.S. Schwartz and L. Pachter. Multiple alignment by sequence annealing. Bioinformatics, 23:e24–29, Jan 2007. S. M. Sahraeian and B. J. Yoon. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of
multiple sequences. Nucleic Acids Res., 38:4917–4928, Aug 2010.
M. C. Frith, M. Hamada, and P. Horton. Parameters for accurate genome alignment. BMC Bioinformatics, 11:80, 2010. Y. Ding, C.Y. Chan, and C.E. Lawrence. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.
RNA, 11:1157–1166, Aug 2005.
Z. J. Lu, J. W. Gloor, and D. H. Mathews. Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA, 15:1805–1813, Oct 2009.
W. A. Lorenz and P. Clote. Computing the partition function for kinetically trapped RNA secondary structures. PLoS ONE, 6: e16178, 2011.
K. Sato, Y. Kato, M. Hamada, T. Akutsu, and K. Asai. IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming. Bioinformatics, 2011. (to appear).
M. Hamada, K. Sato, H. Kiryu, T. Mituyama, and K. Asai. Predictions of RNA secondary structure by combining homologous
Bibliography III
Bjarne Knudsen and Jotun Hein. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res, 31(13):3423–3428, Jul 2003.
S.H. Bernhart, I.L. Hofacker, S. Will, A.R. Gruber, and P.F. Stadler. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9:474, Nov 2008.
H. Kiryu, T. Kin, and K. Asai. Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics, 23:434–441, 2007.
S.E. Seemann, J. Gorodkin, and R. Backofen. Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res., 36:6355–6362, 2008.
M. Hamada, K. Sato, and K. Asai. Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res., 39:393–402, Jan 2011a.
Donglai Wei, Lauren V. Alpert, and Charles E. Lawrence. Rnag: A new gibbs sampler for predicting rna secondary structure for unaligned sequences. Bioinformatics, 2011. doi: 10.1093/bioinformatics/btr421. URL
http://bioinformatics.oxfordjournals.org/content/early/2011/07/24/bioinformatics.btr421.abstract. S. M. Sahraeian and B. J. Yoon. PicXAA-R: Efficient structural alignment of multiple RNA sequences using a greedy approach.
BMC Bioinformatics, 12 Suppl 1:S38, 2011.
M. Hamada, K. Sato, H. Kiryu, T. Mituyama, and K. Asai. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score. Bioinformatics, 25:3236–3243, Dec 2009c.
Y. Tabei and K. Asai. A local multiple alignment method for detection of non-coding RNA sequences. Bioinformatics, 25: 1498–1505, Jun 2009.
Y. Kato, K. Sato, M. Hamada, Y. Watanabe, K. Asai, and T. Akutsu. RactIP: fast and accurate prediction of RNA-RNA interaction using integer programming. Bioinformatics, 26:i460–466, Sep 2010.
S. E. Seemann, A. S. Richter, T. Gesell, R. Backofen, and J. Gorodkin. PETcofold: predicting conserved interactions and structures of two multiple alignments of RNA sequences. Bioinformatics, 27:211–219, Jan 2011.
M. Hamada, H. Kiryu, W. Iwasaki, and K. Asai. Generalized centroid estimators in bioinformatics. PLoS ONE, 6:e16450, 2011b.