060310391 0560565
13
2017/1/5 13:00-14:45
@1 - 4
1
1
MEDLINE
Jensen et al. (2006) Nat Rev Genet 7: 119 2
2
• 5
2008 3
•
•
• PubMed/MEDLINE
• MeSH term
•
•
• (GO)
• MEDIE
• Swanson
•
4
•
•
–
–
(Jensen et al. (2006) Nat Rev 7: 119)
5
• InformaOon retrieval :
–
• EnOty recogniOon :
–
• InformaOon extracOon :
–
• Text mining (Knowledge discovery) :
– A-B B-C A-C
• IntegraOon :
– QTL
Jensen et al. (2006) Nat Rev 7: 119
6
MEDLINE PubMed
• MEDLINE NaOonal Library of
Medicine; NLM
• 1946 39 60 ;
91% 5,600
1900 2012 7
• PubMed MEDLINE
7
• PubMed
– binary search
Google MeSH term
– vector search
Related citaOons
8
MeSH term
(Medical Subject Headings)
• NLM
• 2011 26,142 descriptors
177,000
•
Salt-tolerance Saline-Tolerance Salt-Adap0on
9
Seeds Cotyledon Endosperm
MeSH term
10
PubMed
MeSH term
MeSH term
11
• a idf
–
– 7 (term frequency) d t
• 7(d, t) = nt / Wd
nt t Wd d
– idf (inverse document frequency): t
• idf(t)=log2(N/wi) + 1
N wi t
– d x
dxdt = 7(d, t) × idf(t)
5 10 15 20
01234
Index
log(1:20, 2)
12
•
• x y
• α
sim(x, y) = ∑ x i y i
x i 2
∑ ∑ y i 2
13
(2)
TextPresso
• Textpresso ( hgp://www.textpresso.org/):
–
–
–
14
Natural Language Processing: NLP
•
morphological analysis
: ; cats = cat + s
syntaOc analysis
:
semanOc analysis discourse analysis
1999 . 15
• semanOc analysis
•
•
• ontology
1999 .
16
Gene Ontology (GO)
•
•
• hgp://www.geneontology.org/
GO term
17
GO term
18
Go term
biological process cellular component
molecular funcOon
3
GO term
(Directed Acyclic Graph: DAG)
19
(TextPresso )
Muller et al. (2004) PLoS Biology 2: e309
20
Textpresso
• Recall
Precision
Muller et al. PLoS Biol 2(11): e309 21 hgp://www.textpresso.org/
precision recall
tp fp
fn tn
Precision = tp
tp + fp
Recall = tp
tp + fn
F = 2 × Precision × Recall
Precision + Recall
F
F
22
Textpresso
Muller et al. PLoS Biol 2(11): e309 2 ‘‘regulaOon’’
‘‘associaOon’’
23
• 10%
‘‘Neither pdk-1(gf) nor akt-1(gf) suppressed the
Hyp phenotype of age-1(mg44).’’
• context 70%
‘‘lin-35 and lin-53, two genes that antagonize a C.
elegans pathway, encode proteins similar to Rb and
its binding protein RbAp48.’’
→
Muller et al. PLoS Biol 2(11): e309 24
MEDIE
• MEDIE (hgp://www.nactem.ac.uk/medie/)
• semanOc query
What causes cancer?
25
Ananiadou et al. TRENDS in Biotechnology 24: 571 26
MEDIE
27
(4)
Swanson
Arrowsmith hgp://arrowsmith.psych.uic.edu/
A C
B1, B2, B3, B4
A-C
Undiscovered public knowledge
Swanson Fig.1
MEDLINE (1)
(A) (C) (2)
(A) (C) (3)
C(A)
Biomedical Digital Libraries 2006, 3:2 28
Arrowsmith
hgp://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html 29
Smalheiser (2009) Computer Methods and Programs in Biomedicine 94: 190–197
A = microRNA, C = Xist
Xist (X-inacOve specific transcript)
30
“Disorder”
31
A C
32
co- occurrence
Jensen et al. (2006) Nat Rev Genet 7: 119
von Mering (2005)Nuc Acids Res 33: D433
NLP
b c
33
(5)
Korbel et al. (2005) PLoS Biology 3: e134
92 (I’’)
172,967 -
I 224,754
11,026OG I’
34
[ - ], [
]
35
G2D (Genes to Disease)
hgp://g2d2.ogic.ca/
G2D
Gene Ontology term
Tremblay 2008 PLoS One 3: e2907 2 Teber et al. 2009 BMC BioinformaOcs 10: S69 G2D
36
G2D PHENOTYPE
1. OMIM ID
2.
3. OMIM
MeSH term
C,
4. MEDLINE
MeSH D GO
term GO
5. RefSeq GO
6. RefSeq
(2) BLAST
7.
①
②
1, 2
③ ④
⑤
⑥
⑦
37
OMIM
MEDLINE , MEDLINE MeSH term
C,
MEDLINE
GO term MeSH C D
RefSeq
BLAST
38
2010
主成分 分析
生物 統計 理解 階層 説明
例 応用 データ 必要
実験 研究 計算
系統 高い 興味深い 非常 フーリエ 形 表現
楽しい お願い バーコード レポート 大変 アルゴリズム 意味 方法
レジュメ 手法 授業 数学 話
!"
内容 面白い 数式 難しい 多い 解説
良い #$% 解析
&'&&'()'&)'(
!"#$%&'()&*+',-'./
*+,-./012304567849 8
:;<=*/
RMeCab
39
•
•
•
40