バイオメトリクス授業資料 iwatawiki lec14 s

(1)

060310391

0560565

14 2017/1/16 13:00-14:45

@1 - 4

1

…

DNA

2

• 

• –  ^k-means

SOM

• –  ^SVM

3

• mRNA

4

(2)

h>p://www.scq.ubc.ca/spot-your-genes-an-overview-of-the-microarray/ Art by Jiang

Long

cDNA

2

5 cDNA

= /

h>p://www.promega.com/enotes/applicaRons/ap0066.htm

6 GeneChip

1 mRNA

→ cDNA

→ cRNA

GeneChip

20 25

10 20

Perfect Match (PM) Mismatch

(MM)

MM PM 1

h>p://www.scq.ubc.ca/spot-your-genes-an-overview-of-the-microarray/ Art by Jiang Long ⁷

GeneChip PM MM

•  ¹

•  ^{PM MM}

Schadt et al. (2000) J Cell Biochem 80: 192

8 Perfect Match PM Mismatch MM

(3)

RNA-seq

9 Wang et al. (2009) Nat Rev Genet. 10: 57–63

RNA-seq

Conesa et al. (2016) Genome Biology 17:13 10

‘align-then-assemble’ and ‘assemble-then-align’

11 Haas and Zody (2010)

Nat. Biotech. 28: 421-423

• – 

– 

• – 

-

12

(4)

• –  hierarchical clustering

–  non-hierarchical clustering)

– 

→

13 1 2 3 4

1 1.53 2.38 2.80 0.60

2 1.03 2.54 3.29 0.80

3 0.85 0.21 0.34 3.02

4 1.03 0.82 0.94 1.20

:

1.0 0.95 -0.92 -0.88

0.95 1.0 -0.74 -0.77

-0.92 -0.74 1.0 0.93

-0.88 -0.77 0.93 1.0

1.0 0.95 0.92 0.88

0.95 1.0 0.74 0.77

0.92 0.74 1.0 0.93

0.88 0.77 0.93 1.0

0.0 0.74 4.13 2.55

0.74 0.0 4.36 0.77

4.13 4.36 0.0 2.02

2.55 2.94 2.02 0.0

(

r =

(x

_i

− x

i=1 n

∑ ^)(y

i

− y )

(x

_i

− x

i=1 n

∑ ⁾

²

^(y

i

− y

i=1 n

∑ ⁾

²

14

15

1

1 1.0 1.5 2.0 2.5 3.0 3.5 4.0

0 .5 1 .0 1 .5 2 .0 2 .5 3 .0

Sample ID

Exp re ssi o n l e ve l

2

2 3 2

3 ³

3 4 4 ⁴

4 •  ¹

•  ^dendrogram

• 

16

(5)

d(x i , x j ^{) =} ^(x i1 ^{− x} j1 ⁾

2 ++ (x _ip − x _jp ) ²

x i ^{= (x} i1 ^{,…, x} ip ^), x j ^{= (x} j1 ^{,…, x} jp ⁾

(Euclidean distance)

d(x i , x j ^{) =} ^x i1 ^{− x} j1 ^{++ x} ip ^{− x} jp

(Minkowski distance)

d(x i , x j ^{) = max x} _{ i1 ^{− x} j1 ^{,…, x} ip ^{− x} jp _}

(Maximum distance)

d(x i , x j ) = ^x i1 − x j1

p ++ x ip − x jp

1/p p

(Manha>an distance)

17 p = 1

p = 2

p → ∞

d(x i , x j ^{) =}

x _i1 − x _j1

x _i1 + x _j1 ^++

x _ip − x _jp

x _ip + x _jp

(Canberra distance)

1. 

2. 

1

3. 4.  ¹

2-3

18 0 ₁ ₂ 3 ₄ 5

0 1 2 3 4 5

Expression level in Exp.1

Exp re ssi o n l e ve l in Exp .2

–

1

2 3 4

1 5

2

3

4

1 g e n e 1 g e n e 2 g e n e 5 g e n e 3 g e n e 4

0 .5 1 .0 1 .5 2 .0 2 .5 3 .0

Dendrogram

hclust (*, "average")

D ist a n ce

2

3

4

2

19 (1)

1.  nearest neighbor method

a.k.a. single linkage

2. furthest neighbor method

a.k.a. complete linkage

A ^B

20

(6)

(2)

3. group average method

2. centroid method

× ^×

21 (3)

4. median method

5. Ward’s method

× ^×

×

d(A,B) = E(A B) - E(A) - E(B) 22

0 1 2 3 4 5

012345

Expression level in Exp.1

Expression level in Exp.2

g e n e 3 g e n e 4 g e n e 5 g e n e 1 g e n e 2 0 .5 1 .0 1 .5 2 .0 2 .5 3 .0

Complete linkage

hclust (*, "complete")

D ist a n ce

gene1

gene2 ^gene5

gene3

gene4

g e n e 1 g e n e 2 g e n e 5 g e n e 3 g e n e 4

0 .8 1 .0 1 .2 1 .4 1 .6 1 .8

Single linkage

hclust (*, "single")

D ist a n ce

dendrogram

→ ²³

• g e n e 3 g e n e 4 g e n e 5 g e n e 1 g e n e 2

0 .8 1 .2 1 .6

Euclidean distance

hclust (*, "single")

dist(c, method = "euclidian")

H e ig h t g e n e 3 g e n e 4 g e n e 5 g e n e 1 g e n e 2 ⁰

.6 1 .0 1 .4

Maximum distance

hclust (*, "single")

dist(c, method = "maximum")

H e ig h t g e n e 3 g e n e 4 g e n e 5 g e n e 1 g e n e 2

1 .2 1 .6 2 .0

Manhattan distance

hclust (*, "single")

dist(c, method = "manhattan")

H e ig h t g e n e 1 g e n e 2 g e n e 5 g e n e 3 g e n e 4 ₀ ^.1

5 0 .2 5 0 .3 5

Canberra distance

hclust (*, "single")

dist(c, method = "canberra")

H e ig h t

24

(7)

• →

•  ^1-r ij ^1-|r ij ^|

25 (1)

• •  0, 15, 30 1, 2, 3,

4, 8, 12, 16, 20, 24

•  0

• Eisen et al. (1998) PNAS 95: 14863

Cholesterol

biosynthesis

Cell cycle ( )

Immediate-early

response

Signaling &

angiogenesis

( )

Wound healing &

Tissue remodeling

→ 26

…

1 2 3 4

1 1.53 1.03 0.85 1.03

2 2.38 2.54 0.21 0.82

3 2.80 3.29 0.34 0.94

4 0.60 0.80 3.02 1.20

→

27 (2)

•  ⁶⁰

cell lines

• Ross et al. (2000) Nat Genet 24: 227

28

(8)

• •  k-means

• (Self-organizing maps SOM

29 k-means

1. _k k

2.  ^k

3. 4.  ^2-3

1

30 k-means –

Bishop CM (2006) Pa>ern recogniRon and machine learning. Springer.

31 •  ³⁸⁶ ^1311SNPs

•  k− ⁵

• 

32

●

●●

●

●●

●

●●●

●

●●●

●

●^●

●

●●

●

●●

●

● ^●

●

● ●

●

●●

●

●●

● ●●●●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●●

●

● ●●●●^●^●

●

●●

●

● ●

●

●^●

●

●●

●

●^●●

●

●● ^●

●

● ●

●●

●

●●

●●●

●

●● ●

●●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

●●

●

●●

●

●^● ●

−20 −10 0 10 20

−1001020

Rep 0

PC1

PC2

●

●●

●

●^●

●

● ●

●

● ●^●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●● ^●

●

●●

●

● ●

●

● ●

●

●●

●

●^●

●

● ^●

●

● ^●

● ●●●

●

●●^●

●●

●

●●

●

●●

●

● ●

●

● ●

●●

●

●●

●

●●● ●

●

●●●

●

● ●●●

●

●●

●

● ●

●●

●

●●

●

● ●●

●

●● ●●●●

●

●^●●

●

●^●●

●

−10 −5 0 5 10 15

−10−505101520

Rep 0

PC3

PC4

●

●●

●

●●

●

●●^●

●

●●

●

●^●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●^●

●

● ●●●^●

●

●●

●

● ●

●●

●

●^●

●

●●

●

●●^●

●

● ●

●

●● ^●

●

● ●

●

●●

●^●●

●

● ●●

●●●

●

●●●

●

●^●

●

● ●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

●●

●

●^● ●

−20 −10 0 10 20

−1001020

Rep 1

PC1

PC2

●

●^●

●

●●

●

● ●

●

● ●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

● ^● ●

●

●●

●

●^●

●

●●

●

●● ●^●

●

● ●●

●

●●

●

●●

●

● ^●

● ●●●

●

●●^●

●●

●

●●

●

● ●

●

●^●

●

●●

●

● ●

●●

●

●●● ●

●

●●

●

●●●

●

● ●●●

●

● ●

●●

●

●●

●

● ●●

●

●● ●●●●

●

● ●

●

−10−5 0 5 10 15

−10−505101520

Rep 1

PC3

PC4

×

1

●

●●

●

●●

●

●●^●

●

●●

●

●^●

●

●●

●

● ●

●

●●

●

●●

●

● ●●●

●

●●

●

●●

●

●●

●

●●●

●

● ●●●●●

●

●●

●

●●

●

● ●

●

●^●

●

●●

●

●^●

●

●^●

●

● ●

●●

●

●●

●●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●^●

●

●●

●

●●●●

●

●●

●

●●

●

● ●

●

●●

●

●●●

●

●● ●

−20 −10 0 10 20

−1001020

Rep 2

PC1

PC2

●

●^●

●

●●

●

● ●

●

● ●

●

● ●^●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●

● ●^●

●

● ●●

●

●●

●

●^●

●

● ^●

●

●●

●

● ●●●

●

●●●

●

●●

●

● ●

●

● ●

●●

●

●●● ●

●

●●

●

● ●●●

●

●●

●

● ●

●●

●

●●

●

● ^●

●●

●

● ●

●

●● ●●●●

●

●^●●

●

●^●●

●

−10 −5 0 5 10 15

−10−505101520

Rep 2

PC3

PC4

2

●

●●

●

●●

●

●●^●

●

●●

●

●^●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

● ●●●

●

●●

●

●^●

●

●●

●

● ●

●

●^●

●

● ●●●^●

●

●●

●

●●

●

● ●

●

●^●

●

●^●

●

●^●

●

●● ^●

●

● ●

●●

●

●●

●^●●

●

●●● ●

●

●●

●

●●

●

●^●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

●●

●

●●

●

●● ●

−20 −10 0 10 20

−1001020

Rep 10

PC1

PC2

●

● ●

●

● ●

●

● ^●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

● ^● ●

●

●●

●

●●

●

● ●

● ●^●

●

● ●●

●

●●

●

●^●

●

●●

●● ^●

● ●●●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

●●● ●

●

●●

●

●●●

●

● ●●

●

● ●

●

● ●

●●

●

●●

●

● ●

●

●● ●●●●

●

●●

●

●^●●

●

−10 −5 0 5 10 15

−10−505101520

Rep 10

PC3

PC4

10

(9)

• •  ^L

Soukas et al. (2000) Genes & Development 14: 963

33 k-medoids

•  : 229

22

• 

48 •  ^k-medoids ^R

cluster pam

48 medoids

•  48

● 48

•  ²²⁹

48

34 -6 _-4 _-2 0 ₂ ₄ 6

-4 -2 0 2

PC1

PC 2

-4 -2 ⁰ 2 4 ⁶

-2 0 2 4

PC3

PC 4

PC 1 all

pca.tr$x[, i]

Frequency

-6-4-20246

010203040

PC 1 k-medoids

pca.tr$x[kmed$id.med, i]

Frequency

-6-4-20246

0246810

PC 2 all

pca.tr$x[, i]

Frequency

-4 -2 0 2 4

01020304050

PC 2 k-medoids

Frequency

-4 -2 0 2 4

0246810

PC 3 all

pca.tr$x[, i]

Frequency

-4-20246

01020304050

PC 3 k-medoids

Frequency

-4-20246

0246810

PC 4 all

pca.tr$x[, i]

Frequency

-4-20 24 6

020406080

PC 4 k-medoids

Frequency

-4-20 24 6

051015

•  ^medoid

•  medoid

• 1.  ^5x5

2.  ^w(0)

3.  ^g i ¹

4.  ^g i

BMU (best matching unit)

5.  ^BMU ^BMU

g _i

w(t+1) = w(t) + h(d, t)(g _i – w(t))

ht(d, t) = θ(d, t)α(t)

θ(d, t) α(t)

6.  ^3-5

θ(d, t) α(t)

35 g _i = (0.70, 0.23, 0.31)

w _BMU = (0.60, 0.26, 0.30)

_g _i

BMU

w(t +1) = w(t) + h(d,t)(g _i − w(t))

BMU

h(0,t) = 0.8 BMU

w _{(t +1) =}

0.60

0.26

0.30 "

#

$

$ $

%

&

'

' '

+ 0.8 ×

0.70

0.23

0.31 "

#

$

$ $

%

&

'

' ' ⁻

0.60

0.26

0.30 "

#

$

$ $

%

&

'

' '

"

#

$

$ $

%

&

'

' '

=

0.52

0.29

0.29 "

#

$

$ $

%

&

'

' '

h(1,t) = 0.4 h(d>1, t) = 0

w _{(t +1) =}

0.66

0.72

0.72 "

#

$

$ $

%

&

'

' ' ^{+ 0.4} ^×

0.70

0.23

0.31 "

#

$

$ $

%

&

'

' ' ⁻

0.60

0.72

0.72 "

#

$

$ $

%

&

'

' '

"

#

$

$ $

%

&

'

' '

=

0.68

0.53

0.55 "

#

$

$ $

%

&

'

' '

(2,3)

36

(10)

•  ^h>p://

genomics.stanford.edu

•  ^6 × 5 ⁸²⁸

• 

• Tamayo et al. (1999) PNAS 96: 2907

37 SOM

38

• 

39 •  supervised learning

– 

1 → (classiﬁcaRon)

→ regression

•  ^SVM

Random Forest

•  unsupervised learning

– 

•  ^k-means

40

(11)

support vector machine (SVM)

• 

41 SVM

42 basis funcRon)

(kernel funcRon

feature space)

input space)

43 _{k(x,z) = x} ^T _z

k( _{x,z) = (x} ^T _{z + c)} ^M

k(x,z) = exp − ^{x − z}

2 2 σ ²

$

%

&

'

(

) )

44

(12)

45

• y = 5sin(x) + e

e ~ N (0, 1)

0 ₂ ₄ 6 8 10

-6 -4 -2 0 2 4 6

linear regression

data$x

d a ta $ y

• 

-3 -2 -1 0 1 2 3

0.00.20.40.60.81.0

Shape of kernel (beta = 1)

x

exp(-beta * x^2)

•  2

•  2 x _i , x _j

•  x, y

k(x j , x i ) = exp − β x j ^{− x} i

( 2 )

y = f (x) = ^α j

j=1

n

∑ k(x j ^, x)

x x _j k(x _j , x)

α j

y x

x

…

φ

feature space)

input space)

x ^φ x

y = _m ₌₁ w m x m

∑ M + e = w ^T x + e y = _∑ _k=1 ^K ^w k ^φ k (x) + e = w ^T φ(x) + e

w = _j=1 ^α n ^φ(x j ⁾

∑ n

y = α _j φ (x j ⁾

T φ (x)

j=1

∑ n + e = ^α j k(x j ^, x)

j=1

∑ n + e

47 K =

k(x ₁ , x ₁ ) ^ k(x n , x ₁ )

  

k(x ₁ , x n )  k(x n , x n )

⎛

⎝

⎜ ⎜

⎜

⎞

⎠

⎟ ⎟

⎟

• R( α ) = y i − α j

j=1

∑ n ^k(x ^j ^, ^x ⁱ ⁾

⎛

⎝⎜

⎞

⎠⎟

2 i=1

∑ n

= (y − K α ⁾ ^Τ ++ (y − K α ⁾

• •  R α ^{R α} ^α

α _{= (K} ^Τ _K) ⁻¹ _K ^T _{y = K} ⁻¹ _y

0 ₂ ₄ 6 8 10

-6 -4 -2 0 2 4 6

kernel regression without regularization

data$x

d a ta $ y

…

• overﬁvng

• 

48 R( ^α ) = (y − K ^α ⁾ ^Τ ^" (y − K ^α ) + ^λα ^Τ K ^α

• λ

•  R α ^{R α} ^α

α _{= (K +} λ _I) ⁻¹ _y

0 ₂ ₄ 6 8 10

-6 -4 -2 0 2 4 6

kernel regression (lmbd = 0.4 )

data$x

d a ta $ y

0 2 4 6 8 10

-6-4-20246

kernel regression (lmbd = 0.04 )

data$x

data$y

0 ₂ ₄ 6 8 10

-6-4-20246

kernel regression (lmbd = 4 )

data$x

data$y

•  λ = 0.04 _{•  λ = 4}

(13)

SVM

-1.0 -0.5 0.0 0.5 1.0

-1 .0 -0 .5 0 .0 0 .5 1 .0

Linear kernel

x[,1]

x[ ,2 ]

-1.0 -0.5 0.0 0.5 1.0

-1 .0 -0 .5 0 .0 0 .5 1 .0

Gaussian kernel

x[,1]

x[ ,2 ]

○ SVM

49 Brown et al. 2000 PNAS 97:262

•  SVM

• •  ^2,467 ⁷⁹ ⁵

50

• – 

false discovery rate: FDR

• 

51

• 

• –  ^eQTL ^QTL

₅₂

(14)

• (R 5)

• •  ^:

•  ISBN-10: 4320019253

•  ISBN-13: 978-4320019256

• 

• R

h>p://www.amazon.co.jp/

53

54

バイオメトリクス授業資料 iwatawiki lec14 s

060310391

0560565

14

2017/1/16 13:00-14:45

@1 - 4

1

…

DNA

2

•

•

•

•

– k-means

SOM

•

– SVM

3

•

mRNA

4

h>p://www.scq.ubc.ca/spot-your-genes-an-overview-of-the-microarray/ Art by Jiang

Long

cDNA

2

2

5

cDNA

= /

h>p://www.promega.com/enotes/applicaRons/ap0066.htm

6

GeneChip

1

mRNA

→ cDNA

→ cRNA

→ cRNA

GeneChip

20 25

10 20

Perfect Match (PM) Mismatch

(MM)

MM PM 1

h>p://www.scq.ubc.ca/spot-your-genes-an-overview-of-the-microarray/ Art by Jiang Long 7

GeneChip PM MM

• 1

• PM MM

Schadt et al. (2000) J Cell Biochem 80: 192

8

Perfect Match PM Mismatch MM

RNA-seq

9

Wang et al. (2009) Nat Rev Genet. 10: 57–63

RNA-seq

Conesa et al. (2016) Genome Biology 17:13 10

‘align-then-assemble’ and ‘assemble-then-align’

11

Haas and Zody (2010)

Nat. Biotech. 28: 421-423

•

–

–

–

•

–

•

–

-

12

•

– hierarchical clustering

– non-hierarchical clustering)

–

→

13

1 2 3 4

1 1.53 2.38 2.80 0.60

2 1.03 2.54 3.29 0.80

3 0.85 0.21 0.34 3.02

• 

• 

• 

• 

–  ^k-means

• 

–  ^SVM

• 

h>p://www.scq.ubc.ca/spot-your-genes-an-overview-of-the-microarray/ Art by Jiang Long ⁷

•  ¹

•  ^{PM MM}

• 

– 

– 

– 

• 

– 

• 

– 

• 

–  hierarchical clustering

–  non-hierarchical clustering)

– 

∑ ^)(y

∑ ⁾

^(y

∑ ⁾