バイオインフォマティクス特論3

(1)

バイオインフォマティクス特論

1 パラメータ推定

1-2. ガウス分布を使った推定

(2)

1-2. ガウス分布を使った推論

1-2-1. 平均と標準偏差を推測する

1-2-2. ７⼈の科学者

(3)

1-2-1. 平均と標準偏差を推測する

「ベイズ統計で実践モデリング」

第4章ガウス分布を使った推論

4.1 平均と標準偏差を推測する

(4)

データはn個の独⽴な観測値

x

₁

, x

₂

, … x

_n

例:ある細胞の遺伝⼦の発現量を、温度、時間などの条件を全く同じ

状態で独⽴にn回計測

ガウス分布に従うと仮定

このデータが従う

ガウス分布

の

平均

と

標準偏差

を

推測しよう

注意：Jagsでは、ガウス分布は、

平均

と

精度

(

精度は標準偏差の逆数

）の２つにパラメータで

表される。

(5)

問題をグラフィカルモデルで表現

μ

σ

. . .

(6)

プレート記法によるグラフィカルモデル

μ

σ

x

_i

(7)

連続値

離散値

⾮観測変数

(確率変数）

観測変数

(8)

尤度

μ

σ

x

_i

i data

仮定からデータx

_i

は同じ正規分布

に従うので、

n

尤度=

Π

dnorm(x

_i

, µ, 1/σ

2

₎

i=1

モデルとしては、

x

_i

~ Gaussian(µ, 1/σ

2

₎

と表現

(9)

事前分布

μ

σ

x

_i

i data

µは平均０で、精度が小さい

(=分散が大きい）正規分布に従う

とする。

µ ~ Gaussian(0, 0.001)

σは、上限を10に設定して、

一様分布で表現

σ ~ Uniform(0, 10)

※ 無情報事前分布、あるいはそれに近い形で表現

(10)

x <- seq(-10, 10, 0.01)

(11)

Jagsでのモデルの記述

Gaussian.txt

μ

σ

x

_i

i data

# Inferring the Mean and Standard Deviation

# of a Gaussian

model{

# Data Come From A Gaussian

for (i in 1:n){

x[i] ~ dnorm(mu,lambda)

}

# Priors

mu ~ dnorm(0,.001)

sigma ~ dunif(0,10)

lambda <- 1/pow(sigma,2)

}

(12)

R2jagsによるMCMCサンプリング

Gaussian_jags.R

# clears workspace:

rm(list=ls())

# sets working directories:

setwd("/Users/toh/Desktop/Code/ParameterEstimation/Gaussian")

library(R2jags)

x <- c(1.1, 1.9, 2.3, 1.8)

n <- length(x)

data <- list("x", "n") # to be passed on to JAGS

変数のクリア

Rate_1.txtとRate_1_jags.R

のあるディレクトリへの移動

パッケージの読み込み

(13)

myinits <- list( list(mu = 0, sigma = 1))

# parameters to be monitored:

parameters <- c("mu", "sigma")

# The following command calls JAGS with specific options.

# For a detailed description see the R2jags documentation.

samples <- jags(data, inits=myinits, parameters,

model.file="Gaussian.txt",

n.chains=1

, n.iter=1000,

n.burnin

=1,

n.thin

=1, DIC=T)

# Now the values for the monitored parameters are in the "samples" object,

# ready for inspection.

mu <- samples$BUGSoutput$sims.list$mu

sigma <- samples$BUGSoutput$sims.list$sigma

μとσの初期値の設定

モニターするパラメータの設定

Jagsを呼び出してμとσをサンプリング

(14)

traceplot(samples)

par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5, font.lab = 2, cex.axis = 1.3, bty = "n", las=1)

Nbreaks <- 80

y <- hist(mu, Nbreaks, plot=F)

plot(c(y$breaks, max(y$breaks)), c(0,y$density,0), type="S", lwd=2, lty=1, xlim=c(0,15), ylim=c(0,3), xlab="Rate", ylab="Posterior Density")

Nbreaks <- 80

y <- hist(sigma, Nbreaks, plot=F)

(15)

(16)

(17)

μの要約

σの要約

summary(mu) V1 Min. :-7.035 1st Qu.: 1.531 Median : 1.773 Mean : 1.784 3rd Qu.: 2.021 Max. : 8.734 summary(sigma) V1 Min. :0.2040 1st Qu.:0.5188 Median :0.7184 Mean :0.9790 3rd Qu.:1.0996 Max. :9.5059 density(mu)$x[which(density(mu)$y==max(density(mu)$y))] [1] 1.777701 density(sigma)$x[which(density(sigma)$y==max(density(sigma)$y))] [1] 0.5296268

(18)

1-2-2. ７人の科学者

「ベイズ統計で実践モデリング」

第4章ガウス分布を使った推論

4.2 7⼈の科学者

(19)

実験スキルの⼤きく異なる７⼈の科学者が、全員同じ量について測定を⾏う。

−27.020 3.570 8.191 9.808 9.603 9.945 10.056 直感的には最初の２⼈は適性を⽋いた研究者この量の真の値は10をわずかに下回る程度と考えられる。この測定量についての事後分布を求めて、測定される量について推論する。副次的には、７⼈の科学者の測定スキルをついて推論する。

(20)

問題をグラフィカルモデルで表現

σ

_i

λ

_i

x

_i

μ

i番⽬のデータ

仮定

(1) 全ての科学者の測定はガウス分布に従う

(2) 全員同じ量を測定しているので、それぞれの

ガウス分布は同じ平均値を持つ

(3) 標準偏差の違いで実験スキルの違いを表現

プレート表記でのグラフィカルモデル

(21)

σ

₂

λ

₂

x

₂

₇

(22)

Jagsでのモデルの記述

SevenScientists.txt

# The Seven Scientists

model{

# Data Come From Gaussians With Common Mean But Different Precisions

for (i in 1:n){

x[i] ~ dnorm(mu,lambda[i])

}

# Priors

mu ~ dnorm(0,.001)

for (i in 1:n){

lambda[i] ~ dgamma(.001,.001)

sigma[i] <- 1/sqrt(lambda[i])

}

尤度の要素の設定事前分布の設定

(23)

事前分布の形

x <- seq(0, 1.0, 0.001)

plot(x, dgamma(x, 0.001, 0.001), ty='l')

先の例では標準偏差の事前分布として

⼀様分布を使ったが、今回は精度の

事前分布として

ガンマ分布

を使⽤

(24)

R2jagsによるMCMCサンプリング

SevenScientists_jags.R

# clears workspace:

rm(list=ls())

# sets working directories:

setwd("/Users/toh/Desttop/Code/ParameterEstimation/Gaussian")

library(R2jags)

x <- c(-27.020,3.570,8.191,9.898,9.603,9.945,10.056)

n <- length(x)

data <- list("x", "n") # to be passed on to JAGS

変数のクリア

Rate_1.txtとRate_1_jags.R

のあるディレクトリへの移動

パッケージの読み込み

(25)

myinits <- list(

list(mu = 0, lambda = rep(1,n)))

# parameters to be monitored:

parameters <- c("mu", "sigma")

# The following command calls JAGS with specific options.

# For a detailed description see the R2jags documentation.

samples <- jags(data, inits=myinits, parameters,

model.file ="SevenScientists.txt", n.chains=1, n.iter=1000,

n.burnin=1, n.thin=1, DIC=T)

# Now the values for the monitored parameters are in the "samples" object,

# ready for inspection.

μとλの初期値の設定

モニターするパラメータの設定

Jagsを呼び出してμとσをサンプリング

(26)

MCMC のサンプリングのトレース traceplot(samples)

(27)

Nbreaks <- 80

y <- hist(mu, Nbreaks, plot=F)

(28)

μの要約

summary(mu) V1 Min. : 3.446 1st Qu.: 9.851 Median : 9.922 Mean : 9.860 3rd Qu.: 9.979 Max. :10.658

(29)

Nbreaks <- 80

hist(sigma[,1], Nbreaks)

σ

の事後分布のヒストグラム

(30)

σ

の要約

summary(sigma)

V1 V2 V3 V4 V5 V6 V7

Min. : 11.38 Min. : 0.091 Min. : 0.0893 Min. : 0.01756 Min. : 0.01935 Min. : 0.01721 Min. : 0.01861 1st Qu.: 32.33 1st Qu.: 5.347 1st Qu.: 1.4573 1st Qu.: 0.07211 1st Qu.: 0.22762 1st Qu.: 0.06734 1st Qu.: 0.11319

Median : 54.58 Median : 8.719 Median : 2.5830 Median : 0.14569 Median : 0.40984 Median : 0.13662 Median : 0.22778 Mean : 336.26 Mean : 64.568 Mean : 11.3078 Mean : 0.90485 Mean : 1.42081 Mean : 0.55492 Mean : 1.49933

3rd Qu.: 115.78 3rd Qu.: 16.879 3rd Qu.: 5.1573 3rd Qu.: 0.35877 3rd Qu.: 0.89008 3rd Qu.: 0.32470 3rd Qu.: 0.60704 Max. :169014.80 Max. :14080.411 Max. :1861.6139 Max. :156.96397 Max. :101.48750 Max. :27.16168 Max. :270.28688

(31)

for (i in 1:n) { print(density(sigma[,i])$x[which(density(sigma[,i])$y==max(density(sigma[,i])$y))]) } [1] -27.84334 [1] 6.507084 [1] 2.14607 [1] 0.05037881 [1] 2.226777 [1] 0.1280672 [1] -0.152803 結果が、平均やメジアン（中央値）ほど安定していない MCMCのステップ数（サンプル数）が少ないせいか？

モード (最頻値）を出⼒

(32)

1-2-3. 7人の科学者のコードの改造

ステップ数（サンプル数）が少ない

バーンイン、やシンニングが考慮されていない

(33)

1-2-3. 7人の科学者のコードの改造

• burn in を100 ß--- n.burnin

• thinningを50 ß--- n.thin

• MCMCの全ステップ数を100000 ß--- n.iter

• chainを２にして、初期値を与える。ß-- n.chains

• SevenScientistsv2_jags.R

(=SevenScientists_jags.Rをコピーしたもの）を書き換え

る。2回目のRate_1_jags.Rを参考にできる。

• モデルは同じなので、SevenScientists.txtは変更しなくて

良い

(34)

samples <- jags(data, inits=myinits, parameters,

model.file ="SevenScientists.txt",

n.chains=1

,

n.iter=1000

,

n.burnin=1

,

n.thin=1

, DIC=T)

samples <- jags(data, inits=myinits, parameters,

model.file ="SevenScientists.txt",

n.chains=2

,

n.iter=100000

,

n.burnin=100

,

n.thin=50

, DIC=T)

(35)

(2) ２つの連鎖の初期値を与える。

myinits <- list(

list(mu = 0, lambda = rep(1,n)))

myinits <- list(

list(mu = 0, lambda = rep(1,n))

,

(36)

print(samples)

Inference for Bugs model at "SevenScientists.txt", fit using jags,

2 chains, each with 1e+05 iterations (first 100 discarded), n.thin = 50 n.sims = 3996 iterations saved

mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff mu 9.914 0.155 9.581 9.877 9.930 9.982 10.115 1.008 4000 sigma[1] 289.514 4335.256 16.386 31.753 55.113 115.877 1157.041 1.001 4000 sigma[2] 72.155 2657.668 2.765 5.471 9.256 19.637 169.932 1.001 4000 sigma[3] 13.493 203.082 0.747 1.465 2.488 5.293 39.622 1.002 4000 sigma[4] 2.879 140.652 0.027 0.067 0.131 0.310 3.534 1.001 2200 sigma[5] 2.354 22.821 0.079 0.256 0.454 0.964 11.013 1.001 4000 sigma[6] 0.748 5.373 0.026 0.063 0.130 0.311 4.453 1.001 4000 sigma[7] 1.386 15.690 0.037 0.106 0.210 0.476 5.245 1.001 4000 deviance 23.240 5.306 15.045 19.415 22.644 26.268 35.342 1.001 4000 For each parameter, n.eff is a crude measure of effective sample size,

and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2)

pD = 14.1 and DIC = 37.3

DIC is an estimate of expected predictive error (lower deviance is better).

(37)

mixingの確認

traceplot

traceplot(samples)

(38)

MCMCの２つの連鎖をそれぞれ取り出す

mu <- samples$BUGSoutput$sims.array[,,2]

max(mu)

# [1] 13.00318

min(mu)

# [1] 7.929883

plot(mu[,1],ylim=c(7,14),ty='l',col=2)

par(new=T)

plot(mu[,2],ylim=c(7,14),ty='l',col=4)

(39)

(40)

summary(mu[,1])

Min. 1st Qu. Median Mean 3rd Qu. Max.

7.930 9.876 9.929 9.912 9.983 11.701

summary(mu[,2])

Min. 1st Qu. Median Mean 3rd Qu. Max.

8.360 9.877 9.931 9.916 9.980 13.003

density(mu[,1])$x[which(density(mu[,1])$y==max(density(mu[,1])$y))]

[1] 9.932694

density(mu[,2])$x[which(density(mu[,2])$y==max(density(mu[,2])$y))]

[1] 9.935837

(41)

Nbreaks <- 80

x <- hist(mu[,1],Nbreaks,plot=F) y <- hist(mu[,2],Nbreaks,plot=F)

plot(c(x$breaks, max(x$breaks)), c(0,x$density,0), type="S", lwd=2, lty=1, xlim=c(7,12), ylim=c(0,7), xlab="mu", ylab="Posterior Density", col=2) par(new=T)

plot(c(y$breaks, max(y$breaks)), c(0,y$density,0), type="S", lwd=2, lty=1, xlim=c(7,12), ylim=c(0,7), xlab="mu", ylab="Posterior Density", col=4)

２つのMCMCで得られたμの事後分布がほぼ同じ形に収束していることがわかる

(42)

sigma <- samples$BUGSoutput$sims.array[,,3:9]

max(sigma[,,7])

# [1] 786.2002

min(sigma[,,7])

# [1] 0.01580951

plot(sigma[,1,7],ylim=c(0,790),ty='l',col=2)

par(new=T)

plot(sigma[,2,7],ylim=c(0,790),ty='l',col=4)

(43)

(44)

for (i in 1:7) { print(i) for (j in 1:2) { print(summary(sigma[,j,i])) print(density(sigma[,j,i])$x[which(density(sigma[,j,i])$y==max(density(sigma[,j,i])$y))]) } print("//") } [1] 1

Min. 1st Qu. Median Mean 3rd Qu. Max. 10.64 32.12 56.28 268.19 117.30 107734.56 [1] 184.0602

Min. 1st Qu. Median Mean 3rd Qu. Max. 9.31 31.60 54.29 310.83 115.19 240874.91 [1] -27.53077

[1] "//" [1] 2

(45)

[1] 3

[1] "//" [1] 4

[1] "//" [1] 5

(46)

[1] 6

[1] "//" [1] 7

バイオインフォマティクス特論3