2017/11/2 Python-statistics4 Python で統計学を学ぶ (4) この内容は杉澤村井 (2008) R によるやさしい統計学 (

(1)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

1/16

Python

_{で統計学を学ぶ(4)}

この内容は⼭⽥、杉澤、村井(2008)「R」によるやさしい統計学 (http://shop.ohmsha.co.jp/shop/shopdetail.html?brandcode=000000001781&search=978-4-274-06710-5&sort=)を参考にしています。この講義では、「統計的仮説検定」をとりあげます。これは、統計的仮説検定の⼿順の理解と⽤語の習熟がねらいです。また、代表的な統計的仮説検定、つまり標準正規分布を⽤いた検定、ｔ分布を⽤いた検定、無相関検定、カイ⼆乗検定について学びます。学習項⽬です: 統計的仮説検定の必要性統計的仮説検定の⼿順と⽤語標準正規分布を⽤いた検定(1つの平均値の検定: ⺟分散が既知) t分布を⽤いた検定(1つの平均値の検定: ⺟分散が未知) 相関係数の検定(無相関検定) 独⽴性の検定(カイ⼆乗検定) 関数のまとめ演習問題

統計的仮説検定の必要性

下の散布図を⾒てください(⻘⾊の円は、点の分布の状態を表すために描いたものです）: (http://lang.sist.chukyo-u.ac.jp/Classes/PythonProbStat/Figs/RS-4-1.png) (http://lang.sist.chukyo-u.ac.jp/Classes/PythonProbStat/Figs/RS-4-1.png) これをみると、この2つの変数 a と bの間には相関関係がないようにみえます。実際なので相関なしといえます。ところが、a, bそれぞれから30点ずつ無作為抽出したデータ xa, xb (下にその散布図を⽰す)は、ときに 0.378 という『弱い相関』を⽰すことがあります。参考: 無相関の⺟集団から相関するデータを作る次が無相関の⺟集団から相関するデータを作った⽅法です:

corrcoef(a, b) = −0.034

(2)

In [185]: In [34]: これは、実際には作為的に作られたデータです。しかし、あなたが発明した機器の有効性を⽰す論⽂や、あなたが作った薬品の効果を⽰す論⽂において、都合の良いデータだけを集めたのではないか、もしくは作為がないとしても、このデータは本当に偶然の結果であって多数のデータを取ればこのようなグラフにはならないのではないか、という疑いがかけられることがあります。そのような疑いや批判には、(前の章で学んだように)『標本抽出が無作為抽出であること』(都合の良いデータを集めたわけではない)、そして(本章で学ぶように)『⺟集団に全く相関がないとしたら、抽出した標本からこのような結果が得られる可能性が⾮常に⼩さいということ』(多数のデータを集めても同じような結果が得られる確率が⾼い)を⽰さなければなりません。そして、統計的仮説検定は確率に基づき、後者の主張を⾏うための⽅法です。(前者の「無作為抽出」は、統計による分析の⼤前提です)

統計的仮説検定の⼿順と⽤語

統計的仮説検定の⼀般的な⼿順を次の表に⽰します: ⼿順やること 1 ⺟集団に関する帰無仮説と対⽴仮説を設定 2 検定統計量を選択 3 有意⽔準 αの値を決定 4 データを収集した後、データから検定統計量の実現値を求める 5 結論: 検定統計量の実現値が棄却域に⼊れば帰無仮説を棄却し、対⽴仮説を採択する。そうでなければ、帰無仮説を採択する

1. _{帰無仮説と対⽴仮説}

帰無仮説：提案する⼿法が従来の⼿法と「差がない」、提案する⼿法は「効果がない」という仮説---本来主張したいこととは逆の仮説。この仮説が棄却されることを⽬標として仮説検定を⾏う。具体的には、⺟平均 (⺟平均は0である), ⺟相関係数 (相関がない), ⺟平均の差 (差がない)というような仮説対⽴仮説: 帰無仮説が棄却されたときに採択される仮説--- 帰無仮説とは逆の仮説であり、実験などで⽰したい・主張したいことを表したもの具体的には、⺟平均 (⺟平均は0でない), ⺟相関係数 (相関がある), ⺟平均の差 (差がある)というような仮説対⽴仮説の設定により、検定は次のどちらかで⾏う(両側検定の⽅がより厳しい条件であり、普通は両側検定で⾏う): ⽚側検定：対⽴仮説が、⺟平均 (もしくは )、、⺟相関係数 (もしくは )、、⺟平均の差 (もしくは )、の場合

μ = 0

ρ = 0

μ

1

−

μ

2

= 0

μ ≠ 0

ρ ≠ 0

μ

1

−

μ

2

≠ 0

μ > 0

μ < 0

ρ > 0

ρ < 0

μ

1

>

μ

2

μ

1

<

μ

2 0.306203680441 import matplotlib.pyplot as plt fig = plt.figure() a = random.normal(50,10,500) b = random.normal(50,10,500) plt.scatter(a, b,c='w') plt.xlabel('a') plt.ylabel('b') ax = fig.add_subplot(111) # 中心(50,50)で半径25の円を描画

circle = plt.Circle((50,50),25, fill=False, color='b') ax.add_patch(circle)

aspect = (ax.get_xlim()[1] - ax.get_xlim()[0]) / (ax.get_ylim()[1] - ax.get_ylim()[0]) ax.set_aspect(aspect) # 縦横の縮尺を調整

plt.show()

%matplotlib inline

import matplotlib.pyplot as plt import numpy as np

import numpy.random as random a = random.normal(50,10,500) b = random.normal(50,10,500) for _ in range(100): xa = random.choice(a,30) xb = random.choice(b,30) if (abs(np.corrcoef(xa,xb)[0,1]) > 0.30): break plt.scatter(xa, xb) plt.xlabel('xa') plt.ylabel('xb') x = np.linspace(10, 90, 10000) lm = np.polyfit(xa, xb, 1) plt.plot(x, lm[0]*x+lm[1],"g") print(np.corrcoef(xa,xb)[0,1])

(3)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

3/16

両側検定：対⽴仮説が、⺟平均μ≠0、⺟相関係数ρ≠0 、⺟平均の差 μ - μ ≠ 0の場合要するに、両側検定では、例えば⺟平均を調べるには、⺟平均との両⽅を調べなければならない帰無仮説が正しいものとして分析を⾏う。実際に得られたデータから計算された検定統計量の値によって採択を判断する。帰無仮説が正しいとしたとき、検定統計量が、ほぼ起こり得ない値(それほど極端な値)であれば、帰無仮説を棄却する(つまり、本来の主張を表す対⽴仮説が採択される)。そうでなければ(確率的に⼗分起こりうるような値であれば、帰無仮説を採択する(この場合は、本来主張したかった対⽴仮説が棄却されてしまう)。

2. 検定統計量

検定統計量：統計的仮説検定のために⽤いられる標本統計量のこと。代表的な検定統計量の例: 検定統計量の実現値：実際のデータ（⼿に⼊った標本）を基に計算してえられる具体的な値のこと検定統計量の実現値は、対⽴仮説に合うほど 0から離れた値を⽰す

3. _{有意⽔準と棄却域}

対⽴仮説を採択するか決定するときに基準になるのが有意⽔準(αで表されます) 有意⽔準は5％または1％(α=0.05またはα=0.01)に設定することが多い(つまり、標本が100回に5回(5%の場合)以下にしか現れないデータであった---こんなことは偶然では起こりえない---、だから帰無仮説が成り⽴たないと考えて良いのではないか、という判断基準) 帰無仮説が正しいものとして考えた時の標本分布を帰無分布という---帰無分布に基づいて確率計算される帰無仮説のもとで、⾮常に⽣じにくい検定統計量の値の範囲を棄却域という---帰無仮説が棄却される領域(だから、この範囲に⼊るのが『望ましい』) 採択域: 棄却域以外の部分---「帰無仮説が採択される領域」臨界値: 棄却域と採択域の境⽬の値棄却域に検定統計量の実現値が⼊ったら、帰無仮説を棄却する---本来主張したかったことが採択される！ (http://lang.sist.chukyo-u.ac.jp/Classes/PythonProbStat/Figs/RS-4-3.png) 正規分布を帰無分布とした時の棄却域

4 & 5.

_{統計的仮説検定の結果の報告}

検定統計量の実現値が棄却域に⼊った場合、「差がない」という帰無仮説を棄却し、「差がある」という対⽴仮説を採択する。検定結果は5% (または1%)⽔準で有意である または「p < .05 (または p < .01 )で有意差が⾒られた」と記述する。 帰無仮説が棄却できない場合は、「検定の結果、差が有意でなかった」または「有意差が認められなかった」と書く。課題4-1 あなたはランダムに配置された対象物(例えば地雷や⽯油や埋蔵⾦など)を衛星からのセンサーデータを元に限定された時間(例えば1時間)内に検出する機器を作成した。100個のデータに対し検出率は0.70であった。そして、その性能が従来の製品(検出率は0.60と宣伝されている)よりも優れていることを統計的仮説検定の⼿法により⽰したい。どのような帰無仮説と対⽴仮説をたてればよいか、また検定⽅法は⽚側か両側か、有意⽔準はどのくらいに設定したらよいか、考えを述べよ。 In [ ]:

ｐ値

p値：帰無仮説が正しいという仮定のもとで、標本から計算した検定統計量の実現値以上の値が得られる確率 p値が有意⽔準より⼩さい時に帰無仮説を棄却する [参考: p値が⼩さいことの意味] p値の⼤きさが対⽴仮説を採択する(帰無仮説を棄却する)決め⼿となります。p値が⼩さいということは、『帰無仮説が正しいとすると』確率的にほとんど起こりえないことが起きた(有意⽔準が5%なら100回中5回以下、 1%なら100回中1回以下)ということを意味します。逆にp値が⼤きいということは、確率的にはよくあることが起きた(だから、この結果では差があるとはいえない)、ということになります。

第1種の誤りと第2種の誤り

第1種の誤り α: 「帰無仮説が真のとき、これを棄却してしまう」誤りのことこの種の誤りを犯す確率が「有意⽔準」または「危険率」第2種の誤り β：「帰無仮説が偽のとき、これを採択する(棄却できない)」誤りのこと本当は差があるのに「差がない」と判断してしまう誤り

検定⼒

検定⼒：帰無仮説が偽の場合、全体の確率1から第2種の誤りの確率(1 - β)を引いた確率「第2種の誤りを犯さない確率」とも、つまり間違っている帰無仮説を正しく棄却できる確率のこと

標準正規分布を⽤いた検定(1つの平均値の検定: ⺟分散が既知)

1 2

μ ≠ 0

μ > 0 μ < 0

t,

χ

2、

F

(4)

正規⺟集団から無作為に標本を抽出する（サンプルサイズをとする)と標本平均の分布も正規分布標本平均の平均は [ア] 、分散は [イ ] （問題：ア、イに当てはまる記号を書け---課題4-2) これを標準化したものを検定統計量とする( は標本データの平均): 課題4-2 正規⺟集団から無作為に標本を抽出したとき、理論的に標本平均の平均と、分散がそれぞれどのように表されるか、書きなさい(つまり、上の[ア], [イ]の箇所を補うこと)。またこれを標準化して得られる検定統計量がZで表されている理由を答えなさい。 [ヒント] 標本分布を求める (Rstatistics-03.html#makingSample)の項を読みなおしてください。また、標準化については標準化 (Rstatistics-01.html#RS01:normalization)の項を⾒てください。

Type Markdown and LaTeX:

Pythonを使った実習例題:「⼼理学テスト」がの正規分布に従うものとする。次のデータ(「指導法データ」と呼ぶ)はこの⺟集団から無作為抽出した標本と考えてよいかどうかを判定せよ In [36]: 次のステップで⾏う: 1. 帰無仮説と対⽴仮説をたてる: 帰無仮説は「無作為抽出した標本と考えて良い」、つまり対⽴仮説は「無作為抽出した標本ではない」、つまり 2. 検定統計量の選択: 標本データを標準化した値(Zで表す) 3. 有意⽔準の決定: 両側検定で、有意⽔準 5%、つまり 4. 検定統計量の実現値の計算: In [38]: 帰無仮説の棄却か採択かの決定: 帰無仮説によればこの標本は正規分布に従う。そこでscipy.statsモジュールのppf関数で棄却の臨界値を求める、もしくはcdf関数でp値を求める下側確率：標準正規分布に従う確率変数を例にとると、がある値以下となる確率上側確率：標準正規分布に従う確率変数を例にとると、がある値より⼤きくなる確率 In [45]: In [46]: In [4]: この結果、棄却域はまたはとなるので、の値は棄却域に⼊る。よって、結論「有意⽔準5%において、指導法データは⼼理学テスト(という⺟集団)から無作為抽出した標本とはいえない」。なお、関数cdfを⽤いて、直接p値を求めることもできる: In [47]: In [48]:

N(μ, )

σ

2

_n

X¯

Z =

X¯

− μ

σ/ n⎯⎯

√

N(μ, )

σ

2

α

2

N(12, 10)

μ = 12

μ ≠ 12

α = 0.05

Z

α

Prob(Z ≤ α)

Z

α

Prob(Z > α)

Z < −1.959964

Z > 1.959964

Z

Out[38]: -2.8284271247461898 Out[45]: -1.9599639845400545 Out[46]: 1.959963984540054

Help on method ppf in module scipy.stats._distn_infrastructure:

ppf(self, q, *args, **kwds) method of scipy.stats._continuous_distns.norm_gen instance Percent point function (inverse of `cdf`) at q of the given RV.

Parameters --- q : array_like

lower tail probability arg1, arg2, arg3,... : array_like

The shape parameter(s) for the distribution (see docstring of the instance object for more information)

loc : array_like, optional location parameter (default=0) scale : array_like, optional scale parameter (default=1)

Returns --- x : array_like

quantile corresponding to the lower tail probability q.

Out[47]: 0.0023388684020295768

Out[48]: 0.99766113159797043

from __future__ import division import numpy as np SampleData = np.array([13,14,7,12,10,6,8,15,4,14,9,6,10,12,5,12,8,8,12,15]) z = (np.mean(SampleData) - 12) / (10.0/len(SampleData))**0.5 # 標準化 z import scipy.stats as st st.norm.ppf(0.025) # 下側確率0.05/2 = 0.025となるzの値を求める # 下側確率であるから、この値よりもZ値が小さければ棄却される # 上側確率 1 - 0.05/2 = 0.975となるzの値を求める st.norm.ppf(0.975) # 上側確率であるから、この値よりもZ値が大きければ棄却される help(st.norm.ppf) st.norm.cdf(-2.828427) # 下側確率 # 下側確率とすれば、p値は0.0023という小さな値(< 0.05) st.norm.cdf(2.828427) # 上側確率

(5)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

5/16

In [49]: In [3]: 課題4-3 標準正規分布のグラフを書き、有意⽔準5%の棄却域を⻩⾊で表し、例題のZ値がどこに位置するかを重ね書きした図を作成せよ。 [ヒント] 前節正規分布 (http://lang.sist.chukyo-u.ac.jp/Classes/PythonProbStat/Python-statistics3.html#RS03:normalDistribution)の「正規分布グラフに領域を表⽰する関数」で紹介した関数を拡張修正して⽤いる。 Z値以下の領域をオレンジ⾊で表すと次のような図が得られる: (http://lang.sist.chukyo-u.ac.jp/Classes/PythonProbStat/Figs/RS-4-E1.png)

t

分布を⽤いた検定(1つの平均値の検定: ⺟分散が未知)

正規⺟集団からの無作為標本であっても、⺟集団の分散σ がわからない場合、先の⽅法が使えません---先の検定で⽤いた検定統計量が計算できないからですそこで⺟分散の平⽅根σ の代わりに、標本から求められる不偏分散の平⽅根を使い、を検定統計量とする。これは⾃由度(df) のt分布に従う t分布：統計学でよく利⽤される、正規分布の形に似た左右対称・⼭形の確率分布⾃由度(df)：t分布の形状を決める (http://lang.sist.chukyo-u.ac.jp/Classes/PythonProbStat/Figs/RS-4-5.png) 2

σ̂

t =

X¯

− μ

/

σ̂ n⎯⎯

√

n − 1

Out[49]: 0.0046777368040591535

Help on method cdf in module scipy.stats._distn_infrastructure:

cdf(self, x, *args, **kwds) method of scipy.stats._continuous_distns.norm_gen instance Cumulative distribution function of the given RV.

Parameters --- x : array_like quantiles

arg1, arg2, arg3,... : array_like

Returns --- cdf : ndarray

Cumulative distribution function evaluated at `x` # 両側検定なので2倍する

2*st.norm.cdf(-2.828427)

# 両側検定であるから2倍したp値は0.0047という小さな値(< 0.05)

(6)

Pythonを使った実習例題:「⼼理学テスト」が平均12の正規分布に従うものとする(分散は未知!)。前項にあげた「指導法データ」 (SampleData) が、この⺟集団から無作為抽出した標本と考えてよいかどうかを判定せよ次のステップで⾏う: 1. 帰無仮説と対⽴仮説をたてる: 帰無仮説は「無作為抽出した標本と考えて良い」、つまり対⽴仮説は「無作為抽出した標本ではない」、つまり 2. 検定統計量の選択: 標本の不偏分散の平⽅根を⽤い、を検定統計量とする 3. 有意⽔準の決定: 両側検定で、有意⽔準 5%、つまり 4. 検定統計量の実現値の計算:

t = (np.mean(SampleData) - 12) / (np.var(SampleData, ddof=1)/len(SampleData))**0.5 # 検定量

5. 帰無仮説の棄却か採択かの決定: 帰無仮説によればこの検定統計量は⾃由度のt分布に従う st.t.ppf(0.025,19) # df=19、下側確率0.05/2 = 0.025となるtの値を求める(scipy.stats.tモジュールのppf関数 ⇒ -2.0930240544082634 # 下側確率であるから、この値よりもt値が小さければ棄却される) st.t.ppf(0.975,19) # df=19、上側確率1-0.05/2 = 0.975となるtの値を求める(scipy.stats.tモジュールのppf関数 # 上側確率であるから、この値よりもt値が大きければ棄却される ⇒ 2.093024054408263 この結果、棄却域はまたはとなるので、tの値は棄却域に⼊る。関数 cdf を⽤いて、直接p値を求めることもできる: st.t.cdf(-2.616648,19) # 下側確率 (scipy.stats.tモジュールのcdf関数) # 下側確率とすれば、p値は0.0085 という小さな値(< 0.05) ⇒ 0.0084854604500293768 print(st.t.cdf(2.616648,19)) # 上側確率 ⇒ 0.99151453955 print(2*st.t.cdf(-2.616648,19)) # 両側検定なので2倍する ⇒ 0.0169709209001 # 両側検定より２倍したp値は0.017という小さな値(< 0.05) 6. よって、結論「有意⽔準5%において、指導法データは⼼理学テスト(という⺟集団)から無作為抽出した標本とはいえない」。 In [186]: In [187]: In [188]: In [189]:

μ = 12

μ ≠ 12

σ̂

t =

X¯

− μ

/

σ̂ n⎯⎯

√

α = 0.05

df = n − 1 = 19

t < −2.093024

t > 2.093024

t = -2.616648 Out[187]: -2.0930240544082634 Out[188]: 2.093024054408263 Out[189]: 0.0084854604500293768 # 以上の実行

t = (np.mean(SampleData) - 12) / (np.var(SampleData, ddof=1)/len(SampleData))**0.5 # 検定量 print("t = %f" % t) import scipy.stats as st st.t.ppf(0.025,19) # df=19、下側確率0.05/2 = 0.025となるtの値を求める(scipy.stats.tモジュールのppf関数 # 下側確率であるから、この値よりもt値が小さければ棄却される) st.t.ppf(0.975,19) # df=19、上側確率1-0.05/2 = 0.975となるtの値を求める(scipy.stats.tモジュールのppf関数 # 上側確率であるから、この値よりもt値が大きければ棄却される st.t.cdf(-2.616648,19) # 下側確率 (scipy.stats.tモジュールのcdf関数) # 下側確率とすれば、p値は0.0085 という小さな値(< 0.05)

(7)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

7/16

In [190]: In [5]: In [6]: Pythonでt検定するための関数: 以上のことをすべてやってくれる関数がscipy.statsモジュールのttest_1samp関数である。 0.99151453955 0.0169709209001

ppf(self, q, *args, **kwds) method of scipy.stats._continuous_distns.t_gen instance Percent point function (inverse of `cdf`) at q of the given RV.

cdf(self, x, *args, **kwds) method of scipy.stats._continuous_distns.t_gen instance Cumulative distribution function of the given RV.

Cumulative distribution function evaluated at `x` print(st.t.cdf(2.616648,19)) # 上側確率

print(2*st.t.cdf(-2.616648,19)) # 両側検定なので2倍する # 両側検定より２倍したp値は0.017という小さな値(< 0.05)

help(st.t.ppf)

(8)

In [70]: 「指導法データ」 (SampleData) を⽤いてその使い⽅を⽰す: st.ttest_1samp(データ, ) In [72]: この表⽰から、t値が-2.617、p値が0.017 (両側検定)であることが得られる。

相関係数の検定(無相関検定)

無相関検定：「⺟集団において相関が0である」と設定して⾏う検定⺟集団相関係数(⺟相関)に関する検定を⾏うときは、標本相関係数ｒから次を求めて検定統計量とする: Pythonを使った実習例題:以下で与えられる「統計学テスト1」(StatTest1)と「統計学テスト2」(StatTest2)の得点の相関係数の検定を⾏え。有意⽔準は5%とする。 In [73]: 次のステップで⾏う: 1. 帰無仮説と対⽴仮説をたてる: 帰無仮説は、つまり⺟相関対⽴仮説は「」、つまり⺟相関 2. 検定統計量の選択: 3. 有意⽔準の決定: 両側検定で、有意⽔準 5%、つまり 4. 検定統計量の実現値の計算:

μ

t =

r n − 2

⎯

⎯⎯⎯⎯⎯⎯

⎯

√

1 − r

2

⎯

⎯⎯⎯⎯⎯⎯⎯⎯

⎯

√

ρ = 0

= 0

ρ ≠ 0

≠ 0

t =

r n−2_√√_1−r₂

α = 0.05

Help on function ttest_1samp in module scipy.stats.stats: ttest_1samp(a, popmean, axis=0, nan_policy='propagate') Calculates the T-test for the mean of ONE group of scores.

This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations `a` is equal to the given population mean, `popmean`.

Parameters --- a : array_like sample observation popmean : float or array_like

expected value in null hypothesis, if array_like than it must have the same shape as `a` excluding the axis dimension

axis : int or None, optional

Axis along which to compute test. If None, compute over the whole array `a`.

nan_policy : {'propagate', 'raise', 'omit'}, optional

Defines how to handle when input contains nan. 'propagate' returns nan, 'raise' throws an error, 'omit' performs the calculations ignoring nan values. Default is 'propagate'.

Returns ---

statistic : float or array t-statistic pvalue : float or array two-tailed p-value

Examples ---

>>> from scipy import stats

>>> np.random.seed(7654567) # fix seed to get the same result >>> rvs = stats.norm.rvs(loc=5, scale=10, size=(50,2))

Test if mean of random sample is equal to true mean, and different mean. We reject the null hypothesis in the second case and don't reject it in the first case.

>>> stats.ttest_1samp(rvs,5.0) (array([-0.68014479, -0.04323899]), array([ 0.49961383, 0.96568674])) >>> stats.ttest_1samp(rvs,0.0) (array([ 2.77025808, 4.11038784]), array([ 0.00789095, 0.00014999]))

Examples using axis and non-scalar dimension for population mean.

>>> stats.ttest_1samp(rvs,[5.0,0.0])

(array([-0.68014479, 4.11038784]), array([ 4.99613833e-01, 1.49986458e-04])) >>> stats.ttest_1samp(rvs.T,[5.0,0.0],axis=1)

(array([-0.68014479, 4.11038784]), array([ 4.99613833e-01, 1.49986458e-04])) >>> stats.ttest_1samp(rvs,[[5.0],[0.0]])

(array([[-0.68014479, -0.04323899],

[ 2.77025808, 4.11038784]]), array([[ 4.99613833e-01, 9.65686743e-01], [ 7.89094663e-03, 1.49986458e-04]]))

Out[72]: Ttest_1sampResult(statistic=-2.616648017377738, pvalue=0.016970920269563441) import scipy.stats as st help(st.ttest_1samp) st.ttest_1samp(SampleData,12.0) import numpy as np StatTest1 = np.array([6,10,6,10,5,3,5,9,3,3,11,6,11,9,7,5,8,7,7,9]) StatTest2 = np.array([10,13,8,15,8,6,9,10,7,3,18,14,18,11,12,5,7,12,7,7])

(9)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

9/16

SampleCorr = np.corrcoef(StatTest1, StatTest2)[0,1]

print("Sample Correlation = %f" % SampleCorr) # 標本相関 ⇒ Sample Correlation = 0.749659

SampleSize = len(StatTest1)

tDividend = SampleCorr * (SampleSize - 2.0)**0.5 tDivider = (1.0 - SampleCorr**2)**0.5 t = tDividend/tDivider ⇒ t statistics = 4.805707 5. 帰無仮説の棄却か採択かの決定: 帰無仮説によればこの検定統計量は⾃由度のt分布に従う import scipy.stats as st print(st.t.ppf(0.025,18)) # df=18、下側確率0.05/2 = 0.025となるtの値を求める ⇒ -2.10092204024 # 下側確率であるから、この値よりもt値が小さければ棄却される print(st.t.ppf(0.975,18)) # df=18、上側確率 1 - 0.05/2 = 0.975となるtの値を求める ⇒ 2.10092204024 # 上側確率であるから、この値よりもt値が大きければ棄却される 6. この結果、棄却域はまたはとなるので、tの値は棄却域に⼊る。よって結論「統計学テスト1(StatTest1)と統計学テスト2(StatTest2)は有意⽔準5％において強い相関(相関係数0.75)がある」。 In [83]: In [86]: なおscipy.stats.tモジュールのcdf関数を⽤いて、直接p値を求めることもできる: In [93]: Pythonで無相関検定するための関数: scipy.stat.pearsonr In [92]: In [95]: pearsonr関数の出⼒の第⼀要素は標本相関係数(0.75)、第⼆要素は両側検定によるp値である。

⾃

df = n − 2 = 18

t < −2.101

t > 2.101

Sample Correlation = 0.749659 t statistics = 4.805707 -2.10092204024 2.10092204024 0.999929188559 0.000141622881555

Help on function pearsonr in module scipy.stats.stats: pearsonr(x, y)

Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson's correlation requires that each dataset be normally distributed, and not necessarily zero-mean. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so. Parameters --- x : (N,) array_like Input y : (N,) array_like Input Returns --- r : float

Pearson's correlation coefficient p-value : float 2-tailed p-value References --- http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation (http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation) (0.74965896482424488, 0.00014162288155448246) from __future__ import division

import numpy as np

SampleCorr = np.corrcoef(StatTest1, StatTest2)[0,1] print("Sample Correlation = %f" % SampleCorr) # 標本相関 SampleSize = len(StatTest1)

tDividend = SampleCorr * (SampleSize - 2.0)**0.5 tDivider = (1.0 - SampleCorr**2)**0.5 t = tDividend/tDivider print("t statistics = %f" % t) import scipy.stats as st print(st.t.ppf(0.025,18)) # df=18、下側確率0.05/2 = 0.025となるtの値を求める # 下側確率であるから、この値よりもt値が小さければ棄却される print(st.t.ppf(0.975,18)) # df=18、上側確率 1 - 0.05/2 = 0.975となるtの値を求める # 上側確率であるから、この値よりもt値が大きければ棄却される print(st.t.cdf(t,18)) # 上側確率 print( (1.0 - st.t.cdf(t,18))*2.0 ) # 両側検定なので2倍する # 両側検定により２倍したp値は0.00014という小さな値(< 0.05) help(st.pearsonr) import scipy.stats as st

SampleCorr = st.pearsonr(StatTest1, StatTest2) print(SampleCorr)

(10)

独⽴性の検定(カイ⾃乗検定)

２つの質的変数が独⽴かどうかを確かめる---独⽴とは「2つの質的変数に連関がない」こと独⽴性の検定：２つの質的変数間の連関の有意性を調べる検定期待度数：2つの変数の間に連関がない(独⽴である)という帰無仮説のもとで、帰無仮説が正しければ（連関がなければ）これくらいの度数をとるだろうと期待される度数クロス集計表におけるセルの期待度数 = (セルが属する⾏の周辺度数 ×セルが属する列の周辺度数)÷総度数 (カイ2乗)という確率分布を利⽤するため、カイ⾃乗(2乗)検定ともいう。独⽴性の検定における検定統計量の式は観測度数、は期待度数カイ⼆乗分布: (http://lang.sist.chukyo-u.ac.jp/Classes/PythonProbStat/Figs/RS-4-8.png) Pythonを使った実習例題:20名の学⽣に対し数学(Math)と統計学(Stat)の好き嫌いをアンケート調査した結果が以下。このことから、⼀般に数学と統計学の好き嫌いの間に有意な連関があるといえるかどうか、有意⽔準5%で検定せよ。 Math = np.array(["嫌い","嫌い","好き","好き","嫌い","嫌い","嫌い","嫌い","嫌い","好き","好き", "嫌い","好き","嫌い","嫌い","好き","嫌い","嫌い","嫌い","嫌い"]) Stat = np.array(["好き","好き","好き","好き","嫌い","嫌い","嫌い","嫌い","嫌い","嫌い","好き", "好き","好き","嫌い","好き","嫌い","嫌い","嫌い","嫌い","嫌い"]) このクロス集計表は以下: 統計学好き統計学嫌い合計数学好き Expectation_11 Expectation_12 6 数学嫌い Expectation_21 Expectation_22 14 計 8 12 20 マスのことをセル、セルに書かれた数値を観測度数、観測度数を各々、列・⾏で合計したものを周辺度数、周辺度数の合計を総度数と呼ぶ⾃由度 df = (⾏の数-1)×(列の数-1) In [144]: 次のステップで⾏う: 1. 帰無仮説と対⽴仮説をたてる: 帰無仮説H は、「数学と統計学の2つの変数は独⽴(連関なし)」対⽴仮説H は「「数学と統計学の2つの変数は独⽴(連関なし)」 2. 検定統計量の選択: 3. 有意⽔準の決定: 5%とする(⽚側検定---カイ⼆乗検定は棄却域が⼀つしかない) 4. 検定統計量の実現値の計算:

χ

2

=

+

+ ⋯ +

χ

2

(

O

1

−

E

1

)

2

E

1

(

O

2

−

E

2

)

2

E

2

(

O

k

−

E

k

)

2

E

k

∼

O

1

O

k

E

1

∼

E

k 0 1

=

+

+ ⋯ +

χ

2

(

O

1

−

E

1

)

2

E

1

(

O

2

−

E

2

)

2

E

2

(

O

k

−

E

k

)

2

E

k

Out[144]: _Stat _{好き嫌い All} Math 好き 4 2 6 嫌い 4 10 14 All 8 12 20 Math = np.array(["嫌い","嫌い","好き","好き","嫌い","嫌い","嫌い","嫌い","嫌い","好き","好き", "嫌い","好き","嫌い","嫌い","好き","嫌い","嫌い","嫌い","嫌い"]) Stat = np.array(["好き","好き","好き","好き","嫌い","嫌い","嫌い","嫌い","嫌い","嫌い","好き", "好き","好き","嫌い","好き","嫌い","嫌い","嫌い","嫌い","嫌い"]) import pandas as pd

data = pd.DataFrame({ 'Stat':Stat, 'Math':Math})

table = pd.crosstab(data.Math,data.Stat,margins=True) # クロス集計表を作る table

(11)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

11/16

In [145]: 5. 帰無仮説の棄却か採択かの決定: 帰無仮説によればこの検定統計量は⾃由度の分布に従う In [104]: In [8]: この結果、棄却域はとなり、この例題におけるの値(=2.54)は棄却域に⼊っていない、つまり帰無仮説は棄却されず、採択される。よって結論「有意⽔準5%において、数学と統計学の2つの変数は独⽴ではない(連関がない)」。なお、cdf関数を⽤いて、直接p値を求めることもできる: In [107]: In [108]: Pythonでカイ⼆乗検定するための関数chisquare(scipy.statsモジュール):

df = 1 χ

2

> 3.84

χ

2

_χ

2 2.53968253968

ppf(self, q, *args, **kwds) method of scipy.stats._continuous_distns.chi2_gen instance Percent point function (inverse of `cdf`) at q of the given RV.

Out[8]: 3.8414588206941236

cdf(self, x, *args, **kwds) method of scipy.stats._continuous_distns.chi2_gen instance Cumulative distribution function of the given RV.

Cumulative distribution function evaluated at `x`

Out[108]: 0.11101710551218613 Expectaion_11 = 8*6/20.0 Expectaion_12 = 12*6/20.0 Expectaion_21 = 8*14/20.0 Expectaion_22 = 12*14/20.0

ExpectedFrequency = np.array([Expectaion_11, Expectaion_21,Expectaion_12,Expectaion_22]) ObservedFrequency = np.array([4,4,2,10])

ChiSqElements = (ObservedFrequency - ExpectedFrequency)**2 / ExpectedFrequency ChiSq = np.sum(ChiSqElements) # 検定統計量 print(ChiSq) import scipy.stats as st help(st.distributions.chi2.ppf) st.distributions.chi2.ppf(0.95,1) # 自由度1のカイ二乗分布で確率0.95となるχ2の値を求める # これが棄却域を定める---この値よりもχ2値が大きければ棄却される # カイ二乗分布は『上側』のみ help(st.distributions.chi2.cdf) 1.0 - st.distributions.chi2.cdf(ChiSq,1) # 上側確率 # p値が有意水準0.05よりも大きいので、帰無仮説は棄却されない

(12)

In [109]:

Help on function chisquare in module scipy.stats.stats: chisquare(f_obs, f_exp=None, ddof=0, axis=0) Calculates a one-way chi square test.

The chi square test tests the null hypothesis that the categorical data has the given frequencies.

Parameters --- f_obs : array_like

Observed frequencies in each category. f_exp : array_like, optional

Expected frequencies in each category. By default the categories are assumed to be equally likely.

ddof : int, optional

"Delta degrees of freedom": adjustment to the degrees of freedom for the p-value. The p-value is computed using a chi-squared distribution with ``k - 1 - ddof`` degrees of freedom, where `k` is the number of observed frequencies. The default value of `ddof` is 0.

axis : int or None, optional

The axis of the broadcast result of `f_obs` and `f_exp` along which to apply the test. If axis is None, all values in `f_obs` are treated as a single data set. Default is 0.

Returns ---

chisq : float or ndarray

The chi-squared test statistic. The value is a float if `axis` is None or `f_obs` and `f_exp` are 1-D.

p : float or ndarray

The p-value of the test. The value is a float if `ddof` and the return value `chisq` are scalars.

See Also --- power_divergence mstats.chisquare Notes ---

This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5.

The default degrees of freedom, k-1, are for the case when no parameters of the distribution are estimated. If p parameters are estimated by efficient maximum likelihood then the correct degrees of freedom are k-1-p. If the parameters are estimated in a different way, then the dof can be between k-1-p and k-1. However, it is also possible that the asymptotic distribution is not a chisquare, in which case this test is not appropriate.

References ---

.. [1] Lowry, Richard. "Concepts and Applications of Inferential

Statistics". Chapter 8. http://faculty.vassar.edu/lowry/ch8pt1.html (http://faculty.vassar.edu/lowry/ch8pt1.html) .. [2] "Chi-squared test", http://en.wikipedia.org/wiki/Chi-squared_test (http://en.wikipedia.org/wiki/Chi-squared_test)

Examples ---

When just `f_obs` is given, it is assumed that the expected frequencies are uniform and given by the mean of the observed frequencies.

>>> from scipy.stats import chisquare >>> chisquare([16, 18, 16, 14, 12, 12]) (2.0, 0.84914503608460956)

With `f_exp` the expected frequencies can be given.

>>> chisquare([16, 18, 16, 14, 12, 12], f_exp=[16, 16, 16, 16, 16, 8]) (3.5, 0.62338762774958223)

When `f_obs` is 2-D, by default the test is applied to each column. >>> obs = np.array([[16, 18, 16, 14, 12, 12], [32, 24, 16, 28, 20, 24]]).T >>> obs.shape (6, 2) >>> chisquare(obs) (array([ 2. , 6.66666667]), array([ 0.84914504, 0.24663415]))

By setting ``axis=None``, the test is applied to all data in the array, which is equivalent to applying the test to the flattened array. >>> chisquare(obs, axis=None) (23.31034482758621, 0.015975692534127565) >>> chisquare(obs.ravel()) (23.31034482758621, 0.015975692534127565)

`ddof` is the change to make to the default degrees of freedom.

>>> chisquare([16, 18, 16, 14, 12, 12], ddof=1) (2.0, 0.73575888234288467)

The calculation of the p-values is done by broadcasting the chi-squared statistic with `ddof`.

>>> chisquare([16, 18, 16, 14, 12, 12], ddof=[0,1,2]) (2.0, array([ 0.84914504, 0.73575888, 0.5724067 ]))

`f_obs` and `f_exp` are also broadcast. In the following, `f_obs` has shape (6,) and `f_exp` has shape (2, 6), so the result of broadcasting `f_obs` and `f_exp` has shape (2, 6). To compute the desired chi-squared statistics, we use ``axis=1``:

>>> chisquare([16, 18, 16, 14, 12, 12], ... f_exp=[[16, 16, 16, 16, 16, 8], [8, 20, 20, 16, 12, 12]], ... axis=1) (array([ 3.5 , 9.25]), array([ 0.62338763, 0.09949846])) help(st.chisquare)

(13)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

13/16

In [146]:

サンプルサイズの影響

標本における連関の⼤きさが全く同じであっても、サンプルサイズが異なると検定の結果が変わることがあるサンプルサイズが⼤きくなると、有意になりやすい---統計的仮説検定⼀般にいえる性質 In [160]: In [165]: In [166]: In [167]: In [168]: In [169]:

関数のまとめ

注: numpyをnp, np.randomを random、 matplotlib.pyplotをplt、pandasをpd、scipy.statsをstと略記する

⽬的関数名とモジュール使い⽅指定された範囲からランダム抽出 random.choice(配列,個数) random.choice(range(10),5) 標準正規分布で下側確率に対応する確率分布関数の値 st.norm.ppf(p) st.norm.ppf(0.025) # となるqの値標準正規分布で下側確率(p値)を求める st.norm.cdf(z) st.norm.cdf(1.96) # の値(p値) t分布で下側確率に対応する確率分布関数の値 st.t.ppf(p,⾃由度) st.t.ppf(0.025,19) # ⾃由度19のt分布でとなるqの値 t分布で下側確率(p値)を求める st.t.cdf(z,df) st.t.cdf(1.96,19) # ⾃由度19のt分布での値(p値) t検定を⾏う ttest_1samp(データ, ) ttest_1samp(np.array([13,14,7,12,10,6,8,15,4,14,9,6,8,8,12,15]),12.0) # _の検定

無相関検定を⾏う st.pearsonr(データ1,データ2) st.pearsonr(StatTest1, StatTest2) # 出⼒の第⼀要素は標本相関係数、第⼆要素_{は両側検定によるp値}

カイ⼆乗分布の確率密度関数 st.distributions.chi2.pdf(x,⾃由度) plt.plot(x,st.distributions.chi2.pdf(x,3)) # ⾃由度3の分布関数の描画カイ⼆乗分布で上側確率に対応する値を求

める st.distributions.chi2.ppf(p,⾃由度) st.distributions.chi2.ppf(0.95,2) # ⾃由度2で求めるとなるq値を

カイ⼆乗分布で上側確率を求める 1-st.distributions.chi2.cdf(z,⾃由度) 1-st.distributions.chi2.cdf(3.5,1) # ⾃由度1のカイ2乗分布で_る確率とな

カイ⼆乗検定を⾏う(独⽴性の検定) st.chisquare(観測度数リスト, f_exp =期待度数リスト, ddof=n) #n=_{観測個数-1-⾃由度} st.chisquare(ObservedFrequency, f_exp =ExpectedFrequency, ddof=2)

演習問題4

演習問題4-1 次のデータ(単位はcm)は、平均170cmの正規分布に従う20歳男性の⺟集団からの無作為抽出と考えてよいかどうかを検定せよ。 In [171]:

Prob(Z < q) = 0.025

Prob(Z < 1.96)

Prob(Z < q) = 0.025

Prob(Z < 1.96)

μ

μ = 12

χ

2

Prob(Z < q) = 0.95

Prob(Z ≥ 3.5)

Out[146]: Power_divergenceResult(statistic=2.5396825396825395, pvalue=0.1110171055121861)

Out[160]: _{Mastered NotMastered} Humanities 16 4 Technicals 12 8

Out[166]: Power_divergenceResult(statistic=1.9047619047619047, pvalue=0.16754627748861739)

Out[167]: _{Mastered NotMastered} Humanities 160 40 Technicals 120 80

Out[169]: Power_divergenceResult(statistic=19.047619047619047, pvalue=1.2749674921097076e-05) ExpectedFrequency = np.array([Expectaion_11, Expectaion_21,Expectaion_12,Expectaion_22]) ObservedFrequency = np.array([4,4,2,10])

st.chisquare(ObservedFrequency, f_exp =ExpectedFrequency, ddof=2) # 自由度の計算に使うddofの値に注意

import pandas as pd

data = { 'Mastered': [16,12], 'NotMastered':[4,8]} # ある科目の履修 vs 未履修 df = pd.DataFrame(data)

df.index=['Humanities','Technicals'] # 文系 vs 理系 df # クロス集計表

Exp_11 = sum(df['Mastered'])*sum(df.loc['Humanities',:])/40.0 Exp_12 = sum(df['NotMastered'])*sum(df.loc['Humanities',:])/40.0 Exp_21 = sum(df['Mastered'])*sum(df.loc['Technicals',:])/40.0 Exp_22 = sum(df['NotMastered'])*sum(df.loc['Technicals',:])/40.0 ExpectedFrequency = np.array([Exp_11, Exp_12,Exp_21,Exp_22]) ObservedFrequency = np.array([16,4,12,8])

st.chisquare(ObservedFrequency, f_exp =ExpectedFrequency, ddof=2) # 自由度の計算に使うddofの値に注意 # p値=0.17なので帰無仮説は棄却されない → 連関なし

data10 = { 'Mastered': [160,120], 'NotMastered':[40,80]} # ある科目の履修 vs 未履修 --- 前の10倍 df = pd.DataFrame(data10)

df.index=['Humanities','Technicals'] # 文系 vs 理系 df # クロス集計表

Exp_11 = sum(df['Mastered'])*sum(df.loc['Humanities',:])/400.0 Exp_12 = sum(df['NotMastered'])*sum(df.loc['Humanities',:])/400.0 Exp_21 = sum(df['Mastered'])*sum(df.loc['Technicals',:])/400.0 Exp_22 = sum(df['NotMastered'])*sum(df.loc['Technicals',:])/400.0 ExpectedFrequency = np.array([Exp_11, Exp_12,Exp_21,Exp_22]) ObservedFrequency = np.array([160,40,120,80])

st.chisquare(ObservedFrequency, f_exp =ExpectedFrequency, ddof=2) # 自由度の計算に使うddofの値に注意 # p値=0.000013なので帰無仮説は棄却される→連関あり

import numpy as np

(14)

演習問題4-2 以下に⽰すデータにおいて、勉強時間(StudyHours)と定期試験の成績(ExamResult)の相関係数の無相関検定を⾏え。 In [170]: 演習問題4-3 先の演習問題4-2のデータに対し、ピアソンの相関係数とスピアマンの順位相関係数を求め、さらに無相関検定も⾏え。 import numpy as np StudyHours = np.array([1, 3, 10, 12, 6, 3, 8, 4, 1, 5]) ExamResult = np.array([20, 40, 100, 80, 50, 50, 70, 50, 10, 60])

(15)

http://localhost:8888/notebooks/Python-statistics4.ipynb#

15/16

In [174]:

In [ ]:

演習問題4-4

以下に⽰す演習問題2-2のデータに対し、カイ⼆乗検定を⾏え。 Help on function spearmanr in module scipy.stats.stats: spearmanr(a, b=None, axis=0, nan_policy='propagate')

Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.

The Spearman correlation is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

Parameters ---

a, b : 1D or 2D array_like, b is optional

One or two 1-D or 2-D arrays containing multiple variables and observations. When these are 1-D, each represents a vector of observations of a single variable. For the behavior in the 2-D case, see under ``axis``, below.

Both arrays need to have the same length in the ``axis`` dimension. axis : int or None, optional

If axis=0 (default), then each column represents a variable, with observations in the rows. If axis=1, the relationship is transposed: each row represents a variable, while the columns contain observations. If axis=None, then both arrays will be raveled.

nan_policy : {'propagate', 'raise', 'omit'}, optional

Defines how to handle when input contains nan. 'propagate' returns nan, 'raise' throws an error, 'omit' performs the calculations ignoring nan values. Default is 'propagate'.

Returns ---

correlation : float or ndarray (2-D square)

Spearman correlation matrix or correlation coefficient (if only 2 variables are given as parameters. Correlation matrix is square with length equal to total number of variables (columns or rows) in a and b combined.

pvalue : float

The two-sided p-value for a hypothesis test whose null hypothesis is that two sets of data are uncorrelated, has same dimension as rho.

Notes ---

Changes in scipy 0.8.0: rewrite to add tie-handling, and axis.

References ---

.. [1] Zwillinger, D. and Kokoska, S. (2000). CRC Standard

Probability and Statistics Tables and Formulae. Chapman & Hall: New York. 2000.

Section 14.7

Examples ---

>>> from scipy import stats

>>> stats.spearmanr([1,2,3,4,5], [5,6,7,8,7]) (0.82078268166812329, 0.088587005313543798) >>> np.random.seed(1234321) >>> x2n = np.random.randn(100, 2) >>> y2n = np.random.randn(100, 2) >>> stats.spearmanr(x2n) (0.059969996999699973, 0.55338590803773591) >>> stats.spearmanr(x2n[:,0], x2n[:,1]) (0.059969996999699973, 0.55338590803773591) >>> rho, pval = stats.spearmanr(x2n, y2n) >>> rho array([[ 1. , 0.05997 , 0.18569457, 0.06258626], [ 0.05997 , 1. , 0.110003 , 0.02534653], [ 0.18569457, 0.110003 , 1. , 0.03488749], [ 0.06258626, 0.02534653, 0.03488749, 1. ]]) >>> pval array([[ 0. , 0.55338591, 0.06435364, 0.53617935], [ 0.55338591, 0. , 0.27592895, 0.80234077], [ 0.06435364, 0.27592895, 0. , 0.73039992], [ 0.53617935, 0.80234077, 0.73039992, 0. ]]) >>> rho, pval = stats.spearmanr(x2n.T, y2n.T, axis=1) >>> rho

array([[ 1. , 0.05997 , 0.18569457, 0.06258626], [ 0.05997 , 1. , 0.110003 , 0.02534653], [ 0.18569457, 0.110003 , 1. , 0.03488749], [ 0.06258626, 0.02534653, 0.03488749, 1. ]]) >>> stats.spearmanr(x2n, y2n, axis=None)

(0.10816770419260482, 0.1273562188027364) >>> stats.spearmanr(x2n.ravel(), y2n.ravel()) (0.10816770419260482, 0.1273562188027364)

>>> xint = np.random.randint(10, size=(100, 2)) >>> stats.spearmanr(xint)

(0.052760927029710199, 0.60213045837062351) import scipy.stats as st

(16)

In [179]: 演習問題4-5 次のそれぞれのデータについて無相関検定を⾏え。 In [ ]: In [ ]: 演習問題4-6

badmington.csv (SampleData/badmington.csv)は区切り記号がコンマのCSVのファイルであり、バドミントンのラケットの重量xと硬度yの表(出典: 内⽥(2010)「すぐに使えるRによる統計解析とグラフの応⽤」東京図書)が収められている。このデータをデータフレームとして読み込み、硬度(y)と重量(x)の相関係数を算出し、無相関の検定を⾏え。 [参考] 区切り記号がコンマのcsvファイルを読み込み、その内容をデータフレームとして取り込むには、pandasモジュールのread_csv関数を⽤いる。 In [ ]: import numpy as np FoodTendency = np.array(["洋食","和食","和食","洋食","和食","洋食","洋食","和食","洋食","洋食","和食", "洋食","和食","洋食","和食","和食","洋食","洋食","和食","和食"]) TasteTendency = np.array(["甘党","辛党","甘党","甘党","辛党","辛党","辛党","辛党","甘党","甘党","甘党", "甘党","辛党","辛党","甘党","辛党","辛党","甘党","辛党","辛党"]) #5-1 import numpy as np Kokugo = np.array([60,40,30,70,55]) Shakai = np.array([80,25,35,70,50]) #5-2 --- 単純に(5-1)のデータを2回繰り返したもの Kokugo = np.array([60,40,30,70,55,60,40,30,70,55]) Shakai = np.array([80,25,35,70,50,80,25,35,70,50])

2017/11/2 Python-statistics4 Python で統計学を学ぶ (4) この内容は 杉澤 村井 (2008) R によるやさしい統計学 (

http://localhost:8888/notebooks/Python-statistics4.ipynb#

1/16

Python

で統計学を学ぶ(4)

統計的仮説検定の必要性

corrcoef(a, b) = −0.034

統計的仮説検定の⼿順と⽤語

1.

帰無仮説と対⽴仮説

μ = 0

ρ = 0

μ

−

μ

= 0

μ ≠ 0

ρ ≠ 0

μ

−

μ

≠ 0

μ > 0

μ < 0

ρ > 0

ρ < 0

μ

>

μ

μ

<

μ

http://localhost:8888/notebooks/Python-statistics4.ipynb#

3/16

2.

検定統計量

3.

有意⽔準と棄却域

4 & 5.

統計的仮説検定の結果の報告

ｐ値

第1種の誤りと第2種の誤り

検定⼒

標準正規分布を⽤いた検定(1つの平均値の検定: ⺟分散が既知)

μ ≠ 0

μ > 0 μ < 0

t,

χ

F

N(μ, )

σ

n

X¯

Z =

X¯

− μ

σ/ n⎯⎯

√

N(μ, )

σ

α

N(12, 10)

μ = 12

μ ≠ 12

α = 0.05

Z

Z

α

Prob(Z ≤ α)

Z

Z

α

Prob(Z > α)

Z < −1.959964

Z > 1.959964

Z

http://localhost:8888/notebooks/Python-statistics4.ipynb#

5/16

t

分布を⽤いた検定(1つの平均値の検定: ⺟分散が未知)

2017/11/2 Python-statistics4 Python で統計学を学ぶ (4) この内容は杉澤村井 (2008) R によるやさしい統計学 (

_{で統計学を学ぶ(4)}

_{帰無仮説と対⽴仮説}

_{有意⽔準と棄却域}

_{統計的仮説検定の結果の報告}

_n