TA session #1
Shouto Yonekura
April 17, 2016
Contents
1 Probability 1
2 Random variables, Probability density functions and Distribution functions 2
3 Independence 4
4 Moments 6
5 Moment generating functions 8
6 Texts 11
1 Probability
Let Ω be an abstract (nonempty) space. Let F ∈ 2Ω, where 2Ωis a power set of Ω. Def 1.1
F is a σ-algebra if it satisfies (1) Ω ∈ F
(2) A ∈ F ⇒ Ac∈ F
(3) A1, A2,· · · An,· · · ∈ F =⇒ ∪∞i=1Ai∈ F.
We call (Ω F) a mesurable space. Next, we define a probability measure. Def 1.2
A probability measure is a funcation P : F → [0, 1] such that (1) P (Ω) = 1
(2) A1, A2,· · · An,· · · ∈ F, Ai∩ Aj = φ ∀i 6= j =⇒P∞i=1P(Ai) = P (∪∞i=1Ai)
(2) above is called countable additivity and P (A) is called the probability of the event A ∈ F. We call (Ω F P ) a probability space.
Thm 1.3
Let (Ω F P ) be a probability space and A, B ∈ F. (1) P (Ac) = 1 − P (A)
(2) P (φ) = 0
(3) A ⊂ B =⇒ P (A) ≤ P (B) (4)P∞i=1P(Ai) ≥ P (∪∞i=1Ai)
(5) ∀n, An≤ An+1=⇒ limn→∞P(An) = P (∪nAn)
(6) ∀n, An≥ An+1 and P (A1) < ∞ =⇒ limn→∞P(An) = P (∩nAn) Proof
(1) Since P (Ω) = 1 and Ω = A ∪ Ac,we get 1 = P (Ω) = P (A ∪ Ac) = P (A) + P (Ac). (2) SinceΩc = φ ∈ F, we get P (φ) = 1 − P (Ω) = 0.
(3) Let A ⊂ B. Then P (B) = P (A ∪ (B − A)) = P (A) + P (B − A). Thus P (A) ≤ P (B).
(4) Set B1:= A1,· · · Bn:= A1−∪n−1i=1Bi(n ≥ 2). Then ∀n, {Bn} are disjoint and ∪iAi = ∪iBi. ThereforeP (∪iAi) = P(∪iBi) =PiP(Bi) ≤PiP(Ai).
(5) Set B1= A1 and Bi = Ai− Ai−1 for i > 1. Then we get ∪iAi = ∪iBi and ∀n, {Bn} are disjoint. Thus P(∪iAi) = P (∪iBi)
=PiP(Bi)
= lim
n→∞
Pn i P(Bi)
= lim
n→∞P(∪ niBi)
= lim
n→∞P(An).
(6) Left as an exercise. Q.E.D
2 Random variables, Probability density functions and Distribution func-
tions
We call the samllest σ−algebra which contains all interval in R a Boral algebra denoted by B(R). Def 1.4
Let (Ω F P ) be a probability space. We say X is a random variable if it satisfies
∀A ∈ B(R), X−1(A) := {ω ∈ Ω; X(ω) ∈ A} ∈ F. In addition,
PX(A) := P (X−1(A)), A ∈ B(R) is called a probability distribution.
Note that PX is also a probability measure on (R B(R)) since
PX (∪nAn) = P (X−1(∪nAn))
= P(∪nX−1(An))
= PnP(X−1(An))
=PnPX(An)
holds for disjoint {An} ∈ B(R) and obviously satisfies (1) and (2) of Def 2.1. Def 1.5
Let (Ω F P ) be a probability space.
F(x) := P (X ≤ x) = PX((−∞ x]) is called the (cumulative) distribution function of the random variable X. Prop 1.6
F(x) satisfies following properties: (1) x1≤ x2=⇒ F (x1) ≤ F (x2)
(2) limn→∞F(xn) = 1, limn→−∞F(xn) = 0 (3) Right continuous
(4) 0 ≤ F (x) ≤ 1 Proof
(1) If x1≤ x2 then (−∞ x1] ⊂ (−∞ x2) holds. Thus it follows from Thm 1.3 (2) Since R = ∪n(−∞ n], using Thm 1.3 implies
1 = PX(R) = PX(∪n(−∞ n]) = limn→∞PX((−∞ n]) = limx→∞F(x).
(3) Since
F(+x) := limy↓xF(y) = limn→∞F(x + 1n),
= limn→∞PX((−∞ x +n1]), and ∩n(−∞ x + 1n] = (−∞ x), it follows by Thm (6) of thm 1.6.
(4) Obvious by the definition. Q.E.D
If a randon variable X takes only countable values, then we say PX of X is discrete and p(xi) := P (X = xi)
= PX({xi}) i = 1, 2, · · · is called a probability funcation. By the definition, p(xi) satisfies
(1) p(xi) ≥ 0 , i = 1, 2, · · · , (2) P p(xi) = 1,
(3) F(x) =Pk:xk≤xp(xk) ∀x ∈ R.
If the values X will take are continuous, then we say PX of X is continuous and a function f (x) such that P(X ≤ x) = F (x) :=´−∞x f(t)dt, ∀x ∈ R
is called a probability density function. It is obvious to see that f (x) satisfies (1) dF(x)dx = f (x),
(2) f (x) ≥ 0 , ∀x ∈ R, (3) ´−∞∞ f(x)dx = 1.
Next we extend above to X ∈ Rn. Let X = (X1, X2,· · · Xn) be a random vector on (Ω, F) and its disribution is PX. For any x = (x1· · · , xn) ∈ Rn, we define
F(x) := P ({X1≤ x1,· · · Xn≤ xn}),
= PX(Qi(−∞ xi]),
and we call this (n-dimensional) a joint distribution of X. If X is discrete, then p(xi) := P ({X = xi})
= PX({xi}), i = 1, 2, · · · and is continuous, then
F(x) =´−∞x1 · · ·´−∞xn f(t1,· · · , tn)dx1· · · dxn,∀x ∈ Rn. f(x) = ∂x∂F1···∂x(x)n is also satisfied in the continuous case.
3 Independence
Def 1.7
Let X and Y be are randon variables on (Ω F). If
∀A, ∀B ∈ B(R), P ({(X ∈ A) ∩ (Y ∈ B)}) = P ({X ∈ A}) × P ({Y ∈ B}) holds, then we say X and Y are independent.
Recall that the conditional probability of A given B is given by P (A | B) := P(A∩B)P(B) . If A and B are independent, then we get
P(A | B) := P(A∩B)P(B)
=P(A)P (B)P(B) P(A).
Thm 1.8
X and Y are independent ⇐⇒FX,Y(x, y) = FX(x)FY(y), where FX,Y is the joint distribution of X and Y . Proof
Without loss of generality, we only consicer one demensional case
=⇒By the definition of independence, we get
FX,Y(x, y) = P (X ≤ x, Y ≤ y)
= P (X ∈ (−∞, x])P (Y ∈ (−∞, y])
= P (X ≤ x)P (Y ≤ y) FX(x)FY(y),
for any x, y ∈ B(R).
⇐= Fix I = (a1, b1], (a2, b2] on R. Then we get
P(X ∈ (a1b1], Y ∈ (a2b2])
= P ((X, Y ) ∈ I)
= ∆2IF(X,Y )(x, y)
= F(X,Y )(b1b2) − F(X,Y )(b1a2) − F(X,Y )(a1b2) + F(X,Y )(a1a2)
∆1(a1,b1]FX(x) × ∆1(a2,b2]FY(y)
P(X ∈ (a1b1])P (Y ∈ (a2b2]) Q.E.D
Thm1.9
Let X = (X1, X2,· · · Xn) be a discrete or continuous random vector. (X1, X2,· · · Xn) are indepedent ⇐⇒
(p(X1,···Xn)(xk11,· · · xnkn) =QipXi(xkii), ∀ki= 1, 2 · · · , i = 1, 2, · · · n f(X1,···Xn)(x1· · · xn) =QifXi(xi), ∀x ∈ Rn.
Proof (n=2) Discrete
=⇒Let be{(xi, yj) | i, j = 1, 2 · · · } values of X and Y . By the definition we get p(X,Y )(xiyj) = P (X = xi, Y = yj)
= P (X = xi)P (Y = yj)
= pX(xi)pY(yj)
⇐= FixA, B ∈ B(R).Then we get,
P(X ∈ A, Y ∈ B) = P ((X, Y ) ∈ A × B),
=P(k,l):(xkyl)∈A×BP((X, Y ) = (xkyl)), P
(k,l):(xkyl)∈A×BP(X = xk, Y = yl) ,
=P(k,l):(x
kyl)∈A×Bp(X,Y )(xkyl),
=P(k,l):(xkyl)∈A×BpX(xk)pY(yl) P(X ∈ A)P (Y ∈ B).
Continuous
=⇒ For anyai< bi, we can show that
´b1
a1
´b2
a2 f(X,Y )(x, y)dxdy = P (X ∈ (a1b1], Y ∈ (a2b2]),
= P (X ∈ (a1b1])P (Y ∈ Y ∈ (a2b2]), (´ab11fX(x)dx)(´ab22fY(y)dy),
´b1
a1 fX(x)dx
´b2
a2 fY(y)dy,
⇐⇒ f(X,Y )(x, y) = fX(x)fY(y),
hold∀x, y ∈ R,
⇐= ∀a1, a2∈ R,we observe that
F(X,Y )(a1a2) =´−∞a1 ´−∞a2 f(X,Y )(x, y)dx dy
=´−∞a1 ´−∞a2 fx(x)fY(y)dxdy (´−∞a1 fX(x)dx)(´−∞a2 fY(y)dy)
FX(a1)FY(b1) Q.E.D
4 Moments
The expecattion of a random variable X denoted µ is given by
E[X] := (P
xxp(x),
´ xf (x)dx, and the variance of it denoted σ2 is given by
V[X] := (P
x(x − E[X])2p(x),
´ (x − E[x])2f(x)dx. Let g() be a (measureble) function. Then its expecation is given by
E[g(X)] := (P
xg(x)p(x)
´ g(x)f (x)dx. The expecation E[X] haslinearity, that is
E[ag(X) + bh(X)] = aE[g(X)] + bE[b(X)], for any a, b ∈ R. We often use following formula for calculating a variance:
σ2= E[(X − µ)2] = E[X2− 2µX + µ2]
= E[X2] − 2µE[X] + µ2
= E[X2] − E[X]2. We also easily observe that
V[a + bX] = E[(a + bX − E[a + bX])2]
= E[(a + bX − a + bE[X])2]
= E[b2X2− 2b2XE[X] + b2E[X]2]
= b2E[X2] − b2E[X]2
= b2(E[X2] − E[X]2) b2V[X],
for any a, b ∈ R. We call
µ′k = E[Xk], µk= E[(X − µ)k],
k-th moment and k-th deviation (from mean) respectively. We can show that following formula for calculationg µk
and µ′k:
(X − µ)k =
k 0
µ0Xk−
k 1
µ1Xk−1+
k 2
µ2Xk−2− · · · +
k
k− 1
µk−1X1+ (−1)k
k k
µkX0
= Xk− kµXk−1+k(k−1)2 µ2Xk−2− · · · + kµk−1X+ (−1)kµk. thus we get,
µk= E[Xk] − kµE[Xk−1] +k(k−1)2 µE[Xk−2] − · · · + kµk−1E[X] + (−1)kµk.
= µ′k− kµ′1µ′k−1+k(k−1)2 (µ′1)2µ′k−2− · · · + k(µ′1)k−1µ
′
1+ (−1)k(µ′1)k.
For example, in the case of µ2, µ3and µ4 , we get
µ2= E[(X − µ)2] = E[X2] − E[X]2
= µ′2− (µ′1)2, and
E[(X − µ)3] = µ′3− 3µ′1µ′2+ 2(µ′1)3,
= E[X3] − 3E[X]E[X2] + 2(E[X])3, E[(X − µ)4] = µ′4− 4µ′1µ′3+ 6(µ′1)2µ′2− 3(µ′1)4.
= E[X4] − 4E[X]E[X3] + 6E[X]2E[X2] − 3E[X]4.
The covariance of X and Y is defined by
Cov[X, Y ] := E[(X − E[X])(Y − Y [X])], and also correlation is given by
ρ(X, Y ) := √Cov[X,Y ]
V[X]√V[Y ].
Thm 1.10
(1)X and Y are independent =⇒ ρ(X, Y ) = 0. (2) −1 ≤ ρ ≤ 1.
Proof
(1) We can rewrite
Cov[X, Y ] = E[E(X − µX)E(Y − µY)],
= E[XY − µXE[Y ] − Xµy+ µXµY],
= E[XY ] − E[X]E[Y ], and E[XY ] = E[X]E[Y ] (by assumption), we get Cov[X, Y ] = 0.
(2) The Cauchy–Schwarz inequality implies
|Piaibi|≤pPia2i
pP
ib2i.
Thus if we set ai= xi− µx, bi= yi− µy, then we get
| Cov(X, Y ) |2≤ V (X)V (Y ), Q.E.D
5 Moment generating functions
Def 1.11
Let X a discrete or continuous random variable. We call
M(t) := E[etX] =P∞x=0etxp(x) M(t) := E[etX] =´−∞∞ etxf(x)dx the moment generationg function of X.
Calculating moments using m.g.f;:
dkM(t)
dtk |t=0 = E[X
k], k = 1, 2, · · ·
dM(t)
dt =
d dt(
ˆ ∞
−∞
etxf(x)dx)
= ˆ ∞
−∞
∂
∂te
txf(x)dx
= ˆ ∞
−∞
xetxf(x)dx
dM(t)
dt |t=0 =
ˆ ∞
−∞
xf(x)dx = E[X] d2M(t)
dt2 |t=0= ˆ ∞
−∞
x2etxf(x)dx = E[X2]
V[X] = E[X2] − E[X]2=d
2M(t)
dt2 |t=0−( dM(t)
dt |t=0)
2
Example
X is normal distribution if it has following p.d.f:
f(x) =´−∞∞ √2πσ1 2e−(x−µ)22σ2 dx. Then its m.g.f is given by
M(t) = ˆ ∞
−∞
etx
√2πσ2e
−(x−µ)22σ2 dx,
and could be calculated as follows:
−2σ12(x2− 2µx + µ2− 2σ2tx).
= −2σ12(x2− 2x(µ + σ2t) + (µ + σ2t)2 ) +(µ + σ
2t)2
2σ2 − µ2 2σ2
= −2σ12(x2− µ′)2+ µt + σ22t2, thus we get
M(t) = eµt+σ2t22 ˆ ∞
−∞
√ 1 2πσ2e
−(x−µ
′)2 2σ2 dx
| {z }
=1
= eµt+σ2t22 .
Using this m.g.f, the mean and the variance of a normal distribution can be calculated as follows:
dM(t)
dt |t=0= exp(µt + σ2t2
2 )(µ + σ 2t) |
t=0
= µ = E[X]
d2M(t)
dt2 |t=0= (σ
2exp(µt +σ2t2
2 ) + exp(µt + σ22t2)(µ + σ2t)2) |t=0
= σ2+ µ2= E[X2] V[X] = d2dtM(t)2 |t=0−(dM(t)dt |t=0)2
= σ2+ µ2− µ2
= σ2.
Example
Next we calculate moments of a binomial distribution. First we derive the m.g.f ot it as follows: M(t) =P∞x etx
n x
pxqn−x(q∗ = 1 − p)
=P∞x
n x
(etp)xqn−x
= (etp+ q)n.
Using this gives
dM(t)
dt |t=0= n(etp+ q)n−1etp|t=0
= np = E[X]
d2M(t)
dt2 |t=0= (n(etp+ q)n−1etp+ n(n − 1)(etp+ q)n−2(etp)2) |t=0
= np + n(n − 1)p2
V[X] = d2dtM(t)2 |t=0−(dM(t)dt |t=0)2
= np + n(n − 1)p2− n2p2
= np(1 − p) npq.
Thm 1.12
Let X and Y be are independent random variables and defineZ := X + Y . Then the m.g.f of Z is given by MZ(t) = MX(t)MY(t).
Moreover, if {Xt} are independent and Z :=Pni Xi, then the m.g.f of Z is given by
MZn(t) = Yn i=1
MXi(t).
Proof
We only consider the case of n = 2. Let X and Y be are independent random variables. Then etX and etY are also independent. Thus we get
MZ(t) = E[etZ]
= E[et(X+Y )]
= E[etXetY]
= E[etX]E[etY]
= MX(t)MY(t) as required. Q.E.D
Thm 1.13
Let FX be a distribution function of X andMX(t) be a its m.g.f. Then for any continuous points a < b of FX, FX(b) − FX(a) = limT →∞ 2π1 ´−TT exp(−bt)−exp(−at)
−t MX(t)dt
holds.
I will skip the proof of this theorem. However, this theorem says that analysing some distributions is equiva- lent to analysing its m.g.f.
6 Texts
• Statistics
Yong and Smith (2010) Essentials of Statistical Inference Keener (2010) Theoretical Statistics: Topics for a Core Course
• Linear algebra
Axler, Sheldon. (2006) Linear Algebra Done Right (Undergraduate Texts in Mathematics) Harville David. (1997) Matrix Algebra From a Statistician’s Perspective
• Measure theory and Probability theory
Capinski Marek and Kopp Peter. (2008) Measure, Integral and Probability Rosenthal (2006) A First Look at Rigorous Probability Theory