Speech and Language Processing
Lecture 4
Variational inference and sampling
Information and Communications Engineering Course
Takahiro Shinozaki
Lecture Plan (Shinozaki’s part)
1. 10/20 (remote)
Speech recognition based on GMM, HMM, and N-gram
2. 10/27 (remote)
Maximum likelihood estimation and EM algorithm
3. 11/3 (remote)
Bayesian network and Bayesian inference
4. 11/6 (@TAIST)
Variational inference and sampling
5. 11/7 (@TAIST)
Neural network based acoustic and language models
6. 11/7 (@TAIST)
Today’s Topic
•
Answers for the previous exercises
•
Variational inference
Exercise 3.1
•
Is the directed graph a DAG?
Graph A Graph B
Exercise 3.2
•
Represent the following joint probability by a BN
( ) ( ) (
| |) (
|)
( | ) ), , , ,
(A B C D E P A P B A P C B P D B P E D
P =
A
B
C
Exercise 3.3
•
Fill the blanks so that the following HMM and the
BN become equivalent
b
a
S1
X1 X2 X3 XT S2 S3 ST
b b
a a
σ µ σ µ
, , ,
(X b b)
N |µ ,σ
(X a a)
N | µ ,σ
P(a) P(b) 0.2 0.8 0.8
0.2 0.3
0.7
0.4
0.6
Initial state probability
P(St|St-1) St =a St =b St-1 =a ( 0.3 ) ( 0.7 ) St-1 =b ( 0.4 ) 0.6
Initial state
Exercise 3.4
• Assumes a probabilistic model P(x|μ), a training sample x1, and a prior distribution of a parameter P(μ) are given as follows.
x x
1
μ
( )
= 2 exp− 2 1 µ2
π µ
P
( )
= −(
−)
2 exp
2 1 |
2
µ π
µ x
x P
(
) (
( )
) ( )
1 1 1
| |
x P
P x
P x
P µ = µ µ
( )
=∫
∞( ) (
µ µ)
µ1) Estimate posterior distribution
2) Estimate predictive distribution
( )
−cx dx = πc∫
−∞∞2
exp
Note:
Gaussian distribution with mean μ and variance 1
Gaussian distribution with mean 0 and
Variational Bayes
Evaluating posterior distribution is often not feasible
(
)
(
(
) ( )
) ( )
q(
D)
d p
D p
p D
p D
p |
| |
| ≈ Λ
Λ Λ Λ
Λ Λ
= Λ
∫
Too complex
Simpler distribution
( )
KL[
q( ) (
p D)
]
q
q
| ,
min arg
* Λ = Λ Λ q
( ) (
Λ ≡ q Λ | D)
(For simplicity of notation)
Let’s approximate it with a simpler distribution
The minimum of the KL divergence is found using
variational method
KL
and Lower bound
•
Minimizing
KL
is equal to maximizing lower bound
( ) (
)
[
]
( )
(
( )
)
( )
( ) ( )
( )
( )
(
)
( )
( )
( )
Λ ΛΛ Λ + = Λ Λ Λ Λ = Λ ΛΛ Λ = Λ Λ∫
∫
∫
d D p q q D p d D p D p q q d D p q q D p q KL , log log , log | log | ,( )
[
( ) (
)
]
∫
( )
( )
( )
ΛΛ Λ Λ + Λ Λ = d q D p q D p q KL D
p , | log ,
log
Model evidence
(Constant for q) Lower bound
[ ]
( )
( )
( )
Λ ΛΛ Λ
≡
∫
dq
D p
q
q log ,
Approximation by Factorization
•
Assumes (groups of) hidden variables are conditionally
independent
• It is called a mean filed approximation by analogy to physics
• No restriction on the functional forms
(
)
( )
{
M}
M
i
i i
M
D
q
p
Λ
Λ
Λ
≈
∏
Λ
Λ
=
Λ
Λ
Λ
=
,
,
,
|
,
,
,
1 21 2
Posterior Inference with Mean Field Approximation
• Suppose a probability model consist of two (groups of) hidden variables H, Z and a (group of) observation variable D
• We approximate p(H,Z|D) as q(H,Z|D)=q(H)q(Z)
( ) ( )
max
[ ]
arg
( ) ( )max
[
( ) ( )
]
,
arg
q
q
Z
q
H
H q Z q H
q Z q
L
L
=
∫
q( )
Z dZ =1,∫
q( )
H dH =1( ) ( )
] =[
( ) ( )
]
−(
∫
( )
−1)
−(
∫
( )
−1)
[q Z q H q Z q H q Z dZ q H dH
F L λZ λH
( ) ( )Z q H
F
[ ]
q
qmax
arg
Maximization of
F
[q]
15
( ) ( )
, ]=∫∫
( ) ( )
log(
( ) ( )
, ,)
−(
∫
( )
−1)
−(
∫
( )
−1)
[ dZdH q Z dZ q H dH
H q Z q D H Z p H q Z q H q Z q
F λZ λH
( )
( , , ) 0
log 0 , , log = − ∂∂ = − ∂∂
∫
∫
H H H Z H Z H Z Z H Z H Z Z q dZ q q D H Z p q q q q dH q q D H Z p q q q λλ
( )
( )
H q q
Z q
qz = , H =
( )
( )
(
(
)
)
( )
H C q( )
Z(
p(
Z H D)
)
dZ q dH D H Z p H q C Z q H Z , , log exp , , log exp∫
∫
= =CZ, CH: normalization constant Variational method
(See the appendix for the derivation)
The maximum is obtained by alternatively updating q(Z) and
Variational GMM
(
)
( ) ( ) ( ) (
)
( ) ( )
∏
(
) (
)
=
=
=
t t tt
P
X
M
M
P
P
P
Z
D
P
Z
P
P
P
D
Z
P
p
,
|
|
,
|
|
,
,
,
θ
π
θ
π
θ
π
θ
π
π
θ
X1 X2 XT
Θ M1 M2 MT
D Z
π
( )
( )
(
(
)
)
( )
( )
π
θ
(
(
π
θ
)
)
π
θ
θ
π
θ
π
d
d
D
Z
p
q
Z
q
D
Z
p
Z
q
q
Z,
,
,
log
,
exp
,
,
,
log
exp
,
∫
∑
∝
∝
Mean field approximation:
(
Z
D
) (
q
Z
D
) (
q
D
) (
q
Z
D
)
Cont.
( ) ( )
( )
(
(
)
)
( ) ( )
( )
(
(
)
)
( )
( )
( )
( ) ( )
(
(
) (
)
)
( )
π(
(
π)
)
π( )
θ(
(
θ)
)
θ θ π θ π θ π θ θ θ π π π d M X P q d M P q d d M X P M P q q M q M q Z q M X P M q P q M P M q P q t t t t t t t t t t t t M t t t M t t t , | log exp | log exp , | | log exp , | log exp | log exp∫
∫
∫∫
∏
∑∑
∑∑
⋅ ∝ ∝ = ∝ ∝Cf. (compare with the above results)
X1 X2 XT
Θ M1 M2 MT
D Z π ( ) ( ) ( )= ( ) ∝ ( ) ( ) ( )=
∏
( ) ( )=∑
( ) t t tt P P M
M P P Z P P Z P Z P P Z
P π | π |π π |π π |π π exp log |π
( ) ( ) (∝ ) ( )=
∑
( ) t t t M X P P Z D P P D ZP θ | , θ |θ, θ exp log |θ,
(Mt Xt) (P Mt ) (P Xt Mt) ( (P(Mt ))) ( (P(Xt Mt)))
Pseudo Random Generator
•
On digital computer, everything is deterministically
calculated and there is no randomness
•
However, sometimes we want random numbers
•
Most programing languages have a pseudo random
generator function
Python 2.6
> import random > random.random()
0.89388901900395423
> random.random()
0.98563591571989639
> random.random()
Sampling From a Uniform Distribution
• Random numbers distributed uniformly over some region Example:
Histogram of samples obtained from a uniform distribution over (0, 1) (To make the graph, scilab was used)
histplot(10,rand(1:1000),
10 samples 1000 samples
N
u
m
b
e
r
o
f
o
cc
u
rr
e
n
ce
N
u
m
b
e
r
o
f
o
cc
u
rr
e
n
ce
Sampling From a Gaussian Distribution
• Standard normal (Gaussian) distribution has a mean 0.0 and a variance 1.0
Example:
100 samples 10000 samples
N
u
m
b
e
r
o
f
o
cc
u
rr
e
n
ce
N
u
m
b
e
r
o
f
o
cc
u
rr
e
n
ce
Transform of Random Variable
•
Let
x
be a random variable and
f
be a function
y
=
f
(
x
).
When
x
follows
p
(
x
),
y
follows the following
distribution
q
(
y
)
( ) ( )
dy
dx
x
p
y
Example
•
When
p
(
x
) and
y
=
f
(
x
) are given as follows, obtain
distribution
q
(
y
)
( )
2
3
1
,
0
1
)
(
+
=
∈
=
x
y
x
x
p
3
1
,
3
2
=
−
=
dy
dx
y
x
( )
2
,
5
3
1
)
(
y
=
x
∈
q
0 1 x
p(x)
1
2 5 y
q(x)
Exercise 4.1
•
When
p
(
x
) and
y
=
f
(
x
) are given as follows, obtain
distribution
q
(
y
)
( )
y
( )
x
x
x
p
(
)
=
1
∈
0
,
1
,
=
−
log
1
−
( )
( )
ydy dx y x y x y x − = − − = ∞ = → = = → = exp , exp 1 1 , 0 0
)
exp(
)
(
)
(
y
dy
dx
x
p
y
q
=
=
−
# o f o cc u rr e n ce
Histogram of x Histogram of y
Exercise 4.2
•
When
p
(
x
) and
y
=
f
(
x
) are given as follows, obtain
distribution
q
(y)
(
| 0,1)
(
,)
, 3 4 2 exp 2 1 ) ( 2 + = ∞ ∞ − ∈ = −= x N x x y x
x p π 3 1 , 3 4 = − = dy dx y
x
(
)
(
2)
2 2
2 3 | 4,3
4 2 1 exp 3 2 1 ) ( )
( y N y
dy dx x p y q = − − = = π x # o f o cc u rr e n ce
Histogram of x
y
Histogram of y
Sampling form Complex Distributions
•
Distributions that can be obtained by a
transformation from a simple distribution (such as
uniform distribution) is limited
•
We need sampling methods that do not require
integral and inverse of a function, and can be
applied to more complex distributions
• Ancestral sampling
• Rejection sampling
Ancestral Sampling
•
Assumptions:
• We want samples from a joint distribution p(x1, x2,…, xM)
• The joint distribution is given as a Bayesian network
• Sampling from the conditional distributions are easy
•
Algorithm:
• Sample from the parent nodes to the child nodes in order
• For the child nodes, use the already sampled parent value in the conditional part
D A
E
C
B P(A) P(B|A) P(C|B)
Rejection Sampling
•
Assumptions:
• We want samples from a distribution p(x). The normalization constant Z may be unknown.
• We have a distribution q(x) from which we can easily derive samples. We refer to q as a proposal distraction
( )
p
( )
x
Z
x
p
=
1
~
( )
x pProcedure of Rejection Sampling
Algorithm:
1. Choose a constant k so that the following holds
2. Derive a sample s from q(x)
3. Derive a sample u from a uniform distribution ranging [0, kq(s)]
4. If u > then reject the sample. Otherwise, adopt it
kq(x)
u
( )
x p~
( ) ( )
x
p
x
kq
≥
~
s x
kq(s)
( )
x pMarkov Chain Monte Carlo
•
General and powerful framework for sampling
• Scales well with the dimensionality of the sample space
•
Maintains a state that forms a Markov chain. The
set of the states follows the desired distribution
• Metropolis algorithm
• Metropolis-Hastings algorithm
Metropolis Algorithm
•
Assumptions:
• We want samples from a distribution P(X)
• The normalization constant Z may be unknown
•
Initialization:
1. Prepare a symmetric proposal distribution q(xA|xB) that satisfy q(xA|xB)=q(xB|xA)
2. Prepare an initial state x0
( )
p( )
XZ X
Exercise 4.3
•
Show that
N
(
x
A|
x
B, 1) =
N
(
x
B|
x
A,1), where
N
(
x
|
m
,
v
)
is the Gaussian distribution with mean
m
and
variance
v
(
)
(
)
(
)
(
| ,1)
2 1 exp
2 1
2 1 exp
2 1 1
, |
2 2
A B
A B
B A
B A
x x N x
x
x x
x x N
=
− −
=
− −
=
Procedure of Metropolis Algorithm
•
Algorithm:
1. Get a candidate sample x* from the proposal distribution q(x|xt) based on the current state xt
2. Accept the candidate with probability or discard it
3. If the candidate is accepted, save it as the next state xt+1. If it is discarded, then set xt+1 equals to xt
4. Goto step 3
33
( )= ( )( )
t t
x p
x p x
x
A ~ *
~ , 1 min *,
x0
x1
xt x2 x3
x4
Metropolis-Hastings Algorithm
•
An extension of the Metropolis algorithm
•
No symmetric requirement for the proposal
distribution
•
The acceptance probability of the candidate
state is defined as follows. Other part is the same
as the Metropolis algorithm
(
)
=
( ) (
( ) (
)
)
t k
t
t k t
k
x
x
q
x
p
x
x
q
x
p
x
x
A
|
*
~
*
|
*
~
Gibbs Sampling
•
Problem:
• We want samples from a joint distribution p(x1, x2,…, xM)
•
Algorithm:
1. Prepare an initial state X0 = <x1,x2,…,xM>0
2. Select one of the variables xi in order or at random 3. Get a sample from and update xi with that
value
4. Goto step 2. After enough iterations, the distribution of
xtfollows p(X)
(
x X i)
p i | \
Compared to the Metropolis algorithm:
• Sampling from conditional distribution of xi given all other variables need to be feasible
Variational Method (Outline)
•
Review of derivatives
• Function: a mapping from a value to a value
37
Set of values Set of values
( )
x
f
( ) ( )
0, 00 0
0 = − ≈
−
− x x
x x
x f x
f
0
x
( )
x
f
Functional and Functional Derivative
•
Functional
Set of functions Set of values
[ ]
f
F
[ ]
f
F
Ex. Entropy H[p] takes a function p (probability
distribution) and returns a value
If a functional F takes a
maximum/minimum at f0, and
f is “close” to f0, [ ] ( )
0
0
0 = −
−
f f
f F f
Euler-Lagrange Equation
( ) ( )
x
f
x
( )
x
f
=
0+
εη
Suppose F takes minimum/maximum at
f0 . Let η be an arbitral function of x, and ε is a scalar constant
( )
F
[ ]
f
g
ε
=
g(ε) is a (takes and returns a scalar)function of ε[ ]
f
F
F is a functional of f[ ]
( )
00 0
= ∂∂
=
∂∂
ε
ε=ε
ε
ε=g f
F
When ε is closed to 0, f is close to f0. Therefore, .
This must hold for arbitral η.
[ ] [ ]
00
0 = −
−
ε
f F f
F