Model-based Kernel Sum Rule with
Applications to State Space Models
Yu Nishiyama1, Motonobu Kanagawa2, Arthur
Gretton3, Kenji Fukumizu4
1. The University of Electro-Communications, 2. The Graduate University for Advanced Studies, 3. University College
London, 4. The Institute of Statistical Mathematics
ABC in Montreal@NIPS2014Workshop 12/12/2014 (Fri.)
Likelihood-Free (LF)
▶
Approximate Bayesian Computation (ABC)
▶
Kernel Bayes' Rule (KBR)
[This Talk]
▶ keywords: nonparametric, kernel methods,
This Talk
▶
ABC vs KBR
[Fukumizu et al., NIPS2011]▶
KBR and others
▶ kernel embedding[Smola et al., ALT2007] ▶ kernel sum rule (KSR) [Song et al., ICML2009]
▶ kernel Bayes' rule (KBR) [Fukumizu et al., NIPS2011]
▶
My Contribution (Poster)
▶ parametric& nonparametric
ABC vs KBR
▶
ABC
▶ generates an approximate posterior sample
X1, . . . , XT ∼q(x|y)∝p(y|x)π(x)
under likelihood-free.
▶
ABC rejection algorithm:
(i) GenerateXt from π(·).
(ii) GenerateYt from p(·|Xt).
(iii) If D(y, Yt)< τ, accept Xt; otherwise reject.
(iv) Go to (i).
ABC vs KBR
▶
Assumption (ABC)
▶ π(x) and p(y|x) are probabilistic models. ▶ Sampling from π(x) and p(y|x) is possible.
ABC weak points:
▶ If τ is finite, bias errors. ▶ If τ is small, X
t is rarely accepted.
▶ If y is high dimension, it is so.
(huge computational cost)
▶ Summary statistics fory are not obvious.
ABC vs KBR
▶
Assumption (KBR)
▶ π(x) and p(y|x) are unknown. ▶ Instead one is given with samples
(Xi, Yi) ∼ p(y|x)p(train)(x)
˜
Xi ∼ π(x) (not necessarily)
▶
KBR
[Fukumizu et al., NIPS2011]▶ nonparametric Bayes' rule computation. ▶ Bayes' rule in a feature space (reproducing
ABC vs KBR
▶
KBR good points:
▶ Complex-shaped π(x) and p(y|x) are OK. ▶ no tolerance value τ.
▶ consistency (sample size n→ ∞ with
regularizationϵn →0).
▶ applicable to any domainX, Y (e.g., strings,
graphs, images).
ABC vs KBR
▶
Kernel ABC
[Fukumizu et al., NIPS2011]▶ apply KBR to a sample (X
t, Yt)Tt=1 drawn from
known p(y|x)π(x).
▶ rejection ABC vs KBR
100 101 102 103 10-3 10-2
CPU time vs Error (2 dim.)
CPU time (sec)
Mean Square Errors
KBR COND ABC
6.1 102
1.5 103
2.1 105
4.9 103
2.7 103
3.1 104
8000 2000 1500 1000 800 600 400 200 3000 6000 4000 5000 200 400 600 800 1000 1500 2000 3000 4000 5000 6000 100 101 102 103 104 10-2 10-1
CPU time vs Error (6 dim.)
CPU time (sec)
Mean Square Errors
KBR COND ABC 3.3 102
1.3 106 7.0 104 1.0 104 2.5 103 9.5 102
200 400 2000 600 800 1500 1000 5000 6000 4000 3000 200 400 2000 4000 600 800 1000 1500 6000 30005000
Probabilities in Feature Space
▶
Kernel Embedding
[Smola et al., ALT2007]feature space probabilities
P
m
Pkernel mean
▶ Map a probability P to RKHS H. ▶ kernel mean
mP :=EX∼P[k(·, X)] =
∫
X
k(·, x)dP(x)∈ H.
k(·, X): feature function.
▶ P 7→m
P is one-to-one (characteristic kernel).
Probabilities in Feature Space
▶
Why kernel mean?
▶ Expectation of RKHS function f
EX∼P[f(X)] =EX∼P[⟨f,k(·, X)⟩H] =⟨f,mP⟩H.
▶ sufficient only to estimate m
P.
▶ no density estimation.
▶
Kernel Bayesian Inference (KBI)
▶ Inferm
P instead of P.
Inference in Feature Space
▶
Kernel Sum Rule (KSR)
[Song et al., ICML2009]probabilities on X
Π
m Π
QY
mQY
probabilities on Y feature space
on X
feature space on Y
sum rule
kernel sum rule
▶ sum rule q(y) =
∫
Xp(y|x)π(x)dx.
▶ estimators
▶ mˆ
Π=∑li=1γikX(·,X˜i), mˆQY =
∑n
i=1wikY(·, Yi).
Inference in Feature Space
▶
Kernel Bayes' Rule (KBR)
probabilities on X Π m Π QY mQ Y probabilities on Y feature space on X feature space on Y
Bayes rule kernel Bayes' rule
QX|y
mQX|y
+y
+y
▶ Bayes' rule q(x|y)∝p(y|x)π(x). ▶ Posterior estimator mˆ
QX |y =
∑n
j=1wjkX(·, Xj).
▶ matrix multiplication w=M(KBR)( ˆm
Intuitive
x
y
?
input output
y
?
input output
input output
x y?
(X1,Y1)
(X2,Y2)
input output
▶
KSR
▶ If x is close toX
1, y is close to Y1.
▶ If x is far from X
2, y is far from Y2.
▶ closeness? ⇒ similarity on kernel k(x, X
1).
▶ Introduce RKHS H
X on X.
▶ Vector-valued kernel ridge regression. ▶ Introduce kernel k(y, Y
1) (RKHS HY) on Y.
Intuitive
x?
y (
X1,Y1)
(X2,Y2)
input output
x?
y (
X1,Y1)
(X2,Y2)
input output
▶
KBR
▶ Predict x fory.
▶ One-way KSR with reverse is enough?
⇒ No. Take into account prior π(x).
KSR
&
KBR can be a building block
▶
KBR-based LF Algorithms
▶ State Space Models
[Fukumizu et al., NIPS2011].
▶ Partially Observable Markov Decision Process
[Nishiyama et al., UAI2012].
▶ Predictive State Representation
[Boots et al., UAI2013].
▶
State Space Models
[Fukumizu et al., NIPS2011]▶ filtering algorithm ▶ fu - ara etric
▶ transition processp(x′|x) (nonparametric). ▶ observation processp(z|x) (nonparametric).
Remain & Poster
▶
Objective
▶ flexible inference (parametric& nonparametric).
▶
State Space Models (proposal)
▶ kernel LF filtering algorithm
▶ transition process (parametric)
additive Gaussian noise models
xt+1 =f(xt) +ςt, ςt ∼N(0,Σ)
1. f may be nonlinear.
2. Other noise models are possible (α-stable, normal inverse Gaussian,
variance Gamma distribution).
Parametric
&
Nonparametric
▶
Example (Vision-based Robot Localization)
▶ State x ⇒ robot's position.
▶ Observation z ⇒ image (structural data). ▶ Transition process p(x′|x)
⇒ physical system (parametric).
▶ Observation processp(z|x):
⇒ complex (nonparametric).
▶ KBR is applicable to any domain.
Parametric
&
Nonparametric
▶
Sum Rule in Feature Space
▶ nonparametric KSR (Non-KSR)
[Song et al., ICML2009]
▶ model-based KSR (Mb-KSR)
[This NIPS2014 Workshop]
▶
Model-based KSR (Mb-KSR)
Parametric
&
Nonparametric
▶
Model-based KSR (Mb-KSR)
▶ input: mˆ
Π:=
∑l
i=1γikX(·,X˜i).
▶ output: mˆ
QY =
∑l
i=1γimY|X˜i.
mY|X˜i: closed-form kernel embedding.
▶
Closed-form Kernel Embedding
▶ Gaussian model p(y|x) + Gaussian kernelk(·,·). ▶ Other pairs are possible.
▶ conjugate pair.
Parametric
&
Nonparametric
▶
Bayesian Inference in Feature Space
▶ Apply Non-KSR &Mb-KSR.
X Y Z
Nonparametric Nonparametric
X Y Z
Nonparametric Parametric
X Y Z
Parametric Nonparametric
▶ Left:
Non-KSR + Mb-KSR
▶ Right:
State Space Models in Feature Space
▶
Filtering in State Space Models.
▶ Transition Process (parametric)
▶ Observation Process (nonparametric)
Known Known
Unknown Unknown Unknown
Xt
Xt-1 Xt+1
Zt-1 Zt Zt+1
Known Known
Unknown Unknown Unknown
Xt
Xt-1 Xt+1
Zt-1 Zt Zt+1
▶ Prediction Step
▶ Apply Mb-KSR.
▶ Filtering Step
▶ Apply KBR.
State Space Models in Feature Space
▶
Kernel LF filtering Algorithm.
▶ computations are matrix multiplications.
Algorithm 1KBR-Filter (Transition: probabilistic model, Observation: nonparametric) Initial Belief:prior kernel meanmX
1.
Observation: z1∈ Z.
Initial Filtering:αX1|z1←KBR Algorithm with priormX1.
for t= 2 :Tdo
Weight Computation 1: βZt|z1:t−1= (GX+ǫnIn)
−1G
X′|XαXt−1|z1:t−1.
Observation: zt∈ Z.
Weight Computation 2: αXt|z1:t=RX |Z(βZt|z1:t−1)kZ(zt).
end for
-8 -6 -4 -2 0 2 4
Experiments
▶
Synthetic Experiment
▶ proposed v.s. full-nonparametric [Fukumizu et al.,
NIPS2011]. -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -0.1 0 0.1
100 200 300 400 0.015
0.02 0.025
Sample Size
Mean Squared Error
fKBF proposed
100 200 300 400 0
0.005 0.01 0.015
Sample Size
Mean Squared Error
fKBF proposed
100 200 300 400 0 0.002 0.004 0.006 0.008 0.01 Sample Size
Mean Squared Error
fKBF proposed
Experiments
▶
Vision-based Robot Localization Problem
▶ COsy Localization Database (COLD) [Pronobis et
al., 2009].
(a) Corridor
(b) Two-persons office
Experiments
▶
Vision-based Robot Localization Problem
▶ RMSE vs sample size
-6 -4 -2 0 2 4 6 8 -10 -8 -6 -4 -2 0 2 4 x y
200 400 600 800 1.5 2 2.5 3 3.5 4 4.5 5
Training Sample Size
R MSE NAI NN fKBF Analy
NAI: naive (maximum similarity) NN: k nearest neighbors
fKBF: full-nonparametric
Analy: proposed
Conclusion
▶
KBR is a likelihood-free approach.
▶
Probabilities are inferred in feature space.
▶
Various algorithms in feature space.
▶