Model based Kernel Sum Rule with Applications to State Space Models slide

(1)

Model-based Kernel Sum Rule with

Applications to State Space Models

Yu Nishiyama1, Motonobu Kanagawa2, Arthur

Gretton3, Kenji Fukumizu4

1. The University of Electro-Communications, 2. The Graduate University for Advanced Studies, 3. University College

London, 4. The Institute of Statistical Mathematics

ABC in Montreal@NIPS2014Workshop 12/12/2014 (Fri.)

(2)

Likelihood-Free (LF)

▶

Approximate Bayesian Computation (ABC)

▶

Kernel Bayes' Rule (KBR)

[This Talk]

▶ keywords: nonparametric, kernel methods,

(3)

This Talk

▶

ABC vs KBR

[Fukumizu et al., NIPS2011]

▶

KBR and others

▶ kernel embedding[Smola et al., ALT2007] ▶ kernel sum rule (KSR) [Song et al., ICML2009]

▶ kernel Bayes' rule (KBR) [Fukumizu et al., NIPS2011]

▶

My Contribution (Poster)

▶ parametric& nonparametric

(4)

ABC vs KBR

▶

ABC

▶ generates an approximate posterior sample

X1, . . . , XT ∼q(x|y)∝p(y|x)π(x)

under likelihood-free.

▶

ABC rejection algorithm:

(i) GenerateXt from π(·).

(ii) GenerateYt from p(·|Xt).

(iii) If D(y, Yt)< τ, accept Xt; otherwise reject.

(iv) Go to (i).

(5)

ABC vs KBR

▶

Assumption (ABC)

▶ π(x) and p(y|x) are probabilistic models. ▶ Sampling from π(x) and p(y|x) is possible.

ABC weak points:

▶ If τ is ﬁnite, bias errors. ▶ If τ is small, X

t is rarely accepted.

▶ If y is high dimension, it is so.

(huge computational cost)

▶ Summary statistics fory are not obvious.

(6)

ABC vs KBR

▶

Assumption (KBR)

▶ π(x) and p(y|x) are unknown. ▶ Instead one is given with samples

(Xi, Yi) ∼ p(y|x)p(train)(x)

˜

Xi ∼ π(x) (not necessarily)

▶

KBR

▶ nonparametric Bayes' rule computation. ▶ Bayes' rule in a feature space (reproducing

(7)

ABC vs KBR

▶

KBR good points:

▶ Complex-shaped π(x) and p(y|x) are OK. ▶ no tolerance value τ.

▶ consistency (sample size n→ ∞ with

regularizationϵn →0).

▶ applicable to any domainX, Y (e.g., strings,

graphs, images).

(8)

ABC vs KBR

▶

Kernel ABC

▶ apply KBR to a sample (X

t, Yt)Tt=1 drawn from

known p(y|x)π(x).

▶ rejection ABC vs KBR

100 101 102 103 10-3 10-2

CPU time vs Error (2 dim.)

CPU time (sec)

Mean Square Errors

KBR COND ABC

6.1 102

1.5 103

2.1 105

4.9 103

2.7 103

3.1 104

8000 2000 1500 1000 800 600 400 200 3000 6000 4000 5000 200 400 600 800 1000 1500 2000 3000 4000 5000 6000 100 101 102 103 104 10-2 10-1

CPU time vs Error (6 dim.)

CPU time (sec)

Mean Square Errors

KBR COND ABC 3.3 102

1.3 106 7.0 104 1.0 104 2.5 103 9.5 102

200 400 2000 600 800 1500 1000 5000 6000 4000 3000 200 400 2000 4000 600 800 1000 1500 6000 30005000

(9)

Probabilities in Feature Space

▶

Kernel Embedding

[Smola et al., ALT2007]

feature space probabilities

P

m

P

kernel mean

▶ Map a probability P to RKHS H. ▶ kernel mean

mP :=EX∼P[k(·, X)] =

∫

X

k(·, x)dP(x)∈ H.

k(·, X): feature function.

▶ P 7→m

P is one-to-one (characteristic kernel).

(10)

Probabilities in Feature Space

▶

Why kernel mean?

▶ Expectation of RKHS function f

EX∼P[f(X)] =EX∼P[⟨f,k(·, X)⟩_H] =⟨f,mP⟩_H.

▶ suﬃcient only to estimate m

P.

▶ no density estimation.

▶

Kernel Bayesian Inference (KBI)

▶ Inferm

P instead of P.

(11)

Inference in Feature Space

▶

Kernel Sum Rule (KSR)

[Song et al., ICML2009]

probabilities on X

Π

m Π

Q_Y

m_QY

probabilities on Y feature space

on X

feature space on Y

sum rule

kernel sum rule

▶ sum rule q(y) =

∫

Xp(y|x)π(x)dx.

▶ estimators

▶ mˆ

Π=∑l_i₌₁γikX(·,X˜i), mˆQY =

∑n

i=1wikY(·, Yi).

(12)

Inference in Feature Space

▶

Kernel Bayes' Rule (KBR)

probabilities on X Π m Π QY m_Q Y probabilities on Y feature space on X feature space on Y

Bayes rule kernel Bayes' rule

QX|y

mQ_X_|_y

+y

▶ Bayes' rule q(x|y)∝p(y|x)π(x). ▶ Posterior estimator mˆ

Q_{X |}y =

∑n

j=1wjkX(·, Xj).

▶ matrix multiplication w=M(KBR)( ˆm

(13)

Intuitive

x

y

?

input output

y

?

input output

x y?

(_X1,Y1)

(_X2,_Y2)

input output

▶

KSR

▶ If x is close toX

1, y is close to Y1.

▶ If x is far from X

2, y is far from Y2.

▶ closeness? ⇒ similarity on kernel k(x, X

1).

▶ Introduce RKHS H

X on X.

▶ Vector-valued kernel ridge regression. ▶ Introduce kernel k(y, Y

1) (RKHS HY) on Y.

(14)

Intuitive

x?

y ₍

X1,Y1)

(_X2,_Y2)

input output

x?

y ₍

X1,Y1)

(X2,Y2)

input output

▶

KBR

▶ Predict x fory.

▶ One-way KSR with reverse is enough?

⇒ No. Take into account prior π(x).

(15)

KSR

&

_{KBR can be a building block}

▶

KBR-based LF Algorithms

▶ State Space Models

[Fukumizu et al., NIPS2011].

▶ Partially Observable Markov Decision Process

[Nishiyama et al., UAI2012].

▶ Predictive State Representation

[Boots et al., UAI2013].

▶

State Space Models

▶ ﬁltering algorithm ▶ fu - ara etric

▶ transition processp(x′|x) (nonparametric). ▶ observation processp(z|x) (nonparametric).

(16)

Remain & Poster

▶

Objective

▶ ﬂexible inference (parametric& nonparametric).

▶

State Space Models (proposal)

▶ kernel LF ﬁltering algorithm

▶ transition process (parametric)

additive Gaussian noise models

xt+1 =f(xt) +ςt, ςt ∼N(0,Σ)

1. f may be nonlinear.

2. Other noise models are possible (α-stable, normal inverse Gaussian,

variance Gamma distribution).

(17)

Parametric

&

_{Nonparametric}

▶

Example (Vision-based Robot Localization)

▶ State x ⇒ robot's position.

▶ Observation z ⇒ image (structural data). ▶ Transition process p(x′|x)

⇒ physical system (parametric).

▶ Observation processp(z|x):

⇒ complex (nonparametric).

▶ KBR is applicable to any domain.

(18)

Parametric

&

_{Nonparametric}

▶

Sum Rule in Feature Space

▶ nonparametric KSR (Non-KSR)

[Song et al., ICML2009]

▶ model-based KSR (Mb-KSR)

[This NIPS2014 Workshop]

▶

Model-based KSR (Mb-KSR)

(19)

Parametric

&

_{Nonparametric}

▶

Model-based KSR (Mb-KSR)

▶ input: mˆ

Π:=

∑l

i=1γikX(·,X˜i).

▶ output: mˆ

QY =

∑l

i=1γim_Y|_X˜_i.

m_Y|X˜i: closed-form kernel embedding.

▶

Closed-form Kernel Embedding

▶ Gaussian model p(y|x) + Gaussian kernelk(·,·). ▶ Other pairs are possible.

▶ conjugate pair.

(20)

Parametric

&

_{Nonparametric}

▶

Bayesian Inference in Feature Space

▶ Apply Non-KSR &Mb-KSR.

X Y Z

Nonparametric Nonparametric

X Y Z

Nonparametric Parametric

X Y Z

Parametric Nonparametric

▶ Left:

Non-KSR + Mb-KSR

▶ Right:

(21)

State Space Models in Feature Space

▶

Filtering in State Space Models.

▶ Transition Process (parametric)

▶ Observation Process (nonparametric)

Known Known

Unknown Unknown Unknown

Xt

Xt-1 Xt+1

Zt-1 Zt Zt+1

Known Known

Unknown Unknown Unknown

Xt

Xt-1 Xt+1

Zt-1 Zt Zt+1

▶ Prediction Step

▶ Apply Mb-KSR.

▶ Filtering Step

▶ Apply KBR.

(22)

State Space Models in Feature Space

▶

Kernel LF ﬁltering Algorithm.

▶ computations are matrix multiplications.

Algorithm 1KBR-Filter (Transition: probabilistic model, Observation: nonparametric) Initial Belief:prior kernel meanm_X

1.

Observation: z₁∈ Z.

Initial Filtering:αX₁|z₁←KBR Algorithm with priormX₁.

for t_{= 2 :}Tdo

Weight Computation 1: βZt|z1:t−1= (GX+ǫnIn)

−1_G

X′|XαXt−1|z1:t−1.

Observation: zt∈ Z.

Weight Computation 2: αXt|z1:t=RX |Z(βZt|z1:t−1)kZ(zt).

end for

-8 -6 -4 -2 0 2 4

(23)

Experiments

▶

Synthetic Experiment

▶ proposed v.s. full-nonparametric [Fukumizu et al.,

NIPS2011]. -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -0.1 0 0.1

100 200 300 400 0.015

0.02 0.025

Sample Size

Mean Squared Error

fKBF proposed

100 200 300 400 0

0.005 0.01 0.015

Sample Size

Mean Squared Error

fKBF proposed

100 200 300 400 0 0.002 0.004 0.006 0.008 0.01 Sample Size

Mean Squared Error

fKBF proposed

(24)

Experiments

▶

Vision-based Robot Localization Problem

▶ COsy Localization Database (COLD) [Pronobis et

al., 2009].

(a) Corridor

(b) Two-persons office

(25)

Experiments

▶

Vision-based Robot Localization Problem

▶ RMSE vs sample size

-6 -4 -2 0 2 4 6 8 -10 -8 -6 -4 -2 0 2 4 x y

200 400 600 800 1.5 2 2.5 3 3.5 4 4.5 5

Training Sample Size

R MSE NAI NN fKBF Analy

NAI: naive (maximum similarity) NN: k nearest neighbors

fKBF: full-nonparametric

Analy: proposed

(26)

Conclusion

▶

KBR is a likelihood-free approach.

▶

Probabilities are inferred in feature space.

▶

Various algorithms in feature space.

▶