Probabilistic Inference by minimizing the TRW
Free Energy using CCCP
Yu Nishiyama
1, Xingyao Ye
2, Alan L. Yuille
21
Laboratory for Integrated Theoretical Neuroscience, RIKEN, Brain Science Institute, Wako, JAPAN
2Department of Statistics, UCLA, Los Angeles, USA
Introduction
◮ Probabilistic inference – computing marginals and/or the most probable
assignment (MAP estimate) given a large-scale graphical model.
◮ Many applications – e.g., computer vision, protein folding, genetic analysis,
neural science.
◮ No exact inference algorithms as belief propagation in tree graphs when
graphical models have cycles.
◮ Goal – developing approximate algorithms for graphs with cycles, favoring
algorithms which are as accurate as possible and are guaranteed to converge.
◮ Efficient TRW-BP algorithms [Wainwright et. al. 2005]
⊲ guaranteed to compute the global minimum of TRW free energy, though
not convergent.
◮ TRW-GP [Globerson and Jaakkola 2007]
⊲ guaranteed to converge by optimizing the convex dual of a modified TRW
free energy.
◮ sum-TRW-S [Meltzer et al 2009]
⊲ TRW-BP is convergent if messages are updated with right update order.
Purpose
◮ Developing a set of probabilistic inference algorithms by minimizing the
TRW free energy using CCCP.
Markov Random Field & Free Energy
◮ Markov Random Fields (MRFs) defined over graph G = (V, E)
p(x; θ) = exp
X
i∈V
θi(xi) + X ij∈E
θij(xi, xj) − Φ(θ)
.
◮ The true singleton and pairwise marginals p = {{pi(xi)}, {pij(xi, xj)}} are
searched within the local polytope.
L(G) =
b ≥ 0
P xi
bi(xi) = 1, ∀i ∈ V
P xj
bij(xi, xj) = bi(xi), P
xi
bij(xi, xj) = bj(xj), ∀xi, ∀xj, ∀ij ∈ E.
.
◮ The beliefs b which minimize the free energy yield approximations to the
marginals b∗ ≃ p. The free energies have the following form;
F (b) = −b · θ − S(b),
where −b · θ = P
i∈V
hθi(xi)ibi + P ij∈E
hθij(xi, xj)ib
ij and S(b) is the entropy term.
Tree-ReWeighted (TRW) Entropy
◮ The TRW entropy is given by the convex combination of tree entropies
ST (b):
STRW (b) = X
T∈T
ρ(T)ST(b) = X i∈V
Si(bi) − X
ij∈E
ρijIij(bij).
⊲ T is a spanning tree on graph G. T is the set of all spanning trees. ρ(.) is
a probability distribution over the spanning trees. ρij is the edge
appearance probability of ij ∈ E.
◮ Edge appearance vector ρe = {ρij} is chosen within the spanning tree
polytope:
I
T (G) =
(
ρe ∈ IR|E||ρe = X T∈T
ρ(T )v(T) )
.
⊲ v(T) ∈ {0, 1}|E| is an indicator vector such that [v(T)]ij = 1, ij ∈ T, and
otherwise 0.
◮ The TRW entropy can be rewritten over the local polytope L(G) as
¯
STRW(b) = X
i∈V
ciSi(bi) + X
ij∈E
cijSij(bij). (1)
⊲ cij = ρij and ci = 1 − P
j∈Ni ρij. Ni is the set of neighboring nodes of node i.
◮ The TRW entropy STRW(b) is concave over the local polytope L(G).
◮ The Bethe-approximation entropy corresponds to ρe = 1, though 1 ∈/TI (G).
Hence the Bethe-approximation entropy is not guaranteed to be concave if the graph is loopy.
Concave-Convex Procedure(CCCP)
◮ Let Fvex be the set of all convex functions and Dec(F ) be the set of all pairs
of two convex functions such that the difference of the two convex functions
is equal to the objective function F (b):
Dec(F ) = (fvex, gvex) ∈ F2
vex |fvex(b) − gvex(b) = F (b) .
◮ Theorem
Given a series of decompositions of an objective function F (b) into convex
and concave functions, {(fvext , gvext ) ∈ Dec(F )}, the iterative algorithm
∂fvext (bt+1) = ∂gvext (bt) (2 )
is guaranteed to monotonically decrease the objective function F (b).
◮ Let CCCP(F ; f 0
vex, gvex0 ) denote the specific CCCP algorithm under a
decomposition (fvex0 , gvex0 ). The family of CCCP algorithms is given by the
collection:
CCCP(F ) = {CCCP(F ; fvex, gvex)|(fvex, gvex) ∈ Dec(F )} . (3)
◮ A CCCP(F ; f 0
vex, gvex0 ) can be
expanded into a family of
CCCP algorithms CCCP(F ).
geometrical meaning
time t time t+1 time t+2
TRW-CCCP
◮ Let fvex be a convex free energy which has the form of
fvex(b; u) = −b · θ − u · S, (4)
where u · S = P
i∈V
uiSi(bi) + P
ij∈E
uijSij(bij), and u = ({ui}, {uij}) is chosen in a set U
to ensure that fvex(b; u) is convex.
◮ We decompose the TRW free energy, using two convex free energies
fvex(b; ˜u) and gvex(b; u), as
FTRW (b) = −b · θ + fvex(b; ˜u) − gvex(b; u), (5)
where u˜ = c + u. The set in which CCCP algorithms are guaranteed to
converge is
UCCCP =
n
u ∈ IR|V|+|E||u ∈ U, u˜ ∈ Uo. (6)
◮ Resulting TRW-CCCP algorithms are
Algorithm TRW-CCCP; A family of CCCP to minimize the TRW free energy.
1. Initialization: bα(1)(xα) = bα(0)(xα) ∝ exp h
θα(xα)
cα i
α ∈ V ∪ E
2. Choose a free vector u in the set UCCCP. 3. Inner loop:
bi(xi) ∝ h bij(xi)
bi(xi)
i
˜ uij ˜ ui+˜uij
bi(xi)
bij(xi, xj) ∝ h bij(xi)
bi(xi)
i−˜ ui˜
ui+˜uij
bij(xi, xj) ij ∈ E
4. Outer loop:
bα(t+2)(xα) ∝
h
bα(t+1)(xα)
bα(t)(xα)
iuu˜αα
bα(t+1)(xα) α ∈ V ∪ E
5. Output: A set of approximate marginals b∗(≃ p) when outer loop converges. Otherwise, set beliefs to b(t) = b(t+1), b(t+1) = b(t+2) and go to 2.
Numerical Results: Ising Spin Models
◮ In our experiments,
TRW-CCCP under any
setting of parameter u˜
computed the same
pseudo-marginals b∗ as that
from TRW-BP. Larger
settings of u made
TRW-CCCP local and
smaller settings of u made
TRW-CCCP non-local.
◮ Two dimensional grid (10x10 size)
5 10 15 20 25
‑1.26 ‑1.24 ‑1.22 ‑1.2 ‑1.18 ‑1.16 ‑1.14 ‑1.12 ‑1.1
Mixed(10x10grid)
step(outer loop)
TRW Free Energy
tilde‑u=0.1
tilde‑u=0.2
tilde‑u=0.5
tilde‑u=1
tilde‑u=2
tilde‑u=5
tilde‑u=10
0 200 400 600 800 1000
0
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01
Mixed(5x5grid), Inner Loop=1
step
difference of beliefs
TRW‑BP
tilde‑u=1
tilde‑u=2
20 40 60 80 100 120 140 160
20
40 60 80
100 120
140
Mixed(10x10grid)
step(outer loop)
Number of inner loop
tilde‑u=0.1
tilde‑u=0.2
tilde‑u=0.5
tilde‑u=1
tilde‑u=2
tilde‑u=5
tilde‑u=10
100 200 300 400 500 600 700 800 900 1000
0
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01
Mixed(5x5grid), Inner Loop=2
step
difference of beliefs
TRW‑BP
tilde‑u=1
tilde‑u=2
Conclusions
◮ We developed a new set of convergent CCCP algorithms which are
guaranteed to find the global minimum of the TRW free energy. The
algorithms include many free parameters (u ∈ IR|V|+|E|) which could be altered
to change the convergence rate. TRW-CCCP seems more stable for difficult problems such that energy functions have large random weights and