計算速度 - ⇓ ラグランジアン : - 2011/1/20: 機械学習における連続最適化の新しいトレンド

⇓ ラグランジアン :

テンソル結果 2: 計算速度

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 10 20 30 40 50

Fraction of observed elements

Computation time (s)

As a Matrix Constraint Mixture Tucker (large) Tucker (exact)

しかも凸最適化は速い！

(Tomioka et al. 10)

. . . . . .

手法3: Operator Splitting

Outline

.

. .

₁

イントロ

.

. .

2 準備

.

. .

3 手法

1: Prox

作用素

.

. .

4 手法

2: Legendre

_変換

. .

5 手法3: Operator Splitting

.

. .

6 まとめ

.

. .

Appendix

冨岡亮太(東大) 51 / 66

. . . . . .

手法3: Operator Splitting

凸最適化 ⇔ ^{方程式のゼロ点}

.

最小化問題

.

. . .

.

min

f (x ) + g(x ).

.

方程式のゼロ点

.

. . .

.

Find x such that (T

+ T

)(x ) ∋

where

T

= ∂f,

T

= ∂g (

劣微分作用素

).

例

(Prox

作用素

):

= prox

(z ) = argmin

x^′∈Rⁿ

(x

^′

) + 1

2

∥x^′−z∥²

⇔ T_f

(x ) + (x

−z

)

∋0

⇔

prox

(z) = (I +

T_f

)

⁻¹

(z )

. . . . . .

手法3: Operator Splitting

凸最適化 ⇔ ^{方程式のゼロ点}

.

最小化問題

.

. . .

.

min

f (x ) + g(x ).

.

方程式のゼロ点

.

. . .

.

Find x such that (T

+ T

)(x ) ∋

where

T

= ∂f,

T

= ∂g (

劣微分作用素

).

例

(Prox

作用素

):

x = prox

(z ) = argmin

x^′∈Rⁿ

µ

f (x

^′

) + 1

2 ∥ x

^′

− z ∥

¶

⇔ T

(x ) + (x − z ) ∋

⇔ prox

(z) = (I + T

)

⁻¹

(z )

冨岡亮太(東大) 52 / 66

. . . . . .

手法3: Operator Splitting

Forward-Backward Splitting ( ⇔ IST)

(Lions & Mercier 76)

minimize

f (x) + g(x )

⇔ ﬁnd x T

(x ) + T

(x) ∋

⇔

ﬁnd

x ηT_f

(x ) +

ηT_g

(x )

∋0

⇔

ﬁnd

x x

+

ηT_f

(x)

∋x−ηT_g

(x )

⇔

ﬁnd

(I +

ηT_f

)(x ) = (x

−ηT_g

(x ))

⇔

ﬁnd

x x

= prox

_ηf

(x

−ηTg

(x))

反復式

:

⇓

x^t+1

= prox

_η_t_f

(x

^t −η_t∇g(x^t

))

Douglas Rachford Splitting

も同様の方法で導出できる（複雑すぎるので省略）

. . . . . .

手法3: Operator Splitting

Forward-Backward Splitting ( ⇔ IST)

(Lions & Mercier 76)

minimize

f (x) + g(x )

⇔ ﬁnd x T

(x ) + T

(x) ∋

⇔ ﬁnd x ηT

(x ) + ηT

(x ) ∋

⇔

ﬁnd

x x

+

ηT_f

(x)

∋x−ηT_g

(x )

⇔

ﬁnd

(I +

ηT_f

)(x ) = (x

−ηT_g

(x ))

⇔

ﬁnd

x x

= prox

_ηf

(x

−ηTg

(x))

反復式

:

⇓

x^t+1

= prox

_η_t_f

(x

^t −η_t∇g(x^t

))

Douglas Rachford Splitting

も同様の方法で導出できる（複雑すぎるので省略）

冨岡亮太(東大) 53 / 66

. . . . . .

手法3: Operator Splitting

Forward-Backward Splitting ( ⇔ IST)

(Lions & Mercier 76)

minimize

f (x) + g(x )

⇔ ﬁnd x T

(x ) + T

(x) ∋

⇔ ﬁnd x ηT

(x ) + ηT

(x ) ∋

⇔ ﬁnd x x + ηT

(x) ∋ x − ηT

(x )

⇔

ﬁnd

(I +

ηT_f

)(x ) = (x

−ηT_g

(x ))

⇔

ﬁnd

x x

= prox

_ηf

(x

−ηTg

(x))

反復式

:

⇓

x^t+1

= prox

_η_t_f

(x

^t −η_t∇g(x^t

))

Douglas Rachford Splitting

も同様の方法で導出できる（複雑すぎるので省略）

. . . . . .

手法3: Operator Splitting

Forward-Backward Splitting ( ⇔ IST)

(Lions & Mercier 76)

minimize

f (x) + g(x )

⇔ ﬁnd x T

(x ) + T

(x) ∋

⇔ ﬁnd x ηT

(x ) + ηT

(x ) ∋

⇔ ﬁnd x x + ηT

(x) ∋ x − ηT

(x )

⇔ ﬁnd x (I + ηT

)(x ) = (x − ηT

(x ))

⇔

ﬁnd

x x

= prox

_ηf

(x

−ηTg

(x))

反復式

:

⇓

x^t+1

= prox

_η_t_f

(x

^t −η_t∇g(x^t

))

Douglas Rachford Splitting

も同様の方法で導出できる（複雑すぎるので省略）

冨岡亮太(東大) 53 / 66

. . . . . .

手法3: Operator Splitting

Forward-Backward Splitting ( ⇔ IST)

(Lions & Mercier 76)

minimize

f (x) + g(x )

⇔ ﬁnd x T

(x ) + T

(x) ∋

⇔ ﬁnd x ηT

(x ) + ηT

(x ) ∋

⇔ ﬁnd x x + ηT

(x) ∋ x − ηT

(x )

⇔ ﬁnd x (I + ηT

)(x ) = (x − ηT

(x ))

⇔ ﬁnd x x = prox

_ηf

(x − ηT

(x))

反復式

:

⇓

x^t+1

= prox

_η_t_f

(x

^t −η_t∇g(x^t

))

Douglas Rachford Splitting

も同様の方法で導出できる（複雑すぎるので省略）

. . . . . .

手法3: Operator Splitting

Forward-Backward Splitting ( ⇔ IST)

(Lions & Mercier 76)

minimize

f (x) + g(x )

⇔ ﬁnd x T

(x ) + T

(x) ∋

⇔ ﬁnd x ηT

(x ) + ηT

(x ) ∋

⇔ ﬁnd x x + ηT

(x) ∋ x − ηT

(x )

⇔ ﬁnd x (I + ηT

)(x ) = (x − ηT

(x ))

⇔ ﬁnd x x = prox

_ηf

(x − ηT

(x))

反復式

: ⇓

x

^t+1

= prox

_η_t_f

(x

− η

∇ g(x

))

Douglas Rachford Splitting

も同様の方法で導出できる（複雑すぎるので省略）

冨岡亮太(東大) 53 / 66

. . . . . .

手法3: Operator Splitting

Forward-Backward Splitting ( ⇔ IST)

(Lions & Mercier 76)

minimize

f (x) + g(x )

⇔ ﬁnd x T

(x ) + T

(x) ∋

⇔ ﬁnd x ηT

(x ) + ηT

(x ) ∋

⇔ ﬁnd x x + ηT

(x) ∋ x − ηT

(x )

⇔ ﬁnd x (I + ηT

)(x ) = (x − ηT

(x ))

⇔ ﬁnd x x = prox

_ηf

(x − ηT

(x))

反復式

: ⇓

x

^t+1

= prox

_η_t_f

(x

− η

∇ g(x

))

Douglas Rachford Splitting

も同様の方法で導出できる（複雑すぎるので省略）

. . . . . .

まとめ

Outline

.

. .

₁

イントロ

.

. .

2 準備

.

. .

3 手法

1: Prox

作用素

.

. .

4 手法

2: Legendre

_変換

.

. .

5 手法

3: Operator Splitting

. .

6 まとめ

.

. .

Appendix

冨岡亮太(東大) 54 / 66

. . . . . .

まとめ

メッセージ

ブラックボックス最適化から中身を考慮した最適化へ微分不可能でも怖くない

Prox

作用素や凸共役を計算しよう

. . . . . .

Appendix

Outline

.

. .

₁

イントロ

.

. .

2 準備

.

. .

3 手法

1: Prox

作用素

.

. .

4 手法

2: Legendre

_変換

.

. .

5 手法

3: Operator Splitting

.

. .

6 まとめ

. .

7 Appendix

冨岡亮太(東大) 56 / 66

. . . . . .

Appendix

凸共役の性質 (contd.)

凸共役（

Legendre

変換）と畳み込み

: (f + g)

^∗

(y) = inf

(f

^∗

(y −

α) +

g

^∗

(α)) .

最小化と凸共役

:

inf

x f

(x ) =

−

sup

(〈

0,x〉 −f

(x)) =

−f^∗

(0).

Fenchel

双対

:

inf

(f (x) +

g(x

)) =

−

(f +

g)^∗

(0) = sup

(

−f^∗

(

−α)−g^∗

(α))

. . . . . .

Appendix

凸共役の性質 (contd.)

凸共役（

Legendre

変換）と畳み込み

: (f + g)

^∗

(y) = inf

(f

^∗

(y −

α) +

g

^∗

(α)) .

最小化と凸共役

:

inf

f (x ) = − sup

(〈

x 〉 − f (x)) = − f

^∗

(0).

Fenchel

双対

:

inf

(f (x) +

g(x

)) =

−

(f +

g)^∗

(0) = sup

(

−f^∗

(

−α)−g^∗

(α))

冨岡亮太(東大) 57 / 66

. . . . . .

Appendix

凸共役の性質 (contd.)

凸共役（

Legendre

変換）と畳み込み

: (f + g)

^∗

(y) = inf

(f

^∗

(y −

α) +

g

^∗

(α)) .

最小化と凸共役

:

inf

f (x ) = − sup

(〈

x 〉 − f (x)) = − f

^∗

(0).

Fenchel

双対

:

inf

(f (x) + g(x )) = − (f + g)

^∗

(0) = sup

( − f

^∗

( −

α)

− g

^∗

(α)) .

. . . . . .

Appendix

Fenchel 双対定理

inf

(f (Ax ) + g(x )) = sup

³ − f

^∗

( −

α)

− g

^∗

(A

^⊤α)

´ .

等式制約付き最適化問題

minimize

x∈Rⁿ,z∈R^m

f (z ) + g(x), subject to Ax = z

の双対問題として導出できる．

(http://www.ibis.t.u-tokyo.ac.jp/RyotaTomioka/Notes/DerivingDual)

凸共役の一覧表があれば機械的に双対問題を作れる点が美味しい．

冨岡亮太(東大) 58 / 66

. . . . . .

Appendix

Fenchel 双対の例 (1): SVM

.

主問題

.

. . .

.

min

f (y ◦ X w) + φ

_λ

(x)

 

 

 



f (z) = X

m i=1

max(0, 1 − z

), φ

(x ) = λ

2 ∥ x ∥

.

双対問題

.

. . .

.

max

− f

^∗

( −

α)

− φ

^∗_λ

(X

^⊤

(α ◦ y ))

 

 

 



f

^∗

(−α) = (P

i=1

− α

(0 ≤

≤ 1),

+∞ (oterwise), φ

^∗_λ

(v ) = 1

2λ ∥ v ∥

.

冨岡亮太(東大) 59 / 66

. . . . . .

Appendix

Fenchel双対の例(2):正則化付きロジスティック回帰(Jaakkola & Haussler 99)

.

主問題

.

. . .

.

min

f (y ◦ X w) + φ

_λ

(x)

 

 

 



f (z) = X

m i=1

log(1 + exp( − z

)), φ

_λ

(x ) = λ

2 ∥ x ∥

.

双対問題

.

. . .

.

max

− f

^∗

( −

α)

− φ

^∗_λ

(X

^⊤

(α ◦ y))

 

 

 

 



f

^∗

( −

α) =

X

i=1

α

log(α

)

+(1 − α

) log(1 − α

), φ

^∗_λ

(v ) = 1

2λ ∥ v ∥

,

(a) primal losses

O 1

Hinge Loss Logistic Loss

(b) dual losses

O 1

Hinge Loss Logistic Loss

冨岡亮太(東大) 60 / 66

. . . . . .

Appendix

Fenchel 双対の例 (3): 線形計画

.

主問題

.

. . .

.

(P) min c

^⊤

x,

s.t. Ax = b, x ≥ 0.

⇕

.

主問題

’

.

. . .

.

(P

^′

) min

f (Ax ) + g(x)

 

 

 

 



f(z) =

( 0 (z = b), + ∞ (otherwise), g(x ) =

( c

^⊤

x (x ≥ 0), + ∞ (otehrwise).

.

双対問題

.

. . .

.

(D) max b

^⊤

y, s.t. A

^⊤

y ≤ c.

⇕

.

双対問題’

.

. . .

.

(D

^′

) max

− f ( − y) − g(A

^⊤

y)

 



 

f

^∗

(y ) = b

^⊤

y , g

^∗

(α) =

(

0 (α ≤ c), +∞ (oterwise).

冨岡亮太(東大) 61 / 66

. . . . . .

Appendix

凸じゃない最適化は？

一般的に対処するのは困難

基本的な方針は凸最適化を繰り返し解くこと EM

Difference of Convex (DC) programming CCCP

冨岡亮太(東大) 62 / 66

. . . . . .

Appendix

文献

全体的なまとめ

Tomioka, Suzuki, & Sugiyama (2011) Augmented Lagrangian Methods for Learning, Selecting, and Combining Features. In Sra, Nowozin, Wright., editors,Optimization for Machine Learning, MIT Press.

Combettes & Pesquet (2010) Proximal splitting methods in signal processing. InFixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag.

Boyd, Parikh, Peleato, & Eckstein (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers.

教科書

Rockafellar (1970) Convex Analysis. Princeton University Press.

Bertsekas (1999) Nonlinear Programming. Athena Scientiﬁc.

Nesterov (2003) Introductory Lectures on Convex Optimization: A Basic Course. Springer.

. . . . . .

Appendix

文献

Proximal Point Algorithm/Augmented Lagrangian

Arrow, Hurwicz, & Uzawa (1958) Studies in Linear and Non-Linear Programming. Stanford University Press.

Moreau (1965) Proximité et dualité dans un espace Hilbertien. Bul letin de la S. M. F.

Rockafellar (1976) Monotone operators and the proximal point algorithm.

SIAM J Control Optim 14, 877–898.

Bertsekas (1982) Constrained Optimization and Lagrange Multiplier Methods. Academic Press.

Tomioka, Suzuki, & Sugiyama (2011) Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning. Arxiv:0911.4046.

IST/FISTA

Nesterov (2007) Gradient Methods for Minimizing Composite Objective Function.

Beck & Teboulle (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J Imag Sci 2, 183–202.

冨岡亮太(東大) 64 / 66

. . . . . .

Appendix

文献

Operator Splitting

Lions & Mercier (1976) Splitting Algorithms for the Sum of Two Nonlinear Operators. SIAM J Numer Anal 16, 964–979.

Gabay & Mercier (1976) A dual algorithm for the solution of nonlinear variational problems via ﬁnite element approximation. Comput Math Appl 2, 17–40.

Eckstein & Bertsekas (1992) On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators.

マルチタスク／マルチカーネル

Evgeniou, Micchelli, & Pontil (2005) Learning Multiple Tasks with Kernel Methods. JMLR 6, 615–637.

Bach, Thibaux, & Jordan (2005) Computing regularization paths for learning multiple kernels. Advances in NIPS, 73–80.

Suzuki & Tomioka (2009) SpicyMKL. Arxiv:0909.5026.

ドキュメント内 2011/1/20: 機械学習における連続最適化の新しいトレンド (ページ 52-77)

計算速度

⇓ ラグランジアン :

テンソル結果 2: 計算速度

(Tomioka et al. 10)

Outline

.

. .

.

. .

.

. .

1: Prox

.

. .

2: Legendre

.

. .

.

. .

Appendix

凸最適化 ⇔ 方程式のゼロ点

.

.

.

.

. . .

.

.

min

f (x ) + g(x ).

.

.

.

.

. . .

.

.

Find x such that (T

+ T

)(x ) ∋

where

T

= ∂f,

T

= ∂g (

).

(Prox

):

= prox

(z ) = argmin

(x

) + 1

2

(x ) + (x

)

prox

(z) = (I +

)

(z )

凸最適化 ⇔ 方程式のゼロ点

.

.

.

.

. . .

.

.

min

f (x ) + g(x ).

.

.

.

.

. . .

.

.

Find x such that (T

+ T

)(x ) ∋

where

凸最適化 ⇔ ^{方程式のゼロ点}

凸最適化 ⇔ ^{方程式のゼロ点}