• 検索結果がありません。

Stochastic Optimization: First order method

N/A
N/A
Protected

Academic year: 2021

シェア "Stochastic Optimization: First order method"

Copied!
21
0
0

読み込み中.... (全文を見る)

全文

(1)

Stochastic Optimization:

First order method

† ‡ Taiji Suzuki

Tokyo Institute of Technology

Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences

JST, PRESTO

Intensive course @ Nagoya University

(2)

Outline

1 First order method

Proximal gradient descent

Nesterov’s acceleration and optimal convergence

2 / 16

(3)

Outline

1 First order method

Proximal gradient descent

Nesterov’s acceleration and optimal convergence

(4)

Regularized learning problem

Lasso:

xmin∈Rp

1 n

n i=1

(yi −zix)2+ ∥x∥|{z}1 regularization

.

General regularized learning problem:

xmin∈Rp

1 n

n i=1

ℓ(zi,x) + ψ(x).

Difficulty: Sparsity inducing regularization is usuallynon-smooth.

4 / 16

(5)

Regularized learning problem

Lasso:

xmin∈Rp

1 n

n i=1

(yi −zix)2+ ∥x∥|{z}1 regularization

.

General regularized learning problem:

xmin∈Rp

1 n

n i=1

ℓ(zi,x) + ψ(x).

Difficulty: Sparsity inducing regularization is usuallynon-smooth.

(6)

First order optimization

x

t-1

x

t

x

t+1

Optimization methods that use onlythe function value f(x) and the first order gradientg ∈∂f(x).

Computation per iteration is light, and suited for high dimensional problems.

Newton method is a second order method.

5 / 16

(7)

Outline

1 First order method

Proximal gradient descent

Nesterov’s acceleration and optimal convergence

(8)

Gradient descent

Letf(x) =n

i=1ℓ(zi,x).

minx f(x).

Subgradient method

Differentiable f(x):

xt =xt1ηtf(xt1).

7 / 16

(9)

Gradient descent

Letf(x) =n

i=1ℓ(zi,x).

minx f(x).

Subgradient method

Subdifferentiable f(x):

gt∂f(xt1), xt =xt1ηtgt.

(10)

Gradient descent

Letf(x) =n

i=1ℓ(zi,x).

minx f(x).

Subgradient method (equivalent formula)

Subdifferentiable f(x):

xt=argmin

x

{

x,gt+ 1

txxt12 }

,

where gt ∂f(xt1).

Proximal point algorithm:

xt=argmin

x

{

f(x)+ 1

txxt12 }

.

f(xt)optimum for any convexf andηt =η >0 (?).

Iff(x) is strongly convex: f(xt)f(x) 1 (

1 1+ση

)t1

x0x2.

7 / 16

(11)

Proximal gradient descent

Let f(x) =∑n

i=1ℓ(zi,x).

minx f(x) +ψ(x).

Proximal gradient descent xt=argmin

x

{

⟨x,gt+ψ(x)+ 1

t∥x−xt12 }

=argmin

x

{

ηtψ(x) +1

2∥x−(xt1−ηtgt)2 }

where gt ∈∂f(xt1).

The update rule is given by proximal mapping:

(12)

Example

L1 regularization: ψ(x) =C∥x∥1.

xt,j =STCηt(xt1,j −ηtgt,j) (j-th component) where

STC(q) =sign(q) max{|q| −C,0}.

Unimportant elements are forced to be 0.

For many practically used regularizations, analytic form is obtained.

9 / 16

(13)

Example of proximal mapping (cont.)

Trace norm: ψ(X) =C∥X∥tr=C

jσj(X) (sum of singular values).

Let

Xt1−ηtGt =Udiag(σ1, . . . , σd)V, then

Xt =U



STt1) . ..

STCηd)

V.

(14)

Convergence of proximal gradient descent

Strong convexity andsmoothnessof f determines the convergence rate.

xt =prox(xt1−ηtgttψ(x)).

property of f µ-Strongly convex non-strongly conv

γ-Smooth exp

(

−tµ γ

) γ

t

Non-smooth 1

µt

1 t The step size ηt should be appropriately chosen.

Setting ofηt Strongly conv non-strongly conv

Smooth γ1 γ1

Non-smooth µt2 1

t

To achieve this convergence rate, we need to take an average of {xt}t

appropriately;Polyak-Ruppert averaging, polynomially decaying averaging.

11 / 16

(15)

Convergence of proximal gradient descent

Strong convexity andsmoothnessof f determines the convergence rate.

xt =prox(xt1−ηtgttψ(x)).

property of f µ-Strongly convex non-strongly conv

γ-Smooth exp

(

−t

µ γ

) γ

t2

Non-smooth 1

µt

1 t The step size ηt should be appropriately chosen.

Setting ofηt Strongly conv non-strongly conv

Smooth γ1 γ1

Non-smooth µt2 1

t {x }

(16)

Outline

1 First order method

Proximal gradient descent

Nesterov’s acceleration and optimal convergence

12 / 16

(17)

Nesterov’s acceleration (non-strongly convex)

minx{f(x) +ψ(x)}

Assumption: f(x) is γ-smooth.

Nesterov’s acceleration scheme

Let s1 = 1 andη= γ1, and iterate the following fort = 1,2, . . .

1 Let gt ∈∂f(yt), and update xt =prox(yt−ηgt|ηψ).

2 Set st+1 = 1+

1+4st2

2 .

3 Update yt+1 =xt+ (st−1

st+1

)

(xt−xt−1).

Iff is γ-smooth, then

f(xt)−f(x)∥xt−x t2 .

(18)

Nesterov’s acceleration (strongly convex)

minx{f(x) +ψ(x)}

Assumption: f(x) isγ-smooth andµ-strongly convex. (it must beγ > µ)

Nesterov’s acceleration scheme

Let A1= 1, α1 =γ/µ andη = 1γ, and iterate the following for t = 1,2, . . .

1 Let gt ∈∂f(yt), and update xt=prox(yt−ηgt|ηψ).

2 Set αt+1>1 so that (γ−µ)α2t+1(2γ+Att+1+γ = 0, and let At+1 =Att+1.

3 Update yt+1 =xt+

( µ+At

µ)(αt+11)(αt1)

)

(xt−xt1).

Iff is γ-smooth and µ-strongly convex, then f(xt)−f(x)≤γ

( 1

γ µ

)t

∥x0−x2.

14 / 16

(19)
(20)

Iteration

0 20 40 60 80 100 120

Relative objective (f(xt) - f*)

10-8 10-6 10-4 10-2 100 102 104

Normal Nesterov

Nesterov’s acceleration v.s. normal gradient descent Lasso: n = 8,000,p = 500.

16 / 16

(21)

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):

183–202, 2009.

O. G¨uler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2):

403–419, 1991.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147, 2013.

参照

関連したドキュメント

Department of Cardiovascular and Internal Medicine, Kanazawa University Graduate School of Medicine, Kanazawa (N.F., T.Y., M. Kawashiri, K.H., M.Y.); Department of Pediatrics,

*2 Kanazawa University, Institute of Science and Engineering, Faculty of Geosciences and civil Engineering, Associate Professor. *3 Kanazawa University, Graduate School of

[4] , Recent applications of fractional calculus to science and engineering, International Journal of Mathematics and Mathematical Sciences 2003 (2003), no.. Bhatta, Solutions to

* Department of Mathematical Science, School of Fundamental Science and Engineering, Waseda University, 3‐4‐1 Okubo, Shinjuku, Tokyo 169‐8555, Japan... \mathrm{e}

We use subfunctions and superfunctions to derive su ffi cient conditions for the existence of extremal solutions to initial value problems for ordinary differential equations

Arnold This paper deals with recent applications of fractional calculus to dynamical sys- tems in control theory, electrical circuits with fractance, generalized voltage di-

Arnold This paper deals with recent applications of fractional calculus to dynamical sys- tems in control theory, electrical circuits with fractance, generalized voltage di-

† Institute of Computer Science, Czech Academy of Sciences, Prague, and School of Business Administration, Anglo-American University, Prague, Czech