Stochastic Optimization: First order method

(1)

Stochastic Optimization:

First order method

† ‡ Taiji Suzuki

†Tokyo Institute of Technology

Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences

‡JST, PRESTO

Intensive course @ Nagoya University

(2)

Outline

1 First order method

Proximal gradient descent

Nesterov’s acceleration and optimal convergence

2 / 16

(3)

Outline

(4)

Regularized learning problem

Lasso:

xmin∈R^p

1 n

∑n i=1

(yi −z_i^⊤x)²+ ∥x∥|{z}1 regularization

.

General regularized learning problem:

xmin∈R^p

1 n

∑n i=1

ℓ(z_i,x) + ψ(x).

Diﬃculty: Sparsity inducing regularization is usuallynon-smooth.

4 / 16

(5)

Regularized learning problem

Lasso:

xmin∈R^p

1 n

∑n i=1

(yi −z_i^⊤x)²+ ∥x∥|{z}1 regularization

.

General regularized learning problem:

xmin∈R^p

1 n

∑n i=1

ℓ(z_i,x) + ψ(x).

Diﬃculty: Sparsity inducing regularization is usuallynon-smooth.

(6)

First order optimization

x

t-1

x

t

x

t+1

Optimization methods that use onlythe function value f(x) and the first order gradientg ∈∂f(x).

Computation per iteration is light, and suited for high dimensional problems.

Newton method is a second order method.

5 / 16

(7)

Outline

(8)

Gradient descent

Letf(x) =∑n

i=1ℓ(zi,x).

minx f(x).

Subgradient method

Diﬀerentiable f(x):

xt =xt−1−ηt∇f(xt−1).

7 / 16

(9)

Gradient descent

Letf(x) =∑n

i=1ℓ(zi,x).

minx f(x).

Subgradient method

Subdiﬀerentiable f(x):

gt∈∂f(xt−1), x_t =x_t₋₁−η_tg_t.

(10)

Gradient descent

Letf(x) =∑n

i=1ℓ(zi,x).

minx f(x).

Subgradient method (equivalent formula)

Subdiﬀerentiable f(x):

xt=argmin

x

{

⟨x,gt⟩+ 1

2η_t∥x−xt−1∥² }

,

where gt ∈∂f(xt−1).

Proximal point algorithm:

xt=argmin

x

{

f(x)+ 1

2ηt∥x−xt−1∥² }

.

f(xt)→optimum for any convexf andηt =η >0 (?).

Iff(x) is strongly convex: f(xt)−f(x^∗)≤ _2η¹ (

1 1+ση

)t−1

∥x0−x^∗∥².

7 / 16

(11)

Proximal gradient descent

Let f(x) =∑_n

i=1ℓ(z_i,x).

minx f(x) +ψ(x).

Proximal gradient descent xt=argmin

x

{

⟨x,gt⟩+ψ(x)+ 1

2ηt∥x−xt−1∥² }

=argmin

x

{

ηtψ(x) +1

2∥x−(xt−1−ηtgt)∥² }

where gt ∈∂f(xt−1).

The update rule is given by proximal mapping:

(12)

Example

L₁ regularization: ψ(x) =C∥x∥1.

x_t,j =ST_C_η_t(x_t₋_1,j −η_tg_t,j) (j-th component) where

ST_C(q) =sign(q) max{|q| −C,0}.

→ Unimportant elements are forced to be 0.

For many practically used regularizations, analytic form is obtained.

9 / 16

(13)

Example of proximal mapping (cont.)

Trace norm: ψ(X) =C∥X∥tr=C∑

jσ_j(X) (sum of singular values).

Let

Xt−1−ηtGt =Udiag(σ1, . . . , σ_d)V, then

X_t =U





STCηt(σ1) . ..

ST_C_η(σ_d)



V.

(14)

Convergence of proximal gradient descent

Strong convexity andsmoothnessof f determines the convergence rate.

x_t =prox(x_t₋₁−η_tg_t|η_tψ(x)).

property of f µ-Strongly convex non-strongly conv

γ-Smooth exp

(

−tµ γ

) γ

t

Non-smooth 1

µt

√1 t The step size η_t should be appropriately chosen.

Setting ofηt Strongly conv non-strongly conv

Smooth _γ¹ _γ¹

Non-smooth _µt² √¹

t

To achieve this convergence rate, we need to take an average of {xt}t

appropriately;Polyak-Ruppert averaging, polynomially decaying averaging.

11 / 16

(15)

Convergence of proximal gradient descent

Strong convexity andsmoothnessof f determines the convergence rate.

x_t =prox(x_t₋₁−η_tg_t|η_tψ(x)).

property of f µ-Strongly convex non-strongly conv

γ-Smooth exp

(

−t

√µ γ

) γ

t²

Non-smooth 1

µt

√1 t The step size ηt should be appropriately chosen.

Setting ofηt Strongly conv non-strongly conv

Smooth _γ¹ _γ¹

Non-smooth _µt² √¹

t {x }

(16)

Outline

12 / 16

(17)

Nesterov’s acceleration (non-strongly convex)

minx{f(x) +ψ(x)}

Assumption: f(x) is γ-smooth.

Nesterov’s acceleration scheme

Let s1 = 1 andη= _γ¹, and iterate the following fort = 1,2, . . .

1 Let g_t ∈∂f(y_t), and update x_t =prox(y_t−ηg_t|ηψ).

2 Set st+1 = ¹⁺

√1+4s_t²

2 .

3 Update y_t+1 =x_t+ (st−1

st+1

)

(x_t−x_t−1).

Iff is γ-smooth, then

f(x_t)−f(x^∗)≤ 2γ∥x_t−x^∗∥ t² .

(18)

Nesterov’s acceleration (strongly convex)

minx{f(x) +ψ(x)}

Assumption: f(x) isγ-smooth andµ-strongly convex. (it must beγ > µ)

Nesterov’s acceleration scheme

Let A1= 1, α1 =γ/µ andη = ¹_γ, and iterate the following for t = 1,2, . . .

1 Let gt ∈∂f(yt), and update xt=prox(yt−ηgt|ηψ).

2 Set αt+1>1 so that (γ−µ)α²_t+1−(2γ+At)αt+1+γ = 0, and let At+1 =At/αt+1.

3 Update yt+1 =xt+

( µ+At

(γ−µ)(αt+1−1)(αt−1)

)

(xt−xt−1).

Iff is γ-smooth and µ-strongly convex, then f(xt)−f(x^∗)≤γ

( 1−

√γ µ

)t

∥x0−x^∗∥².

14 / 16

(19)

(20)

Iteration

0 20 40 60 80 100 120

Relative objective (f(xt) - f*)

10^-8 10^-6 10^-4 10^-2 10⁰ 10² 10⁴

Normal Nesterov

Nesterov’s acceleration v.s. normal gradient descent Lasso: n = 8,000,p = 500.

16 / 16

(21)

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):

183–202, 2009.

O. G¨uler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2):

403–419, 1991.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147, 2013.