Stochastic Optimization:
First order method
† ‡ Taiji Suzuki
†Tokyo Institute of Technology
Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences
‡JST, PRESTO
Intensive course @ Nagoya University
Outline
1 First order method
Proximal gradient descent
Nesterov’s acceleration and optimal convergence
2 / 16
Outline
1 First order method
Proximal gradient descent
Nesterov’s acceleration and optimal convergence
Regularized learning problem
Lasso:
xmin∈Rp
1 n
∑n i=1
(yi −zi⊤x)2+ ∥x∥|{z}1 regularization
.
General regularized learning problem:
xmin∈Rp
1 n
∑n i=1
ℓ(zi,x) + ψ(x).
Difficulty: Sparsity inducing regularization is usuallynon-smooth.
4 / 16
Regularized learning problem
Lasso:
xmin∈Rp
1 n
∑n i=1
(yi −zi⊤x)2+ ∥x∥|{z}1 regularization
.
General regularized learning problem:
xmin∈Rp
1 n
∑n i=1
ℓ(zi,x) + ψ(x).
Difficulty: Sparsity inducing regularization is usuallynon-smooth.
First order optimization
x
t-1x
tx
t+1Optimization methods that use onlythe function value f(x) and the first order gradientg ∈∂f(x).
Computation per iteration is light, and suited for high dimensional problems.
Newton method is a second order method.
5 / 16
Outline
1 First order method
Proximal gradient descent
Nesterov’s acceleration and optimal convergence
Gradient descent
Letf(x) =∑n
i=1ℓ(zi,x).
minx f(x).
Subgradient method
Differentiable f(x):
xt =xt−1−ηt∇f(xt−1).
7 / 16
Gradient descent
Letf(x) =∑n
i=1ℓ(zi,x).
minx f(x).
Subgradient method
Subdifferentiable f(x):
gt∈∂f(xt−1), xt =xt−1−ηtgt.
Gradient descent
Letf(x) =∑n
i=1ℓ(zi,x).
minx f(x).
Subgradient method (equivalent formula)
Subdifferentiable f(x):
xt=argmin
x
{
⟨x,gt⟩+ 1
2ηt∥x−xt−1∥2 }
,
where gt ∈∂f(xt−1).
Proximal point algorithm:
xt=argmin
x
{
f(x)+ 1
2ηt∥x−xt−1∥2 }
.
f(xt)→optimum for any convexf andηt =η >0 (?).
Iff(x) is strongly convex: f(xt)−f(x∗)≤ 2η1 (
1 1+ση
)t−1
∥x0−x∗∥2.
7 / 16
Proximal gradient descent
Let f(x) =∑n
i=1ℓ(zi,x).
minx f(x) +ψ(x).
Proximal gradient descent xt=argmin
x
{
⟨x,gt⟩+ψ(x)+ 1
2ηt∥x−xt−1∥2 }
=argmin
x
{
ηtψ(x) +1
2∥x−(xt−1−ηtgt)∥2 }
where gt ∈∂f(xt−1).
The update rule is given by proximal mapping:
Example
L1 regularization: ψ(x) =C∥x∥1.
xt,j =STCηt(xt−1,j −ηtgt,j) (j-th component) where
STC(q) =sign(q) max{|q| −C,0}.
→ Unimportant elements are forced to be 0.
For many practically used regularizations, analytic form is obtained.
9 / 16
Example of proximal mapping (cont.)
Trace norm: ψ(X) =C∥X∥tr=C∑
jσj(X) (sum of singular values).
Let
Xt−1−ηtGt =Udiag(σ1, . . . , σd)V, then
Xt =U
STCηt(σ1) . ..
STCη(σd)
V.
Convergence of proximal gradient descent
Strong convexity andsmoothnessof f determines the convergence rate.
xt =prox(xt−1−ηtgt|ηtψ(x)).
property of f µ-Strongly convex non-strongly conv
γ-Smooth exp
(
−tµ γ
) γ
t
Non-smooth 1
µt
√1 t The step size ηt should be appropriately chosen.
Setting ofηt Strongly conv non-strongly conv
Smooth γ1 γ1
Non-smooth µt2 √1
t
To achieve this convergence rate, we need to take an average of {xt}t
appropriately;Polyak-Ruppert averaging, polynomially decaying averaging.
11 / 16
Convergence of proximal gradient descent
Strong convexity andsmoothnessof f determines the convergence rate.
xt =prox(xt−1−ηtgt|ηtψ(x)).
property of f µ-Strongly convex non-strongly conv
γ-Smooth exp
(
−t
√µ γ
) γ
t2
Non-smooth 1
µt
√1 t The step size ηt should be appropriately chosen.
Setting ofηt Strongly conv non-strongly conv
Smooth γ1 γ1
Non-smooth µt2 √1
t {x }
Outline
1 First order method
Proximal gradient descent
Nesterov’s acceleration and optimal convergence
12 / 16
Nesterov’s acceleration (non-strongly convex)
minx{f(x) +ψ(x)}
Assumption: f(x) is γ-smooth.
Nesterov’s acceleration scheme
Let s1 = 1 andη= γ1, and iterate the following fort = 1,2, . . .
1 Let gt ∈∂f(yt), and update xt =prox(yt−ηgt|ηψ).
2 Set st+1 = 1+
√1+4st2
2 .
3 Update yt+1 =xt+ (st−1
st+1
)
(xt−xt−1).
Iff is γ-smooth, then
f(xt)−f(x∗)≤ 2γ∥xt−x∗∥ t2 .
Nesterov’s acceleration (strongly convex)
minx{f(x) +ψ(x)}
Assumption: f(x) isγ-smooth andµ-strongly convex. (it must beγ > µ)
Nesterov’s acceleration scheme
Let A1= 1, α1 =γ/µ andη = 1γ, and iterate the following for t = 1,2, . . .
1 Let gt ∈∂f(yt), and update xt=prox(yt−ηgt|ηψ).
2 Set αt+1>1 so that (γ−µ)α2t+1−(2γ+At)αt+1+γ = 0, and let At+1 =At/αt+1.
3 Update yt+1 =xt+
( µ+At
(γ−µ)(αt+1−1)(αt−1)
)
(xt−xt−1).
Iff is γ-smooth and µ-strongly convex, then f(xt)−f(x∗)≤γ
( 1−
√γ µ
)t
∥x0−x∗∥2.
14 / 16
Iteration
0 20 40 60 80 100 120
Relative objective (f(xt) - f*)
10-8 10-6 10-4 10-2 100 102 104
Normal Nesterov
Nesterov’s acceleration v.s. normal gradient descent Lasso: n = 8,000,p = 500.
16 / 16
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):
183–202, 2009.
O. G¨uler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2):
403–419, 1991.
I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages 1139–1147, 2013.