2011/8/26: Convex Optimization: Old Tricks for New Problems

(1)

. . . .

Convex Optimization: Old Tricks for New Problems

Ryota Tomioka1

1_{The University of Tokyo}

(2)

. . . .

(3)

. . . .

A typical machine learning problem (1/2)

minimize w∈Rn 1 2∥Xw − y∥ 2 2 | {z } data-ﬁt + φλ(w ) | {z } Regularization X Ridge penalty φλ= λ 2 n P j=1 w_j2. L1 penalty φλ = λ n P j=1 |wj|.

(4)

. . . .

A typical machine learning problem (2/2)

Logistic regression for binary (yi ∈ {−1, +1}) classiﬁcation:

minimize

w∈Rn

m P i=1

log(1 + exp(−yi〈xi,w〉))

| {z }

data-ﬁt

+ φλ(w ) | {z }

Regularization

The logistic loss function

log(1 + e−yz) =− log P(Y = y|z)

negative log-likelihood where P(Y = +1|z) = 1 1 + ez logistic function f(x)=log(1+exp(−x)) y<x,w> −50 0 5 0.5 1 z σ (z)

(5)

. . . .

Bayesian inference as a convex optimization

minimize q E| {z }q[f (w )] average energy +Eq[log q(w )] | {z } entropy s.t. q(w )≥ 0, Z q(w )dw = 1 where f (w ) =− log P(D|w) | {z }

neg. log likelihood

− log P(w)

| {z }

neg. log prior

⇒ q(w) = 1 Ze −f (w) _{(Bayesian posterior)} Inner approximations Variational Bayes Empirical Bayes Outer approximations Belief propagation See Wainwright & Jordan 08.

(6)

. . . .

Bayesian inference as a convex optimization

neg. log likelihood

− log P(w)

| {z }

neg. log prior

(7)

. . . .

Bayesian inference as a convex optimization

neg. log likelihood

− log P(w)

| {z }

neg. log prior

(8)

. . . .

Bayesian inference as a convex optimization

neg. log likelihood

− log P(w)

| {z }

neg. log prior

(9)

. . . .

Convex optimization = standard forms (boring?)

Example:Linear Programming (LP)

. Primal problem . . . .. . . . (P) min c⊤x, s.t. Ax = b, x ≥ 0. . Dual problem . . . .. . . . (D) max b⊤y , s.t. A⊤y ≤ c.

Quadratic Programming (QP), Second Order Cone Programming (SOCP), Semideﬁnite Programming (SDP), etc...

Pro: “Efﬁcient” (but complicated) solvers are already available.

(10)

. . . .

Convex optimization = standard forms (boring?)

Example:Linear Programming (LP)

. Primal problem . . . .. . . . (P) min c⊤x, s.t. Ax = b, x ≥ 0. . Dual problem . . . .. . . . (D) max b⊤y , s.t. A⊤y ≤ c.

Quadratic Programming (QP), Second Order Cone Programming (SOCP), Semideﬁnite Programming (SDP), etc...

Pro: “Efﬁcient” (but complicated) solvers are already available.

(11)

. . . .

Easy problems (that we don’t discuss)

Objective f is differentiable & no constraint

I L-BFGS quasi-Newton method

F requires only gradient. F scales well.

I Newton’s method

F requires also Hessian. F very accurate.

F for medium sized problems.

Differentiable f & simple box constraint

(12)

. . . .

Non-differentiability is everywhere

Support Vector Machine

minimize w C m X i=1 ℓH(yi〈xi,w〉)+ 1 2∥w∥ 2

Lasso (least absoluteshrinkageand

selectionoperator) minimize w L(w ) + λ n X j=1 |wj| max(0,1−yz)

(13)

. . . .

Why we need sparsity

Genome-wide association studies

I Hundreds of thousands of genetic variations (SNPs), small number of participants (samples).

I Number of genes responsible for the disease is small.

I Solve classiﬁcation problem (disease/healthy) with sparsity constraint.

EEG/MEG source localization

I Number of possible sources≫ number of sensors

I Needs sparsity at a group level

φλ(w ) = λ X g∈G ∥wg∥2 (w ∈ R3)

(14)

. . . .

L1-regularization and sparsity

Best convex approximation of∥w∥0.

Threshold occurs for ﬁnite λ.

Non-convex cases (p < 1) can be solved by

re-weighted L1 minimization −3 −2 −1 0 1 2 3 0 1 2 3 4 |x|0.01 |x|0.5 |x| x2

(15)

. . . .

L1-regularization and sparsity

−3 −2 −1 0 1 2 3 0 1 2 3 4 |x|0.01 |x|0.5 |x| x2

(16)

. . . .

L1-regularization and sparsity

(17)

. . . .

L1-regularization and sparsity

re-weighted L1 minimization −20 −1.5 −1 −0.5 0 0.5 1 1.5 2 0.5 1 1.5 2 −3 −2 −1 0 1 2 3 0 1 2 3 4 |x|0.01 |x|0.5 |x| x2

(18)

. . . .

Multiple kernels & multiple tasks

Multiple kernel learning [Lanckriet et al., 04; Bach et al., 04;...]

I Given: kernel functions k1(x , x′), . . . ,KM(x, x′)

I How do we optimally select and combine “good“ kernels?

minimize f1∈H1, f2∈H2, ...,fM∈HM C N X i=1 ℓ ³ yi PM m=1fm(xi) ´ + λ M X m=1 ∥fm∥_Hm

Multiple task learning [Evgeniou et al 05]

I Given: two learning tasks.

I Can we do better than solving them individually? minimize w1,w2,w12 L1(w1+w12) | {z } Task 1 loss +L_|2(w2_{z+w12_}) Task 2 loss +λ(∥w1∥ + ∥w2∥ + ∥w12∥)

w12: shared component, w1: Task 1 only component, w2: Task 2

(19)

. . . .

Multiple kernels & multiple tasks

Multiple kernel learning [Lanckriet et al., 04; Bach et al., 04;...]

I Given: kernel functions k1(x , x′), . . . ,KM(x, x′)

I How do we optimally select and combine “good“ kernels?

minimize f1∈H1, f2∈H2, ...,fM∈HM C N X i=1 ℓ ³ yi PM m=1fm(xi) ´ + λ M X m=1 ∥fm∥_Hm

Multiple task learning [Evgeniou et al 05]

I Given: two learning tasks.

I Can we do better than solving them individually? minimize w1,w2,w12 L1(w1+w12) | {z } Task 1 loss +L_|2(w2_{z+w12_}) Task 2 loss +λ(∥w1∥ + ∥w2∥ + ∥w12∥)

w12: shared component, w1: Task 1 only component, w2: Task 2

(20)

. . . .

Estimation of low-rank matrices (1/2)

Completion of partially observed low-rank matrix minimize X 1 2∥Ω(X − Y )∥ 2_{+ λ}_∥X∥ S1 where ∥X∥_S₁ := r X j=1 σj(X ) (Schatten 1-norm)

Linear sum of singular-values⇒ sparsity in the singular-values.

I Collaborative ﬁltering (netﬂix)

I Sensor network localization

1 4 2 3 2 2 1 1 1 1 1 3 2 4 2 3 4 1 2 1 1 Movies Users

(21)

. . . .

Estimation of low-rank matrices (2/2)

Classiﬁcation of matrix shaped data X .

f (X ) =〈W , X〉 + b

Multivariate Time Series

s Time e ns or s

Secondorderstatistics

S

e

Secondorderstatistics

Sensors 5 10 15 20 25 n s or s 51015202530354045 30 35 40 45 Se n

Classiﬁcation of binary relationship between two objects (e.g., protein and drug)

f (x, y ) = x⊤W y + b

(22)

. . . .

Agenda

Convex optimization basics

I Convex sets

I Convex function

I Conditions that guarantee convexity

I Convex optimization problem Looking into more details

I Proximity operatorsand IST methods

I Conjugate dualityand dual ascent

(23)

. . . .

Convexity

Learning objectives Convex sets Convex function

Conditions that guarantee convexity Convex optimization problem

(24)

. . . .

Convex set

A subset V ⊆ Rnis aconvex set

⇔ line segment between two arbitrary points x, y ∈ V is included in V ;

that is,

∀x, y ∈ V , ∀λ ∈ [0, 1], λx + (1 − λ)y ∈ V .

x

(25)

. . . .

Convex function

A function f :Rn_{→ R ∪ {+∞} is a}_{convex function}

⇔ the function f is below any line segment between two points on f ;

that is, ∀x, y ∈ Rn_,_{∀λ ∈ [0, 1], f ((1 − λ)x + λy) ≤ (1 − λ)f (x) + λf (y)} (Jensen’s inequality) x f(x) y f(y) f(x) f(y) Non−convex Convex Johan Jensen 1859 – 1925

(26)

. . . .

Convex function

A function f :Rn→ R ∪ {+∞} is aconvex function

⇔the epigraph of f is a convex set; that is

V_f :={(t, x) : (t, x) ∈ Rn+1,t ≥ f (x)} is convex.

(x,f(x))

(y,f(y)) Epigraph

(27)

. . . .

Exercise

Show that the indicator function δ_C(x) of a convex set C is a

convex function. Here

δC(x) =

(

0 if x ∈ C,

(28)

. . . .

Conditions that guarantee convexity (1/3)

Hessian∇2f (x) is positive semideﬁnite(if f is differentiable)

Examples

I (Negative) entropy is a convex function.

f (p) = n X i=1 pilog pi, ∇2_{f (p) = diag(1/p} 1, . . . ,1/pn)≽ 0. 0 1

I log determinant is a concave (−f is convex) function

f (X ) = log|X| (X ≽ 0), ∇2_{f (X ) =}_−X−⊤_{⊗ X}−1_{≼ 0}

(29)

. . . .

Conditions that guarantee convexity (1/3)

Hessian∇2f (x) is positive semideﬁnite(if f is differentiable)

Examples

I (Negative) entropy is a convex function.

f (p) = n X i=1 pilog pi, ∇2_{f (p) = diag(1/p} 1, . . . ,1/pn)≽ 0. 0 1

I log determinant is a concave (−f is convex) function

f (X ) = log|X| (X ≽ 0), ∇2_{f (X ) =}_−X−⊤_{⊗ X}−1_{≼ 0}

(30)

. . . .

Conditions that guarantee convexity (2/3)

Maximum over convex functions{fj(x)}∞j=1

f (x) := max

j fj(x) (fj(x) is convex for all j)

(31)

. . . .

Conditions that guarantee convexity (2/3)

Maximum over convex functions{f (x; α) : α ∈ Rn_}

f (x) := max

α∈Rnf (x; α)

Example

Quadratic over linear is a convex function

f (y , Σ) = max α " −1₂α⊤Σα + α⊤y # (Σ≻ 0) = 1 2y ⊤_Σ−1_y 0 0.05 0.1 −1 −0.5 0 0.5 1 0 10 20 30 y Σ y Σ −1 y

(32)

. . . .

Conditions that guarantee convexity (2/3)

Maximum over convex functions{f (x; α) : α ∈ Rn_}

f (x) := max

α∈Rnf (x; α)

Example

Quadratic over linear is a convex function

f (y , Σ) = max α " −1₂α⊤Σα + α⊤y # (Σ≻ 0) = 1 2y ⊤_Σ−1_y 0 0.05 0.1 −1 −0.5 0 0.5 1 0 10 20 30 y Σ y Σ −1 y

(33)

. . . .

Conditions that guarantee convexity (3/3)

Minimum ofjointly convexfunction f (x , y )

f (x ) := min

y∈Rnf (x , y ) is convex.

Examples

I Hierarchical prior minimization

f (x) = min d1,...,dn≥0 1 2 n X j=1 Ã x2 j dj +d p j p ! (p≥ 1) = 1 q n X j=1 |xj|q (q = 2p 1 + p)

I Schatten 1- norm (sum of singularvalues)

f (X ) = min Σ_≽0 1 2 ³ Tr ³ X Σ−1X⊤ ´ +Tr (Σ) ´ =Tr ³ (X⊤X )1/2 ´ = r X j=1 σj(X ). q=1 q=1.5 q=2

(34)

. . . .

Conditions that guarantee convexity (3/3)

f (x ) := min

Examples

f (X ) = min Σ_≽0 1 2 ³ Tr ³ X Σ−1X⊤ ´ +Tr (Σ) ´ =Tr ³ (X⊤X )1/2 ´ = r X j=1 σj(X ). q=1 q=1.5 q=2

(35)

. . . .

Conditions that guarantee convexity (3/3)

f (x ) := min

Examples

f (X ) = min Σ≽0 1 2 ³ Tr ³ X Σ−1X⊤ ´ +Tr (Σ) ´ =Tr ³ (X⊤X )1/2 ´ = r X j=1 σj(X ). q=1 q=1.5 q=2

(36)

. . . .

Conditions that guarantee convexity (3/3)

f (x ) := min

Examples

f (X ) = min Σ≽0 1 2 ³ Tr ³ X Σ−1X⊤ ´ +Tr (Σ) ´ =Tr ³ (X⊤X )1/2 ´ = r X j=1 σj(X ). q=1 q=1.5 q=2

(37)

. . . .

Convex optimization problem

f : convex function, g: concave function (−g is convex), C: convex set.

minimize

x f (x), maximizey g(y ),

s.t. x ∈ C. s.t. y ∈ C.

Why?

local optimum⇒ global optimum

duality (later) can be used to check convergence

(38)

. . . .

Proximity operators and IST methods

Learning objectives

(Projected) gradient method

Iterative shrinkage/thresholding (IST) method Acceleration

(39)

. . . .

Proximity view on gradient descent

“Linearize and Prox”

xt+1=argmin x µ ∇f (xt₎₍_x_{− x}t₎₊ 1 2ηt∥x − x t_∥2 ¶ =xt − ηt∇f (xt)

Step-size should satisfy

ηt ≤ 1/L(f ).

L(f ): the Lipschitz constant ∥∇f (y) − ∇f (x)∥ ≤ L(f )∥y − x∥. L(f )=upper bound on the

maximum eigenvalue of the Hessian

xt xt+1 x*

(40)

. . . .

Constraint minimization problem

What do we do, if we have a constraint? minimize

x∈Rn f (x),

s.t. x ∈ C.

can be equivalently written as minimize

x∈Rn f (x) +δC(x),

(41)

. . . .

Constraint minimization problem

What do we do, if we have a constraint? minimize

x∈Rn f (x),

s.t. x ∈ C.

can be equivalently written as minimize

x∈Rn f (x) +δC(x),

(42)

. . . .

Projected gradient method

(Bertsekas 99; Nesterov 03)

Linearize the objective f , δC is the indicator of the constraint C

xt+1=argmin x µ ∇f (xt )(x − xt)+δC(x)+ 1 2ηt∥x − x t_∥2 2 ¶ =argmin x µ δC(x)+ 1 2ηt∥x − (x t_{− η} t∇f (xt))∥22 ¶ =proj_C(xt− ηt∇f (xt)). Requires ηt ≤ 1/L(f ). Convergence rate f (xk)− f (x∗)≤ L(f )∥x0− x ∗_∥2 2 2k Need the projectionproj_C to be easy to compute −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

(43)

. . . .

Ideas for regularized minimization

Constrained minimization problem minimize

x∈Rn f (x) +δC(x).

⇒ need to compute theprojection

xt+1 =argmin x µ δC(x) + 1 2ηt∥x − y∥ 2 2 ¶

Regularized minimization problem minimize

x∈Rn f (x) +φλ(x)

⇒ need to compute theproximity operator

xt+1=argmin x µ φλ(x) + 1 2ηt∥x − y∥ 2 2 ¶

(44)

. . . .

Proximal Operator: generalization of projection

prox_φ λ(z) = argmin x µ φλ(x)+ 1 2∥x − z∥ 2 2 ¶

φλ = δC: Projection onto a convex set

prox_δ C(z) = projC(z). φλ(x) = λ∥x∥1: Soft-Threshold prox_λ(z) = argmin x µ λ∥x∥1+ 1 2∥x − z∥ 2 2 ¶ =      zj+ λ (zj <−λ), 0 (−λ ≤ zj ≤ λ), z_j− λ (z_j > λ). λ −λ z ST(z)

Prox can be computed easily for aseparableφλ.

(45)

. . . .

Iterative Shrinkage Thresholding (IST)

xt+1=argmin x µ ∇f (xt )(x− xt)+φλ(x)+ 1 2ηt∥x − x t_∥2 2 ¶ =argmin x µ φλ(x)+ 1 2ηt∥x − (x t _{− η} t∇f (xt))∥22 ¶ =prox_λη_t(xt − ηt∇f (xt)).

The same condition for ηt, the

same O(1/k ) convergence (Beck & Teboulle 09)

f (xk)− f (x∗)≤ L(f )∥x0− x

∗_∥2

2k

If theProx operator prox_λ is easy, it is simple to implement.

(46)

. . . .

IST summary

Solve minimization problem minimize

w∈Rn f (w ) + φλ(w )

by iteratively computing

wt+1=prox_λη_t(wt − ηt∇f (wt)).

Exercise: Derive prox operator for Ridge regularization φλ(w ) = λ n X j=1 w_j2 Elastic-net regularization φλ(w ) = λ n X j=1 ³ (1− θ)|wj| + θwj2 ´ .

(47)

. . . .

Exercise 1: implement an L1 regularized logistic

regression via IST

minimize

w∈Rn

m P i=1

log(1 + exp(−y_i〈x_i,w〉))

| {z } data-ﬁt + λ n P j=1 |wj| | {z } Regularization Hint: deﬁne fℓ(z) = m X i=1 log(1 + exp(−zi)).

Then the problem is

minimize fℓ(Aw ) + λ n P j=1 |wj| where A =    y1x1⊤ y2x2⊤ .. . ymxm⊤   

(48)

. . . .

Some hints

.

1 Compute the gradient of the loss term

∇wfℓ(Aw ) =−A⊤ µ exp(−zi) 1 + exp(−z_i) ¶m i=1 .

.

2 The gradient step becomes

wt+12 ₌wt _{+ η}_tA⊤ µ exp(−zi) 1 + exp(−zi) ¶m i=1 .

.

3 _{Then compute the proximity operator}

wt+1=prox_λη t(w t+1₂₎ =          wt+ 1 2 j + ληt (w t+1 2 j <−ληt), 0 (−ληt ≤ w t+1₂ j ≤ ληt), wt+ 1 2 j − ληt (w t+1₂ j > ληt).

(49)

. . . .

Matrix completion via IST

(Mazumder et al. 10)

Loss function: L(X ) = 1 2∥Ω(X − Y )∥ 2_. gradient: ∇L(X) = Ω⊤_(Ω(_X _{− Y ))} Regularization: φλ(X ) = λ r X j=1 σj(X ) (S1-norm).

Prox operator (Singular Value Thresholding):

prox_λ(Z ) = U max(S− λI, 0)V⊤.

Iteration: Xt+1=prox_λη_t ³ (I− ηtΩ⊤Ω)(Xt) | {z } ﬁll in missing + η_|tΩ⊤_{zΩ(Yt_}) observed ´

When ηt =1, ﬁll missings with predicted values Xt, overwrite the

(50)

. . . .

Matrix completion via IST

(51)

. . . .

Matrix completion via IST

(52)

. . . .

Matrix completion via IST

(53)

. . . .

FISTA: accelerated version of IST

(Beck & Teboulle 09; Nesterov 07) .

.

. 1 _{Initialize x}0_{appropriately, y}1₌_x0_，_s 1=1. .

.

. 2 _{Update x}t_: xt =prox_λη_t(yt − ηt∇L(yt)). .

.

. 3 _{Update y}t_: yt+1=xt + µ st − 1 st+1 ¶ (xt − xt−1), where st+1 = ¡ 1 + q 1 + 4s2 t ¢± 2.

The same per iteration complexity. Converges as O(1/k2). Roughly speaking, yt _{predicts where the IST step should be}

(54)

. . . .

Effect of acceleration

0 2000 4000 6000 8000 10000 10−8 10−6 10−4 10−2 100 102 ISTA MTWIST FISTA Without acceleration With acceleration Number of iterations

From Beck & Teboulle 2009 SIAM J. IMAGING SCIENCES Vol. 2, No. 1, pp. 183-202

(55)

. . . .

Conjugate duality and dual ascent

Convex conjugate function

Lagrangian relaxation and dual problem Dual ascent

(56)

. . . .

Conjugate duality

Theconvex conjugate f∗of a function f :

f∗(y ) = sup

x∈Rn

(〈x, y〉 − f (x))

f*(y) f(x)

Since the maximum over linear functions is always convex, f need not be convex.

(57)

. . . .

Conjugate duality (dual view)

.

. . . .. . . .

−f∗₍_{y ) is the minimum y -intercept of the hyperplanes that has slope y}

and have intersection with the graph of f (x).

f∗(y ) = sup x (〈x, y〉 − f (x)) ⇔ −f∗(y ) = inf x (f (x)− 〈x, y〉) =inf x,bb, s.t. f (x) =〈x, y〉 +b. -f*(y) y

(58)

. . . .

Conjugate duality (dual view)

.

. . . .. . . .

f∗(y ) = sup x (〈x, y〉 − f (x)) ⇔ −f∗₍_{y ) = inf} x (f (x)− 〈x, y〉) =inf x,bb, s.t. f (x) =〈x, y〉 +b. -f*(y) y

(59)

. . . .

Conjugate duality (dual view)

.

. . . .. . . .

f∗(y ) = sup x (〈x, y〉 − f (x)) ⇔ −f∗₍_{y ) = inf} x (f (x)− 〈x, y〉) =inf x,bb, s.t. f (x) =〈x, y〉 +b. -f*(y) y

(60)

. . . .

Conjugate duality (dual view)

.

. . . .. . . .

f∗(y ) = sup x (〈x, y〉 − f (x)) ⇔ −f∗(y ) = inf x (f (x)− 〈x, y〉) =inf x,bb, s.t. f (x) =〈x, y〉 +b. -f*(y) y

(61)

. . . .

Demo

(62)

. . . .

Example of conjugate duality

f∗(y ) = sup_x_∈Rn(〈x, y〉 − f (x)) Quadratic function f (x ) = x 2 2σ2 f∗(y ) = σ2y2 2

f*(y)

f(x)

(63)

. . . .

Example of conjugate duality

f∗(y ) = sup_x_∈Rn(〈x, y〉 − f (x))

Logistic loss function

f (x) = log(1 + exp(−x))

f∗(−y) = y log(y) + (1 − y) log(1 − y)

0 −1

(64)

. . . .

Example of conjugate duality

f∗(y ) = sup_x_∈Rn(〈x, y〉 − f (x))

Logistic loss function

f (x) = log(1 + exp(−x)) f∗(−y) = y log(y) + (1 − y) log(1 − y)

0 −1

(65)

. . . .

Example of conjugate duality

f∗(y ) = sup_x_∈Rn(〈x, y〉 − f (x)) L1 regularizer f (x) =|x| f∗(y ) = ( 0 (−1 ≤ y ≤ 1) +∞ (otherwise) f(x) f*(y)

(66)

. . . .

Example of conjugate duality

f∗(y ) = sup_x_∈Rn(〈x, y〉 − f (x)) L1 regularizer f (x) =|x| f∗(y ) = ( 0 (−1 ≤ y ≤ 1) +∞ (otherwise) f(x) f*(y)

(67)

. . . .

Bi-conjugate f

∗∗

may be different from f

For nonconvex f ,

f(x)

(68)

. . . .

Lagrangian relaxation

Our optimization problem:

minimize

w∈Rn f (Aw ) + g(w )



 For examplef (z) = 1₂∥z − y∥2₂

(squared loss)   Equivalently written as minimize z∈Rm_,w_∈Rn f (z) + g(w ), s.t. z = Aw (equality constraint) Lagrangian relaxation minimize z,w L(z, w, α) = f (z) + g(w) + α ⊤₍_z_{− Aw)}

As long as z = Aw , the relaxation is exact.

(69)

. . . .

Lagrangian relaxation

minimize

w∈Rn f (Aw ) + g(w )



 For examplef (z) = 1₂∥z − y∥2₂

(70)

. . . .

Lagrangian relaxation

minimize

w∈Rn f (Aw ) + g(w )



 For examplef (z) = 1₂∥z − y∥2₂

(71)

. . . .

Lagrangian relaxation

minimize

w∈Rn f (Aw ) + g(w )



 For examplef (z) = 1₂∥z − y∥2₂

(72)

. . . .

Weak duality

inf z,wL(z, w, α) ≤ p ∗ _{(primal optimal)} proof inf z,wL(z, w, α) = inf µ inf

z=AwL(z, w, α), infz̸=AwL(z, w, α)

¶ =inf µ p∗, inf z_̸=AwL(z, w, α) ¶ ≤ p∗

(73)

. . . .

Weak duality

(74)

. . . .

Weak duality

(75)

. . . .

Dual problem

From the above argument

d (α) := inf

z,wL(z, w, α)

is a lower bound for p∗ for any α. Why don’t wemaximizeover w ?

Dual problem maximize α∈Rm d (α) Note sup α inf z,wL(z, w, α) = d ∗_{≤ p}∗ ₌ _inf z,wsup_α L(z, w, α)

If d∗=p∗,strong duality holds. This is the case if f and g both closed and convex.

(76)

. . . .

Dual problem

From the above argument

d (α) := inf

z,wL(z, w, α)

is a lower bound for p∗ for any α. Why don’t wemaximizeover w ? Dual problem maximize α∈Rm d (α) Note sup α inf z,wL(z, w, α) = d ∗_{≤ p}∗ ₌ _inf z,wsup_α L(z, w, α)

If d∗=p∗,strong duality holds. This is the case if f and g both closed and convex.

(77)

. . . .

Dual problem

d (α) = inf z,wL(z, w, α) (≤ p ∗₎ = inf z,w ³ f (z) + g(w ) + α⊤(z − Aw)´ =inf z (f (z) +〈α, z〉) + infw ³ g(w )− D A⊤α,w E´ =− sup z (〈−α, z〉 − f (z)) − sup w ³D A⊤α,wE− g(w)´ =−f∗(−α) − g∗(A⊤α)

(78)

. . . .

Dual problem

d (α) = inf z,wL(z, w, α) (≤ p ∗₎ = inf z,w ³ f (z) + g(w ) + α⊤(z − Aw) ´ =inf z (f (z) +〈α, z〉) + infw ³ g(w )− D A⊤α,w E´ =− sup z (〈−α, z〉 − f (z)) − sup w ³D A⊤α,wE− g(w)´ =−f∗(−α) − g∗(A⊤α)

(79)

. . . .

Dual problem

(80)

. . . .

Dual problem

(81)

. . . .

Dual problem

(82)

. . . .

Fenchel’s duality

inf w_∈Rn(f (Aw ) + g(w )) = sup α∈Rm ³ −f∗(−α) − g∗(A⊤α) ´ M. W. Fenchel Examples

Logistic regression with L1 regularization

f (z) =

m

X

i=1

log(1 + exp(−zi)), g(w ) = λ∥w∥1.

Support vector machine (SVM)

f (z) = C m X i=1 max(0, 1− zi), g(w ) = 1 2∥w∥ 2 2.

(83)

. . . .

Example 1: Logistic regression with L1 regularization

. Primal . . . .. . . . min w f (y ◦ Xw) + φλ(w )      f (z) = m X i=1 log(1 + exp(−zi)), φλ(w ) = λ∥w∥1. . Dual . . . .. . . . max α −f ∗₍_{−α) − φ}∗ λ(X⊤(α◦ y))                f∗(−α)= m X i=1 αilog(αi) +(1− αi)log(1− αi), φ∗_λ(v ) = ( 0 (∥w∥_∞≤ λ), +∞ (otherwise).

(a) primal losses

O 1 Hinge Loss Logistic Loss (b) dual losses O 1 Hinge Loss Logistic Loss

(84)

. . . .

Example 2: Support vector machine

. Primal . . . . . min w f (y ◦ Xw) + φλ(w )        f (z) = C m X i=1 max(0, 1− zi), φλ(w ) = 1 2∥w∥ 2_. . Dual . . . . . max α −f ∗₍_{−α) − φ}∗ λ(X⊤(α◦ y))        f∗(−α)= (Pm i=1−αi (0≤ α ≤ C), +∞ (oterwise), φ∗_λ(v ) = 1 2∥v∥ 2_.

(85)

. . . .

Dual ascent

Assume for a moment that the dual d (α) is differentiable. For a given αt

d (αt) = inf

z,w

¡

f (z) + g(w ) +αt,z− Aw®¢

and one can show that (Chapter 6, Bertsekas 99)

∇αd (αt) =zt+1− Awt+1 where zt+1=argmin z ¡ f (z) +αt,z®¢ wt+1=argmin w ³ g(w )− D A⊤αt,w E´

(86)

. . . .

Dual ascent (Uzawa’s method)

              

Minimize the Lagrangian wrt x and z:

zt+1=argmin_z¡f (z) +αt,z®¢,

wt+1 ₌_argmin w

¡

g(w )−A⊤αt,w®¢.

Update the Lagrangian multiplier αt:

αt+1 = αt + ηt(zt+1− Awt+1). Pro: Very simple.

Con: When f∗ or g∗is non-differentiable, it is a dual subgradient method (convergence more tricky) NB: f∗is differentiable⇔ f is strictly convex. H. Uzawa primal dual

(87)

. . . .

Exercise 2: Matrix completion via dual ascent

(Cai et al. 08)

minimize X 1 2λ∥z − y∥ 2 | {z } Strictly convex + ³ τ∥X∥tr+ 1 2∥X∥ 2 | {z } Strictly convex ´ , s.t. Ω(X ) = z. ⇓ Lagrangian: L(X, z, α) = 1 2λ∥z − y∥ 2 | {z } =f (z) + ³ τ∥X∥S1+ 1 2∥X∥ 2 | {z } =g(x) ´ + α⊤(z− Ω(X)). Dual ascent   

Xt+1=prox_τ¡Ω⊤(αt)¢ (Singular-Value Thresholding)

zt+1₌_y_{− λα}t

(88)

. . . .

Exercise 2: Matrix completion via dual ascent

(Cai et al. 08)

minimize X 1 2λ∥z − y∥ 2 | {z } Strictly convex + ³ τ∥X∥tr+ 1 2∥X∥ 2 | {z } Strictly convex ´ , s.t. Ω(X ) = z. ⇓ Lagrangian: L(X, z, α) = 1 2λ∥z − y∥ 2 | {z } =f (z) + ³ τ∥X∥S1+ 1 2∥X∥ 2 | {z } =g(x) ´ + α⊤(z− Ω(X)). Dual ascent   

Xt+1=prox_τ¡Ω⊤(αt)¢ (Singular-Value Thresholding)

zt+1 ₌_y_{− λα}t

(89)

. . . .

Augmented Lagrangian and ADMM

Learning objectives

Structured sparse estimation Augmented Lagrangian

(90)

. . . .

Total Variation based image denoising

[Rudin, Osher, Fatemi 92]

minimize X 1 2∥X − Y ∥ 2 2+ λ X i,j °° °³∂xXij ∂yXij ´°°_° 2 Original X0 Observed Y

(91)

. . . .

In one dimension

Fused lasso [Tibshirani et al. 05] minimize x 1 2∥x − y∥ 2 2+ λ n−1 X j=1 ¯¯xj+1− xj¯¯

True

Noisy

(92)

. . . .

Structured sparsity estimation

TV denoising minimize X 1 2∥X − Y ∥ 2 2+ λ X i,j °° °³∂xXij ∂yXij ´°°_° 2 Fused lasso minimize x 1 2∥x − y∥ 2 2+ λ n−1 X j=1 ¯¯xj+1− xj¯¯ .

Structured sparse estimation problem

. . . .. . . . minimize x_∈Rn f (x)|{z} data-ﬁt + φ_{| {z }}λ(Ax) regularization

(93)

. . . .

Structured sparsity estimation

TV denoising minimize X 1 2∥X − Y ∥ 2 2+ λ X i,j °° °³∂xXij ∂yXij ´°°_° 2 Fused lasso minimize x 1 2∥x − y∥ 2 2+ λ n−1 X j=1 ¯¯xj+1− xj¯¯ .

Structured sparse estimation problem

. . . . . minimize x_∈Rn f (x)|{z} data-ﬁt + φ_{| {z }}λ(Ax) regularization

(94)

. . . .

Structured sparse estimation problem

minimize

x∈Rn f (x)|{z}

data-ﬁt

+ φ_{| {z }}λ(Ax)

regularization

Not easy to compute prox operator (because it isnon-separable)

⇒ difﬁcult to applyIST-type methods.

Dual is not necessarily differentiable

(95)

. . . .

Forming the augmented Lagrangian

Structured sparsity problem minimize x_∈Rn f (x)|{z} data-ﬁt + φ_{| {z }}λ(Ax) regularization Equivalently written as minimize w_∈Rn f (x) + φ| {z }λ(z) separable! , s.t. z = Ax (equality constraint) .

Augmented Lagrangian function

. . . .. . . . Lη(x, z, α) = f (x) + φλ(z) + α⊤(z− Ax) + η 2∥z − Ax∥ 2 2

(96)

. . . .

Forming the augmented Lagrangian

Structured sparsity problem minimize x_∈Rn f (x)|{z} data-ﬁt + φ_{| {z }}λ(Ax) regularization Equivalently written as minimize w_∈Rn f (x) + φ| {z }λ(z) separable! , s.t. z = Ax (equality constraint) .

. . . . . Lη(x, z, α) = f (x) + φλ(z) + α⊤(z− Ax) + η 2∥z − Ax∥ 2 2

(97)

. . . .

Augmented Lagrangian Method

.

. . . .. . . . Lη(x, z, α) = f (x) + φλ(z) + α⊤(z− Ax) + η 2∥z − Ax∥ 2_. .

Augmented Lagrangian method (Hestenes 69, Powell 69)

. . . . .             

Minimize the AL function wrt x and z:

(xt+1,zt+1) = argmin

x∈Rn_,z_∈RmLη

(x, z, αt).

Update the Lagrangian multiplier:

αt+1= αt+ η(zt+1− Axt+1).

Pro: The dual isalwaysdifferentiable due to the penalty term.

Con: Cannot minimize over x and z independently

(98)

. . . .

Alternating Direction Method of Multipliers (ADMM;

Gabay & Mercier 76)

                    

Minimize the AL functionLη(x, zt, αt)wrt x:

xt+1=argmin x∈Rn ³ f (x)− αt⊤Ax + η 2∥z t _{− Ax∥}2 2 ´ .

Minimize the AL functionLη(xt+1,z, αt)wrt z:

zt+1₌_argmin z_∈Rm ³ φλ(z) + αt⊤z + η 2∥z − Ax t+1_∥2 2 ´ .

αt+1= αt+ η(zt+1_{− Ax}t+1_).

Looks ad-hoc but convergence can be shown rigorously. Stability does not rely on the choice of step-size η. The newly updated xt+1enters the computation of zt+1.

(99)

. . . .

Alternating Direction Method of Multipliers (ADMM;

Gabay & Mercier 76)

                    

αt+1= αt+ η(zt+1_{− Ax}t+1_).

(100)

. . . .

Alternating Direction Method of Multipliers (ADMM;

Gabay & Mercier 76)

                    

αt+1= αt+ η(zt+1_{− Ax}t+1_).

(101)

. . . .

Exercise: implement an ADMM for fused lasso

Fused lasso minimize x 1 2∥x − y∥ 2 2+ λ∥Ax∥1

What is the loss function f ? What is the regularizer g?

What is the matrix A for fused lasso?

(102)

. . . .

Conclusion

Three approaches for various sparse estimation problems

I Iterative shrinkage/thresholding –proximity operator

I Uzawa’s method –convex conjugate function

I ADMM – combination of the above two

Above methods go beyond black-box models (e.g., gradient descent or Newton’s method) – takes better care of the problem structures.

These methods are simple enough to be implemented rapidly, but should not be considered as a silver bullet.

⇒ Trade-off between:

I Quick implementation – test new ideas rapidly

I Efﬁcient optimization – more inspection/try-and-error/cross validation

(103)

. . . .

Topics we did not cover

Stopping criterion

I Care must be taken when making a comparison. Beyond polynomial convergence O(1/k2)

I Dual Augmented Lagrangian (DAL) converges super-linearly

o(exp(−k)). Software

http://mloss.org/software/view/183/ (This is limited to non-structured sparse estimation.) Beyond convexity

I Dual problem is always convex. It provides a lower-bound of the original problem. If p∗=d∗, you are done!

I Dual ascent(or dual decomposition) for sequence labeling in natural language processing; see [Wainwright, Jaakkola, Willsky 05; Koo et al. 10]

I Difference of convex (DC) programming.

I Eigenvalue problem. Stochastic optimization

(104)

. . . .

A new book “Optimization for Machine Learning” is coming out from the MIT press.

Contributed authors including: A. Nemirovksi, D. Bertsekas, L. Vandenberghe, and more.

(105)

. . . .

Possible projects

.

1 _{Compare the three approaches, namely IST, dual ascent, and}

ADMM, and discuss empirically (and theoretically) their pros and cons.

.

2 _{Apply one of the methods discussed in the lecture to model some}

(106)

. . . .

References

Recent surveys

Tomioka, Suzuki, & Sugiyama (2011) Augmented Lagrangian Methods for Learning, Selecting, and Combining Features. In Sra, Nowozin, Wright., editors, Optimization for

Machine Learning, MIT Press.

Combettes & Pesquet (2010) Proximal splitting methods in signal processing. In

Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag.

Boyd, Parikh, Peleato, & Eckstein (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers.

Textbooks

Rockafellar (1970) Convex Analysis. Princeton University Press. Bertsekas (1999) Nonlinear Programming. Athena Scientiﬁc.

Nesterov (2003) Introductory Lectures on Convex Optimization: A Basic Course. Springer. Boyd & Vandenberghe. (2004) Convex optimization, Cambridge University Press.

(107)

. . . .

References

IST/FISTA

Moreau (1965) Proximité et dualité dans un espace Hilbertien. Bul letin de la S. M. F. Nesterov (2007) Gradient Methods for Minimizing Composite Objective Function. Beck & Teboulle (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J Imag Sci 2, 183–202.

Dual ascent

Arrow, Hurwicz, & Uzawa (1958) Studies in Linear and Non-Linear Programming. Stanford University Press.

Chapter 6 in Bertsekas (1999).

Wainwright, Jaakkola, & Willsky (2005) Map estimation via agreement on trees: message-passing and linear programming. IEEE Trans IT, 51(11).

Augmented Lagrangian

Rockafellar (1976) Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. of Oper. Res. 1.

Bertsekas (1982) Constrained Optimization and Lagrange Multiplier Methods. Academic Press.

Tomioka, Suzuki, & Sugiyama (2011) Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning. JMLR 12.

(108)

. . . .

References

ADMM

Gabay & Mercier (1976) A dual algorithm for the solution of nonlinear variational problems via ﬁnite element approximation. Comput Math Appl 2, 17–40.

Lions & Mercier (1979) Splitting Algorithms for the Sum of Two Nonlinear Operators. SIAM J Numer Anal 16, 964–979.

Eckstein & Bertsekas (1992) On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators.

Matrices

Srebro, Rennie, & Jaakkola (2005) Maximum-Margin Matrix Factorization. Advances in NIPS 17, 1329–1336.

Cai, Candès, & Shen (2008) A singular value thresholding algorithm for matrix completion. Tomioka, Suzuki, Sugiyama, & Kashima (2010) A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices. In ICML 2010.

Mazumder, Hastie, & Tibshirani (2010) Spectral Regularization Algorithms for Learning Large Incomplete Matrices. JMLR 11, 2287–2322.

(109)

. . . .

References

Multi-task/Mutliple kernel learning

Evgeniou, Micchelli, & Pontil (2005) Learning Multiple Tasks with Kernel Methods. JMLR 6, 615–637.

Lanckriet, Christiani, Bartlett, Ghaoui, & Jordan (2004) Learning the Kernel Matrix with Semideﬁnite Programming.

Bach, Thibaux, & Jordan (2005) Computing regularization paths for learning multiple kernels. Advances in NIPS, 73–80.

Structured sparsity

Tibshirani, Saunders, Rosset, Zhu and Knight. (2005) Sparsity and smoothness via the fused lasso. J. Roy. Stat. Soc. B, 67.

Rudin, Osher, Fetemi. (1992) Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60.

Goldstein & Osher (2009) Split Bregman method for L1 regularization problems. SIAM J. Imag. Sci. 2.

Mairal, Jenatton, Obozinski, & Bach. (2011) Convex and network ﬂow optimization for structured sparsity.

Bayes & Probabilistic Inference

Wainwright & Jordan (2008) Graphical Models, Exponential Families, and Variational Inference.