pdf Research Kengo Kato

(1)

2013.5.7.(minor edit: 2015.8.28)

Implications of KKT conditions in quantile regression

Kengo Kato Let y = (y1, . . . , yn)^T ∈ Rⁿ and X = (x1, . . . , xn)^T ∈ R^n×p be a pair of a vector of dependent variables and a design matrix. Consider to solve the following minimization problem:

(QR) min

β∈R^p n

∑

i=1

ρτ(yi− x^T_iβ),

where ρτ(u) = (τ − 1(u ≤ 0))u is the check function used in quantile regression. This minimization problem reduces to the following linear programming problem:

(QR-LP) min

u,v∈Rⁿ,β∈_R^p^{τ 1} T

n^u+ (1 − τ )1^T_nv

s.t. u − v = y − Xβ, u ≥ 0n, v ≥ 0n,

where 1n = (1, . . . , 1)^T ∈ Rⁿ and 0n = (0, . . . , 0)^T ∈ Rⁿ. The inequalities u ≥ 0 and v ≥ 0 are interpreted coordinatewise. In the problem (QR), the set of feasible solutions is non-empty and the objective function is non-negative on the set of feasible solutions. Hence there exists at least one optimal solution to the problem (QR-LP) and so to (QR). The purpose of this note is to to prove the following lemma, which roughly describes “first order conditions” for the problem (QR), by using the KKT theorem. The following lemma is a small modification of Lemma 2.1 in [2] where only the LAD case (i.e. the case with τ = 1/2) is handled. Lemma 1. _{Let β}^∗ be an optimal solution to the problem (QR). Let I^∗ = {i ∈ {1, . . . , n} : yi = x^T_i β^∗}. Then there exist ai∈ [−1, 0], i ∈ I^∗ such that

n

∑

i=1

{τ − 1(yi≤ x^Ti^β^∗)}xi =^∑

i∈I^∗

aixi.

Hence we have

n

∑

i=1

{τ − 1(yi≤ x^T_iβ^∗)}xi

≤ Card(I^∗) max

1≤i≤n^|xⁱ^|.

Proof. Let

u^∗_i = max{yi− x^T_i β^∗, 0}, v^∗_i = max{−yi+ x^T_i β^∗, 0}.

Then u^∗− v^∗ = y − Xβ^∗ and (u^∗, v^∗, β^∗) is an optimal solution to the problem (QR-LP). Defining

f (u, v, β) = τ 1^T_nu+ (1 − τ )1^T_nv,

g(u, v, β) = (g1(u, v, β), . . . , g2n(u, v, β))^T = (−u^T, −v^T)^T, h(u, v, β) = (h1(u, v, β), . . . , hn(u, v, β))^T = u − v − y + Xβ, the problem (QR-LP) can be written as

u,v∈Rminⁿ^,β∈R^pf (u, v, β)

s.t. g(u, v, β) ≤ 02n, h(u, v, β) = 0n.

1

(2)

2

Let ei ∈ Rⁿ denote the vector of which only the i-th element is 1 and the other ele- ments are all zero. Then the gradient vectors of f (u, v, β), gi(u, v, β), gn+i(u, v, β) and hi(u, v, β) are given by

∇f (u, v, β) =



 τ 1n

(1 − τ )1n

0p



, ∇gi(u, v, β) =





−ei

0n

0p



,

∇gn+i(u, v, β) =



 0

−ei

0p



, ∇hi(u, v, β) =



 ei

−ei

xi



, i = 1, . . . , n.

Since all the constraints are linear, by the KKT theorem [1, Proposition 3.3.7], there exist µ1, . . . , µ2n≥ 0 and λ1, . . . , λn such that



 τ 1n

(1 − τ )1n

0p



+

n

∑

i=1

µi





−ei

0n

0p



+

n

∑

i=1

µn+i



 0_n

−ei

0p



+

n

∑

i=1

λi



 ei

−ei

xi



= 02n+p, and

µiu^∗_i = 0, µn+iv_i^∗= 0, i = 1, . . . , n. Recall I^∗= {i ∈ {1, . . . , n} : yi= x^T_iβ^∗}. Let

I₊^∗ = {i ∈ {1, . . . , n} : yi> x^T_iβ^∗}, I₋^∗ = {i ∈ {1, . . . , n} : yi < x^T_i β^∗}. Observe that

i ∈ I₊^∗ ⇒ u^∗i > 0 ⇒ µi= 0 ⇒ λi= −τ, i ∈ I−^∗ ⇒ v_i^∗> 0 ⇒ µn+i= 0 ⇒ λi= 1 − τ, Therefore,

∑

i∈I₊^∗

τ xi+ (τ − 1)^∑

i∈I₋^∗

xi=^∑

i∈I^∗

λixi.

The left side is expressed as

∑

i∈I^∗₊∪I₋^∗

{τ − 1(yi≤ x^T_iβ^∗)}xi=

n

∑

i=1

{τ − 1(yi≤ x^T_i β^∗)}xi+ (1 − τ )^∑

i∈I^∗

λixi,

so that

n

∑

i=1

{τ − 1(yi≤ x^T_iβ^∗)}xi= ^∑

i∈I^∗

(λi− 1 + τ )xi. For i ∈ I^∗, we have

τ − µi+ λi= 0 ⇒ λi≥ −τ,

1 − τ − µn+i− λi= 0 ⇒ λi≤ 1 − τ,

so that λi ∈ [−τ, 1 − τ ], i ∈ I^∗. This completes the proof. □ The next question is how large Card(I^∗) is. Suppose that y is random but X is fixed (if X is random, consider the conditional distribution of y given X). Lemma 2. Suppose that n ≥ p and the distribution of y is absolutely continuous with respect to the Lebesgue measure on Rⁿ. Then

Card(I^∗) ≤ p, a.s.

(3)

3

Proof. For I ⊂ {1, . . . , n}, let

SI = {y ∈ Rⁿ : yi= x^Ti^β, ∀i ∈ I, ∃β ∈ R^p}.

Then as long as Card(I) ≥ p + 1, the set SI is a linear subspace of Rⁿ of dimension at most n − 1, so that its Lebesgue measure in Rⁿ is zero. The conclusion follows from the fact that

{Card(I^∗) ≥ p + 1} ⊂ ^∪

I⊂{1,...,n}

Card(I)≥p+1

{y ∈ SI},

and the absolute continuity of the distribution of y. □ References

[1] Bertsekas, D. (1999). Nonlinear Programming (2nd edition). Athena Scientific. [2] El-Attar, R.A., Vidyasagar, M. and Dutta, S.P.K. (1979). An algorithm for l1-norm minimization with application to nonlinear l1-approximation. SIAM J. Numer. anal. 16 70-86.