Supplementary Material:
Stochastic Dual Coordinate Descent with Alternating Direction Multiplier Method
Taiji Suzuki S-TAIJI@IS.TITECH.AC.JP
Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Tokyo 152-8552, JAPAN
A. Derivation of the proximal operation for the smoothed hinge loss
By the definition of the smoothed hinge loss, we have that, for−1≤yiv≤0, fi∗(v) = sup
u∈R
{uv−fi(u)}= sup
u∈R
uv−1
2(1−yiu)2
= sup
u∈R
1
2(1 +yiv)2−1 2−1
2(1 +yiv−yiu)2
=1
2(1 +yiv)2−1 2, andfi∗(v) =∞otherwise.
Since
fi∗(v) C +1
2(q−v)2= ( 1
2C(1 +yiv)2−2C1 +12(q−v)2 (−1≤yiv≤0),
∞ (otherwise),
=
1+C 2C
v+y1+Ci−qC2
+v
2(yi−qC)2
2C(1+C) +q22 (−1≤yiv≤0),
∞ (otherwise).
Thus by minimizing this with respect tov, we have that
prox(u|fi∗/C) =
Cu−yi
1+C (−1≤ Cuy1+Ci−1 ≤0),
−yi (−1> Cuy1+Ci−1), 0 (otherwise).
B. Proof of the main theorem
In the section, we give the proofs of the theorems in the main body. For notational simplicity, we rewrite the dual problem as follows:
x∈Xmin,y∈Y
Xn i=1
gi(xi) +φ(y), (S-1a)
s.t. Zx+By= 0, (S-1b)
whereZ ∈ Rp×n,B ∈ Rp×d. This is equivalent to the dual optimization problem in the main text whengi = fi∗ and φ=nψ∗(·/n)(or equivalentlyφ∗=nψ). We writeg(x) =Pn
i=1gi(xi).
Then we consider the following update rule:
y(t)←arg min
y
φ(y)− hw(t−1), Zx(t−1)+Byi+ρ
2kZx(t−1)+Byk2+1
2ky−y(t−1)kQ
x(t)i ←arg min
xI
X
i∈I
gi(xi)− hw(t−1), ZIxI +By(t)i+ρ
2kZIxI+Z\Ix(t−1)\I +By(t)k2+1
2kxI−x(t−1)I kGii
w(t)=w(t−1)−γρ{n(Zx(t)+By(t))−(n−n/K)(Zx(t−1)+By(t−1))}.
Assumption 1 can be interpreted as follows. There is an optimal solution(x∗, y∗)and corresponding Lagrange multiplier w∗such that
∂g(x∗)∋Z⊤w∗, ∂φ(y∗)∋B⊤w∗.
We denote by∇f(x)an arbitrary element of the subgradient∂f(x)of a convex functionf atx. Moreover, we suppose that each (dual) loss functiongiisv-strongly convex andφish-smooth:
gi(xi)−gi(x∗i)≥ hz⊤i w∗, xi−x∗ii+vkxi−x∗ik2
2 .
We also assume that there exithandvφsuch that, for ally, uand ally∗ ∈ Y∗, there exitsyb∗ ∈ Y∗which depends ony and we have
φ(y)−φ(y∗)≥ hB⊤w∗, y−y∗i+v′φ
2 kPKer(B)(y−y∗)k2, φ∗(u)−φ∗(B⊤w∗)≥ hy∗, u−B⊤w∗i+h′
2ku−B⊤w∗k2.
Note that the primal and dual are flipped compared with the main text. Once can check that there is a correspondence betweenvψ, hin the main text andvφ′ andh′such thatvφ′ =vnψ andh′=nh.
Define
F(x, y) :=
Xn i=1
gi(xi) +φ(y)− hw∗, Zx+Byi (=nFD(x, y)).
By the definition ofw∗, one can easily check that
F(x, y)−F(x∗, y∗)≥ nv
2 kx−x∗k2≥0.
We define
R′(x, y, w)
=F(x, y)−F(x∗, y∗) +2
ρkw(t)−w∗k2+ρ(1−γ)
2 kZx+Byk2+1
2kx−x∗k2vIp+H+ 1
2Kky− Y∗k2Q. Here again we have that R′ = nRD. Let nˆ = n/K, the expected cardinality of |I|, and let DiagI(S)be a block diagonal matrix whoseIk×Ik (k = 1, . . . , K)diagonal elements are non-zero and given by(Diag(S))Ik,Ik = SIk,Ik
(k= 1, . . . , K).
Theorem 2. Suppose thatγ= 4n1,DiagI(G)≻2γρ(n−n)Diagˆ I(Z⊤Z)andB⊤is injective. Then, under the assump- tions, the objective function converges R-linearly:
R′(x(t), y(t), w(t))≤ 1− µ
K T
R(x(0), y(0), w(0)), E[F(x(t), y(t))−F(x∗, y∗)]≤
1− µ K
T
R(x(0), y(0), w(0)), where
µ:= min (1
2
v v+σmax(H)
, h′ρσmin(BB⊤)
2 max{1,4h′ρ,4h′σmax(Q)}, Kvφ′
4σmax(Q), Kvσmin(BB⊤) σmax(Q)(ρσmax(Z⊤Z) + 4v)
) ,
In particular, we have that
E[kw(t)−w∗k2]≤ ρ 2
1− µ K
T
R(x(0), y(0), w(0)).
Theorem 1 in the main text can be obtained using the relationv′φ = vnφ,h′ = nh, F = nFD andR′ = nRD. The convergence of the primal objective is obtained by using the following fact: Sincegis strongly convex aroundx∗, we have that
g(x)−g(x∗)≥ hZ⊤w∗, x−x∗i+vkx−x∗k2
2 (∀x)
⇒g∗(u)≤g∗(Z⊤w∗) +hx∗, u−Z⊤w∗i+ku−Z⊤w∗k2 2v (∀u).
where we usedZ⊤λ∗∈∂g(x∗). Using this, we have that, 1
n Xn i=1
fi(zi⊤w(t))−1 n
Xn i=1
fi(zi⊤w∗)≤D
Zx∗/n, w(t)−w∗E
+kZ⊤(w(t)−w∗)k2 2nv
=D
−y∗/n, B⊤(w(t)−w∗)E
+kZ⊤(w(t)−w∗)k2
2nv ,
where we used the relationZx∗+By∗= 0. Moreover, using the relationψ(B⊤w)≤ψ(B⊤w∗) +hy∗/n, B⊤(w−w∗)i+ l1kw−w∗k+l2kw−w∗k2and the Jensen’s inequalityE[kw(T)−w∗k]2≤E[kw(T)−w∗k2], we obtain the assertion.
Proof of Theorem 2.
Step 1(Deriving a basic inequality):
g(x(t))−g(x(t−1)) +φ(y(t))−φ(y(t−1))
=X
i∈I
gi(x(t)i )−X
i∈I
gi(x(t−1)i ) +φ(y(t))−φ(y(t−1))
=X
i∈I
gi(x(t)i )− hw(t−1), Zx(t)+By(t)i+ρ
2kZx(t)+By(t)k2+1
2kx(t)I −x(t−1)I k2GI,I +hw(t−1), Zx(t)+By(t)i −ρ
2kZx(t)+By(t)k2−1
2kx(t)I −x(t−1)I k2GI,I
−X
i∈I
gi(x(t−1)i ) +φ(y(t))−φ(y(t−1)). (S-2)
Here we define thatZ˜I = [Z\IZI]andx˜:=
"
x(t−1)\I xI
#
for a givenxI, and
˜
gI(xI) :=X
i∈I
gi(xi)−D
w(t−1),Z˜Ix˜+By(t)E +ρ
2kZ˜Ix˜+By(t)k2+1
2kxI−x(t−1)I k2GI,I.
Then by the update rule ofx(t), we have that
˜
gI(x(t)I )≤g˜I(x∗I)−v
2kx(t)I −x∗Ik2−ρ
2kZI(x(t)I −x∗I)k2−1
2kx(t)I −x∗IkGI,I,
which implies X
i∈I
gi(x(t)i )−D
w(t−1), Zx(t)+By(t)E +ρ
2kZx(t)+By(t)k2+1
2kx(t)I −x(t−1)I k2GI,I
≤X
i∈I
gi(x∗i)−D
w(t−1), ZIx∗I+Z\Ix∗\I+By(t)E +ρ
2kZIx∗I +Z\Ix(t−1)\I +By(t)k2+1
2kx∗I−x(t−1)I k2GI,I
−v
2kx(t)I −x∗Ik2−ρ
2kZI(x(t)I −x∗I)k2−1
2kx(t)I −x∗IkGI,I
=X
i∈I
gi(x∗i)−D
w(t−1), ZI(x∗I−x(t)I )E
−D
w(t−1), Zx(t)+By(t)E +ρ
2kZIx∗I+Z\Ix(t−1)\I +By(t)k2−ρ
2kZx(t)+By(t)k2+ρ
2kZx(t)+By(t)k2+1
2kx∗I−x(t−1)I k2GI,I
−v
2kx(t)I −x∗Ik2−ρ
2kZI(x(t)I −x∗I)k2−1
2kx(t)I −x∗IkGI,I
=X
i∈I
gi(x∗i)−D
w(t−1), ZI(x∗I−x(t)I )E
−v
2kx(t)I −x∗Ik2−ρ
2kZI(x(t)I −x∗I)k2−1
2kx(t)I −x∗IkGI,I
−ρhZ\Ix(t)\I +By(t), ZI(x(t)I −x∗I)i+ρ
2kZIx∗Ik2−ρ
2kZIx(t)I k2+1
2kx∗I−x(t−1)I k2GI,I
−D
w(t−1), Zx(t)+By(t)E +ρ
2kZx(t−1)+By(t)k2. Using this, the RHS of Eq. (S-2) can be further bounded by
(RHS)≤X
i∈I
gi(x∗i)−X
i∈I
gi(x(t−1)i )− hw(t−1), ZI(x∗I −x(t)I )i
−v
2kx(t)I −x∗Ik2−ρ
2kZI(x(t)I −x∗I)k2−1
2kx(t)I −x∗IkGI,I
−ρhZ\Ix(t)\I +By(t), ZI(x(t)I −x∗I)i+ρ
2kZIx∗Ik2−ρ
2kZIx(t)I k2 +1
2kx∗I −x(t−1)I k2GI,I−1
2kx(t)I −x(t−1)I k2GI,I
+φ(y(t))−φ(y(t−1)). (S-3)
Here, we bound the term
−ρhZ\ix(t)\i +By(t), ZI(x(t)I −x∗I)i+ρ
2kZIx∗Ik2−ρ
2kZIx(t)I k2. By Lemma 3, the expectation of this term is equivalent to
Eh
−ρ
nhZx(t−1)+By(t), Z(nx(t)−(n−ˆn)x(t−1)−nxˆ ∗)ii + ρ
2Kkx(t−1)−x∗k2Diag
I(Z⊤Z)−ρ 2Eh
kx(t)−x(t−1)k2Diag
I(Z⊤Z)
i.
Note that, for any block diagonal matrixSwhich satisfiesSIk,Ik′ = (Si,j)(i,j)∈Ik×Ik′ =O(∀k6=k′), we have that E[kx(t)I −x∗Ik2SI,I] = E[kx(t)I −x(t−1)I +x(t−1)I −x∗Ik2SI,I]
= E[kx(t)I −x(t−1)I k2SI,I] + E[2hx(t)I −x(t−1)I , x(t−1)I −x∗IiSI,I] + E[kx(t−1)I −x∗Ik2SI,I]
= E[kx(t)−x(t−1)k2S] + E[2hx(t)−x(t−1), x(t−1)−x∗iS] + 1
Kkx(t−1)−x∗k2S
= E[kx(t)−x∗k2S]−E[kx(t−1)−x∗k2S] + 1
Kkx(t−1)−x∗k2S
= E[kx(t)−x∗k2S]−
1− 1 K
E[kx(t−1)−x∗k2S],
where the expectation is taken with respect to the choice ofI∈ {I1, . . . , IK}. Moreover, for a fixed vectorq, we have that E[hqI, x(t)I −x∗Ii]
=E[hqI, x(t)I −x(t−1)I +x(t−1)I −x∗Ii] = E[hq, x(t)−x(t−1)i] + E[hqI, x(t−1)I −x∗Ii]
=E[hq, x(t)−x(t−1)i] + E
" K X
k=1
1[I=Ik]hqIk, x(t−1)Ik −x∗Iki
#
=E[hq, x(t)−x(t−1)i] + 1 K
XK k=1
1[I=Ik]hqIk, x(t−1)Ik −x∗Iki= E[hq, x(t)−x(t−1)i] + 1
Khq, x(t−1)−x∗i
=E
q, x(t)−
1− 1 K
x(t−1)− 1 Kx∗
.
Then, by taking expectation with respect toIand multiplying both sides of the above inequality byn, we have that nE[g(x(t)) +φ(y(t))−g(x(t−1))−φ(y(t−1))]
≤g(x∗)−g(x(t−1)) + E[hw(t−1), Z(nx(t)−(n−n)xˆ (t−1)−nxˆ ∗)i]
−Ehnv
2 kx(t)−x∗k2+nρ
2 kx(t)−x∗k2Diag
I(Z⊤Z)+n
2kx(t)−x∗k2Diag
I(G)
i +(n−n)vˆ
2 kx(t−1)−x∗k2] + (n−ˆn)ρ
2 kx(t−1)−x∗k2Diag
I(Z⊤Z)+n−nˆ
2 kx(t−1)−x∗k2Diag
I(G)
+ Eh
−ρhZx(t−1)+By(t), Z(nx(t)−(n−n)xˆ (t−1)−ˆnx∗)ii +ρˆn
2 kx(t−1)−x∗k2Diag
I(Z⊤Z)−nρ 2 Eh
kx(t)−x(t−1)k2Diag
I(Z⊤Z)
i +nˆ
2kx(t−1)−x∗k2Diag
I(G)−n
2E[kx(t)−x(t−1)k2Diag
I(G)]
+nφ(y(t))−nφ(y(t−1)). (S-4)
Here, note that the last two termnφ(y(t))−nφ(y(t−1))is bounded as nφ(y(t))−nφ(y(t−1))
=ˆn(φ(y(t))−φ(y(t−1))) + (n−n)(φ(yˆ (t))−φ(y(t−1)))
≤ˆn(φ(y∗)−φ(y(t−1))) +D
∇φ(y(t)),(n−n)(yˆ (t)−y(t−1)) + ˆn(y(t)−y∗)E
−nhˆ ′
2 kB⊤w∗− ∇φ(y(t))k2.
for arbitraryy∗∈ Y∗where we used Lemma 4 in the last line. Define
˜
w(t):=w(t−1)−ρ(Zx(t−1)+By(t)).
Note thatB⊤w˜(t)−Q(y(t)−y(t−1))∈∂φ(y(t)).
Next, addingE[nhw∗, Z(x(t−1)−x(t)) +B(y(t−1)−y(t))i]to the both sides of Eq. (S-4), we have that nE[F(x(t), y(t))−F(x(t−1), y(t−1))]
≤ˆn(F(x∗, y∗)−F(x(t−1), y(t−1)))
+ E[hw(t−1)−w∗, Z(nx(t)−(n−ˆn)x(t−1)−nxˆ ∗)i]
+ E[hw˜(t)−w∗, B(ny(t)−(n−n)yˆ (t−1)−nyˆ ∗)i]
− hQ(y(t)−y(t−1)), ny(t)−(n−n)yˆ (t−1)−nyˆ ∗i
−Ehnv
2 kx(t)−x∗k2+n
2kx(t)−x∗k2Hi
+(n−n)vˆ
2 kx(t−1)−x∗k2+n
2kx(t−1)−x∗k2H + Eh
−ρhZx(t−1)+By(t), Z(nx(t)−(n−ˆn)x(t−1)−nxˆ ∗)ii
−n 2Eh
kx(t)−x(t−1)k2Hi
−ˆnh′
2 kB⊤w∗− ∇φ(y(t))k2. (S-5) Step 2(Rearranging cross terms between(x(t), y(t), w(t))and(x(t−1), y(t−1), w(t−1))):
Now, we definexˆ(t):=nx(t)−(n−n)xˆ (t−1)andyˆ(t):=ny(t)−(n−n)yˆ (t−1). Then by the update rule ofw(t), we have thatw(t)=w(t−1)−γρ(Zxˆ(t)+Byˆ(t)). We evaluate the termE[hw(t−1)−w∗, Z(ˆx(t)−ˆnx∗)i] + E[hw˜(t)−w∗, B(ˆy(t)− ˆ
ny∗)i]:
hw(t−1)−w∗, Z(ˆx(t)−ˆnx∗)i+hw˜(t)−w∗, B(ˆy(t)−nyˆ ∗)i
=hw(t−1)−w∗, Z(ˆx(t)−ˆnx∗)i+hw(t−1)−ρ(Zx(t−1)+By(t))−w∗, B(ˆy(t)−nyˆ ∗)i
=hw(t)+γρ(Zxˆ(t)+Byˆ(t))−w∗, Z(ˆx(t)−ˆnx∗)i
+hw(t)+γρ(Zxˆ(t)+Byˆ(t))−ρ(Zx(t−1)+By(t))−w∗, B(ˆy(t)−nyˆ ∗)i
=− 1
γρhw(t)−w∗, w(t)−w(t−1)i
+γρkZxˆ(t)+Byˆ(t)k2−ρhZx(t−1)+By(t), B(ˆy(t)−nyˆ ∗)i
=− 1 2γρ
kw(t)−w∗k2+kw(t)−w(t−1)k2− kw(t−1)−w∗k2 +γρkZxˆ(t)+Byˆ(t)k2−ρhZx(t−1)+By(t), B(ˆy(t)−nyˆ ∗)i
= 1 2γρ
−kw(t)−w∗k2+kw(t−1)−w∗k2 +γρ
2 kZxˆ(t)+Byˆ(t)k2
−ρhZx(t−1)+By(t), B(ˆy(t)−nyˆ ∗)i.
Therefore,
hw(t−1)−w∗, Z(ˆx(t)−nxˆ ∗) +hw˜(t)−w∗, B(ˆy(t)−nyˆ ∗)i
−ρhZx(t−1)+By(t), Z(nx(t)−(n−n)xˆ (t−1)−ˆnx∗)i
= 1 2γρ
−kw(t)−w∗k2+kw(t−1)−w∗k2 +γρ
2 kZˆx(t)+Byˆ(t)k2
−ρhZx(t−1)+By(t), Zxˆ(t)+Byˆ(t)i
= 1 2γρ
−kw(t)−w∗k2+kw(t−1)−w∗k2 +γρ
2 n2kZx(t)+By(t)k2+γρ
2 (n−n)ˆ 2kZx(t−1)+By(t−1)k2
−γρn(n−n)hZxˆ (t)+By(t), Zx(t−1)+By(t−1)i
−ρhZx(t−1)+By(t), Z(nx(t)−(n−n)xˆ (t−1)) +B(ny(t)−(n−n)yˆ (t−1))i.
Next, we expand the non-squared term:
−γρn(n−n)hZxˆ (t)+By(t), Zx(t−1)+By(t−1)i
−ρhZx(t−1)+By(t), Z(nx(t)−(n−n)xˆ (t−1)) +B(ny(t)−(n−n)yˆ (t−1))i
=−γρn(n−n)hZxˆ (t)−Zx∗, Zx(t−1)−Zx∗i
−γρn(n−n)hByˆ (t)−By∗, By(t−1)−By∗i
−γρn(n−n)hZxˆ (t)−Zx∗, By(t−1)−By∗i
−γρn(n−n)hByˆ (t)−By∗, Zx(t−1)−Zx∗i
−nρhZx(t−1)−Zx∗, Z(x(t)−x∗)i+ (n−ˆn)ρkZx(t−1)−Zx∗k2 + (n−n)ρhByˆ (t)−By∗, B(y(t−1)−y∗)i −nρkBy(t)−By∗k2
−ρhZx(t−1)−Zx∗, B(ny(t)−(n−n)yˆ (t−1)−nyˆ ∗)i
−ρhB(y(t)−y∗), Z(nx(t)−(n−n)xˆ (t−1)−nxˆ ∗)i
=−(γρn(n−n) +ˆ nρ)hZx(t)−Zx∗, Zx(t−1)−Zx∗i
−(γρn(n−n)ˆ −(n−n)ρ)hByˆ (t)−By∗, By(t−1)−By∗i
−γρn(n−n)hZxˆ (t)−Zx∗, By(t−1)−By∗i
−(γρn(n−n) +ˆ nρ−(n−n)ρ)hByˆ (t)−By∗, Zx(t−1)−Zx∗i + (n−n)ρkZxˆ (t−1)−Zx∗k2−nρkBy(t)−By∗k2
−ρ(n−ˆn)hZx(t−1)−Zx∗, B(y∗−y(t−1))i
−ρnhB(y(t)−y∗), Z(x(t)−x∗)i. (S-6)
Using the relation
hZx(t)−Zx∗, By(t−1)−By∗i=hZ(x(t)−x∗), B(y(t)−y∗)i+hZ(x(t)−x∗), B(y(t−1)−y(t))i, hBy(t)−By∗, Zx(t−1)−Zx∗i=hB(y(t)−y(t−1)), Z(x(t−1)−x∗)i+hB(y(t−1)−y∗), Z(x(t−1)−x∗)i, the RHS of Eq. (S-6) is equivalent to
−(γρn(n−n) +ˆ nρ)hZx(t)−Zx∗, Zx(t−1)−Zx∗i
−(γρn(n−n)ˆ −(n−n)ρ)hByˆ (t)−By∗, By(t−1)−By∗i + (n−n)ρkZxˆ (t−1)−Zx∗k2−nρkBy(t)−By∗k2
+{−(γρn(n−n) +ˆ ρˆn) +ρ(n−n)}hZxˆ (t−1)−Zx∗, B(y(t−1)−y∗)i
−(γρn(n−n) +ˆ ρn)hB(y(t)−y∗), Z(x(t)−x∗)i
−γρn(n−n)hZ(xˆ (t)−x∗), B(y(t−1)−y(t))i
−(γρn(n−n) +ˆ ρˆn)hBy(t)−By(t−1), Zx(t−1)−Zx∗i.
The last two terms are transformed to
−γρn(n−n)hZˆ (x(t)−x∗), B(y(t−1)−y(t))i
−(γρn(n−ˆn) +ρˆn)hBy(t)−By(t−1), Zx(t−1)−Zx∗i
=γρn(n−ˆn)hZ(x(t)−x(t−1)), B(y(t)−y(t−1))i
−ρˆnhBy(t)−By∗, Zx(t−1)−Zx∗i+ρˆnhBy(t−1)−By∗, Zx(t−1)−Zx∗i.
Thus, the RHS of Eq. (S-6) is further transformed to
−(γρn(n−n) +ˆ nρ)hZx(t)−Zx∗, Zx(t−1)−Zx∗i
−(γρn(n−n)ˆ −(n−ˆn)ρ)hBy(t)−By∗, By(t−1)−By∗i + (n−n)ρkZxˆ (t−1)−Zx∗k2−nρkBy(t)−By∗k2
+{−γρn(n−n) +ˆ ρ(n−ˆn)}hZx(t−1)−Zx∗, B(y(t−1)−y∗)i
−(γρn(n−n) +ˆ ρn)hB(y(t)−y∗), Z(x(t)−x∗)i +γρn(n−ˆn)hZ(x(t)−x(t−1)), B(y(t)−y(t−1))i
−ρˆnhBy(t)−By∗, Zx(t−1)−Zx∗i.
By Lemma 5 andZx∗=−By∗, this is equivalent to
−1
2(γρn(n−n) +ˆ nρ){kZx(t)−Zx∗k2+kZx(t−1)−Zx∗k2− kZx(t)−Zx(t−1)k2}
−1
2(γρn(n−n)ˆ −(n−ˆn)ρ){kBy(t)−By∗k2+kBy(t−1)−By∗k2− kBy(t)−By(t−1)k2} + (n−n)ρkZxˆ (t−1)−Zx∗k2−nρkBy(t)−By∗k2
−1
2{−γρn(n−n) +ˆ ρ(n−ˆn)}(kZx(t−1)−Zx∗k2+kB(y(t−1)−y∗)k2− kZx(t−1)+By(t−1)k2) +1
2(γρn(n−n) +ˆ ρn)(kZx(t)−Zx∗k2+kB(y(t)−y∗)k2− kZx(t)+By(t)k2) +γρn(n−ˆn)hZ(x(t)−x(t−1)), B(y(t)−y(t−1))i
−ρˆnhBy(t)−By∗, Zx(t−1)−Zx∗i
=−ρˆn
2 kZx(t−1)−Zx∗k2+1
2(γρn(n−n) +ˆ nρ)kZx(t)−Zx(t−1)k2
−ρˆn
2 kBy(t)−By∗k2+1
2(γρn(n−n)ˆ −(n−n)ρ)kByˆ (t)−By(t−1)k2
−1
2{γρn(n−n)ˆ −ρ(n−n)}kZxˆ (t−1)+By(t−1)k2
−1
2{γρn(n−n) +ˆ ρn}kZx(t)+By(t)k2
+γρn(n−ˆn)hZ(x(t)−x(t−1)), B(y(t)−y(t−1))i
−nρhByˆ (t)−By∗, Zx(t−1)−Zx∗i
=−nρˆ
2 kZx(t−1)+By(t)k2 +1
2(γρn(n−n) +ˆ nρ)kZx(t)−Zx(t−1)k2 +1
2(γρn(n−n)ˆ −(n−ˆn)ρ)kBy(t)−By(t−1)k2
−1
2{γρn(n−n)ˆ −ρ(n−n)}kZxˆ (t−1)+By(t−1)k2
−1
2{γρn(n−n) +ˆ ρn}kZx(t)+By(t)k2
+γρn(n−ˆn)hZ(x(t)−x(t−1)), B(y(t)−y(t−1))i. (S-7) Since
γρn(n−ˆn)hZ(x(t)−x(t−1)), B(y(t)−y(t−1))i
≤γρn(n−n)ˆ
2 {kZ(x(t)−x(t−1))k2+kB(y(t)−y(t−1))k2}, the RHS of Eq. (S-7) is bounded by
−ˆnρ
2 kZx(t−1)+By(t)k2 +1
2(2γρn(n−ˆn) +nρ)kZx(t)−Zx(t−1)k2 +1
2(2γρn(n−ˆn)−(n−ˆn)ρ)kBy(t)−By(t−1)k2
−1
2{γρn(n−n)ˆ −ρ(n−ˆn)}kZx(t−1)+By(t−1)k2−1
2{γρn(n−n) +ˆ ρn}kZx(t)+By(t)k2. Combining this and Eq. (S-5), and noticingkZx(t)−Zx(t−1)k=kZI(x(t)I −x(t−1)I )k=kx(t)−x(t−1)kDiag
I(Z⊤Z), we
obtain
nE[F(x(t), y(t))−F(x(t−1), y(t−1))]
≤ˆn(F(x∗, y∗)−F(x(t−1), y(t−1))) + 1
2γρ
−kw(t)−w∗k2+kw(t−1)−w∗k2
−nρˆ
2 kZx(t−1)+By(t)k2 +1
2{γρn2−γρn(n−ˆn)−ρn}kZx(t)+By(t)k2 +1
2{γρ(n−n)ˆ 2−γρn(n−n) +ˆ ρ(n−n)}kZxˆ (t−1)+By(t−1)k2
−Ehnv
2 kx(t)−x∗k2+n
2kx(t)−x∗k2Hi +(n−n)vˆ
2 kx(t−1)−x∗k2+n
2kx(t−1)−x∗k2H +γρn(n−n)Eˆ h
kx(t)−x(t−1)k2Diag
I(Z⊤Z)
i−n
2E[kx(t)−x(t−1)k2Diag
I(G)] + (γρn(n−ˆn)−(n−n)ρˆ
2 )kB(y(t)−y(t−1))k2
− hQ(y(t)−y(t−1)), ny(t)−(n−ˆn)y(t−1)−nyˆ ∗i
−nhˆ ′
2 kB⊤w∗− ∇φ(y(t))k2.
Since we have assumedDiagI(G)≻2γρ(n−n)Diagˆ I(Z⊤Z), it holds that γρn(n−n)Eˆ h
kx(t)−x(t−1)k2Diag
I(Z⊤Z)
i
−n
2E[kx(t)−x(t−1)k2Diag
I(G)]≤0.
Moreover, we have that
− hQ(y(t)−y(t−1)), ny(t)−(n−n)yˆ (t−1)−nyˆ ∗i
=−nky(t)−y(t−1)k2Q+1
2{ky(t)−y(t−1)k2Q+ky(t−1)−y∗k2Q− ky(t)−y∗k2Q}
=−
n−nˆ 2
ky(t)−y(t−1)k2Q+nˆ
2ky(t−1)−y∗k2Q−ˆn
2ky(t)−y∗k2Q. Finally, we achieve
nE[F(x(t), y(t))−F(x(t−1), y(t−1))]
≤ˆn(F(x∗, y∗)−F(x(t−1), y(t−1))) + 1
2γρ
−kw(t)−w∗k2+kw(t−1)−w∗k2
−ρn(1−γ)
2 kZx(t)+By(t)k2+ρ(n−n)(1 +ˆ γ)
2 kZx(t−1)+By(t−1)k2
−Ehnv
2 kx(t)−x∗k2+n
2kx(t)−x∗k2Hi +(n−n)vˆ
2 kx(t−1)−x∗k2+n
2kx(t−1)−x∗k2H +γρn(n−n)Eˆ h
kx(t)−x(t−1)k2Diag
I(Z⊤Z)
i−n
2E[kx(t)−x(t−1)k2Diag
I(G)]
−nρˆ
2 kZx(t−1)+By(t)k2 + (γρn(n−ˆn)−(n−n)ρˆ
2 )kB(y(t)−y(t−1))k2
−
n−ˆn 2
ky(t)−y(t−1)k2Q+nˆ
2ky(t−1)−y∗k2Q−nˆ
2ky(t)−y∗k2Q
−nhˆ ′
2 kB⊤w∗− ∇φ(y(t))k2. (S-8)
Note that Eq. (S-8) holds for arbitraryy∗∈ Y∗. Step 3:(Deriving the assertion)
(i) Now, since we can take∇φ(y(t)) =B⊤w(t−1)−ρ(Zx(t−1)+By(t))−Q(y(t)−y(t−1)), it holds that kB⊤w∗− ∇φ(y(t))k2=kB⊤(w∗−w(t−1))−ρ(Zx(t−1)+By(t))−Q(y(t)−y(t−1))k2. SinceB⊤is injective, this gives that
−h′
2kB⊤w∗− ∇φ(y(t))k2
≤ −h′σmin(BB⊤)kw∗−w(t−1)k2+ 2h′ρ2kZx(t−1)+By(t)k2+ 2h′kQ(y(t)−y(t−1))k2
≤ −h′σmin(BB⊤)kw∗−w(t−1)k2+ 2h′ρ2kZx(t−1)+By(t)k2+ 2h′σmax(Q)ky(t)−y(t−1)k2Q. Now, dividing both sides bymax{1,4h′ρ,4h′σmax(Q)}(≥1), we have
−nhˆ ′
2 kB⊤w∗− ∇φ(y(t))k2
≤ − nhˆ ′σmin(BB⊤)
max{1,4h′ρ,4h′σmax(Q)}kw∗−w(t−1)k2+ˆnρ
2 kZx(t−1)+By(t)k2+nˆ
2ky(t)−y(t−1)k2Q. (S-9) (ii) Next, it holds that, for someyb∗∈ Y∗,
1 2
F(x∗, y∗)−F(x(t−1), y(t−1))
≤ −v′φ
4 kPKer(B)(y(t−1)−yb∗)k2. (S-10) On the other hand, for arbitrarya >0, it follows that
−ρ
8kZx(t−1)+By(t−1)k2
≤ −1
8(1−a)kZ(x(t−1)−x∗)k2−1
8(1−a−1)kB(y(t−1)−by∗)k2. Thus, settinga= 1 +ρσmax2v(Z⊤Z), we have that
−ρ
8kZx(t−1)+By(t−1)k2
≤ρ 8
2v
ρσmax(Z⊤Z)σmax(Z⊤Z)kx(t−1)−x∗k2
−ρ 8
2vρ
ρσmax(Z⊤Z) + 4vσmin(BB⊤)kPKer(B)⊥ (y(t−1)−yb∗)k2
=v
4kx(t−1)−x∗k2− vρσmin(BB⊤)
4(ρσmax(Z⊤Z) + 4v)kPKer(B)⊥ (y(t−1)−yb∗)k2. (S-11) Combining Eqs. (S-10), (S-11), we have that
ˆ n 2
F(x∗, y∗)−F(x(t−1), y(t−1))
−nρˆ
8nkZx(t−1)+By(t−1)k2
≤nvˆ
4 kx(t−1)−x∗k2−ˆn nmin
nv′φ, nρvσmin(BB⊤) ρσmax(Z⊤Z) + 4v
ky(t−1)−by∗k2Q 4σmax(Q)
≤nvˆ
4 kx(t−1)−x∗k2−ˆn nmin
nv′φ, nρvσmin(BB⊤) ρσmax(Z⊤Z) + 4v
ky(t−1)− Y∗k2Q
4σmax(Q) . (S-12)
(iii) By the assumptionDiagI(G)≻2γρ(n−n)Diagˆ I(Z⊤Z), it holds that γρn(n−n)Eˆ h
kx(t)−x(t−1)k2Diag
I(Z⊤Z)
i−n
2E[kx(t)−x(t−1)k2Diag
I(G)]≤0. (S-13) (iv) Therefore, ifγ= 4n1 , applying Eq. (S-9), Eq. (S-12) and Eq. (S-13) to Eq. (S-8), for
ν =nˆ nmin
(1 4
v v+σmax(H)
, h′ρσmin(BB⊤)
2 max{1,4h′ρ,4h′σmax(Q)}, nvφ′/ˆn
4σmax(Q), nvσmin(BB⊤)/nˆ 4σmax(Q)(ρσmax(Z⊤Z) + 4v)
) ,
we have that
Eh
F(x(t), y(t))−F(x∗, y∗) + 1
2nγρkw(t)−w∗k2 +ρ(1−γ)
2 kZx(t)+By(t)k2+1
2kx(t)−x∗k2vIp+H+ nˆ
2nky(t)−y∗k2Qi
≤(1−ν) (
F(x(t−1), y(t−1))−F(x∗, y∗) + 1
2nγρkw(t−1)−w∗k2 +ρ(1−γ)
2 kZx(t−1)+By(t−1)k2+1
2kx(t−1)−x∗k2vIp+H+ nˆ
2nky(t−1)−y∗k2Q )
.
Settingµ:=nν/ˆn, this gives the assertion.
Lemma 3.
Eh
−ρhZ\ix(t)\i +By(t), ZI(x(t)I −x∗I)i+ρ
2kZIx∗Ik2−ρ
2kZIx(t)I k2i
≤Eh
−ρ
nhZx(t−1)+By(t), Z(nx(t)−(n−ˆn)x(t−1)−x∗)ii + ρ
2nkx(t−1)−x∗k2Diag
I(Z⊤Z)−ρ 2Eh
kx(t)−x(t−1)k2Diag
I(Z⊤Z)
i .
Proof.
ρhZ\Ix(t−1)\I , ZI(x∗I−x(t)I )i+ρhBy(t), ZI(x∗I−x(t)I )i+ρ
2kZIx∗Ik2−ρ
2kZIx(t)I k2
=ρhZx(t−1), ZI(x∗I−x(t)I )i+ρhBy(t), ZI(x∗I−x(t)I )i+ρ
2kZIx∗Ik2−ρ
2kZIx(t)I k2
−ρhZIx(t−1)i , ZI(x∗I−x(t)I )i
=ρhZx(t−1), ZI(x∗I−x(t−1)I +x(t−1)I −x(t)I )i+ρhBy(t), ZI(x∗I −x(t−1)I +x(t−1)I −x(t)I )i +ρ
2kZIx∗Ik2−ρ
2kZIx(t)I k2−ρhZIx(t−1)I , ZI(x∗I−x(t)I )i
=ρhZx(t−1)+By(t), ZI(x∗I −x(t−1)I )i +ρ
2kZI(x(t−1)I −x∗I)k2−ρ
2kZI(x(t)I −x(t−1)I )k2 +ρhZx(t−1)+By(t), Z(x(t−1)−x(t))i.