リーマン計量調整に基づく
Tucker
多様体の
幾何の提案と最適化問題への応用
電気通信大学大学院情報理工学研究科情報ネットワーク工学専攻 笠井裕之
Amazon
Development
CentreIndia,
Bamdev MishraHiroyuki
KasaiDepartment
ofComputer
and NetworkEngineering,
TheUniversity
of Electro‐CommunicationsBamdev Mishra
Amazon
Development
Centre India概要
本稿では,低ランクテンソルTucker分解のための新し
い幾何空間
Scaled Tucker.Manifold による テンソル補完
問題
の効率的な手法を提案した論文
[1]
の概要を記す.提案
手法は,.一般的なテンソル回帰問題に対して,Scaled
TuckerManifold
により効率的な解法を確立することが可能となる.
Scaled TuckerManifol
の導出にあたっては,Tucker
分解の対称構造と回帰問題の最小自乗構造に着目した新しいリーマ
ン計量を提案し,幾何空間を定義する数々の構成要素を導出
している.1
Introduction
This paper addresses the
problem
of low‐rank tensorcompletion
when the rank is apriori
knownorestimated. Without lossofgenerality,
wefocuson3‐ordertensors. Givenatensor\mathcal{X}^{n}1^{\mathrm{X}n}2\times n_{3},whose entries
\mathcal{X}_{i_{1},i_{2},i_{3}}^{\star}
areonly
known forsomeindices(i_{1}, i_{2}, i_{3})\in $\Omega$,
where $\Omega$ is a subset of the
complete
set of indices\{(i_{1}, i_{2}, i_{3})
:i_{d}\in\{1, . . . , n_{d}\},
d\in\{1
,2,
3 thefixed‐rank
tensorcompletion problem
isformulatedas\displaystyle \min_{\mathcal{X}\in \mathbb{R}^{n}1^{\mathrm{X}n}2^{\mathrm{X}n}3}\frac{1}{| $\Omega$|}\Vert \mathcal{P}_{ $\Omega$}(\mathcal{X})-P_{ $\Omega$}(\mathcal{X}^{\star})\Vert_{F}^{2}
subject
to rank(\mathcal{X})=\mathrm{r},
where the
operator
\mathcal{P}_{ $\Omega$}(\mathcal{X})_{i_{1}i_{2}i_{3}}=\mathcal{X}_{i_{1}i_{2}i_{3}}
if(i_{1}, i_{2}, i_{3})\in $\Omega$
and\mathcal{P}_{ $\Omega$}(X)_{i_{1}i_{2}i_{3}}=0
otherwise and(with
aslight
abuse ofnotation) \Vert \Vert_{F}
is the Frobenius norm.rank(X) (=\mathrm{r}=
(r_{1},
r_{2}, r_{3} called the multilinear rank of \mathcal{X}, istheset of the ranks of for each of mode‐d
unfolding
matrices. r_{d}\ll n_{d} enforces a low‐rank structure. The mode is a matrixobtained
by
concatenating
the mode‐d fibersalong
column and mode‐dunfolding
of \mathcal{X} isThe
optimization
problem
(1)
has manyvariants,
and one of those isextending
thenuclear norm
regularization approach
from the matrixcase[2]
tothe tensorcase. Whilethis
generalization
leadstogood
results[3−5],
itsscalabilityto large‐scale
instances isnottrivial, especially
due tothenecessity
ofhigh‐dimensional singular
valuedecomposition
computations.
A differentapproach exploits
Tuckerdecomposition
[6,
Section4]
of alow‐rank tensor \mathcal{X} to
develop large‐scale algorithms
for(1),
e.g., in[7, 8].
Thepresent
paper
exploits
both thesymmetry
present inTuckerdecomposition
and theleast‐squares
structureof thecostfunction of
(1)
by
using
theconcept
ofpreconditioning.
Whileprecon‐ditioning
inunconstrainedoptimization
iswell studied[9,
Chapter
5],
preconditioning
onconstraints with
symmetries,
owing
tonon‐uniqueness
of Tuckerdecomposition[6,
Sec‐tion
4.3],
is notstraightforward.
We build upon the recent work[10]
thatsuggests
touse Riemannian
preconditioning
witha tailored metric(inner product)
inthe Rieman‐nian
optimization
frameworkonquotient
manifolds[11−13].
Ourproposed preconditioned
nonlinear
conjugate
gradient algorithm
isimplemented
intheMatlab toolboxManopt
[14]
and it
outperforms
state‐of‐the‐art methods. \mathrm{I}\mathrm{n}\cdot \mathrm{t}\mathrm{h}\mathrm{e}supplementary
materialsection,
we show concretemathematical derivations and additional numericalcomparisons.
We alsoprovide
ageneric
Manopt factory
(a
manifolddescription
Matlabfile)
with additionalsupport
for second‐orderimplementations,
e.g., thetrust‐region
method.2
Exploiting
the
problem
structure
We focusonthetwofundamentalstructures
present
in(1):
symmetry
intheconstraints,
and the
least‐squares
structure of the cost function.Finally,
anovel metric isproposed.
The
quotient
andleast‐squares
structures. The Tuckerdecomposition
ofatensorX\in \mathbb{R}^{n}1\times 2 of rank \mathrm{r}
(=(r_{1}, r_{2}, r_{3}))
is[6,
Section4.1]
X=\mathcal{G}\times 1\mathrm{U}_{1}\times 2\mathrm{U}_{2^{\times}3}\mathrm{U}_{3}
,where\mathrm{U}_{d}\in \mathrm{S}\mathrm{t}(r_{d}, n_{d})
ford\in\{1
,2,
3\}
belongs
totheStiefel
manifold
of matrices of size n_{d}\times r_{d}with
orthogonal
columns and\mathcal{G}\in \mathbb{R}^{r_{1}\times r_{2}\times r_{3}}
.Here,
\mathcal{W}\times d\mathrm{V}\in \mathbb{R}^{n1\times}\ldots n_{d-1}\times m\times n_{d+N}\mathrm{u}\times\cdots n
computes
the d‐modeproduct
of a tensor \mathcal{W}\in \mathbb{R}^{n1\times\cdots \mathrm{x}n}\backslash .N anda matrix V \in \mathbb{R}^{m\times n}d.
Tucker
decomposition
is notunique
as X remainsunchanged
under the transforma‐tion
(\mathrm{U}_{1}, \mathrm{U}_{2}, \mathrm{U}_{3}, \mathcal{G})\mapsto(\mathrm{U}_{1}\mathrm{O}_{1}, \mathrm{U}_{2}\mathrm{O}_{2}, \mathrm{U}_{33,1123}\mathrm{O}\mathcal{G}\times \mathrm{O}^{T}\mathrm{x}_{2}\mathrm{O}^{T}\times \mathrm{O}_{3}^{T})
for all\mathrm{O}_{d}\in O(r_{d})
,which is the set of
orthogonal
matrices of size of r_{d}\times r_{d}. The classicalremedy
toremove this
indeterminacy
is to have additional structures on\mathcal{G}
likesparsity
or re‐stricted
orthogonal
rotations[6,
Section4.3].
Incontrast,
we encode the transforma‐tion in an abstract search space of
equivalence classes,
defined as,[(\mathrm{U}_{1}, \mathrm{U}_{2}, \mathrm{U}_{3}, \mathcal{G})]
:=\{(\mathrm{U}_{1}\mathrm{O}_{1}, \mathrm{U}_{2}\mathrm{O}_{2}, \mathrm{U}_{3}\mathrm{O}_{3}, \mathcal{G}\times \mathrm{O}^{T_{\mathrm{X}_{2}}}\mathrm{O}^{T}\times \mathrm{O}_{3}^{T}) : 0_{d}\in \mathcal{O}(r_{d})\}|
. Thesetofequivalence
classesisthe
quotient
manifold[15,
Theorem9.16]
\mathcal{M}/\sim :=\mathcal{M}/(\mathcal{O}(r_{1})\times \mathcal{O}(r_{2})\times \mathcal{O}(r_{3}))
,where\mathcal{M} iscalled the totalspace
(computational space)
that is theproduct
space\mathcal{M} :=composition,
the local minima of(1)
in\mathcal{M} are notisolated,
butthey
become isolated on\mathcal{M}/\sim
.Consequently,
theproblem
(1)
is anoptimization
problem
on aquotient
mani‐fold for which
systematic
procedures
areproposed
in[11−13]
by endowing
\mathcal{M}/\sim
with aRiemannianstructure. Wecall
\mathcal{M}/\sim
the Tuckermanifold.
Another structure that is
present
in(1)
is theleast‐squares
structure of thecostfunc‐tion. A way to
exploit
it is to endow the search space with a metric(inner product)
induced
by
the Hessian of the cost function[9].
This induced metric(or
itsapproxi‐
mation)
resolves convergence issues of first‐orderoptimization
algorithms. Specifically
for the case of
quadratic
optimization
with rank constraint(matrix case),
Mishra andSepulchre
[10,
Section5]
proposeafamily
of Riemannian metricsfrom the Hessian of thecostfunction. Since
applying
thisapproach directly
for(1)
iscomputationally costly,
weconsider a
simplified
cost functionby assuming
that $\Omega$ contains the full set ofindices,
i.e.,
we focus on\Vert \mathcal{X}-X^{\star}\Vert_{F}^{2}
to propose a metric candidate. Agood
candidate isby
considering only
the blockdiagonal
elements of the Hessian of\Vert X-X^{\star}\Vert_{F}^{2}
. It shouldemphasized
that the cost function||X-X^{\star}\Vert_{F}^{2}
is convex andquadratic
in X. Conse‐quently,
it is also convex andquadratic
in thearguments
(\mathrm{U}_{1}, \mathrm{U}_{2}, \mathrm{U}_{3}, \mathcal{G})
individually.
The blockdiagonal
approximation
of the Hessian of\Vert \mathcal{X}-\mathcal{X}^{\star}\Vert_{F}^{2}
in(\mathrm{U}_{1}, \mathrm{U}_{2}, \mathrm{U}_{3}, \mathcal{G})
is((\mathrm{G}_{1}\mathrm{G}_{1}^{T})\otimes \mathrm{I}_{n}1, (\mathrm{G}_{2}\mathrm{G}_{2}^{T})\otimes \mathrm{I}_{n}2, (\mathrm{G}_{3}\mathrm{G}_{3}^{T})\otimes \mathrm{I}_{n3}, \mathrm{I}_{r1r2r}3)
,where\mathrm{G}_{d}
isthe mode‐dunfolding
of\mathcal{G}
and is assumed to be full rank. The terms\mathrm{G}_{d}\mathrm{G}_{d}^{T}
ford\in\{1
,2,
3\}
arepositive
definite
when
r_{1}\leq r_{2}r_{3}, r_{2}\leq r_{1}r_{3}
, andr_{3}\leq r_{1}r_{2}.
A novel Riemannian metric and its motivation. An elementx inthe totalspace
\mathcal{M} has the matrix
representation
(\mathrm{U}_{1}, \mathrm{U}_{2}, \mathrm{U}_{3}, \mathcal{G})
.Consequently,
thetangent
spaceT_{x}\mathcal{M}
is the Cartesian
product
of thetangent
spaces of the individualmanifolds,
i.e., T_{x}\mathcal{M}
has the matrix characterization
[13] T_{x}\mathcal{M}=\{(\mathrm{Z}_{\mathrm{U}_{1}}, \mathrm{Z}_{\mathrm{U}_{2}}, \mathrm{Z}_{\mathrm{U}_{3}}, \mathrm{Z}_{\mathcal{G}})\in \mathbb{R}^{n\mathrm{x}r_{1}}1\times \mathbb{R}^{n2\times r2}\times
\mathbb{R}^{n3\times r}3\times \mathbb{R}^{r\mathrm{x}r2\times r}13 :\mathrm{U}_{d}^{T}\mathrm{Z}_{\mathrm{U}_{d}}+\mathrm{Z}_{\mathrm{U}_{d}}^{T}\mathrm{U}_{d}=0
, ford\in\{1
,2,
3 The earher discussion onsymmetry
andleast‐squares
structureleads tothe novel metricg_{x} :T_{x}\mathcal{M}\times T_{x}\mathcal{M}\rightarrow \mathbb{R}
g_{x}($\xi$_{x}, $\eta$_{x}) =\langle$\xi$_{\mathrm{U}_{1}}, $\eta$_{\mathrm{U}_{1}}(\mathrm{G}_{1}\mathrm{G}_{1}^{T})\rangle+\langle$\xi$_{\mathrm{U}_{2}}, $\eta$_{\mathrm{U}_{2}}(\mathrm{G}_{2}\mathrm{G}_{2}^{T})\rangle
+\{$\xi$_{\mathrm{U}_{3}}, $\eta$_{\mathrm{U}_{3}}(\mathrm{G}_{3}\mathrm{G}_{3}^{T})\rangle+\{$\xi$_{\mathcal{G}}, $\eta$_{\mathcal{G}}\rangle,
where
$\xi$_{x}, $\eta$_{x}\in T_{x}\mathcal{M}
aretangent
vectorswith matrixcharacterizations,
($\xi$_{\mathrm{U}_{1}}, $\xi$_{\mathrm{U}_{2}}, $\xi$_{\mathrm{U}_{3}}, $\xi$_{\mathcal{G}})
and($\eta$_{\mathrm{U}_{1}}, $\eta$_{\mathrm{U}_{2}}, $\eta$_{\mathrm{U}_{3}}, $\eta$_{\mathcal{G}})
,respectively
and\rangle
isthe Euclidean innerproduct.
Ascontraststo the classical Euclidean
metric,
the metric(2)
scales the levelsets of thecost functionon the search space that leads a
preconditioning
effect on thealgorithms developed
onthe Tucker manifold.
3
Notions of
optimization
onquotient
manifolds
Each
point
on aquotient
manifoldrepresents
an entireequivalence
class of matricesin the total space. Abstract
geometric
objects
on aquotient
manifold call for matrixbut under
appropriate
compatibility
between the Riemannian structure of\mathcal{M} and theRiemannian structure of the
quotient
manifold\mathcal{M}/\sim
,they
definealgorithms
on thequotient
manifold. Once we endow\mathcal{M}/\sim
with aRiemannianstructure,
the constraintoptimization
problem
(1)
isconceptually
transformedintoanunconstrainedoptimization
over the Riemannian
quotient
manifold(2).
When thepoints
x and y in \mathcal{M}belong
tothe same
equivalence class, they
represent
asingle
point
[x] :=\{y\in \mathcal{M} : y\sim x\}
onthequotient
manifold\mathcal{M}/\sim
. The abstracttangent
spaceT_{[x]}(\mathcal{M}/\sim)
at[x]\in \mathcal{M}/\sim
hasthe matrix
representation
inT_{x}\mathcal{M}
, but restricted to the directions that do not inducea
displacement along
theequivalence
class[x]
. This is realizedby decomposing
T_{x}\mathcal{M}
into two
complementary subspaces.
The vertical space,\mathcal{V}_{x}
is thetangent
space of theequivalence
class[x]
. On the otherhand,
the horizontal space\mathcal{H}_{x}
is theorthogonal
subspace
to\mathcal{V}_{x}
,i.e., T_{x}\mathcal{M}=\mathcal{V}_{x}\oplus \mathcal{H}_{x}
. The horizontalsubspace provides
a valid matrixrepresentation
to the abstracttangent
spaceT_{[x]}(\mathcal{M}/\sim) [
11, Section3.5.8].
An abstracttangent
vector$\xi$_{[x]}\in T_{[x]}(\mathcal{M}/\sim)
at[x]
has aunique
element$\xi$_{x}\in \mathcal{H}_{x}
that is called itshorizontal
lift.
Endowed with the Riemannian metric(2),
thequotient
manifold\mathcal{M}/\sim
is a Riemannian submersionof \mathcal{M}. The submersion
principle
then allows to work outconcrete matrix
representations
of abstractobject
on\mathcal{M}/\sim
.Particularly,
starting
froman
arbitrary
matrix(with
appropriate
dimensions),
twolinearprojections
areneeded: thefirst
projection $\Psi$_{x}
isontothetangent
spaceT_{x}\mathcal{M}
,while the secondprojection
$\Pi$_{x}
isontothe horizontal
subspace
\mathcal{H}_{x}
. Thecomputation
costof theseprojections
isO(n_{1}r_{1}^{2}+n_{2}r_{2}^{2}+
n_{3}r_{3}^{2})
.Finally,
wepropose aRiemannian nonlinearconjugate
gradient algorithm
for(1)
thatscales wellto
large‐scale
instances.Specifically,
we usetheconjugate
gradient implemen‐
tation of
Manopt
with theingredients
described in Table??. The convergenceanalysis
of this method follows from
[11,
16,
17].
Iff(X)=\Vert P_{ $\Omega$}(X)-P_{ $\Omega$}(\mathcal{X}^{\star})\Vert_{F}^{2}/| $\Omega$|
, thenthe Riemannian
gradient
\mathrm{g}\mathrm{r}\mathrm{a}\mathrm{d}_{x}f
, which has the matrix characterization$\Psi$(\mathrm{e}\mathrm{g}\mathrm{r}\mathrm{a}\mathrm{d}_{x}\prime f)
,where
\mathrm{e}\mathrm{g}\mathrm{r}\mathrm{a}\mathrm{d}_{\mathrm{x}}f
is the Euclideangradient
off
. We show a way to compute astep‐size
guess
effectively.
The totalcomputational
cost per iterationofourproposed algorithm
isO(| $\Omega$|r_{1}r_{2}r_{3})
,where| $\Omega$|
isthe number of known entries.4
Numerical
comparisons
We show numerical
comparisons
ofourproposed algorithm
with state‐of‐the‐artalgo‐
rithms that include
TOpt
[7]
andgeomCG
[8],
forcomparisons
with Tuckerdecomposition
based
algorithms,
and HaLRTC[3],
Latent[4],
and Hard[5]
as nuclear norm minimiza‐tion
algorithms.
All simulations areperformed
in Matlab on a 2.6 GHz Intel Core i7machine with 16 GB RAM. For
specific
operations
withunfoldings
of S, we usethe mexinterfacesthat are
provided
ingeomCG.
Forlarge‐scale
instances,
ouralgorithm
isonly.
compared
withgeomCG
asotheralgorithms
cannot handle these instances. Werandomly
sampling
(OS)
ratio,
to create thetraining
set $\Omega$.Algorithms
(and
problem
instances)
are initialized
randomly,
as in[8],
and arestopped
when either the mean square error(MSE)
onthetraining
set $\Omega$ isbelow10^{-12}
orthe number of iterations exceeds 250. Wealso evaluate themean squareerror on atest set $\Gamma$, which is different from $\Omega$. Five runs
are
performed
ineach scenario.Case 1 considers
synthetic
small‐scale tensorsof size100\times 100\times 100, 150\times 150\times 150,
and 200\times 200\times 200 and rank
\mathrm{r}=(10,10,10)
areconsidered. OS is{10,
20,
30}.
The resultshows that the convergence behavior of our
proposed algorithm
is eithercompetitive
orfaster than the others.
Next,
Case 2 considerslarge‐scale
tensors of size 3000\times 3000\times3000,
5000\times 5000\times 5000,and 10000\times 10000\times 10000 and ranks\mathrm{r}=(5,5,5)
and(10,10,10).
OS is 10. Our
proposed algorithm outperforms geomCG.
Case 3 considers instanceswhere the dimensions and ranks
along
certainmodesaredifferent than others. Twocasesareconsidered. Case
(3.a)
considerstensorssize20000\times 7000\times 7000, 30000\times 6000\times 6000,
and 40000\times 5000\times 5000 with rank
\grave{\mathrm{r}}=(5,5,5)
. Case(3.b)
considers atensor of size10000\times 10000\times 10000 with ranks
(7,
6,
6),
(10,
5,
5),
and(15,
4,
4).
In all the cases, theproposed algorithm
converges faster thangeomCG. Finally,
Case4considers MovieLens‐10\mathrm{M} dataset that contains 10000054
ratings
corresponding
to 71567 users and 10681movies. We
split
the time into7‐days
wide binsresults,
andfinally,
get
a tensor of size71567\times 10681\times 731. The fraction of known entries is less than
0.002%.
Weperform
fiverandom
80/10/10‐train/validation/test
partitions.
The maximum iteration is set to 500.Our
proposed algorithm consistently
gives
lowertesterrorsthangeomCG
acrossdifferent ranks.5
Conclusion and future work
We have
proposed
apreconditioned
nonlinearconjugate
gradient algorithm
for thetensor
completion problem by exploiting
the fundamentalstructuresofsymmetry,
duetonon‐uniqueness
of Tuckerdecomposition,
andleast‐squares
of thecost function. A novelRiemannian metric is
proposed
that enablestousethe versatile Riemannianoptimization
framework. Numerical
comparisons suggest
that ourproposed algorithm
has asuperior
performance
ondifferent benchmarks.参考文献
[1]
H. Kasai and B. Mishra. Low‐ranktensorcompletion:
aRiemannian manifold pre‐conditioning approach.
InICML,
2016.[2]
E. J. Candèsand B. Recht. Exact matrixcompletion
viaconvexoptimization.
Found.[3]
J.Liu,
P.Musialski,
P.Wonka,
and.J. Ye. Tensorcompletion
forestimating missing
values in visual data. IEEE Trans. Pattern Anal. Mach.
Intell.,
35(1):208-220
, 2013.[4]
R.Tomioka,
K.Hayashi,
and H.Kashimaf
Estimationof low‐ranktensorsviaconvexoptimization.
Technicalreport,
arXivpreprint
\mathrm{a}\mathrm{r}\mathrm{X}\mathrm{i}\mathrm{v}:1010.0789, 2011.[5]
M.Signoretto, Q.
T.Dinh,
L.D.Lathauwer,
and J. A. K.Suykens. Learning
withten‐sors: aframework based on convex
optimization
andspectral regularization.
Mach.Learn.,
94(3):303-351
, 2014.[6]
T. G. Kolda and B. W. Bader. Tensordecompositions
andapplications.
SIAMRev.,
51(3):455-500
, 2009.[7]
M.Filipovič
and A. Jukič. Tucker factorization withmissing
data withapplication
tolow‐n‐ranktensor
completion.
Multidim.Syst. Sign; P.,
2013.[8]
D.Kressner,
M.Steinlechner,
and B.Vandereycken.
Low‐ranktensorcompletion by
Riemannian
optimization.
BIT Numer.Math.,
54(2):447-468
, 2014.[9]
J. Nocedal and S. J.Wright.
Numericaloptimization,
volume Second Edition.Springer,
2006.[10]
B. Mishra and R.Sepulchre.
Riemannianpreconditioning.
SIAM J.Optim.,
26(1):635-660
, 2016.[11]
P.‐A.Absil,
R.Mahony,
and R.Sepulchre. 0ptimization Algorithms
onMatrix Man‐ifold.
s. PrincetonUniversity
Press,
2008.[12]
S. T. Smith.optimization techniques
onRiemannian manifold. In A.Bloch, editor,
Hamiltonian and Gradient
Flows,
Algorthms
andControl,
volume3,
pages 113‐136.Amer. Math.
Soc., Providence, RI,
1994.[13]
A.Edelman,
T.A.Arias,
and S.T. Smith. Thegeometry
ofalgorithms
withorthog‐
onality
constraints. SIAMJ. MatrixAnal.Appl.,
20(2):303-353
, 1998.[14]
N.Boumal,
B.Mishra,
P.‐A.Absil,
and R.Sepulchre. Manopt:
aMatlabtoolbox foroptimization
onmanifolds. J. Mach. Learn.Res.,
15(1):1455‐1459,
2014.[15]
J. M. Lee. Introductiontosmoothmanifolds,
volume 218 of Graduate Texts inMath‐ematics.
Springer‐Verlag,
NewYork,
secondedition,
2003.[16]
H. Sato and T. Iwai. A new,globally
convergent
Riemannianconjugate
gradient
method.