Research Article

(1)

Volume 2010, Article ID 464815,17pages doi:10.1155/2010/464815

Research Article

Some Remarks on Diffusion Distances

Maxim J. Goldberg

¹

and Seonja Kim

²

1Theoretical and Applied Science, Ramapo College of NJ, 505 Ramapo Valley Road, Mahwah, NJ 07430, USA

2Mathematics Department, SUNY Rockland Community College, 145 College Road, Suﬀern, NY 10901, USA

Correspondence should be addressed to Maxim J. Goldberg,[email protected] Received 14 June 2010; Revised 31 July 2010; Accepted 3 September 2010

Academic Editor: Andrew Pickering

Copyrightq2010 M. J. Goldberg and S. Kim. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

As a diﬀusion distance, we propose to use a metricclosely related to cosine similaritywhich is defined as theL² distance between twoL²-normalized vectors. We provide a mathematical explanation as to why the normalization makes diﬀusion distances more meaningful. Our proposal is in contrast to that made some years ago by R. Coifman which finds theL²distance between certainL¹ unit vectors. In the second part of the paper, we give two proofs that an extension of mean first passage time to mean first passage cost satisfies the triangle inequality; we do not assume that the underlying Markov matrix is diagonalizable. We conclude by exhibiting an interesting connection between thenormalizedmean first passage time and the discretized solution of a certain Dirichlet-Poisson problem and verify our result numerically for the simple case of the unit circle.

1. Introduction

Several years ago, motivated by considering heat flow on a manifold, R. Coifman proposed a diffusion distance—both for the case of a manifold and a discrete analog for a set of data points in Rⁿ. In the continuous case, his distance can be written as the L² norm of the difference of two specified vectors, each of which has unit L¹ norm. An analogous situation holds in the discrete case.Coifman’s distance can be successfully used in various applications, including data organization, approximately isometric embedding of data in low-dimensional Euclidean space, and so forth. See, for example, 1–3. For a unified discussion of diffusion maps and their usefulness in spectral clustering and dimensionality reduction, see4.

We see a drawback in Coifman’s diﬀusion distance in that it finds theL²norm of the distance between two L¹ unit vectors, rather than L² unit vectors. As shown by a simple example later in this paper, two vectorsrepresenting two diﬀusions, which we may want to

(2)

consider to be far apart, are actually close to each other inL², even though the angle between them is large, because they have smallL²norm, while still having unitL¹norm. Additionally, applying Coifman’s distance to heat flow inRⁿ, a factor of a power of timetremains, with the exponent depending on the dimensionn. It would be desirable not to have such a factor.

Our main motivation for this paper is to propose an alternate diffusion metric, which finds theL²distance between twoL²unit vectorswith analogous statements for the discrete case. Our distance is thus the length of the chord joining the tips, on the unit hypersphere, of twoL²normalized diffusion vectors, and is therefore based on cosine similaritysee4.4 below. Cosine similarityaffinityis popular in kernel methods in machine learning; see for example,5,6 in particular, Section 3.5.1—Document Clustering Basicsand for a review of kernel methods in machine learning,7.

In the case of heat flow on Rⁿ, our proposed distance has the property that no dimensionally dependent factor is left. Furthermore, for a general manifold, our diﬀusion distance gives, approximately, a scaled geodesic distance between two pointsxandy, when xandyare closer than√

t, and maximum separation when the geodesic distance betweenx andy, scaled by√

t, goes to infinity.

We next give two proofs that the mean first passage cost—defined later in this paper as the cost to visit a particular point for the first time after leaving a specified point—satisfies the triangle inequality.See Theorem 4.2 in8in which the author states that the triangle inequality holds for the mean first passage time.We give two proofs that do not assume that the underlying Markov matrix is diagonalizable; our proofs do not rely on spectral theory.

We calculate explicitly the normalized limit of the mean first passage time for the unit circleS¹by identifying the limit as the solution of a specific Dirichlet-Poisson problem onS¹. We also provide numerical verification of our calculation.

The paper is organized as follows. After a section on notation, we discuss R. Coifman’s diﬀusion distance for both the continuous and discrete cases in Section 3. InSection 4, we define and discuss our alternate diﬀusion distance. InSection 5, we give two proofs of the triangle inequality for mean first passage cost. We conclude the section by exhibiting an interesting connection between thenormalizedmean first passage time and the discretized solution of a certain Dirichlet-Poisson problem and verify our result numerically for the simple case ofS¹.

2. Notation and Setup

In this paper, we will present derivations for both the continuous and discrete cases.

In the continuous situation, we assume there is an underlying Riemannian manifold Mwith measured x;x, y, u, z, . . .will denote points inM. Fort ≥ 0,ρ_tx, ywill denote a kernel onM×M, withρ_tx, y≥0 for allx, y∈M, and satisfying the following semigroup property:

M

ρtx, uρs

u, y

duρ_{t s} x, y

, 2.1

for allx, y∈M, ands, t≥0. In addition, we assume the following property:

M

ρt

x, y

dx1, 2.2

(3)

for ally∈Mand allt≥0. The latter convention gives the mass preservation property

M

Ttfxdx

M

f y

dy, 2.3

where

T_tfx≡

M

ρ_t x, y

f y

dy. 2.4

We will often specialize to the case whenρ_tx, y ρ_ty, xfor allx, y ∈Mandt≥0, as in the case of heat flow. Note that whenρ_tx, yis the fundamental solution for heat flow, we haveρ0x, u δxu, where δxudenotes the Dirac delta function centered at x. We will sometimes assumeas in the case of heat flow on a compact manifoldthat there exist 0 ≤ λ₁ ≤ λ₂ ≤ · · ·,with eachλ_j corresponding to a finite dimensional eigenspace, and a complete orthonormal family ofL²functionsφ1, φ2, . . ., such that

ρt

x, y ^∞

j1

e^−λ^j^tφjxφj

y

, 2.5

for t > 0. We will also frequently use the following fact: if ρ_t is symmetric in the space variables, then for anyx, y∈M,

M

ρtu, xρt

u, y du

M

ρtx, uρt

u, y

duρ2t

x, y

, 2.6

where we have used the symmetry ofρ_tand its semigroup property.

For the discrete situation, the analog ofρtx, yis anN×NmatrixA, withAaij N i,j1, everya_ij ≥0. In keeping with the usual convention thatAis Markov if each row sum equals 1, that is,

jaij 1 for alli, the analog ofTtfx

Mρtx, yfydyisA^Tv, whereA^T is the transpose ofA, andvis anN×1 column vector. So the indexicorresponds to the second space variable inρ_t, the indexjcorresponds to the first space variable inρ_t, andtn,n1,2, . . ., corresponds to thenth power ofA. The obvious analog ofρtsymmetric in its space variables is a symmetric Markov matrixA, that is,AA^T.

For A as above, not necessarily symmetric, we think of a_ij as the probability of transitioning from state si to statesj in t 1 tick of the clock; S {s1, s2, . . . , sN}is the underlying set of states. ForXa subset of the set of statesS, theN×NmatrixP_Xwill denote the following projection: all entries ofP_Xare 0 except for diagonal entriesk, k, whens_k∈X;

the latter entries are equal to 1.

Finally,1 will denote theN×1 column vector where each entry is 1;e_iwill denote the N×1 column vector with theith component 1, and all others 0, and, for a set of statesX,X will denote the complement ofXwith respect toS.

(4)

3. A Diffusion Distance Proposed by R. Coifman

Several years ago, R. Coifman proposed a novel diﬀusion distance based on the ideas of heat flow on a manifold or a discrete analog of heat flow on a set of data pointssee, e.g,1,2for a thorough discussion. In this section, we will describe Coifman’s distance using our notation, and consider some of its good points, and what we see as some of its drawbacks.

Referring toSection 2, for the continuous case, the unweighted version of Coifman’s distance betweenx, y∈M, which we will denote bydC,tx, y, can be defined as follows:

d_C,tx, y ²≡ T_t

δ_x−δ_y , T_t

δ_x−δ_y Ttδx, Ttδx

Ttδy, Ttδy

−2

Ttδx, Ttδy

.

3.1

Here,

T_tδ_zv

M

ρ_tv, uδzuduρ_tv, z, 3.2

forz∈M. The · · · ,· · · is the usual inner product onL²M.In1, the authors consider a weighted version of 3.1 which naturally arises when the underlying kernel does not integrate to 1in each variable. In terms of data analysis, this corresponds to cases where the data are sampled nonuniformly over the region of interest. For simplicity, we are just using Coifman’s unweighted distance.

Note that we thus have d_C,tx, y ²

M

ρ_tv, xρtv, xdv

M

ρ_t v, y

dv−2

M

ρ_tv, xρt

v, y dv ρt·, x²

2 ρt

·, y²

2−2

ρt·, x, ρt

·, y .

3.3

Although Coifman’s original definition used a kernel symmetric with respect to the space variable, d_C,tx, y as given above need not be based on a symmetric ρ_t. Note that, by the defining3.1,d_C,tx, yis symmetric in xand yeven ifρ_tis not, and satisfies the triangle inequality. Ifρtis symmetric in the space variables, from2.6we see that:

d_C,t

x, y ²ρ_2tx, x ρ_2t y, y

−2ρ_2t x, y

, 3.4

a form matching one of Coifman’s formulations for the continuous case.

If, in addition toρ_tbeing symmetric in the space variables, we have that2.5holds, as in the case of heat flow, we easily see that:

dC,t

x, y ²^∞

j1

e^−2λ^j^t

φjx−φj

y₂

, 3.5

the original form proposed by Coifman. Note that the latter expression again explicitly shows thatdC,tx, yis symmetric inxandyand satisfies the triangle inequalityby considering, for example, the right-hand side as the square of a weighted distance inl².

(5)

Referring again toSection 2, for the discrete situation, where we start with a set of data pointsS {s1, s₂, . . . , s_N}, andAis a Markov matrix specifying the transition probabilities between the “states” ofS, the distance between two data pointss_iands_jis given by

d_C,1

s_i, s_j ²≡ A^T

e_i−e_j , A^T

e_i−e_j

A^Te_i, A^Te_i

A^Te_j, A^Te_j

−2

A^Te_i, A^Te_j

AA^Te_i, e_i

AA^Te_j, e_j

−2

AA^Te_i, e_j

AA^T

ii

AA^T

jj−2 AA^T

ij,

3.6

where · · ·,· · · is the usual inner product inR^N, and for a matrixB,B_ij denotes thei, j entry ofB. Again, symmetry and the triangle inequality are easily verified. IfAis symmetric,

dC,1

si, sj 2 A²

ii

A²

jj−2 A²

ij. 3.7

The “1” appearing in the subscript of dC,1si, sj refers to the fact that A¹ A is used, corresponding tot1 in the continuous case. As the diﬀusion along data points flows, after nticks of the clock, we can successively consider

dC,n

si, sj 2 Aⁿ

A^T_n

ii

Aⁿ

A^T_n

jj−2 Aⁿ

A^T_n

ij, 3.8

which, for a symmetricA, equals

A²ⁿ

ii

A²ⁿ

jj−2 A²ⁿ

ij. 3.9

An important benefit of introducing a diﬀusion distance as above can be illustrated by considering3.5. Ifρtis such that3.5holds for a complete orthonormal family{φj}, we see that astincreases, we are achieving anapproximate isometric embedding ofM into successively lower-dimensional vector spaceswith a weighted norm. More specifically, for λj >0, iftis large, the termse^−2λ^j^tφjx−φjy²are nearly 0. So, astincreases, we see that the “heat smeared” manifoldMis parametrized by only a few leadingφ_j’s. Thus, “stepping”

through higher and higher times, we are obtaining a natural near-parametrization of more and more smeared versions ofM, giving rise to a natural ladder of approximations toM.

Analogous considerations hold in the discrete situation for Asymmetric, when we easily see that the eigenvalues of A² are between 0 and 1 and decrease exponentially for A²ⁿ, asnincreasesthe “heat smeared” data points are now parametrized by a few leading eigenvectors ofA, associated to the largest eigenvalues.

See 1–3 for more discussion and examples of the natural embedding discussed above, along with illustrations of its power to organize unordered data, as well as its insensitivity to noise.

(6)

We would now like to point out what we see some drawbacks of Coifman’s distance, which led us to propose an alternative distance inSection 4.

Let us consider3.4for the case where

ρ_t x, y

4πt^−n/2e^−|x−y|²^/4t, 3.10

the fundamental solution to the heat equation inRⁿ. Then,

d_C,t

x, y ² 2−2e^−|x−y|²^/8t

8πt^n/2 . 3.11

If|x−y|²/8tis small, then to the leading order in|x−y|²/4t,

d_C,t

x, y ² 1 8πt^n/2

⎛

⎝x−y²

4t O

⎛

⎝x−y² 4t

2⎞

⎠

⎞

⎠. 3.12

Thus, if |x−y| √

t, we do recover the geodesic distance between x and y but, due to the 1/t^n/2 term in front, normalized by a power of t which depends on the dimensionn.

As pointed out by the reviewer, forRⁿ itself, the normalization does depend onn, but is simply a global change of scale, for eacht, and thus basically immaterial. Suppose, however, that the data we are considering come in two “clumps”, one of dimensionnand the other of dimensionm, withn /m. Let us also suppose these clumps are somehow joined together and, far away from the joining region, each clump is basically a flat Euclidean space of the corresponding dimension. Then, far away from the joint, heat diﬀusion in a particular clump would behave as if it were inRⁿ, respectivelyR^muntil the time that the flowing heat “hits”

the joint region. Thus, in the part of each clump that is far from the joint, the diffusion distance would be normalized differently, one normalization depending onnand the other onm. An overall change of scale would not remove this difference, thus we would not recover the usual Euclidean distance in the two clumps simultaneously, as we would like.

The second point of concern is more general in nature. In the continuous case, Coifman’s distance involves theL²distance betweenT_tδ_z, whenzx, andT_tδ_zwhenzy;

see3.1. TheL¹norm ofT_tδ_zis 1, since

M

Ttδzvdv

M

ρtv, uδzudu

dv1, 3.13

using the mass preservation assumption of2.2. For the discrete case,1^TA^Tei1, where1^T is the 1×Nvector of 1’s.

So the diﬀusion distance proposed by Coifman finds theL²resp.,l²distance between L¹ resp.,l¹normalized vectors. Let us illustrate by an example for the discrete situation, withN10,000, in which this may lead to undesired results. Without specifying the matrix

(7)

A, suppose that after some time has passed, we have the following two 1×10,000 vectors giving two diﬀerent results of diﬀusion:

v1 1

100, . . . , 1

100,0, . . . ,0

, 3.14

where the first one hundred entries are each 1/100, and the rest 9,900 entries are 0, and

v₂ 1

10,000, 1

10,000, . . . , 1 10,000

, 3.15

where each entry is 1/10,000.

Note thatv1andv2both havel¹norm 1. Now, considering two canonical basis vectors e_i^Tande^T_j,i /j, each of which hasl¹norm 1, we see thate^T_i −e^T_j, e_i^T−e^T_j 2. So, a distance of √

2 gives thein fact, maximumseparation between two completely differentl¹ unit diffusion vectors. Return tov₁ andv₂, note that v₂ corresponds to total diffusion, whilev₁ has only diffused over 1% of the entries. We would thus hope thatv1andv2would be nearly as much separated ase^T_i ande^T_j, that is, have diffusion distance not much smaller than√

2.

But a trivial calculation shows that

v1−v₂, v₁−v₂ < .1, 3.16

which seems much smaller than what we would like. The problem is that

v1−v₂, v₁−v₂ is small since thel²norm of each ofv1andv2is small, even though thel¹norm of each is 1.

In the next section, we propose a variant of the diﬀusion distance discussed in this section. Our version will find theL²resp.,l²distance between vectors which are normalized to haveL²resp.,l²norm to be 1, rather thanL¹orl¹norm 1.

4. An Alternate Diffusion Distance

In this section, we propose a new diﬀusion distance. Let us first define our alternate diﬀusion distance for the continuous case. Refer to Section 2 for the definitions of functions and operators used below.

For anyz∈M, let

ψ_zu≡ δzu

Ttδ_z₂ δzu

Mρtv, zρtv, zdv. 4.1 Then,

T_tψ_zu

M

ρ_tu, wδ_zw

Ttδz₂dw ρ_tu, z

Ttδz₂. 4.2

(8)

Note thatTtψz·hasL²norm 1:

M

T_tψ_zu ²du

M

ρ_t²u, z

Mρ²_tv, zdvdu1. 4.3 Forx, y∈M, we define our diﬀusion distance,d_2,tx, y,as follows:

d2,t

x, y ²≡ Tt

ψx−ψy

, Tt

ψx−ψy

2−2

Mρtu, xρt

u, y du ρt·, x2ρ_t

·, y

2

2−2

ρt·, x, ρt

·, y ρ_t·, x

2ρ_t

·, y

2

,

4.4

where we have used4.3. Here again, · · · ,· · · is the usual inner product onL²M. Note the analogy to3.3.

As is clear from the defining equality in4.4,d_2,tx, yis symmetric inxandyand satisfies the triangle inequality:

d_2,tx, z≤d_2,t x, y

d_2,t y, z

, 4.5

for allx, y, z∈ M. Geometrically,d2,tx, yis the length of the chord joining the tips of the unit vectorsTtψ_xandT_tψ_y. We have that

0≤d_2,t x, y

≤√

2, 4.6

for allx, y∈Mandt≥0.

Ifρ_tis symmetric in the space variables, by2.6, we have that d_2,t

x, y ²2−2 ρ2t

x, y ρ2tx, x

ρ2t

y, y. 4.7

As an example, again consider the case where ρtx, y 4πt^−n/2e^−|x−y|²^/4t, the fundamental solution to the heat equation inRⁿ. Then,

d_2,t

x, y ²2−2e^−|x−y|²^/8t. 4.8

Note that if|x−y| √ t, then d2,t

x, y ²≈2−2

1−x−y² 8t

x−y²

4t , 4.9

(9)

sod2,tx, ygivesapproximatelythe geodesic distance inRⁿ, in the “near regime” where

|x−y| √

t, and with scale√

t. Note that unlike3.12, not^n/2term appears.Also see the discussion following3.12.Also note that if|x−y|²/8tis large,d_2,tx, y≈√

2the greatest possible distance, see4.6, so for suchtthe pointsxandyarenearlymaximally separated.

Hence,d2,tx, y, for the case of heat flow inRⁿ, gives a scaled geodesic distance whenxis close toy, with 2√

tas the unit of length and near maximum separation whenxis far fromy at the scale√

t.

For any, say, compact Riemannian manifoldM, ifρtx, yis the fundamental solution to the heat equation onM, we have that

ρ_t x, y

4πt^−n/2e^−d²^x,y/4t1 Ot, ast−→0 , 4.10

where dx, y is the geodesic distance on M see9. Hence, repeating the expansion in 4.9for a compact manifoldM, withtsmall, anddx, y √

t, we have that d2,tx, y ≈ dx, y/2√

t, again recovering scaled geodesic distance.The discussion following 3.12 gives an example for which it would be preferable not to have presented a normalization factor which depends on the dimension.Exponentially decaying bounds on the fundamental solution of the heat equation for a manifoldMsee9, Chapter XII, Section 12, suggest that xandybecome nearly maximally separated, as given byd_2,tx, y, whendx, y scaled by

√tis large, just as in the Euclidean case.

In the discrete situation, where we start with a set of data pointsS {s1, s2, . . . , sN}, andAis a Markov matrix specifying the transition probabilities between the “states” ofS, forn1,2, . . . ,we let

vi,n ei

A^T_n

e_i, 4.11

wheree_iis theith canonical basis vectorseeSection 2, and · · ·,· · · is thel²vector norm.

Fors_i, s_j∈Sandn1,2, . . ., we defined_2,nsi, s_jby d_2,n

s_i, s_j ²≡ A^T_n

v_i,n−v_j,n ,

A^T_n

v_i,n−v_j,n

2−2

A^Tn

e_i, A^Tn

e_j A^Tn

e_i A^Tn

e_j 2−2

Aⁿ

A^Tn ij

Aⁿ A^T_n

ii

Aⁿ A^T_n

jj

,

4.12

where · · ·,· · · and · · ·,· · · are, respectively, the usual inner product and norm inR^N, and, for a matrixB,B_ijdenotes thei, jentry ofB.

(10)

IfAis symmetric, d2,n

si, sj 2 2−2

A²ⁿ ij

A²ⁿ_ii

A²ⁿ_jj. 4.13

As before,nrepresents thenth tick of the clock.

5. The Mean First Passage Cost Satisfies the Triangle Inequality:

An Example of Its Normalized Limit

In this section, we consider a slightly diﬀerent topic: the mean first passage costdefined belowbetween two states as a measure of separation in the discrete situation. We give two explicit proofs showing that the mean first passage cost satisfies the triangle inequalityin 8, the author states this result when all costs are equal to 1 as Theorem 4.2, but the proof is not very explicit in our opinion.

In 10–12, as well as some of the references listed therein, it is shown that the symmetrized mean first passage time and cost are metrics for mean first passage cost see, in particular,10; also, in the above sources the symmetrized mean first passage time is called the commute time. “Symmetrized” refers to the sum of the first cost time to reach a specified state from a starting state and to return back to the starting state. This symmetrization is necessary to ensure a quantity symmetric in the starting and destination states. In the sources cited above, the fundamental underlying operator is the graph Laplacian L, which, using the notation of 12, is defined as L D −W. Here, W wij is the adjacency matrix of a graph, andDis the diagonal degree matrix, with theith entry on the diagonal equaling

jw_ij. In addition to assuming the nonnegativity of thew_ij’s, the authors in the above works assume thatW is symmetric. The resulting symmetryand positive semi- definiteness ofLimplies the existence of a full set of nonnegative eigenvalues ofL, and the diagonalizability ofLis used heavily in the proofs that the commute time/cost is a distance.

In the random walk interpretation, see, for example,12, the following normalized Laplacian is relevant:Lrw I−D⁻¹W. To make a connection with the notation in the present paper, D⁻¹WA, a Markov matrix giving the transition probabilities of the random walk. Although D⁻¹W is not necessarily symmetric, it is easy to see thatD⁻¹W D^−1/2{D^−1/2WD^−1/2}D^1/2 see the discussion in12. HenceD⁻¹W, while not itself symmetric in general, is conjugate to the symmetric matrixD^−1/2WD^−1/2, and thus too has a full complement of eigenvalues.

In this section, as in the rest of the paper unless stated otherwise, we are not assuming that the Markov matrixAis symmetric or conjugate to a symmetric matrix; henceAmay not be diagonalizable i.e., A may have Jordan blocks of dimension greater than 1. We thus do not have spectral theory available to us. Furthermore, we do not wish to necessarily symmetrize the mean first passage time/cost to obtain a symmetric quantity; we are not actually going to get a distance, but will try to obtain the “most important” property of being a distance, namely, the triangle inequality.

A model example we are thinking about is the following. Suppose we have a map grid and are tracking some localized storm which is currently at some particular location on the grid. We suppose that the storm behaves like a random walk and has a certainconstant in time probability to move from one grid location to another at each “tick of the clock”

time step. We can thus model the movements of the storm by a Markov matrix A, with

(11)

thenth power of Agiving the transition probabilities after nticks of the clock. If there is no overall wind, the matrixAcould reasonably be assumed to be symmetric, and we could use spectral theory. But suppose there is an overall wind in some fixed direction, which is making it more probable for the storm to move north, say, rather than south. Then the matrix A is not symmetric; there is a preferred direction of the storm to move in, from one tick of the clock to the next; spectral theory cannot, in general, be used. Furthermore, it may not be reasonable in this situation to consider the commute time—the symmetrized mean first passage time—since we may rather want to know the expected time to reach a certain population center from the current location of the storm, and may not care about the storm’s return to the original location. Thus the mean first passage time would be the quantity of interest.

In the first part of this section, we give two proofs that the mean first passage cost/time, associated with a not-necessarily-symmetric Markov matrixA, does indeed satisfy the triangle inequality; our proofs do not rely on spectral theory. We think that satisfying the triangle inequality, while in general failing to be symmetric, is still a very useful property for a bilinear form to have.

We conclude the section by exhibiting a connection between thenormalizedmean first passage time and the discretized solution of a certain Dirichlet-Poisson problem and verify our result numerically for the simple case of the unit circle.

In this section,S {s1, s2, . . . , sN}is a finite set of states andAis a Markov matrix giving the transition probabilities between states in one tick of the clockseeSection 2.C will denote anN×N matrix with non-negative entries,C cij N

i,j1. We will think of each c_i,j as the “cost” associated with the transition from states_i to states_j. By a slight abuse of notation, for 1≤m, n≤N,Pnwill be theN×Nmatrix in which all entries are 0, except the n, nentry which is 1this corresponds toX{sn}inSection 2. Also,Pmnwill be theN×N matrix in which all entries are 0, except them, mandn, nentries each of which is 1this corresponds toX{sm, sn}inSection 2.

LetYmnbe the random variable which gives the cost accumulated by a particle starting at states_muntil its first visit to states_nafter leavings_m. In other words, if a particular path of the particle is given by the statessm, sj1, sj2, . . . , sjp, sn, the value ofYmniscmj1 cj1j2 · · · cjpn. We supposeAhas the property that for everyi, j,there exists annsuch thatAⁿ_ij >0, that is, every statej is eventually reachable from every statei. Then, as is shown in13 using slightly diﬀerent notation, we have the following formula forEYmn, which is the expected cost of going from statesmto statesn:

EYmn e^T_m I I−AI−P_n

c_pqa_pq

1, 5.1

wherecpqapq is theN×Nmatrix withp, qentry equal tocpqapq.In particular, it is shown in 13 thatI −AI−Pnis invertible and AI−Pn^k → 0, as k → ∞.See 14,15 for discussion of related expected values, and8,10–12,16–18for discussion of mean first passage times and related concepts.

We will give two proofs that the expected cost of going from one state to another satisfies the triangle inequality.

(12)

Proposition 5.1. EYij≤EYik EYkj.

We again note that this proposition, for the case all costs are 1, is stated as Theorem 4.2 in8, but we feel the proof is not very explicit.In our proofs below, we assumej /k; if jk, the inequality inProposition 5.1is immediate.

Proof. 1Our first proof is probabilistic. Let a random walker start at states_iand accumulate costs given by the matrixCas he moves from state to state. As soon as the walker reaches state sj, we obtain a sample value ofYij. Now, at this point of the walk, there are two possibilities.

Either the walker has passed through states_kbefore his first visit tos_j after leavings_i, or he has not. In the first instance, we have obtained sample values ofY_ik andY_kj along the way, andYij Yik Ykj for this simulation. In the second case, we let the walker continue until he first reachess_k, to obtain a sample value ofY_ik, and walk still more until he reachess_jfor the first time since leavings_k, thus giving a sample value ofY_kjnote that by the memoryless property, this sample value ofYkjis independent of the walker’s prior history. In the second case, we thus clearly have thatYij < Y_ik Ykj. Combining the two cases, we haveY_ij≤Y_ik Ykj. Repeating the simulation, averaging, and taking the limit as the number of simulations goes to infinity, we obtain thatEYij≤EYik EYkj.

2Our second proof is via explicit matrix computations. Let us define the following two quantities:

Q1≡e^T_i I I−A

I−P_jk cpqapq

1,

Q2≡

e_i^T I I−A

I−P_jkAPk1

e^T_k I I−A

I−P_j cpqapq

1.

5.2

SeeSection 2and the paragraphs before the statement ofProposition 5.1.Now, we have

Q1≡e^T_i I I−A

I−P_jk cpqapq

1

e^T_i ∞ m0

A

I−Pjk m cpqapq

1

≤e^T_i ∞ m0

AI−P_k^m c_pqa_pq

1

e^T_i I I−AI−P_k

cpqapq

1

EYik;

5.3

(13)

see5.1. Also,

Q₂≡

e^T_i I I−A

I−Pjk

AP_k1

e^T_k I I−A

I−Pj

c_pqa_pq

1

e^T_i I I−A

I−Pjk

AP_k1

E Y_kj

.

5.4

But

e_i^T I I−A

I−Pjk

AP_k1≤e^T_i I

I−AI−P_kAP_k1 e^T_i I

I−AI−P_kI−AI−Pk1 1,

5.5

where we have used a series expansion to show the first inequality all entries are nonnegative, and the fact thatAPk1 I−AI−Pk1, sinceA11. Thus,Q2≤EYkj.

We will finish our second proof by showing thatQ1 Q2EYij. First note that

Q₂≡

e^T_i I I−A

I−P_jkAP_k1

e^T_k I I−A

I−P_j c_pqa_pq

1 e^T_i I

I−A

I−P_jkAPk I I−A

I−P_j cpqapq

1,

5.6

using thatP_ke_ke^T_kP_k1e^T_k. Thus,

Q₁ Q₂ e^T_i I I−A

I−Pjk

I−A

I−P_j

AP_k I I−A

I−Pj

c_pqa_pq

1 e^T_i I

I−A

I−P_jk I−A

I−P_j−P_k I I−A

I−P_j c_pqa_pq

1 e^T_i I

I−A

I−P_jk I−A

I−Pjk

I I−A

I−P_j cpqapq

1

e^T_i I I−A

I−Pj

c_pqa_pq

1 E

Yij

.

5.7

Here we have used the fact thatPj PkPjkas mentioned earlier, we are assumingj /k; the triangle inequality we are proving holds trivially for the casej k.

(14)

We would like to point out that the decomposition ofEYij Q1 Q2in the second proof above is not a “miraculous” guess. We arrived at this decomposition by writingEYij as the derivative evaluated at 0 of the characteristic function Fourier transform of Y_ij see13, and breaking up the expression to be diﬀerentiated into a sum of terms: one term corresponding to the random walk going fromsitosjwithout visitingskfirst, and one term corresponding to visitings_kbefore reachings_j. After diﬀerentiation, the resulting six pieces, when suitably combined into two terms, yieldedQ1andQ2.

We conclude this section by considering certainsuitably normalizedlimiting values of the expected cost of going from states_ito states_j, for the first time after leavings_i, given by5.1. For this discussion, we will take all the costs to be identically 1, that is,cpq 1 for all p, q. Then, we see from5.1that

E Y_ij

e^T_i I I−A

I−Pj

1e_i^T I I−A

I−Pj

A1

e^T_i I I−A

I−Pj

A I−P_j

1 e^T_i I I−A

I−Pj

AP_j1

e^T_i I I−A

I−P_jA I−Pj

1 e^T_i I I−A

I−P_j I−A

I−Pj

1

e^T_i I I−A

I−Pj

A I−P_j

1 1,

5.8

where we have usedAP_j1 I−AI−P_j1, sinceA11.

Now, let us digress a little to describe a stochastic approach to solving certain boundary value problems. The description below follows very closely parts of Chapter 9 in 19. Some statements are excerpted verbatim from that work, with minor changes in some labels. The background results below are well known and are often referred to as Dynkin’s formulasee, e.g,20. We are presenting them for the reader’s convenience and will use them to exhibit an interesting connection between the mean first passage time and the discretized solution of a certain Dirichlet-Poisson problem; see5.16.

LetDbe a domain inRⁿ, and letLdenote a partial diﬀerential operator onC²Rⁿof the form:

Lⁿ

i1

b_ix ∂

∂x_i n i,j1

a_ijx ∂²

∂x_i∂x_j, 5.9

where aijx ajix. We assume each aijx ∈ C²Dis bounded and has bounded first and second partial derivatives; also, eachb_ixis Lipschitz. SupposeLis uniformly elliptic in Di.e., all the eigenvalues of the symmetric matrixaijx are positive and stay uniformly away from 0 inD. Then, forg∈C^αD, someα >0, andgbounded, and forφ∈C∂D, the functionwdefined below solves the following Dirichlet-Poisson problem:

Lw−g inD,

x→y,x∈Dlim wx φ y

, for all regular points y∈∂D. 5.10

(15)

Regular points in this context are defined in19and turn out to be the same as the regular points in the classical sense, i.e., the pointsyon∂Dwhere the limit of the generalized Perron- Wiener-Brelot solution coincides withφy, for allφ∈C∂D.

Now we definew. We choose a square rootσx∈R^n×nof the matrix 2aijx , that is, 1

2σxσ^Tx aijx

. 5.11

Next, forbbix , letXtbe an It ˆo diﬀusion solving

dXtbXtdt σXtdBt, 5.12

whereBtis n-dimensional Brownian motion. Then, wx Ex

φXτ Ex τ 0

gXudu

!

, forx∈D, 5.13

is a solution of5.10. Here, the expected values are over paths starting fromx∈D, andτis the first exit time fromD.

Let us transfer the above discussion to, say, a compact manifoldM, rather thanRⁿ. We sampleMand let the “states”s_kbe the sample points. We construct a transition matrix Ato give a discretized version of5.12. Let >0 be the approximate separation between the sample points. Fix a sample points_j, and letDM\Bsjbe the domain inMconsisting of the complement of the closure of the ball of radiusinM, centersj. Letsibe a sample point inD. For this situation, in5.10, letφ be the 0 function, andg be the constant 1 function.

Then5.13becomes:

wx Esi

τ 0

du

!

, 5.14

τ is the first exit time fromD, that is, first visit time to theneighborhood ofsj.Compare with Proposition 8B in21which discusses the case of the Dirichlet-Poisson problem5.10 withφ 0 andg 1 for a manifold.As shown in13 with slightly diﬀerent notation, a discrete version of5.14is

e^T_i I I−A

I−Pj

"

c_pqa_pq 1

Δt, 5.15

wherec"_pj 0, allp, andc"_pq1, for allqnot equal toj, and allp. Thus,c"_pqa_pq AI−P_j. Combining5.15and5.8, we see that:

E Y_ij

Δt−Δt

e^T_i I I−A

I−Pj

A I−P_j

1

Δt≈w s_j

, 5.16

forΔtsmall.

(16)

We thus see a connection between thenormalizedmean first passage time and the solution to the Dirichlet-Poisson problem discussed above.

Let us illustrate the preceding discussion by a simple example: M S¹, the unit circle. We will considerd²/dθ², the Laplacian onS¹, and sampleS¹ uniformly. We will let the transition matrixAtake the walker from the current state to each of the two immediate neighbor states, with probability 1/2 for each. The variance is thenΔθ². SincedX_t√

2dB_t, see5.12, we must haveΔθ²2Δt, and we should useΔθ²/2 as our value ofΔtin5.16.

Using symmetry, we can takesj0, the 0 angle onS¹, without loss of generality. Let

wθ 1

2θ−2π−θ−, ≤θ≤2π−. 5.17

Note that

d²

dθ²wθ −1, < θ <2π−, w w2π− 0.

5.18

Sowis the unique solution satisfying5.10for our example, on the domainS¹\−, . To numerically confirm5.16, we ran numerical experiments in which we discretized S¹ into N equispaced points, with the transition matrix A taking a state to each of its 2 immediate neighbors with probability 1/2, and used Δθ²/2 as the value of Δt in 5.16 to calculateEYijΔt−Δt. We tooksjto be the angle 0, andsito be the closest sample point to the angle with radian measure 1, for example. Letting → 0 in5.17, we compared the value ofEYijΔt−Δtwithwθ 1/2θ2π−θ. For instance, withN 1000,s_j 0, and sithe nearest sample point to the angle with radian measure 1, the relative error is less than 0.08%. Note thatwθis, forθclose to 0, essentially a scaled geodesic distance onS¹ from our base angle 0.

6. Conclusions

The authors have presented a diﬀusion distance which usesL² unit vectors, and which is based on the well-known cosine similarity. They have discussed why the normalization may make diﬀusion distances more meaningful. We also gave two explicit proofs of the triangle inequality for mean first passage cost, and exhibited a connection between thenormalized mean first passage time and the discretized solution of a certain Dirichlet-Poisson problem.

Acknowledgments

We thank Raphy Coifman for his continuous generosity in sharing his enthusiasm for mathematics and his ideas about diﬀusion and other topics. We would also like to thank the anonymous reviewer for his/her thorough critique of this paper and many helpful suggestions.

(17)

References

1 R. R. Coifman and S. Lafon, “Diﬀusion maps,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 5–30, 2006.

2 S. S. Lafon, Diﬀusion maps and geometric harmonics, Ph.D. thesis, Yale University, 2004.

3 R. R. Coifman and M. Maggioni, “Diﬀusion wavelets,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 53–94, 2006.

4 B. Nadler, S. Lafon, R. R. Coifman, and I. G. Kevrekidis, “Difusion maps, spectral clustering and reaction coordinates of dynamical systems,” Applied and Computational Harmonic Analysis, vol. 21, no.

1, pp. 113–127, 2006.

5 M. Brand and K. Huang, “A unifying theorem for spectral embedding and clustering,” in Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, C. M. Bishop and B. J. Frey, Eds., Key West, Fla, USA, 2003,http://research.microsoft.com/enus/um/cambridge/

events/aistats2003/proceedings/papers.htm.

6 Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis, Springer, New York, NY, USA, 2007.

7 T. Hofmann, B. Sch ¨olkopf, and A. J. Smola, “Kernel methods in machine learning,” The Annals of Statistics, vol. 36, no. 3, pp. 1171–1220, 2008.

8 J. J. Hunter, “Stationary distributions and mean first passage times of perturbed Markov chains,”

Research Letters in the Information and Mathematical Sciences, vol. 3, pp. 85–98, 2002.

9 I. Chavel, Eigenvalues in Riemannian geometry, vol. 115 of Pure and Applied Mathematics, Academic Press, Orlando, Fla, USA, 1984.

10 F. Fouss, A. Pirotte, J. -M. Renders, and M. Saerens, “Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, pp. 355–369, 2007.

11 M. Saerens, F. Fouss, L. Yen, and P. Dupont, “The principal components analysis of a graph, and its relationships to spectral clustering,” in Machine Learning: ECML 2004, vol. 3201 of Lecture Notes in Artificial Intelligence, pp. 371–383, Springer, Berlin, Germany, 2004.

12 U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–

416, 2007.

13 M. Goldberg and S. Kim, “Applications of some formulasfor finite state Markov chains,” Applied and Computational Harmonic Analysis. In press.

14 J. J. Hunter, “On the moments of Markov renewal processes,” Advances in Applied Probability, vol. 1, pp. 188–210, 1969.

15 J. J. Hunter, “Generalized inverses and their application to applied probability problems,” Linear Algebra and its Applications, vol. 45, pp. 157–198, 1982.

16 J. J. Hunter, “A survey of generalized inverses and their use in stochastic modelling,” Research Letters in the Information and Mathematical Sciences, vol. 1, pp. 25–36, 2000.

17 J. J. Hunter, “Generalized inverses, stationary distributions and mean first passage times with applications to perturbed Markov chains,” Research Letters in the Information and Mathematical Sciences, vol. 3, pp. 99–116, 2002.

18 J. G. Kemeny and J. L. Snell, Finite Markov Chains, The University Series in Undergraduate Mathematics, Van Nostrand, Princeton, NJ, USA, 1960.

19 B. Øksendal, Stochastic Diﬀerential Equations: An Introduction with applications, Springer, Berlin, Germany, 6th edition, 2007.

20 D. Kulasiri and W. Verwoerd, “Stochastic Dynamics: Modeling Solute Transport in Porous Media, vol. 44 of North-Holland Series in Applied Mathematics and Mechanics, Elsevier, Amsterdam, The Netherlands, 2002.

21 K. D. Elworthy, Stochastic Diﬀerential Equations on Manifolds, vol. 70 of London Mathematical Society Lecture Note Series, Cambridge University Press, Cambridge, UK, 1982.