An Explicit Example of Leave-One-Out Cross-Validation Parameter Estimation for a Univariate Radial Basis Function

(1)

An Explicit Example of Leave-One-Out Cross-Validation Parameter Estimation for a Univariate Radial Basis Function

L. Bos^a·F. Polato^b

Communicated by S. De Marchi

Abstract

We give an explicit example for the selection of the shape parameter for a certain univariate radial basis function (RBF) interpolation problem.

1 Introduction

Radial Basis Function interpolation (RBF) is an important method of (multivariate) interpolation of typically scattered data, which has been much used in applications. Thebasicform of RBF is as follows. Given abasisfunctiong:R⁺→R, the associated RBF interpolant of a data set{(x_j,y_j)} ⊂R^d+1withn“sites”x_j∈R^dand function values y_j∈R, is the function of the form s(x) =

n

X

j=1

a_ig(|x−x_j|)such thats(x_i) =y_i, 1≤i≤n(if it exists). Typically the basis functionghas a so-calledshapeparameter, the value of which has an important effect on both the quality of the resulting interpolant as well as the numerical conditioning of associated interpolation linear system to be solved. For example, for the Gaussian basis function

g_λ(x):=exp(−λkxk²₂),λ >0

smallλ≈0 gives a basis function nearly constant in a neighbourhood of the origin, while largeλ≈ ∞gives a basis function, for all intents, a delta function supported at the origin.

A discussion of practical methods for choosing the shape parameter may be found, for example, in[5, Chapt. 17], where it may be verified that the problem of selecting the shape parameter is indeed important and, in general, rather difficult. One may also consult the monographs[4,7]for more on the theory of RBF.

Given this typical difficulty of analyzing multivariate interpolation procedures, it is often useful to look more carefully at the univariate case for some suggestion as to how the general case might behave. The goal of this paper is to give an explicit univariate example of one of the most commonly used procedures for selecting an “optimal” shape parameter, the so-called Leave-One-Out Cross-Validation procedure, in the hope that it sheds some light on what happens more generally.

2 Leave-One-Out Cross-Validation (LOOCV)

Consider RBF interpolation with a basis functiong_λ:R⁺→Rdependent on some shape parameterλ. For a collection of sites X={x₁, . . . ,x_n} ⊂R^dwe let

X_j:=X\{xj}, j=1, 2, . . . ,n i.e.,X_jis the set of sites withx_jleft out. Then, let

s_j(x) =X

k6=j

a^(j)_k g_λ(|x−x_k|) such that

s_j(x_k) =y_k,k6=j.

In other words,s_jis the RBF interpolant for the sitesX_j. We may think of the values_j(x_j)as the predicted value of the data atX_j for the left out sitex_j, ande_j:=y_j−s_j(x_j), 1≤j≤n, measures its discrepancy with the value in the full dataset.

LOOCV selects the parameterλto minimize the 2-norm of the vector of discrepanciese_j, i.e., to minimize E(λ):=

n

X

j=1

|e_j|².

(2)

Our example will be the LOOCV procedure applied to the functions

g_λ(x):=ae^λx+be^−λx,λ,a,b∈R (1)

and sites

0=x₁<x₂<x₃<. . .<x_n=1. (2) RBF interpolation by such g_λwas considered in[1]where it is shown (Theorem 1) that the determinant of the associated interpolation matrix is

det [g_λ(|xi−x_j|)]1≤i,j≤n

= (b−a)ⁿ⁻²e⁻^2λ

P_n j=1x_j

_n₋₁ Y

j=1

(e²^λx^j+1−e²^λx^j)

b²e²^λx¹−a²e²^λxⁿ .

Naturally, we must restrict the values ofa,b∈Rso that this determinant is non-zero and hence the interpolation problem has a unique solution.

Given this restriction, it is then shown in[1, Theorem 5]that the cardinal functions,u_k(i.e., those linear combinations of the g_λ(| · −x_i|)with the property thatu_k(x_i) =δik) are given by

u_k(x) =e^λ(x^k^−x)







e^2λx−e^2λxk−1

e^2λxk−e^2λxk−1 ifx∈[x_k₋₁,x_k]

e^2λx−e^2λxk+1

e^2λxk−e^2λxk+1 ifx∈[x_k,x_k+1]

0 otherwise

, 2≤k≤n−1,

u₁(x) =e^λ(x¹⁻^x)

¨_e2λx−e^2λx2

e^2λx1−e^2λx2 ifx∈[x₁,x₂]

0 otherwise ,

u_n(x) =e^λ(xⁿ⁻^x)

¨_e2λx−e^2λxn−1

e^2λxn−e^2λxn−1 ifx∈[x_n−1,x_n]

0 otherwise

independently of the values ofaandb!

This allows us, for convenience’s sake, to takeb=−awitha=1/(2λ)for which our basis function (1) becomes g_λ(x) =sinh(λx)

λ .

Note that

limλ→0g_λ(x) =x

and the interpolation becomes one by piecewise linear functions. We will therefore letg₀(x):=x.

Remark1. It is worth noting at this point that asλincreases the cardinal functions become more and more like delta functions.

Indeed, it is easy to verify that

λ→∞limu_k(x) =

1 forx=x_k 0 forx6=x_k.

We will actually setx1=0 andx_n=1 and select the parameterλaccording to the LOOCV principle of minimizing E(λ):=

n−1

X

j=2

|e_j|² where

e_j:=s_j(x_j)−y_j, 2≤j≤n−1 ands_jis the interpolant of the sitesX_j, explicitly

0=x₀<x₁<. . .<x_j−1<x_j+1<. . .<x_n=1. (3) To calculate the valuess_j(x_j)we will use the formulas for the cardinal functionsu_kgiven above. Indeed, letu⁽_k^j)be thekth cardinal function for the sites (3). Then

s_j(x) =

n

X

k=0,k6=j

y_ku^(j)_k (x) and, in particular, due to the compact support of the cardinal functions,

s_j(x_j) =y_j−1u⁽_j₋^j)₁(x_j) +y_j+1u⁽_j+1^j)(x_j). It is easily verified that then

s_j(x_j) =y_j₋₁e^−λh^j−1 e^2λh^j−1

2λh −2λh +y_j+1e^λh^j 1−e⁻^2λh^j−1

2λh −2λh

(3)

where, as usual, we have seth_j:=x_j+1−x_j, 1≤j≤n−1. It follows that e_j:=y_j−

y_j₋₁e^−λh^j−1 e^2λh^j−1

e^2λh^j−e⁻^2λh^j−1 +y_j+1e^λh^j 1−e^−2λh^j−1 e^2λh^j−e⁻^2λh^j−1

(4) and

E(λ) =

n−1

X

j=2

e²^λh^j−e⁻²^λh^j−1+y_j+1e^λh^j 1−e⁻^2λh^j−1 e²^λh^j−e⁻²^λh^j−1−y_j

2

. (5)

We note that, since the interchange ofaandbin the definition (1) ofg_λis equivalent to replacingλby−λ, the formulas foru_k are invariant under this replacement,λby−λ. Consequently,E(−λ) =E(λ)andE(λ)is anevenfunction.

Remark2. In case the data comes fromy(x) =sinh(αx)for someα∈R, we note that then y(x) =αg_α(x) =αg_α(|x−x₁|), x∈[0, 1],

i.e.,yis in the span of the translatesg(| · −x_j|)and so its interpolant is itself. ConsequentlyE(α) =0 andλ=αis the optimal parameter. Further, as theu_k(x)donotdepend on the constantsa,bin (1), we also have, for example, that the optimalλ=αfor

y(x) =exp(αx).

Remark3. The optimalλneed not be unique. For example, if we taken=3 withy₁=y₃=0 thenE(λ) = (0−y₂)²is constant inλ.

Remark4. An optimalλmay not exist. For example, if we taken=3 withy₁=y₃= +1 and y₂=−1 then E(λ) =

e^−λh¹ e^2λh²−1

e^2λh²−e⁻^2λh¹+e^λh² 1−e⁻^2λh¹ e^2λh²−e⁻^2λh¹ +1

2

. As the terms are all positiveE(λ)>1 while, as is easily verified, lim_λ→∞E(λ) =1 (cf. Remark1).

2.1 The case of a positive concave function

Theorem 2.1. Suppose that y(x)≥0is concave ( y⁰⁰(x)≤0) for x∈[0, 1].Thenλ=0is an optimal LOOCV parameter, for any set of sites (2).

Proof.The discrepancy atx_jis given bye_j, (4). Forλ=0 this becomes e⁽⁰⁾_j :=y_j−

y_j−1 h_j

h_j₋₁+h_j +y_j+1 h_j−1

h_j₋₁+h_j

. (6)

We claim that for 2≤j≤(n−1),

e_j≥e⁽⁰⁾_j (≥0 sincey(x)is concave).

To see this, first note that

e_j≥e⁽⁰⁾_j

⇐⇒y_j−

e^2λh^j−e⁻^2λh^j−1 +y_j₊₁e^λh^j 1−e⁻^2λh^j−1 e^2λh^j−e⁻^2λh^j−1

≥y_j−

y_j₋₁ h_j

h_j−1+h_j+y_j+1 h_j₋₁ h_j−1+h_j

⇐⇒y_j₋₁ h_j

h_j−1+h_j +y_j+1 h_j₋₁ h_j−1+h_j

≥y_j₋₁e^−λh^j−1 e^2λh^j−1

e²^λh^j−e⁻²^λh^j−1+y_j+1e^λh^j 1−e⁻^2λh^j−1 e²^λh^j−e⁻²^λh^j−1. Since, by assumption, y_j_±₁≥0, for this it suffices to show that

h_j

h_j₋₁+h_j ≥e^−λh^j−1 e^2λh^j−1

e^2λh^j−e⁻^2λh^j−1 (7)

and h_j−1

h_j₋₁+h_j ≥e^λh^j 1−e^−2λh^j−1

e^2λh^j−e⁻^2λh^j−1. (8)

To see (7). setx:=λh_j−1and y:=λh_j. Then

h_j

h_j₋₁+h_j = y x+y

(4)

while

e^−λh^j−1 e^2λh^j−1

e²^λh^j−e⁻²^λh^j−1 = e^x+2y−e^x e^2(x+y)−1

≤ y

x+y (by Lemma2.2below)

= h_j h_j−1+h_j. Similarly, for (8), setx:=λh_jandy:=λh_j−1so that

e^λh^j 1−e⁻^2λh^j−1

e²^λh^j−e⁻²^λh^j−1 = e^x+2y−e^x e^2(x+y)−1

≤ y

x+y (by Lemma2.2below)

= h_j₋₁ h_j₋₁+h_j.

Lemma 2.2. For all x,y>0,

e^2y+x−e^x e^2(x+^y)−1≤ y

x+y. Proof.This holds iff

e^2y+x−e^x

y ≤ e^2(x+^y)−1 x+y

⇐⇒ e^xe^2y−1

y ≤ e^2(x+^y)−1 x+y

⇐⇒ e^2y−1

y ≤e⁻^xe^2(x+y)−1 x+y

⇐⇒ h(y)≤e⁻^xh(x+y) forh(t):= (e^2t−1)/t.

Hence consider, for a fixedy>0,

f(x):=e^−xh(x+y).

We need to show thatf(x)≥h(y) =f(0),x≥0, i.e., that the minimum off on[0,∞)isf(0). To see this we calculate f⁰(x) =−e⁻^xh(x+y) +e⁻^xh⁰(x+y)

=e⁻^x{h⁰(x+y)−h(x+y)}.

But by Lemma2.3,h⁰(x+y)≥h(x+y)and sof⁰(x)≥0 andf is increasing on[0,∞). Lemma 2.3. Let

h(t):=e^2t−1 t (with h(0):=limt→0h(t) =2). Then, for t≥0,h⁰(t)≥h(t).

Proof.We calculate

h⁰(t) =2t e^2t−e^2t+1

t² .

Henceh⁰(t)≥h(t)

⇐⇒ (2t−1)e^2t+1

t² ≥e^2t−1 t

⇐⇒ (2t−1)e^2t+1≥t(e^2t−1)

⇐⇒ (t−1)e^2t+1≥ −t

⇐⇒ (t−1)e^2t+1+t≥0.

Now, ift≥1,(t−1)≥0 and this latter inequality is clearly true. On the other hand, if 0≤t<1, then(t−1)<0, and (t−1)e^2t+1+t≥0

⇐⇒ 1+t≥(1−t)e^2t

⇐⇒ 1+t 1−t ≥e^2t, which is true by Lemma2.4.

(5)

Lemma 2.4. For0≤t<1,

e^2t≤1+t 1−t. Proof.The Taylor series fore^2tis

e^2t=1+ X∞ k=1

2^k k!t^k. while the Taylor series for(1+t)/(1−t)is

1+t 1−t = 2

1−t −1

=2 X∞

k=0

t^k−1

=1+ X∞ k=1

2t^k.

Now it is easy to confirm by induction that 2^k/k!≤2,k=1, 2, . . . . Hence, comparing Taylor series, we are done. 2.2 The case of equally spaced sites

Here we consider the sitesx_j:= (j−1)h, 1≤j≤nforh:=1/(n−1). In this thee_jsimplify to e_j=y_j− 1

e^λh+e^−λh

y_j₋₁+y_j+1 =y_j−1

2sech(λh)

y_j₋₁+y_j+1 and

E(λ) =

n−1

X

j=2

y_j−1

2sech(λh)

y_j₋₁+y_j+1

2

.

Since, as noted previously,E(−λ) =E(λ)we minimize over the interval[0,∞). We easily calculate E⁰(λ) =hsech(λh)tanh(λh)

× Xn−1

j=2

y_j−1

2sech(λh)

y_j₋₁+y_j+1

=hsech(λh)tanh(λh)

×

¨_n−1 X

j=2

y_j(y_j−1+y_j+1)−1

2sech(λh) Xn−1

j=2

(y_j−1+y_j+1)²

«

=hsech(λh)tanh(λh) A−1

2sech(λh)B

where we have set

A:=

n−1

X

j=2

y_j(y_j−1+y_j+1)andB:=

n−1

X

j=2

(y_j−1+y_j+1)².

First note thathsech(λh)tanh(λh) =0 iffλ=0 which is already an endpoint of our interval. The case ofB=0 is a bit special.

For then, forλ >0, sgn(E⁰(λ)) =sgn(A). Hence, then,λ=0 is the minimum ifA>0, there is no minimum ifA<0 andE(λ)is constant ifA=0.

Suppose then thatB6=0. Then, asB>0 and sech(t)≤1, ifA≥B/2 thenE⁰(λ)>0 forλ >0 andλ=0 is the unique optimal parameter.

In caseA≤0 thenE⁰(λ)<0 forλ >0 and again there is no minimum on[0,∞). Otherwise, in case 0<A<B/2 then there is a critical point given by

sech(λh) =2A/B,λ=1

hsech⁻¹(2A/B) which must necessarily be the optimumλ.

(6)

If we letn→ ∞we may observe that fory(x)∈C²[0, 1], 2A

B =2Pn−1

j=2y_j(y_j₋₁+y_j+1) Pn−1

j=2(y_j₋₁+y_j+1)²

=2hPn−1

j=2y_j(y_j₋₁+y_j+1) hPn−1

j=2(y_j₋₁+y_j+1)²

=4R1

0(y(x))²d x+2h²R1

0 y(x)y⁰⁰(x)d x+O(h³) 4R1

0(y(x))²+4h²R1

0 y(x)y⁰⁰(x)d x+O(h³)

=1+¹2h²R1

0 y(x)y⁰⁰(x)d x/R1

0(y(x))²d x+O(h³) 1+h²R1

0 y(x)y⁰⁰(x)d x/R1

0(y(x))²d x+O(h³)

=1−h² 2

R1

0 y(x)y⁰⁰(x)d x R1

0(y(x))²d x +O(h³). In the case that

Z1 0

y(x)y⁰⁰(x)d x≥0

this expression is at most 1 (forhsufficiently large). Assuming this to be the case and using the fact that sech⁻¹(t) =log

1+p 1−t² t

, we obtain that, for largenthe optimal parameter is then

λ= v u u t

R1

0 y(x)y⁰⁰(x)d x R1

0(y(x))²d x +O(h).

3 Comparison with a Maximum Likelihood Estimate

In the context of Kriging, in which RBF interpolation is embedded in a statistical context, it has been suggested to use a Maximum Likelihood Estimate (MLE) for the optimal parameter. Here, the interpolation matrixR∈R^n×ngiven byR= [g_λ(|xi−x_j|)]1≤i,j≤n, is interpreted as a covariance matrix for a certain family ofnrandom variables. As suchRis assumed to be positive definite, and then the MLE parameter turns out be theλ(see e.g.[6]) for which

m(λ):=|det(R)|^1/n|y^tR⁻¹y| (9) is aminimum.

In the case of our example, the interpolation matrix isnotpositive definite (it is however conditionally definite on a(n−1)- dimensional subspace) and hence the statistical interpretation of Kriging does not directly apply. However, one may nevertheless attempt to minimize the expression (9) and compare with the LOOCV parameter. Indeed doing so reveals an interesting relation between the two approaches. We concentrate on the case of equally spaced sites,x_j= (j−1)h, 1≤j≤n,h:=1/(n−1).

First note that from Propositions 3.1 and 3.2 of[1]we have then that

R⁻¹= λ 2 sinh(hλ)×







−^sinhsinh⁽⁽¹^−h)λ)(λ) 1 0 · · · 0 ^sinh_sinh^(hλ)_(λ)

1 −2 cosh(hλ) 1 0 · · · 0

0 1 −2 cosh(hλ) 1 · · · 0

· · ·

0 · · · 0 1 −2 cosh(hλ) 1

sinh(hλ)

sinh(λ) 0 · · · 0 1 −^sinh((1_sinh(λ)⁻^h)λ)





 .

(7)

LetM:=2 sinh(hλ)

λ R⁻¹be the above matrix. We caluclate y^tM y=y1

§

−sinh((1−h)λ)

sinh(λ) y1+y2+sinh(hλ) sinh(λ) y_n

ª

+

n−1

X

i=2

y_i{y_i₋₁−2 cosh(hλ)y_i+y_i+1} +y_n

§sinh(hλ)

sinh(λ) y₁+y_n₋₁−sinh((1−h)λ) sinh(λ) y_n

ª

=h y₁

§y2−y1

h +1

h

1−sinh((1−h)λ) sinh(λ)

y₁+sinh(hλ) hsinh(λ)y_n

ª

+h²

n−1

X

i=2

§

y_iy_i₋₁−2yi+y_i+1 h² +2y_i²

1−cosh(hλ) h²

ª

+h y_n

§sinh(hλ)

hsinh(λ)y1+ y_n₋₁−y_n

h +1

h

1−sinh((1−h)λ) sinh(λ)

y_n

ª . Then taking the limit ash→0⁺, we see that

hlim→0⁺

1

hy^tM y=y(0)

y⁰(0) +λcoth(λ)y(0) + λ

sinh(λ)y(1) +

Z1 0

y(x)y⁰⁰(x)d x−λ² Z1

0

y²(x)d x +y(1)

−y⁰(1) +λcoth(λ)y(1) + λ

sinh(λ)y(0) .

Now, notice thatm(λ)being a positive value will be minimized if for someλ,y^tR⁻¹y=0, or, equivalently,y^tM y=0. If we were to ignore the boundary terms involving the values ofy(x)andy⁰(x)atx=0, 1 this would happen (approximately) when

Z1 0

y(x)y⁰⁰(x)d x−λ² Z1

0

y²(x)d x=0, i.e., for

λ= v u u t

R1

0 y(x)y⁰⁰(x)d x R1

0(y(x))²d x , i.e., for essentially thesamevalue as the LOOCV parameter in this circumstance!

However, the boundary terms do alter this optimal value of the parameter. In Figure 1 below we give the plots of the resulting interpolants for the LOOCV parameter (λ=1.000658) and the MLE estimate computed numerically (λ=2.563091) andn=13.

The MLE interpolant is noticably worse. However, we emphasize that asRis not positive definite the MLE approach is technically not applicable. It is nontheless interesting, that apart from the boundary effects the two approaches provide the same parameter estimate, in this circumstance.

4 Conclusions

We have given a univariate example where it is possible to give explicit values for the optimal LOOCV parameter for RBF interpolation. We do not claim that this is a practical example – we give it only in the hope that it may provide some small insight into the difficult general problem of RBF parameter selection.

Acknowledgments. Work supported by the ex-60% funds of the University of Verona.

References

[1] L.. Bos and S. De Marchi. Univariate Radial Basis Functions with Compact Support Cardinal Functions East J. Approx., Vol. 14 (1) (2008), 69 – 80.

[2] L. Bos and U. Maier. On the asymptotics of points which maximize determinants of the formd et[g(|x_i−x_j|)]in “Advances in Multivariate Approximation”, Mathematical Research, (W. Haussmannet al.Eds.), Vol. 107, pp. 107 – 128, Wiley-Vch, Berlin, 1999.

[3] L. Bos and U. Maier. On the Asymptotics of Fekete-Type Points for Univariate Radial Basis Interpolation J. of Approx. Theory119, 252 – 270 (2002).

[4] M. Buhmann. Radial Basis Functions Cambridge U. Press (2003).

[5] G. Fasshauer. Meshfree Approximation Methods with Matlab World Scientific Press (2007).

(8)

0 0.5 1 0.99

1 1.01 1.02 1.03 1.04 1.05 1.06 1.07

0 0.5 1

0.99 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07

Figure 1:Left: LOOCV interpolant; Right: MLE interpolant

[6] S. Lophaven, H. B. Nielsen, and J. Sondergaard. DACE, A Matlab Kriging Toolbox Technical Report IMM-TR-2002-12, Technical Univ. of Denmark, (2002).

[7] H. Wendland. Scattered Data Approximation Cambridge U. Press (2005).