Kernel Methods - KERNEL MATRIX COMPLETION

Figure 2.2: Embedding of data points into a feature space.

In the input space X, no linear classifier could be found to separate the two classes (blue and orange). However, when the data points were mapped into a higher dimensional spaceF, a linear classifier was

found.

product. Likewise, once the classifier has been trained, a new pointxcan be classified by (2.18) only through the inner product ofxwith the support vectors, which makes SVM very efficient to use. The SVM algorithm was originally developed by Vapnik and Chervonenkis in 1963, and gained popularity when Boser, Guyon, and Vapnik extended SVM to nonlinear classifiers in 1992, where kernel trick was applied.

2.2 Kernel Methods

For data sets with more complex pattern, a linear classifier may not be found in the original input spaceX. In such cases, the data is said to be non-linearly separable in X. However, when these data points are mapped to a higher-dimensional spaceF, referred to asfeature space, a linear classifier can be found, as in the case in Fig. 2.2.

Situations like these require learning algorithms to change the representation of the data through some user-specified nonlinear mapping function φ : X → F, and is permissible to do so since changing the data’s representation or coordinates does not change the underlying patterns or regularities in the data [47].

A class of pattern recognition algorithms, thekernel methods, maps the data into a higher-dimensional space where the patterns can be easily detected. This efficient mapping of the data in the feature space is done via a kernel function, formally defined as follows:

Definition 2.2.1 (Kernel). A kernel is a function κ such that for all x,z ∈ X satisfies

κ(x,z) =hφ(x), φ(z)i,

where φis a mapping from an input space X to an (inner product) feature spaceF φ:x7→φ(x)∈ F.

10 Chapter 2. Preliminaries Simply put, a kernel function takes the inner product of two data images under an embeddingφ.

2.2.1 The Kernel Trick

Popularly termed askernel trick, one can conveniently replace any inner product of two data images appearing in a learning algorithm, by a valid kernel function. To see this, suppose a linear classifier is being constructed in a feature spaceF for the data images under a mappingφ:x7→φ(x)∈ F. By kernel trick, the dual form in (2.13) can be rewritten as

W(α) =

i=1

α_i−1 2

i,j=1

α_iα_jy_iy_jhφ(x_i), φ(x_j)i (2.19)

i=1

α_i−1 2

i,j=1

α_iα_jy_iy_jκ(x_i,x_j), (2.20)

while the decision function in (2.18) can be expressed as

f(x) = sign X

i∈S

α⁰_iyihφ(xi), φ(x)i+b0

= sign X

i∈S

α⁰_iy_iκ(x_i,x) +b₀

, (2.21)

which corresponds to the optimal hyperplane separating the data images in the feature space.

Observe that through kernels, one does not have to explicitly compute the coor-dinates of the data in its new representation; rather, one has just be able to evaluate the corresponding kernel.

As an illustration, given the input data X = (x1, . . . ,x_`)^> ∈ R^`×n, the SVM algorithm considered in the previous section operates only through the inner product of the input:







hx₁,x1i hx₁,x2i · · · hx₁,x`i hx₂,x₁i hx₂,x₂i · · · hx₂,x_`i

... ... . .. ... hx_`,x1i hx_`,x2i · · · hx_`,x_`i







=XX^>, (2.22)

which is called the Gram matrix³ of X. Note that since all the data information needed in the algorithm is contained in this matrix, the original data can be discarded.

3The matrix is named after the Danish actuary and mathematician Jorgen P. Gram (1850–1916).

2.2. Kernel Methods 11 This is highly desirable especially when the number of objects`is much smaller than the dimensionn, which is usually the case.

Now, when the data points are mapped underφ, the information needed in (2.19) is

K =







hφ(x₁), φ(x₁)i hφ(x₁), φ(x₂)i · · · hφ(x₁), φ(x_`)i hφ(x₂), φ(x1)i hφ(x₂), φ(x2)i · · · hφ(x₂), φ(x_`)i

... ... . .. ...

hφ(x_`), φ(x₁)i hφ(x_`), φ(x₂)i · · · hφ(x_`), φ(x_`)i













κ(x1,x1) κ(x1,x2) · · · κ(x1,x_`) κ(x₂,x₁) κ(x₂,x₂) · · · κ(x₂,x_`)

... ... . .. ... κ(x_`,x₁) κ(x_`,x₂) · · · κ(x_`,x_`)







, (2.23)

which is called the kernel matrix. It can be observed that the entries of the kernel matrix can be obtained in only one operation, i. e., through a kernel function, instead of mapping the data and then computing the inner product. As a result, the mapping φcan be left implicit. Moreover, since the kernel is operating on pairs of data points, it can be viewed as a similarity measure between those two points, and hence the kernel matrix represents some generalized similarity measure between input vectors.

It was mentioned earlier that a valid kernel is needed in order to exploit the kernel trick. Fortunately, there is no need to construct the mapping explicitly to test whether or not a function constitutes a valid kernel—a necessary and sufficient condition is that for all the possible choices of the (training) data, the kernel matrix must be positive semidefinite⁴, as stated inMercer’s Theorem:

Theorem 2.2.1 (Mercer’s Theorem). A symmetric function κ(xi,xj) can be ex-pressed as an inner product

κ(x_i,x_j) =hφ(x_i), φ(x_j)i

for some φ if and only if the kernel matrix (2.23) is positive semidefinite for any x_i,x_j ∈Rⁿ, for i, j= 1, . . . , `.

This theorem serves as a powerful tool that allows us to handle kernels without caring how the mappingφnor the corresponding feature space look like. As long as the necessary condition for kernel matrices–the positive semidefiniteness–is maintained, a valid kernel is guaranteed, as well as the existence of its corresponding feature space.

As a matrix containing the kernel values for every pair of data points, the kernel matrix (as well as the Gram matrix) is positive semidefinite, and the proof is rather

4A matrixA∈R^n×nis positive semidefinite if for allα∈Rⁿ,α^>Aα≥0.

12 Chapter 2. Preliminaries straightforward. Indeed, lettingKi,j =κ(xi,xj) =hφ(x_i), φ(xj)ibe thei, j-th entry of kernel matrixK, and for anyα∈Rⁿ,

α^>Kα=

i,j=1

α_iα_jK_ij =

i,j=1

α_iα_jhφ(x_i), φ(x_j)i

* _` X

i=1

α_iφ(x_i),

j=1

α_jφ(x_j) +

i=1

α_iφ(x_i)

≥0.

2.2.2 Constructing Kernels

Given the characterizations for valid kernels, one can now create functions derived from simpler kernels without explicitly constructing the feature space. Provided that the new function created is positive semidefinite, the function is a valid kernel, and the feature space where the inner product is computed by this new function exists.

The following proposition shows the closure property of kernels under some op-erations, i. e., when one or more kernels undergo with such opop-erations, the positive semidefinite property of kernels is preserved [47]:

Proposition 2.2.1 (Closure properties). Let κ₁ and κ₂ be kernels over X × X, X ⊆Rⁿ, a∈R⁺, f(·) a real-valued function on X, φ:X 7→ F with κ3 a kernel over F × F, and B a symmetric positive semidefinite n×n matrix. Then the following functions are kernels:

(i) κ(x,z) =κ1(x,z) +κ2(x,z);

(ii) κ(x,z) =aκ1(x,z);

(iii) κ(x,z) =κ₁(x,z)κ₂(x,z);

(iv) κ(x,z) =f(x)f(z);

(v) κ(x,z) =κ₃(φ(x), φ(z));

(vi) κ(x,z) =x^>Bz.

The proofs for (i) and (ii) can be shown by utilizing Mercer’s theorem, where it is only necessary to show that the induced kernel matrix is positive semidefi-nite. Indeed, given kernel matrices K₁ and K₂ obtained by restricting κ₁ and κ₂ to{x₁, . . . ,x`|xi∈Rⁿ}, and given any α∈R^`,

2.3. Summary 13

ドキュメント内 KERNEL MATRIX COMPLETION (ページ 31-35)