k -Nearest Neighbor (KNN) - Learning Algorithms

4.3 Learning Algorithms

4.3.2 k -Nearest Neighbor (KNN)

an example of a two-dimensional Gaussian probability density function. The graph contains two features, x₁ and x₂ represented by x-axis and y-axis re-spectively, while the z-axis indicates p(x), and corresponds to the covariance matrix

Σ=

σ₁² 0 0 σ22

. (4.8)

The shape of this graph are controlled by the covariance matrixΣ. Figure 4.3 has the Σwith σ12 ≪σ22, so the graph is elongated along the x2 direction.

To classify a test instance xin our experiments, we specified a threshold ε which have a range of value between 0 and 1. Ifp(x) is less thanε, we will decide that the test instance xis anomaly, because the test instance gives a lower probability compared to normal instances. Otherwise, we decide that the test instancexis normal, because the test instance gives a probability in a boundary of normal instances. For our experiments, however, we generalize the ε to the adaptive threshold, is specified as

ε= 1

(2π)^{f /2}|Σ|^1/2 exp

−1 2ρ²

, (4.9)

where ρ is a parameter to obtain the proportion of maximum probability.

Smaller values of ρ produce higher value of p(x) than larger values of ρ. In our experiments, we varied values of ρ between 2 and 4 on a linear scale to determine the best detection performance. Therefore, we can define the classification function of test data xas

g(x) =

anomaly, if p(x)< ε;

normal, otherwise. (4.10)

The main reason for selecting MND is that this learning algorithm is one of the linear classifiers. We need a simple learning algorithms to work with our learning model, because the simpler algorithm we used, the faster computational time we obtain. Although data in real network environments may never come with the normal distribution, the MND provides a robust approximation and has many nice mathematical properties. Furthermore, because of the central limit theorem, many multivariate statistics converge to the normal distribution as the sample size increases.

learning samples in the f-dimensional space [92]. It is quite different from the MND, which is based upon probability density of random variable.

To get better known about KNN, let us show an example of two-dimensional instances as shown in Figure 4.4. Suppose we plot training instances on a two-dimensional plane as shown in Figure 4.4 (left). The training instances have been labeled with one of the two classes, square class and triangle class.

The test sample (⊗ mark) should be classified either to the square class or the triangle class. If we set k = 3 (solid line circle as a boundary of 3-nearest neighbor), the algorithm assigns the test sample to the triangle class because there are 2 triangles and only 1 square inside the inner circle. If we set k

= 5 (dashed line circle as a boundary of 5-nearest neighbor), the algorithm assigns the test sample to the square class because the number of squares is greater than the number of triangles inside the outer circle.

?_⊗ ?_⊗

r r r

r r uuu uu rr r r

r r rrr rr

Figure 4.4: Examples of KNN: (left) the original KNN, (right) our modified KNN.

The original KNN, we need to define two parameters, the constantk and distance metric. In our experiments, however, we added another parameter, the distance D, in order to apply KNN for unlabeled training data. To understand our modified KNN, let us explain by using Figure 4.4 (right).

Suppose we have only one square class, and intend to classify the test sample (⊗ mark) whether it is in the square class or not. If we set k = 5 and D

= d1 (solid line circle), the algorithm does not assign the test sample to the square class because there are only 3 squares within the distance d₁. On the other hand, if we setk = 5 and a longer distanceD=d2 (dashed line circle), the algorithm assigns the test sample to the square class because there are 5 squares within the distance d₂. Therefore, there are 3 parameters used to

classify the test data in our experiments, namely the constantk, the distance d, and distance metric.

For the distance metric, the Euclidean distance [93] is commonly used for many classification problems. In our experiments, the nearest neighbors of the data are also defined by standard Euclidean distances. More precisely, suppose test data xcomprisingf features be described by the feature vector [x1. . . xf]^T, where xi denotes the value of the i-th feature of data x, and T denotes the transpose of column vector, which is the simple way to write a column vector in a single row. The Euclidean distance between two instances x and y is the length of the line segment connecting them, given by

d(x,y) = v u u t

i=1

(xi−y_i)². (4.11) Even if we employ the Euclidean distance as the distance metric for our experiments, there are many other possible distance metrics. The following distance metrics are well studied:

• Mahalanobis distance [94]

• Manhattan distance [95]

• Minkowski distance [96]

• Canberra distance [97]

These metrics provide a different technique to measure between two instances in a high dimensional space. Dissimilar distance metrics also have advantages and disadvantages for specified tasks. Therefore we need to carefully select distance metric for our purpose.

To classify a test instance in our experiments, we specified the constant parameter k = 3 and varied the distance parameter. Thanks to the feature scaling step, we can vary the parameter D on a logarithmic scale between 10⁻⁶ and 10⁰ for selection of the best detection performance. We defined the classification function of test data xas

g(x) =







anomaly, if amount of training data nearest tox is less than k in the distance D;

normal, otherwise.

(4.12) We selected the KNN to work with the multi-timeline model because this algorithm do not need an explicit learning phase. In other word, all the training data is needed in the detecting phase, so time consumption in

learning phase is a constant value. Moreover, the modified KNN used in our experiment do not require all training data in learning phase because if we find the k training instances in range of D distances, we classify the test instance as a normal immediately, we do not need all training instances for classification. In this manner, this algorithm takes a short detection time for new instance. For this reason, the KNN is highly suitable for real-time anomaly detection in network traffic. In addition, thek value of KNN allows the classifier to tolerate noisy data; network traffic is one of the data which contain lot of noise.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1796本文 (ページ 63-66)