Classiﬁcation - A dissertation submitted in partial fulﬁllment of the requirements for the degr

We evaluated the generalization of three-input IDS models in terms of classiﬁcation. The two-spiral problem [20] is a well-known classiﬁcation benchmark for supervised learning algorithms. In particular, FNNs have often used this problem to demonstrate the performance of new architectures and algorithms. The test involves the classiﬁcation of two data sets that comprise two spirals. Typically, each spiral comprises 97 data points;

194 data points are plotted on the plane. Points in the spiral are located according to the following condition:

r=p(θ+ 2πn) +r_o, (3.7)

wherer andθare the radius and angle, respectively. pandro are the parameters that determine the size of the spiral, andnis an integer that represents the number of revolutions. p = 1/πandro = 0.5are used as standard values.

Generally, single-hidden-layer networks based on standard BPL cannot perform stable perfect classiﬁca-tions in the two-spiral problem. Simple gradient descent methods involve numerous training iteraclassiﬁca-tions, which require a considerable amount of computational time. Only FNNs with reﬁned architectures or algorithms can achieve 100% classiﬁcations with fast convergence. By using cascade-correlation learning proposed by Fahlman and Lebiere [21], it was possible to solve the two-spiral problem with fewer training iterations and a smaller network size. Hwanget al. [22] indicated the drawback of cascade-correlation learning that it is

22 CHAPTER 3. MODELING ABILITY OF THE IDS METHOD

-6 -4 -2 0 2 4 6

Figure 3.6: Spiral data separated into 13 partitions for each input.

difﬁcult to achieve a high accuracy in regression modeling tasks. Their projection pursuit learning network ex-hibited a good performance in both regression modeling and classiﬁcation tasks. Jia and Chua [23] described the effect of input data encoding in which binary, weighted binary, gray, and temperature encoding schemes were tested. These encoding schemes can rapidly increase the number of input units. It is considered that the two-spiral problem can be solved by using many input units. ´Alvarez [24] proposed a knowledge-based neural network in which the radius and angle were used as inputs. This takes advantage of the geometric nature of input patterns; the use of polar coordinates is very effective in the classiﬁcation of two spirals. Singh [25]

used spiral data expanded into a three-dimensional coordinate space. In this case, the position of the spirals was determined byx,y, andzcoordinates; thezcoordinate was equal to the mean of thexandycoordinates.

Singh used eight features including thex,y, andzcoordinates as the input vector for the FNN.

As described above, many researchers have attempted to determine an efﬁcient solution to solve the two-spiral problem in FNNs by employing input encoding schemes or using preprocessing input data. It is evident that effective input data encoding improves the performance of FNNs. In other words, a supervised learning algorithm that can easily solve complex classiﬁcation problems without adopting such techniques can serve as an excellent algorithm. Therefore, we applied the IDS method based on the standard use to the two-spiral benchmark, in the same manner as that of the previous benchmarks, without geometrically preprocessing the spiral data.

In the IDS modeling, each input domain was divided at regular intervals. The resolution of the x-y plane was 256×256 and the ink drop pattern size was set to 9. In intricate classiﬁcation tasks, it is preferable to reduce the size of the ink drop pattern in terms of the classiﬁcation rate and learning speed because the interpolation between the data points is not necessarily required. The IDS method cannot solve the two-spiral problem if the number of partitions of the input domain is insufﬁcient. First, we examined the minimum

3.5. CLASSIFICATION 23

-6 -4 -2 0 2 4 6

Figure 3.7: Output of a 13-partition IDS model constructed using a two-spiral data set. (100% classiﬁcation achieved)

number of partitions that could be used to achieve the perfect classiﬁcation of the spiral data and found that the 13-partition IDS model performed 100% classiﬁcation. Figure 3.6 shows the spiral data points separated into 13 partitions. In this classiﬁcation task, the data point was classiﬁed as a white spiral when the model output at some data point is greater than 0.6. When the model output was less than 0.4, the data point is classiﬁed as a black spiral. The remaining range, 0.4 to 0.6, is deﬁned as an invalid output. This rule has been frequently applied to the binary output of FNNs. Figure 3.7 shows the output of the 13-partition IDS model for 64×64 input points closely set at regular intervals over the data plane. The cascade-correlation network is one of FNNs that can stably achieve perfect classiﬁcations in the two-spiral benchmark. We obtained the source code of the cascade-correlation learning algorithm from the CMU AI repository [26], and applied the algorithm to the two-spiral benchmark. Figure 3.8 shows the output of the cascade-correlation network. It can be observed that the IDS model represents the two spirals more clearly.

Singh [25] used spiral data that was expanded into a three-dimensional coordinate space as a new two-spiral benchmark. Figure 3.9 represents the two-spiral data points in three-dimensional coordinates. We applied the IDS method to this three-dimensional two-spiral benchmark. First, by using the constructive algorithm, we obtained a minimum structure that achieved 100% classiﬁcation of the spiral data. Thus, we used the stop condition (3.3) and set emin(n) = 0. Mmin andMmax were set to 2 and 15, respectively. The minimum structure obtained from the search wasm1= 5,m2 = 15, andm3 = 3.

Next, we examined the generalization of the IDS model obtained by the constructive algorithm. For the evaluation of the generalization in the three-dimensional two-spiral problem, we used the same approach as that used by Singh [25]. Eight test sets were generated by introducing an offset in the training data: 1) (x_d+δ, y_d+δ,z_d+δ), 2) (x_d+δ,y_d+δ,z_d−δ), 3) (x_d+δ,y_d−δ,z_d+δ), 4) (x_d+δ,y_d−δ,z_d−δ), 5) (x_d−δ,

24 CHAPTER 3. MODELING ABILITY OF THE IDS METHOD

-6 -4 -2 0 2 4 6

Figure 3.8: Output of a cascade-correlation network. (100% classiﬁcation achieved)

-6 -4 -2

x

0 2 4 6 ^-6 ^-4^-2

0 2 4 6 -6

y

-4 -2 0 2 4 6

z

Figure 3.9: The two spirals in three-dimensional coordinates.

3.5. CLASSIFICATION 25

Table 3.3: Classiﬁcation Rate (%) for Each Test Set δ= 0.1 δ = 0.2 δ= 0.3

1 99.0 90.6 80.2

2 98.5 91.7 83.3

3 97.4 92.7 80.2

4 97.9 94.8 85.9

5 96.9 92.2 84.9

6 97.4 89.6 78.1

7 99.0 94.8 87.0

8 100 96.9 88.0

ave. 98.3 92.9 83.5

y_d+δ,z_d+δ), 6) (x_d−δ, y_d+δ,z_d−δ), 7) (x_d−δ,y_d−δ, z_d+δ), and 8) (x_d−δ, y_d−δ, z_d−δ), where (x_d,y_d,z_d),x_d∈X₁,y_d∈X₂,z_d∈X₃, is the spiral data point, andδis the offset to be added. In this experiment, the input domain was deﬁned as[−6.5,6.5]³. When a small offset is introduced in the original spiral data points, one or two test points in each test set are located outside the input domain. We excluded such points from the calculations of the classiﬁcation rate. Table 3.3 lists the classiﬁcation rates for each test set. On the other hand, the average rates for the test sets with offsets0.1, 0.2, and0.3in Singh’s FNN [25]

were 96.5%, 91.0%, and 80.5%, respectively. For this benchmark, the generalization performance of the IDS model is sufﬁciently high when compared with that of Singh’s FNNs.

26 CHAPTER 3. MODELING ABILITY OF THE IDS METHOD

Chapter 4

Performance Evaluation

4.1 Introduction

This chapter deals with the performance of the IDS method as a soft computing tool. In the experiments, we use Hwang’s benchmark [17], described in Section 4.1.1. We compare the noise tolerance, fault tolerance, and real-time capabilities of IDS models with those of MLPs, radial basis function networks (RBFNs), and adap-tive neuro-fuzzy inference systems (ANFISs). The MLP is the most popular model of neural networks. The RBFN is a variant of the ANN and is known to have superior fault tolerance and fast convergence [27]. The ANFIS is characterized by its hybrid learning: the parameters of the premise part and those of the consequent part in fuzzy inference are adjusted by the gradient descent method and the least squares method, respectively [28]. The MLP, RBFN, and ANFIS used in the experiments are described in Sections 4.1.2, 4.1.3, and 4.1.4, respectively.

4.1.1 Hwang’s Benchmark

Hwang’s ﬁve-function set is used for function approximation in several studies on ANNs [19][29]–[33]. Some of the studies that deal with this benchmark evaluate the generalization performance using both noiseless and noisy training data, and their benchmark results are based on the same conditions in the number of training examples, the distribution of test data, and SNR. Hwang’s ﬁve-function set comprises the following non-linear functionsgi : [0,1]² → <.

•Simple Interaction Function:

g₁(x₁, x₂) = 10.391 ((x₁−0.4)(x₂−0.6) + 0.36).

•Radial Function:

g₂(x₁, x₂) = 24.234 (r²(0.75−r²)) r² = (x₁−0.5)²+ (x₂−0.5)².

•Harmonic Function:

g3(x1, x2) = 42.659 (0.1 + ˜x1(0.05 + ˜x⁴₁−10˜x²₁x˜²₂+ 5˜x⁴₂)) wherex˜₁ =x₁−0.5andx˜₂ =x₂−0.5.

28 CHAPTER 4. PERFORMANCE EVALUATION

•Additive Function:

g4(x1, x2) = 1.3356(1.5(1−x1) +e^2x¹⁻¹sin(3π(x1−0.6)²) +e^3(x²⁻^0.5)sin(4π(x2−0.9)²)).

•Complicated Interaction Function:

g₅(x₁, x₂) = 1.9(1.35 +e^x¹sin(13(x₁−0.6)²)e⁻^x²sin(7x₂)).

Figure 4.1 graphically represents these functions. For the test conditions of this benchmark, 225 randomly generated examples are used as the training set. Let(x_l, y_l)be the lth training example. The noisy training data are generated as follows:

y_l=g_i(x_l) + 0.25²_l, i= 1,2,· · ·,5 (4.1) where²_l ≈N(0,1)is the zero-mean unit-variance Gaussian noise. The test set comprises 10000 data points uniformly distributed over the input domain, as shown in (2.13). In this benchmark, FVU (2.12) is employed as an error measure. The model accuracy is shown by the FVU calculated from the 10000 data points.

In order to generate training sets based on random numbers, we used the drand48 function implemented as a pseudo-random number generator in the FreeBSD 6.1 operating system. Figure 4.2 shows the graphical representation of the noisy data whose points were uniformly plotted.

4.1.2 Multilayer Perceptron

Single-output MLPs are described as follows:

M LP(x) =

∑H i=1

u_if(

∑I j=1

v_ijx_j−v_i0)−u₀ (4.2)

whereHandI are the number of hidden units and input units, respectively; u0, the bias of the output unit;

ui, the interconnection weight between the output unit and theith hidden unit;vi0, the bias of theith hidden unit;vij, the interconnection weight between theith hidden unit and thejth input unit; andf is a sigmoidal activation function. The weights and biases are initialized with small random values and trained using the standard backpropagation learning [16].

4.1.3 Radial Basis Function Network

Single-output RBFNs with Gaussian activation functions are described as follows:

RBF N(x) =w₀+

∑H i=1

w_iexp(−kx−c_ik²

r² ) (4.3)

whereci is the center of the RBF of theith hidden unit;r, the radius of the RBF; andwi, the interconnection weight between the output unit and theith hidden unit.k · kdenotes the Euclidean norm.

For the setup conditions and training procedure of the RBFN, we followed the procedure given in [27] as shown below. The centers of the RBFs are uniformly distributed over the input domain, and their radii are ﬁxed at the same value. These parameters are not updated during training. The weights are initialized to zero and trained using the gradient descent method.

4.1. INTRODUCTION 29

0.2 0 0.6 0.4

1 0.8

x² 0 0.20.4 0.60.81

0 1 2 3 4 5 6 7

0.2 0 0.6 0.4

1 0.8

x² 0 0.2 0.40.60.8 1

0 1 2 3 4 5

g1: Simple interaction g2: Radial

0.2 0 0.6 0.4

1 0.8

x² 0 0.20.4 0.60.81

x¹

01 23 45 67 8 9

0.2 0 0.6 0.4

1 0.8

x² 0 0.2 0.40.60.8 1

x¹

0 1 2 3 4 5 6

g3: Harmonic g4: Additive

0.2 0 0.6 0.4

1 0.8

x² 0 0.20.4 0.60.8 1

x¹

0 1 2 3 4 5 6 7

g5: Complicated interaction Figure 4.1: Graphs of ﬁve functions.

0.2 0 0.6 0.4

1 0.8

x2 0 0.20.4 0.60.81

x¹

0 1 2 3 4 5 6 7

0.2 0 0.6 0.4

1 0.8

x2 0 0.2 0.40.60.8 1

x¹

0 1 2 3 4 5 6

(a) g1 (b) g4

Figure 4.2: Examples of noisy data. (30×30 data points distributed at regular intervals over the input domain)

30 CHAPTER 4. PERFORMANCE EVALUATION

A₁₁

12 21 22

A A A

x x_{1 2} x1

x₂

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Premise parameters Consequent parameters Figure 4.3: Structure of an ANFIS.

4.1.4 Adaptive Neuro-Fuzzy Inference System

ANFISs use fuzzy rules of the ﬁrst-order Sugeno type [34]. Thekth rule is shown as follows:

Ifx1isA^k₁ and · · · andxN isA^k_N, theny=p^k₀+p^k₁x1+· · · +p^k_NxN. (4.4) A_ij denotes thejth fuzzy set for theith input variable. The membership function ofA_ij uses the following formula:

µAij(x) = 1 1 +(

(^x⁻_a^c^ij

ij )²)bij . (4.5)

The parametersaij,bij, andcij and the parametersp^k₀,p^k₁,· · ·,p^k_N are referred to as the premise and conse-quent parameters, respectively. The learning of ANFISs varies these parameters. The ANFIS has a neural-like network structure and employs a hybrid learning algorithm. Figure 4.3 shows the structure of a two-input ANFIS. In the forward pass, the premise parameters are ﬁxed and the consequent parameters are updated after a batch of training examples is processed by the least squares method. In the backward pass, the consequent parameters are ﬁxed and the premise parameters are updated after a batch of training examples is processed by the gradient descent method. For the learning of ANFISs, the batch mode and incremental mode are used.

The abovementioned procedure corresponds to the batch mode. In the incremental mode, the switch between the update of premise parameters and that of consequent parameters is performed for every training example.

ドキュメント内 A dissertation submitted in partial fulﬁllment of the requirements for the degree of (ページ 30-39)