These input data are representation of network traffic, so they cannot pro-cess data by themselves. Therefore, we require a learning algorithm utilizing the representation of input data to produce a classifier in training process and to classify test data in detecting process. We design the input data in general so that one can pick any machine learning algorithm to work with multi-timeline representation. In our design, a key role of learning algorithm is making connections among training data, test data, representation of in-put data, and classifier. Additionally, we provide some learning algorithms for our experiments in Chapter 4.
Note that closer to our approach, many studies have proposed a technique known as the multivariate technique. For example, a study by Lakhina et al. [80] shows that the principal component analysis (PCA) can be used for separating the high-dimentional space from different network traffict mea-surements into subspaces, after that they can perform anomaly detection from these disjoint subspaces. Another study by Nychis et al. [81] proposed an entropy-based technique for anomaly detection in network traffic. The main idea of this approach is that they analyze the power of multiple traffic distributions, including the number of addresses, ports, and flow sizes, then they use the entropy of these distributions in time series to detect anomalies in network traffic. The last example has been studied by Kanda et al.[82].
They proposed a combination of sketchs, PCA, and entropy-based technique to detect anomalies in time series of network traffic. These techniques as the multivariate technique employ multiple values or multiple features of time se-ries and compare a test data to other time sese-ries. The multivariate technique is still based on batch representation because they still need entire network traffic for a certain period. The multi-timeline representation, however, com-pares a test data to historic network traffic rather than the current one, and do not need entire data for detection.
pro-cess. We also apply the multi-timeline detection system to take the major role of representation of input data, it is a crucial part to generate an effi-cient classifier for anomaly detection. To understand functionality of detector module, we explain each function in the following subsections.
Network Traffic
Feature Extraction
Feature Scaling
Weighting Process
Detector
Module Classifier Alarm
Training Data
Test Data
Figure 3.3: Process connections and data flows of detector module.
3.2.1 Feature Extraction
The main function of feature extraction is selecting important features of network traffic. Extracted features should directly or indirectly associate with expected anomalies. In our experiments, we can extract feature of network traffic by using information on packet header and aggregate packets or flows on interval basis. Table 3.2 shows fundamental features of network traffic by aggregating information of packet header. However, we can create variations of these features by combing more features together, for example, the number packets per flow over an interval could be derived from ratio between the Packet and Flow feature. Moreover, we also can extract more specific protocols or ports based on these features, such as the number packets of HTTP port 80 or the number of flows of ICMP, and so forth.
Table 3.2: Fundamental features of network traffic by aggregating informa-tion of packet header.
Feature Description
Packet Number of packets Byte Sum of packet sizes Flow Number of flows
SrcAddr Number of source addresses DstAddr Number of destination addresses SrcPort Number of source ports
DstPort Number of destination ports
On the contrary, feature representation in nominal value, such as binary values of 0 and 1, is another way for feature extraction rather than aggregat-ing information of packet header. For example, we indicate that IP addresses or port numbers appear on an interval with 1 and does not appear with 0, so the classifier should consider IP addresses or port numbers which does not appear before as anomalies. Therefore, output from feature extraction could be mixed between features with real value and features with nominal value. Our experiments apply only the aggregation technique so all features are represented in real values.
One of the feature extraction parameters is interval value δ that affects detection performance, resources, and time consumption of multi-timeline system. If we set a short interval value, the system could detect short time anomalies; however, we have to reserve more storage for each interval infor-mation and take processing unit more often. Conversely, if we set a long interval value, we could reduce storage for each interval information; how-ever, the system hardly detect a short time anomaly. The interval value also has an effect on time consumption, because it depends on the number of packets in each interval. If we set a short interval value, the number of packet in each interval less than those of a long interval value. Moreover, the time between anomaly occur and alert rely on time interval and processing time after that as
T(x) = δ+d(x), (3.6)
where T(x) is total processing time when an anomaly occur at interval x, δ is an interval value, and d(x) is detecting time processed at interval x.
3.2.2 Feature Scaling
Feature scaling is a process attempts to standardize a wide range of fea-ture values to the same range. This process may be unnecessary for other techniques for anomaly detection in network traffic; however, feature scal-ing or feature normalization is indispensable for the multi-timeline detection system that relies mainly upon machine learning algorithms. Most learning algorithms will not function properly without feature scaling.
Even though many scaling functions have been proposed [83], the common goal of scaling functions is to independently normalize each feature compo-nent to the [0,1] or [-1,1] range. We highly recommend the following two scaling functions that would be appropriate for real-time anomaly detection.
We could normalize feature values by using the max value as x′ = x
max(x), (3.7)
or by using the range of feature as
x′ = x−µ
max(x)−min(x), (3.8)
where x′ is the normalized value derived from x an original value, µ is the average value, max(x) and min(x) are maximum and minimum values of x respectively. These two scaling functions have the same time complexity as O(q) where q is the number of feature values, so these would take a very short time of computation for real-time systems.
3.2.3 Weighting Process
Weighting is a mandatory process for the multi-timeline detection module, especially during the learning process. This module guides the learning al-gorithm to generate a decision function which relies on particular timelines.
Consequently, network operators could weight on one or more timelines to bias the decision function of detection system. For example, recent timelines should have a strong influence on the classifier than other older timelines.
There are various weighting techniques for different purposes that could be plug into this module, such as by performing a sum, integral, average or even calculus [84]. In our experiments, we adapted a gradual weighting function to recent timelines and we will describe more details of our weighting function in Chapter 4, System Implementation.
3.2.4 Classifier
The major role of classifier is distinguishing between normal and anomaly in network traffic. The classifier in detector module is created by an algo-rithm using representation of input data and training data during learning process. As a result, effectiveness of classifier depends highly upon three factors: learning algorithm, representation of input data, and training set.
In detecting process, the classifier receives a test data as an input, then pro-duces a label of test data as an output to alarm system. One of our purposes of multi-timeline learning is to enhance several capabilities of detector mod-ule. In addition, the detector module could be applied in various schemes to detect different types of anomalies in network systems.