Statistical Study - Statistical-based Detection Methodology

Chapter 4 Statistical-based Detection Methodology

4.1. Statistical Study

4.1.1. Data source definition

In general, data mainly include three sources: (1) network traffic at the network system level, (2) user profiling data at the individual user, site or group level, and (3) network configuration information. Data-related concerns are mainly related to collections, sampling, sample size analysis, and reliability and validity tests. Being able to collect good data is a necessity in developing a robust model. Any sampling has to be randomly drawn from the population. It is important to detect an attack or anomalous event at the earliest possible stage. This requires quickly processing incoming data and comparing them to the historical data to locate any abnormalities.

Because network traffic data are usually enormous in size (74 hours of logged traffic data from a large enclave network could include 344 million observations in 24 GB, for example), it is clear that analyses conducted directly from such raw data may not be necessary and may also be inefficient.

Unlike in other research areas, statistical analysis in network security faces unique challenges because of the rapid changes in computer and network hardware and software provide that provided great challenges in both the data collection process and data analysis. There is no gold-standard approach for data collection and there is no gold-standard benchmark data for testing various algorithms. Each network system has its unique characteristics and requires developing a particular algorithm to collect its data. Despite this, there are a few principles that may serve as guidelines for collecting data in practice

First, good data have to meet several basic criteria, including (1) adequate sample size for fitting statistical models, (2) stability for profiling user behavior, (3) reliability for lowering false alarm rates, (4) delectability, predictability, and power discrimination, and (5) associability.

Second, efforts should be made to ensure that the data in which the profiling is based on are accurate and reliable across the different users or groups that are being evaluated and, when appropriate, across time. This includes standardizing the definitions of the predicting variables and where possible disclosing the quality of the data with regard to accuracy and reliability. Such information should be clearly described and substantiated.

When the quality of data is variable, an effort should be made to determine its impact on the profiling results.

Third, data should also be timely. Old data may not represent the current patterns of the network being profiled. In many cases, practical considerations limit the ability to acquire and analyze the data soon after an actual event encounter. Also, in some cases, data from earlier time periods may be pooled with more recent data to improve the precision of the estimates, which may compromise the ability to determine whether behavior patterns are changing over time [59].

4.1.2. Statistical modeling

A robust classification system must have good sensitivity and specificity, a low false

alarm rate, a high positive alarm rate, and the ability to detect new incidents, of which are fundamental goals in network security. There are great challenges in reach such goals. Although the potential benefits of applying the statistical technique in network security are countless, many intrusion detection and prevention systems continually suffer high false alarm rates and have difficultly in identifying new attacks. Nevertheless, some basic principles and guidelines for classification and modeling are given below.

First, statistical models, particularly those intended for classification and profiling purposes, should account for particular features of the organization of the data.

Second, with regard to these model performance measures, classification models should be evaluated by measures of discrimination, calibration, and goodness-of-fit.

The decision about what constitutes a “good” or “good enough” model will be based more on subjective considerations than on predefined criteria, but the model performance will depend on the degree to which traffic characteristics contribute to the outcome and the availability of variables that reflect variables associated with the outcome. Also, these models should be developed and validated in different samples to assess robustness, and such evaluations should be conducted repeatedly. If validation has not been performed, then that should also be reported.

Third, all models are not good and there is no one “gold-standard” model that can be used to compare the model performances. A stream of network traffic data with many positive predictor variables might not represent a true attack, and a look-alike normal stream could present a novel attack due to uncertain factors and users changing their behavior. In general, increasing the sensitivity could reduce the false positive alarm rate, and increasing the specificity could reduce the false negative alarm rate. The objective of a good statistical model is to demonstrate high values in sensitivity, specificity, and correctly classified rate. To achieve this goal, the process of selecting predictive variables must consider the issues of stability of variables’ statistical significance.

Fourth, we may be able to develop a robust attack-specific model to detect a particular type of attack or abnormal event but we cannot develop a model that can cover all types of attacks. Such a model may not exist.

Finally, model parameters should be updated frequently to take into account new attacks and user behavior changes in over time [60].

4.1.3. System Infrastructure

Both intrusion and prevention systems need to be constructed with a hybrid approach.

Use of a hybrid modeling approach means that a final classification decision on real time traffic, such as anomaly-free or anomalous, should be made based on a vote of multi-classification algorithms. A hybrid model that integrates and combines more than one classification of models and algorithms has shown potential for reducing false positive and false negative alarms rates that could maximize the strengths and minimize the weaknesses of each other.

Uncertainty should be treated carefully and rationally. Statistics is concerned with how data change people beliefs. A confidence level is the desired “certainty” we wish to have in making a conclusion about the population. Both the probability and the confidence interval provide measurements for the uncertainty. Probability tells us how likely the observed traffic belong to a particular pattern and the confidence interval tells us the error margin for the estimate of interest. Statistical simulation techniques, such as bootstrap and Monte Carlo can be used to acquire a probability of the outcome [61, 62].

Standard error can be utilized to the calculate confidence interval that depends on the confidence level selected.

The decision of whether we should use a statistical modeling approach or an alternative approach depends on many known and unknown factors, and should be gauged by the accuracy of predicting results. Although in many situations, an empirical historical data-based statistical modeling approach is used as a principal tool for intrusion detection and prevention, it can also be used as an alternative tool to benefit both researchers and network administrators for making better evidence-based decisions.

4.2. Fundamental Statistical Roles and challenges in

ドキュメント内 Development and Evaluation of a Comprehensible DNS Query Traffic based Statistical Bot Detection System (ページ 52-55)