Study on Human Related Analysis in Privacy Protected Videos

(1)

九州大学学術情報リポジトリ

Kyushu University Institutional Repository

Study on Human Related Analysis in Privacy Protected Videos

馬, 超

https://doi.org/10.15017/2534459

出版情報：九州大学, 2019, 博士（情報科学）, 課程博士バージョン：

(2)

Study on Human Related Analysis in Privacy Protected Videos

Chao Ma

Department of Advanced Information Technology

Graduate School and Faculty of Information Science and Electrical Engineering

Kyushu University

June 2019

(3)

(4)

Abstract

Human related analysis using surveillance cameras is prevalent in crime preven- tion, condition monitoring, abnormal behavior detection etc.. These surveillance cameras are ideal tools for these purposes, since they are cheap and any person within the monitoring scopes can be monitored. However, privacy protection issue has been bothering people when using them. The reason is that the recorded videos containing privacy may be abused before effective privacy protection. Be- cause of the privacy concerns, the deployment of surveillance cameras encounters resistance from the public in many countries. To protect privacy for surveillance cameras is in urgent need.

To protect privacy in surveillance videos, mostly adopted approach is based on post processing. This approach hides the privacy manually or by some algorithms after capturing the privacy included data. However, since the data with privacy are transmitted, processed or stored, there is a vulnerable time for privacy leakage.

To solve this, our lab proposed an optical level anonymous image sensing system, which hides the facial regions optically at the video capturing stage. When it works, firstly, it finds the facial locations in the scene using a privacy-safe thermal image. Then, masks corresponding to these facial locations are generated and displayed on a spatial light modulator (SLM), which controls the light rays into an RGB image sensor. By this way, light rays from the faces in the scene are blocked from coming into the RGB image sensor. The output from the RGB image sensor does not contain any facial information (with facial regions black), thus the risk of privacy leakage is avoided. From the working process we see, finding the facial regions in thermal images is important for effective privacy protection. To improve the system effectiveness, we should better detect facial regions in thermal images. Furthermore, we also intend to apply the system to a typical human activity analysis: abnormal behavior detection. For these purposes, this thesis tackles two human related analytical issues based on privacy protected videos: 1) face detection in privacy-safe thermal videos; 2) abnormal behavior detection in facially masked RGB videos.

For the first human related analytical issue, the author proposed two approaches: 1) new feature types by extending MB-LBP. By these extensions, the feature robustness and effectiveness are improved; 2) a mixed feature training algorithm based on AdaBoost. By this algorithm, cascade classifiers with multiple feature types for improved discrimination ability can be obtained. The author captured a dataset of 8400 thermal images from 21 participants. Using a hold-out validation, the author showed that the proposed approaches are effective. The author also did a field experiment and showed the factors which affect the face detection performance for using both thermal and RGB images.

(5)

For the second human related analytical issue, the author proposed a neural network called C3D-AE. The C3D-AE consists of two steps: 1) feature extraction;

2) classification based on the extracted features. C3D-AE uses a 3D convolutional neural network called C3D for feature extraction and an autoencoder for clas- sifying. In training, the fully connection layers of pre-trained C3D network are firstly removed and concatenated with an autoencoder. Then the autoencoder is trained by using the features extracted from video clips with normal behaviors by the pre-trained C3D. In predicting, the reconstruction error of the autoencoder is compared with a threshold to predict abnormal behavior. The author performed three experiments, a hold-out validation experiment and two field experiments. In the hold-out validation experiment, the author captured a dataset from 22 participants. The author showed the effectiveness of C3D-AE, compared it with other methods using videos with/without facial masks. In field tests, the author showed the applicability for abnormal behavior detection in real scenarios with robustness.

(6)

List of Figures

1.1 Concept of the optical level anonymous image sensing system. . . . 2 3.1 Examples demonstrating the calculation of LBP and LTP . . . 17 3.2 Examples demonstrating the parameter decision processes by using

the MB-LBP and MB-LTP . . . 19 3.3 The design and calculation examples of MB-ALBP and MB-ALTP. 20 3.4 Examples demonstrating the parameter decision processes by using

the MB-ALBP and MB-ALTP . . . 22 3.5 The relationship between the multi-block feature types . . . 23 3.6 Concept of the mixed feature training. . . 25 3.7 Settings and variations of the dataset for face detection in thermal

image. . . 31 3.8 Training results of cascade classifiers and discussion. . . 33 3.9 The composition of strong classifiers trained using mixed feature

pools by averaging those in the five times of hold-out validation. . . 35 3.10 Combined results of hold-out validation experiments using cascade

classifiers. . . 37 3.11 Advantage of considering the absolute temperature. . . 38 3.12 Combined results of accumulated rejection rate of cascade classifiers. 40 3.13 The average calculation time of patches from pyramid of one image

passing cascade classifiers with different number of strong classifiers. 44 3.14 The capturing environment for the field experiment. . . 46 3.15 The appearances of RGB images and thermal images in the 6 sce-

narios. . . 49 3.16 The face detection results evaluated by F-score for the 6 scenarios. . 51 3.17 Some typical face detection results in thermal and RGB images in

different scenarios. . . 53 4.1 The structure of networks. . . 59 4.2 The dataset of abnormal behavior detection for hold-out validation. 65 4.3 Effectiveness of C3D-AE. . . 68 4.4 ROC curves of different combinations of feature extractors and nor-

mal vector modelers. . . 70

(9)

4.5 The field test in corridor scenario. . . 73

4.6 The normalized reconstruction errors. . . 76

4.7 Field test in indoor passageway scenario. . . 77

4.8 Frames show some new elements in the passageway scenario. . . 78

4.9 Factors that affect the performance of bounding box and MHI based methods. . . 81

A.1 The hardware prototype of the optical level anonymous image sensing system. . . 103

A.2 The calibration process for the optical level anonymous image sensing system. . . 105

A.3 The software design for the optical level anonymous image sensing system. . . 108

A.4 The postures of one participate standing in 2 m for testing the privacy enhancement in the real scenario. . . 109

A.5 Privacy protection performance of the optical level anonymous image sensing system in the corridor scenario. . . 111

(10)

List of Tables

3.1 The parameter settings and descriptions for all the scenarios used for capturing. . . 47 4.1 The parameters and descriptions for all the layers in C3D network. 60 4.2 The performance of the C3D-AE. . . 67 4.3 Execution time of different feature extractors and one-class classifiers. 71 4.4 Individual abnormal behavior detection results in corridor scenario. 74 4.5 Individual abnormal behavior detection results in indoor passage-

way scenario. . . 79 4.6 Sensitivity and its drop of the abnormal behavior detection methods

for the forward/backward falls without/with occlusion in indoor passageway scenario. . . 80 4.7 Specificity for the normal behaviors in indoor passageway scenario. 80 A.1 Components used in the LCoS camera. . . 102

(11)

(12)

Chapter 1 Introduction

1.1 Privacy Issues in Video-based Surveillance

Human related analysis using surveillance cameras is prevalent in crime preven- tion [1], condition monitoring [2], abnormal behavior detection [3] etc.. These surveillance cameras are widely installed in stores, banks, airports, parking lots, city streets, schools, etc., since they are cheap, and any person in their monitoring scopes can be monitored. Despite surveillance cameras have huge contribution to modern society, privacy protection issue has been bothering people [4]. The reason is that the recorded videos containing privacy may be abused before effective privacy protection is applied. Because of the privacy concerns, the deployment of surveillance cameras encounters resistance from the public in many countries [5].

To protect privacy for surveillance cameras is in urgent need.

To solve the privacy protection problem in surveillance videos, firstly, the mean- ing of privacy must be clearly defined. In [6], the privacy was defined as the contents of the videos in confidence. It can be categorized into human-related privacy such as facial regions and object-related privacy such as the number plates of ve- hicles. In this thesis, the discussion is confined within the scope of human-related privacy. For human-related privacy, in the community of privacy protection, a consensus is that facial regions in the videos are the most sensitive contents which should be hidden [7, 8, 9, 10, 11, 12, 13], and through hiding the facial regions in the videos, privacy can be effectively protected.

To effectively hide the facial regions, our lab proposed an optical level image sensing system [14]. The concept is shown in Figure 1.1. The system contains a thermal camera and an RGB camera. The thermal camera is privacy-safe, and its purpose is for finding the facial regions in the scene. The RGB camera gives output videos to the system users for monitoring. The light rays from the scene can enter simultaneously into the thermal camera and the RGB camera by using a cold mirror. This cold mirror transmits infrared rays into the thermal camera and reflects the visible light rays into the RGB camera. The visible light rays

(13)

Figure1.1:Conceptoftheopticallevelanonymousimagesensingsystemproposedin[14].Theimagesformedinall stepsarevisualized.Agreendialogboxisusedtodemonstratethedigitalimagegeneratedineachstep,andtheimage inthereddialogboxisthelatentopticalimageontheRGBimagesensorplane.Redblocksareusedinthespatial lightmodulatortoindicatetheblockingofthevisiblelightraysfromthefacialregions,whilethegreenblocksindicate thatthespatiallightmodulatorallowsthevisiblelightraysenteringtheRGBimagesensor.Wecanseeinallsteps,no imagewithrecognizablefacialregionsappears.

(14)

entering the RGB image sensor are controlled by a spatial light modulator (SLM).

Our system can hide the facial regions at the image capturing stage by blocking the light rays from facial regions in the scene through controlling the SLM, thus it physically cannot capture any privacy. In this sense, it is an ideal tool for human related analysis, since the output of the system is privacy protected video with facial regions masked (facial regions are black).

When the facial regions are hidden in the video, an important issue is what useful information we can extract from these privacy protected videos for the applications of human related analysis. However, as far as we know, almost all the human related analysis algorithms are based on ordinary videos with clear facial regions [15]. Since we intend to utilize our optical level anonymous image sensing system outputting the privacy protected videos to human activity analysis, we need to find a target application where the privacy protected videos are applicable, and design suitable algorithm for the target application. Furthermore, we need to know how the facial masks affect the performance by comparing the performance using videos with/without facial masks.

1.2 Contribution

According to the working process of the optical level anonymous image sensing system proposed by our lab, finding the facial regions in thermal images is important for effective privacy protection. To improve the system effectiveness, we should better detect facial regions in thermal images. Furthermore, we apply the optical level anonymous image sensing system to a typical human activity analysis: abnormal behavior detection. We need to design an algorithm which can effectively work by using privacy protected videos with facial masks. For these purposes, in this thesis, the author tackles two human related analytical issues based on privacy protected videos around the optical level anonymous image sensing system: 1) face detection in privacy-safe thermal videos; 2) abnormal behavior detection in facially masked RGB videos. The author addresses the two human related analytical issues as follows:

Human Related Analytical Issue 1: Face detection in thermal images.

To better detect the facial regions in thermal images, the author proposed two approaches to adapt local features to face detection in thermal images:

• New Feature Types: The author created new feature types by considering the properties of facial regions in thermal images. The author realized the new feature types by extending MB-LBP features [16]. The author considered 2 aspects to improve the performance of the original MB-LBP features:

1) a margin for the encoding process of the response calculation in original MB-LBP features; 2) the generally constant distribution of facial tempera-

(15)

ture. In this way, the features robustness and effectiveness for face detection in thermal images are improved.

• Mixed Feature Training Algorithm: The author proposed an AdaBoost- based training algorithm to build cascade classifiers containing different feature types with different advantages. The algorithm can build cascade classifiers containing number-type and/or category-type features [17]. In this way, an improved description power can be obtained.

Human Related Analytical Issue 2: Abnormal behavior detection. To detect the abnormal behaviors in the RGB videos with facial masks, the author proposed an efficient neural network:

• C3D-AE Neural Network: The author proposed a one-class training based neural network, which is called C3D-AE. This neural network com- bines a 3D convolutional neural network for feature extraction and an autoencoder for modeling the normal behaviors. The utilized 3D convolutional neural network was proposed for action recognition, which is called C3D [18].

The C3D is pre-trained using Sports-1M [19], which has 1 million videos.

The author used the pre-trained network as feature extractor by removing the last two fully connection layers, and concatenated it with an autoencoder to model normal behaviors. In training, the autoencoder is trained by the features extracted from videos with only normal behaviors. In predicting, the reconstruction errors of the autoencoder are used to discriminate the normal and abnormal behaviors by comparing them with a threshold. Since the autoencoder is trained by using features extracted from videos with normal behaviors, features extracted from videos with abnormal behaviors will cause larger average reconstruction errors.

1.3 The Structure of This Thesis

Chapter 2 describes the related works and problem settings of this thesis. In this chapter, three branches of related works are introduced. First, since this thesis utilizes privacy protected videos, traditional privacy protection approaches are introduced. Second, since the author designs new feature types for face detection employing AdaBoost algorithm, the traditional feature types based on AdaBoost are introduced. Furthermore, since the author proposes a mixed feature leaning algorithm to training cascade classifiers based on AdaBoost algorithm, the traditional approaches for feature fusion are described. Third, since the author designs new neural network for abnormal behavior detection, the background knowledge for video based abnormal behavior detection approaches are introduced.

(16)

Chapter 3 describes the feature adapting approaches which the author proposed. Two approaches are introduced: 1) new feature types extended from MB- LBP features; 2) mixed feature training algorithm based on AdaBoost. In addi- tion, the author captured a dataset and showed the effectiveness by a hold-out validation. The author also tested the proposed approaches in a field experiment.

Chapter 4 describes the proposed C3D-AE neural network and its training and predicting methods for abnormal behavior detection. The author captured a dataset from 22 participants to evaluate the neural network and showed the effectiveness of the proposed network and compared it with other methods. To see how the facial masks affect the abnormal behavior detection performance, the author also made comparison by using samples with/without facial masks.

Furthermore, the author tested the proposed method in two field tests, and showed the effectiveness and robustness of the methods under different scenarios.

Chapter 5 concludes the works and gives future work directions.

(17)

(18)

Chapter 2 Related Works

2.1 Introduction

This chapter describes three branches of related works. First, since this thesis utilizes the optical level anonymous image sensing system for privacy protection, the existing privacy protection approaches are introduced, and the advantages and disadvantages are discussed. Second, since the new feature types are designed for face detection employing AdaBoost algorithm, the traditional feature types designed for AdaBoost algorithm are introduced, these traditional feature types are the basis for the author to adapt to face detection in thermal image. Furthermore, since the a mixed feature leaning algorithm is proposed to train cascade classifiers based on AdaBoost algorithm, as the background knowledge, the traditional approaches for feature fusion are described. Third, since the new neural network is proposed for abnormal behavior detection based on videos, the existing video based abnormal behavior detection approaches are introduced.

2.2 Privacy Protection Approaches

Access control based approach is straightforward, which only permits the authorized access to the permitted part of recorded video database. Kumar and Babu [20] proposed a database management model, which can hide some frames of the videos to the unauthorized assessors. However, as Bertino et al. [21] pointed out that this database management model is too simple, thus its application is limited. Bertino et al. [21] proposed a more advanced content-based access control strategy so that different groups of visitors can be allowed to access different elements in the video, or the same video element with different resolution according to their rights. Chinomi et al. [22] proposed a system called PriSurv. In their system different groups of visitors to the database can access videos with different degree of privacy enhancement. In summary, in this approach, all the videos

(19)

without privacy enhancement are stored in the database. The privacy enhancement is obtained when the visitors access the database. The main problem of this approach is the risk of data breach, since the database contains the videos with privacy. From the report by Verison [23] we can see, nowadays data breach is a serious problem for database management.

Post processing based approach[7, 8, 9, 10, 11] hides the privacy contents, especially the facial regions in the videos by blurring, blacking out or mosaicking them. Parvel et al. [24] compared the privacy protection performance by blurring, blacking out or mosaicking the facial regions with different strength. They showed the trade-off between recognizability and intelligibility of these facial hiding methods. Newton et al. [7] proposed a post processing method which keeps the necessary facial characters while decreasing the facial recognizability. Their method protects the privacy by replacing the facial regions with k averaged facial regions. The problem of post processing based approach is that before the videos are privacy enhanced, there is a vulnerable time in transmission and storage of the videos [5, 14, 25]. To be specific, after captured by the image sensors, the transmitted, processed or stored data have not yet been applied with privacy protection, at this stage the data are vulnerable to hacking activities.

Front-end approach appeared recently with the advancement of hardware technology, especially the development of IC design and optics design. Front-end approach usually adopts specially designed image sensors or optical elements to hide the privacy such as the facial regions. By this way, the videos output as early as from the image sensor are already enhanced in privacy. There are two ways to implement the front-end approach: 1) sensor level; 2) optical level. The sensor level approach uses specially designed image sensors to obtain privacy protection.

Winkler and Rinner [26] proposed a TrustEye sensor by encapsulating the ordinary image sensor and privacy protection circuit into one indivisible unit. In this way, they prevented the direct access to the output of the image sensor. Fernandez- Berni et al. [27] reported an ultra-low-power sensor with reconfigurable focal- plane sensing-processing arrays. This reconfigurability enables the obfuscation of different image regions at the focal-plane. Pittaluga et al. [25] integrated a circuit with thresholding function into a thermal image sensor. All the pixels around the temperature range of the facial regions are darkened by this thresholding function.

The optical level approach usually uses the ordinary image sensors. But in front of the image sensors, specially designed optical elements are devised. Pittaluga and Koppal [5] provided a batch of carefully designed optical elements, which can perform the tasks such as people tracking and counting, motion analyzing etc., without shooting any private information.

In front-end approach, the videos output from the image sensors are already privacy enhanced before they are sent to the host boards inside the cameras. It is a more advanced way in aspects of privacy safety, power consumption and con- venience of deployment. Comparing the two implementation methods of sensor

(20)

level and optical level of front end approach, we see for sensor level approach, once the image sensor is designed, privacy enhancement algorithm by the image processing circuit is fixed, generally it is hard to modify. The reason is the design of image processing circuit is highly related to a specific kind of privacy enhancement algorithms [28]. Furthermore, both IC design and manufacture are difficult and extremely expensive [29]. For optical level approach, it does not need specially designed image sensors, so it is more convenient to implement. The optical level anonymous image sensing system of our lab [14] is one of the optical level implementations.

2.3 Local Features and Fusion of Features

2.3.1 Local Features

Haar-like features were first proposed by Viola and Jones [30] in 2001 for face detection using RGB images. The response of one Haar-like feature is calculated by subtracting sums of pixel values in neighboring rectangle regions. All the Haar-like features form a feature pool. AdaBoost algorithm is employed to repeatedly select a feature from the feature pool for a weak classifier. Several weak classifiers are combined into a strong classifier. After several strong classifiers are built, they are chained together into a cascade classifier. Their cascade classifier is the first real- time high-performance method for face detection using RGB images [31]. Later, Reese et al. [32] proved that it is also feasible to employ Haar-like based cascade classifier for face detection using thermal images. The limitation of the Haar-like features is that they are too simple [16]. As a consequence, the obtained cascade classifier contains too many weak classifiers and leads to high computation.

Histogram of Oriented Gradient (HOG) was first proposed by Dalal and Triggs [33] in 2005 for human body detection using RGB images. Its response is calculated as a histogram using all the pixels in image patches. The calculation of the response histogram contains two steps. First, image gradients are calculated pixel by pixel in image patches. Second, response histogram forms according to the gradient orientations. HOG feature can also be employed by AdaBoost algorithm.

For example, Jia and Zhang [34] proposed a HOG-based cascade classifier. To fit the AdaBoost framework, their method uses the value of one orientation bin in the original HOG response histogram for each weak classifier. In this thesis, the author also uses one orientation bin in the original HOG response histogram for each weak classifier as that Jia and Zhang adopted, since the author also uses AdaBoost algorithm. The limitation of HOG is that it works efficiently only for objects with clear contours. The reason is that HOG is calculated by considering image gradients, which indicates the edges in images well [35].

Local Binary Pattern (LBP)was first proposed by Ojala et al. [36] in 1996 for texture analysis using RGB images. Its response is calculated as a histogram.

(21)

The calculation of its response histogram contains two steps. First, each pixel value in image patches is compared as reference with those of the surrounding ones. If the surrounding pixel is larger than the center one, the place of the surrounding pixel is labeled as 1, otherwise, it is labeled as 0. The result vector of 0 and 1 is the response in that specific pixel. Second, the response histogram is formed according to the responses of all the pixels in the image patches. The limitation of LBP is its sensitivity to noise for the uniform image regions [37]. The reason is that in uniform image regions, pixel intensities are quite similar. When tiny image noise pollutes a pixel, the comparison result of one pixel with its surrounding ones is possibly changed. This causes an incorrect LBP response. Too many incorrect LBP responses will finally lead to an incorrect LBP histogram. Unfortunately, lots of uniform regions exist in facial regions [37], which causes the unrobustness for facial analysis using the original LBP.

To improve the robustness of LBP, there are two main approaches. First approach is introducing a margin around the central reference pixel value when calculating the LBP response by comparison. When tiny image noise exists, this margin can keep the response same. A typical feature type adopts this approach is Lo- cal Ternary Pattern (LTP) [37]. Second approach is comparing average pixel values in blocks with multiple pixels. In this way, the interference of image noise is canceled by averaging. A typical feature type adopts this approach is Multi- scale Block LBP[38]. Interestingly, there is also feature types considering both of these approaches. Multi-scale Block LTP [39] is one of them.

The LBP, LTP, Multi-scale Block LBP and Multi-scale Block LTP are calculated in all possible positions inside image patches for response histograms. This calculation is quite time consuming for real-time face detection in video. Fur- thermore, these feature types cannot be employed directly by cascade classifiers, because they use histogram representations. To achieve a faster face detection speed employing AdaBoost algorithm, Multi-Block LBP(shorten as MB-LBP) features were proposed by Zhang et al. [16]. MB-LBP also compares the average pixel value of the central multi-pixel block with those of the surrounding ones.

However, the MB-LBP is not represented by a histogram, the response itself is the representation. In other words, the response of a single MB-LBP feature is directly used for classification in one weak classifier of cascade classifiers.

In the original papers introducing Multi-scale Block LBP [38] and Multi-Block LBP [16], both of them used the abbreviation of MB-LBP. The paper introducing Multi-scale Block LTP [39] used the abbreviation of MB-LTP. However, as we described, multi-block features and multi-scale block features are different. In this thesis, the author only considers the multi-block features. To make it clear and the discussion irredundant, the author will use the short forms MB-LBP and MB-LTP from now in this thesis for describing Multi-Block LBP by Zhang et al. [16] and the new feature type proposed in this thesis, Multi-Block LTP, respectively.

From all the feature types above we can see two aspects: 1) the feature types

(22)

which can be employed by cascade classifiers fall into two groups: i) number- type features, such as Haar-like or HOG; ii) category-type features, such as MB- LBP [17]. The response of a number-type feature is a real number, while that of a category-type feature is a high-dimensional vector [17]. 2) the category-type features are relatively complex, because the dimension of their responses is higher.

This complexity gives a room to further adapt them to specific applications.

2.3.2 Fusion of Features

Different feature types have different description power. To enhance the feature description power, multiple feature types are used together rather than just single type. There are three main approaches to combine multiple feature types in object detection applications: concatenation, co-occurrence, and mixed-feature pool. In our survey, all approaches have only been used in RGB images.

Concatenation is realized by concatenating several individual feature histograms of different feature types to a longer one. This requires the used feature types have histogram representations. Wang et al. [40] proposed a method to con- catenate histogram of LBP and that of HOG to a HOG-LBP histogram. Jiang and Ma [41] created color and bar-shaped feature histograms, and concatenated them with HOG histogram to obtain a feature type called HOG III. Both of these two concatenated feature types are used for human body detection. They can achieve an improved performance compared with those only using one feature type.

Co-occurrenceis realized by using more than one features simultaneously for one weak classifier in cascade classifiers. Mita et al. [42] proposed a joint Haar- like features for face detection. Their joint features were implemented using two or three co-occurring Haar-like features. They showed that their joint features were better than the single Haar-like feature.

Mixed feature pool is realized by mixing two or more feature types in one feature pool for the feature selection by AdaBoost algorithm. As a result, one strong classifier in cascade classifiers may be built by weak classifiers with different feature types. The difference from the co-occurrence approach is that one weak classifier contains only one feature. Xia et al. [43] proposed a mixed feature pool for an object-tracking application using RGB images. They mixed two number- type features, Haar-like and HOG, in their mixed feature pool. In this thesis, the author expects that the mixed feature pool can also be applied to thermal images.

Furthermore, in this thesis, the author not only mixes number-type features, but also mixes category-type features such as MB-LBP into the mixed feature pool.

The author proposed an AdaBoost-based training algorithm to build cascade classifiers with number-type and/or category-type features. The proposed training method can select features from feature pools containing both number-type and category-type features. In contrast, that of Xia et al. [43] can only select features from feature pool with number-type features (Haar-like and HOG).

(23)

2.4 Abnormal Behavior Detection

According to Al-Dhamari et al. [3], human-related abnormal behavior detection can be grouped into two categories: 1) individual abnormal behavior detection;

2) crowd abnormal behavior detection. Individual abnormal behavior detection mainly focuses on finding the abnormal behaviors, such as falling down of individ- uals [44, 45, 46]. Crowd abnormal behavior detection tries to take the scene as a whole to decide whether abnormal is happening [47, 48, 49]. These two categories have different scopes of applications. For individual abnormal behavior detection, the main applications include guaranteeing the safety of impaired or old people in hospital or nursing house. For crowd abnormal behavior detection, the main applications include guaranteeing the social security in public places such as railway station and airport. In this thesis, the author focuses on the individual abnormal behavior detection.

To detect the individual abnormal behavior, the necessary prerequisite is to define what kinds of behaviors are normal or abnormal [3]. However, it is hard to give a uniform definition of individual abnormal behavior. The reason is that the definition highly depends on the specific application as well as the goal of the research [3]. In this thesis, the author confines the research and discussion within the scope of individual abnormal behavior for the safety of impaired or old people in hospital or nursing house in corridor/passageway scenarios. The goal is detecting the individual abnormal behavior especially falls.

To detect the individual abnormal behaviors, the most commonly employed approach consists of two steps: 1) feature extraction; 2) classifier trained or designed by using the extracted features [3]. There is a variety of feature types proposed for the human action related applications using videos. For example, Laptiv et al. [50]

proposed a type of feature called HOG/HOF. This type of feature considers both the local motion and the local appearance by HOG and HOF features, respectively.

The HOG and HOF features are concatenated into HOG/HOF feature. Klaser et al. [51] proposed a type of feature called HOG3D. This type of feature is the con- ceptual generalization of the HOG feature from 2D images to 3D video volumes.

For HOG3D, the gradients are calculated in 3D spatio-temporal grids rather than that in 2D grids for original HOG. Wang et al. [52] proposed the idea of dense trajectory, which is generated by calculating the optical flow from densely sampled pixels. Then the HOG, HOF and MBH [52] features are calculated along the trajectory, and concatenated into a longer feature. Later, they improved their method [53] by considering camera motion as well as the post processing after HOG, HOF and MBH features are extracted. All the feature types introduced above are designed manually using feature engineering. The main problem for these manually designed feature types is that it is hard to guarantee their optimality for specific application.

In recent years, the booming of deep learning gives a more effective way for

(24)

learning features [54]. There is a variety of deep neural networks proposed for learning spatio-temporal features for videos. For example, Ji et al. [55] extended the idea of convolutional neural networks (CNN) [56] from the 2D images to 3D video volumes by proposing a deep neural network called 3D convolutional neural networks (3D CNN). Their 3D CNN has a hardwired layer as the first layer of the network. This hardwired layer calculates the gradients and optical flow fields along the horizontal and vertical directions. Then further 3D convolution and pooling operations are performed after the hardwired layer. Later, Tran et al. [18]

proposed another 3D CNN structure, which is called C3D. Different from that by Ji et al., the C3D does not need the hardwired layers, all the weights are learned by using the training data. They showed the efficiency and generic ability of their neural network, and released a pre-trained model for following researchers. Since their neural network can extract features with powerful descriptive ability, in this thesis, the author employs their pre-trained model as feature extractor for specific application of abnormal behavior detection.

After the features are extracted, a classifier is learned to discriminate the normal and abnormal behaviors [3]. For this purpose, one-class learning methods are commonly adopted to construct models of normal behaviors [3], since the number of normal and abnormal samples is typically unbalanced (In the real scene, there are much more normal samples than the abnormal samples). For example, one-class SVM [57] was proposed to find a spherical decision boundary by a set of support of normal feature vectors. In this model, the points outside the sphere boundary are deemed as abnormal. It is widely used for abnormal behavior detection in surveillance videos [58]. Hawkins et al. [59] proposed an autoencoder structure for modeling the normal feature vectors. In training phase, they used the normal feature vectors to train a specially designed autoencoder. In testing phase, they used the reconstruction errors to indicate the normal and abnormal. In this thesis, the author also adopts the autoencoder to model the normal behaviors and uses the model to detect abnormal behaviors. The author also makes comparison using the one-class SVM based method.

(25)

(26)

Chapter 3 Adapting Local Features to Face Detection in Thermal Image

3.1 Introduction

Face detection is fundamental in computer vision. Most of the existing works on face detection are based on RGB images, since they are easy to obtain. However, there are some problems in using RGB images for face detection. We can see four main problems: 1) using RGB images, there is a risk of privacy leakage, since the identities of people in the RGB images are easy to recognize. 2) facial appearances in RGB images are liable to change by lighting condition [31]. For example, when the light source is on the right side of a face, this side of facial region appears brighter than that of left side, and vice versa. It is not so easy for a face detection algorithm to find facial regions with different brightness distribution simultaneously. 3) facial appearances in RGB images sometimes become irregular by make-up pigments [60]. For example, actors/actresses in dramas or fashion shows sometimes wear strange make-up, which makes the facial appearances looked totally unhuman. In this situation, detecting faces is basically impossible. 4) it is hard to discriminate printed faces from real ones by almost all face detection algorithms [61]. The reason is printed faces are the same as real faces in RGB images. To completely solve all the above problems is quite difficult, since these problems are inevitable when using RGB images. We need to find a new camera type which can provide more reliable information under those adverse conditions.

There is a variety of digital cameras in these days with different properties besides RGB color cameras. A thermal camera is one of them and works in thermal IR band (wavelength between 8µm to 14µm). The light in this band is called thermal IR radiation. The strength of thermal IR radiation from an object is associated with its temperature [62]. In a thermal camera, a germanium lens images the thermal IR radiation from the scenes to a thermal sensor. In the thermal image sensor, each pixel is a small thermal IR radiation detector made

(27)

of thermal IR radiation sensitive material [62]. By this way, thermal cameras can obtain the thermal distribution of the scenes as thermal images. They are the potential alternatives for face detection applications.

We can see thermal cameras can solve the problems caused by using RGB cameras. 1) using thermal cameras for face detection can achieve better privacy protection. The reason is that it is harder to recognize the identities of people in thermal images [14]. 2) the temperature distribution of human faces is generally constant and higher than backgrounds as measured in [63]. This advantage solves 2 problems: i) printed faces can be discriminated from real ones in thermal images, since printed faces have different temperature distribution from real ones; ii) faces with strange make-up pigments are easier to detect in thermal images than those in RGB images, since make-up pigments do not affect the facial temperature distribution. As a result, a face with or without make-up pigments appears the same in thermal images. 3) thermal cameras are only sensitive to the radiation thermal IR band, while not sensitive to visible light. This means no matter how we change the ambient lighting condition (eg. strength, color or direction of light), the facial appearances keep the same in thermal images. This advantage solves the problem that facial appearances dramatically change under different lighting conditions in RGB images.

In this thesis, the author used thermal images for face detection in the optical level anonymous image sensing system. To improve the face detection performance in thermal images, the author adapted the local features in two ways: 1) the author created new feature types by considering the properties of facial regions in thermal images. The author realized the new feature types by extending MB-LBP features [16]. The author considered 2 aspects based on the original MB-LBP features: i) a margin for the encoding process of the response calculation in original MB-LBP features; ii) the generally constant distribution of facial temperature. In this way, the author improved the feature robustness and effectiveness for face detection in thermal images. 2) the author proposed an AdaBoost-based training algorithm to build cascade classifiers containing different feature types with different advantages. The algorithm can build cascade classifiers containing number-type and/or category-type features [17]. In this way, an improved description power can be obtained. This chapter will describe the technical details of the two ways for adapting local features for face detection in thermal images, and discuss the results from a hold-out validation experiment and a field test.

3.2 Adaptation of Local Features

3.2.1 Extension of MB-LBP Features

In this section, the author presents the new feature types extended from MB-LBP for the face detection in thermal image. First, the author extends MB-LBP to

(28)

MB-LTP by considering a margin for better robustness to thermal camera noise.

Second, the author extends the two feature types, MB-LBP and MB-LTP by considering the generally constant distribution of facial temperature to improve the performance for face detection in thermal images. The author discusses all the feature types in the setting of thermal images, and uses pixel temperature instead of pixel value, since the pixel value represents the temperature.

From MB-LBP to MB-LTP

Figure 3.1: Examples demonstrating the calculation of MB-LBP and MB-LTP.

Each rectangular region has 2×3 pixels. (a) The calculation of MB-LBP. (b) The calculation of MB-LTP with tM B−LT P = 0.3^◦C.

In AdaBoost, MB-LBP features [16] encode the texture around a block, which is defined by a possible rectangular area with one or multiple pixels in the image.

The encoding process is performed by comparing the average pixel temperature of the central block as reference with those of its 8 surrounding blocks, the feature response on image patch x_i is encoded by:

PM B−LBP(x_i) = Ψ⁷_q=0(SM B−LBP(g_q−g_c)), (3.1) where g_c and g_q represent the average pixel temperatures of the center block and that of one of the 8 blocks surrounding it with index q. SM B−LBP is the labeling function defined by:

SM B−LBP(u) =

(1 u≥0

0 u <0. (3.2)

Ψ⁷_q=0 is the operator which places the labels with the indexes from 0 to 7 into a sequence. We can choose any order in determining the indexq of the surrounding

(29)

blocks when encoding. Zhang et al. [16] encoded clockwise from the top left block, which the author of this thesis will follow. Figure 3.1 (a) illustrates the encoding process of a single MB-LBP feature with an example.

To improve the robustness of MB-LBP features to thermal camera noise, the author extends it by considering a margin around the reference. The author uses three values to label the comparison result. We can find the similar idea in the work from Tan [37] or Jia et al. [39]. The difference is that their works calculate the histogram representation, while our proposed feature type is specially designed for AdaBoost algorithm using single feature response for classification. Since three values are used for coding the comparison results, the author names the new feature type as Multi-Block Local Ternary Pattern (shorten as MB-LTP).

The 8 surrounding blocks are encoded for image patch xi as:

P_{M B−LT P}(x_i) = Ψ⁷_q=0(S_{M B−LT P} ((g_q−g_c), t_{M B−LT P})). (3.3)

The labeling functionSM B−LT P((g_q−g_c), tM B−LT P) is defined as:

SM B−LT P(u, t) =







1 u≥t 0 −t < u < t

−1 u≤ −t.

(3.4)

g_cand g_q also represent the average pixel temperature of the center block and that of one of the 8 surrounding blocks with index q.

The margin tM B−LT P (tM B−LT P ≥ 0) is decided for each feature as a feature parameter in the training process by AdaBoost selection, which is similar to the position or size of the feature. In Figure 3.2, we demonstrate how the parameters of MB-LBP and MB-LTP are decided by examples. When training a strong classifier by AdaBoost, a feature pool with all possible parameter combinations is built for selection. From the feature parameter lists in the figure we can see, compared with MB-LBP, there is one more parameter, the margin, in MB-LTP. The AdaBoost can select the best feature for the strong classifier under construction, at the same time, the feature parameters are also decided for the selected MB-LBP or MB-LTP feature.

(30)

Figure 3.2: Examples demonstrating the parameter decision processes by using MB-LBP and MB-LTP. In the figure, X and Y represent the horizontal and vertical coordinates of the top-left point of a MB-LBP or MB-LTP feature in the image.

W and H represent the width and height of the block in a MB-LBP or MB-LTP feature. M is the margin only for the MB-LTP. (a) AdaBoost can select the best feature from the feature pool of MB-LBP features. When the best feature is selected, the parameters in the MB-LBP parameter list are also decided. (b) AdaBoost can select the best feature from the feature pool of MB-LTP features.

When the best feature is selected, the parameters in the MB-LTP parameter list are also decided. Compare with the parameters of MB-LBP, one more parameter, the margin, is decided for MB-LTP features.

MB-ALBP and MB-ALTP

MB-LBP and MB-LTP compare the average pixel temperature of the central block with those of surrounding ones. This relative comparison is not always efficient for thermal images. We can find lots of non-facial patches with the same responses as those of the facial patches for a specific feature, this means it is impossible to discriminate the facial patches from non-facial patches by checking their responses.

Figure 3.3 (a) illustrates the phenomenon by showing the responses of the same positioned MB-LBP and MB-LTP features in three patches containing different objects. We can clearly see that for these same positioned MB-LBP or MB-LTP features in the three patches, the responses are same.

(31)

Figure 3.3: The design and calculation examples of MB-ALBP and MB-ALTP. (a) The responses of MB-LBP and MB-LTP with the same size and location, while tM B−LT P = 0.3^◦C. The three images contain a male face, a hot cup and a book stand. The white, gray and black blocks in the feature represent the codes of 1, 0 and -1. In this situation, MB-LBP and MB-LTP fail to discriminate the face, hot cup, and book stand. (b) The responses of MB-ALBP and MB-ALTP features with the same size and location as those in (a). The examples use the absolute temperature 30^◦C as reference for both MB-ALBP and MB-ALTP, andtM B−ALT P

is 0.3^◦C. In this situation, MB-ALBP and MB-ALTP can clearly discriminate the three objects. (c) and (d) are the examples of calculation of MB-ALBP and MB- ALTP, respectively.

To improve the effectiveness of MB-LBP and MB-LTP for face detection in thermal images, we should consider more properties of the facial regions in the thermal images. The research by Ariyaratnam and Rood [63] showed that facial temperature distribution is generally constant and higher than that of backgrounds, since the blood flow patterns (which is highly related to the tissue temperature) in the faces of different people are quite similar. This suggests a way to improve the discriminative abilities of MB-LBP and MB-LTP. By this intuition, the author changes the references of both MB-LBP and MB-LTP to an absolute

(32)

temperature θ. The new feature types are named as Multi-Block Absolute LBP (shorten as MB-ALBP) and Multi-Block Absolute LTP (shorten as MB-ALTP), respectively. In MB-LBP or MB-LTP, feature responses are encoded using the 8 surrounding blocks. To be specific, the 8 surrounding blocks are compared with the central one. However, for MB-ALBP or MB-ALTP, absolute temperature is used for each feature as the reference. This absolute temperature does not belong to any of the nine blocks in MB-ALBP or MB-ALTP features. By this reason, to encode more information, all the nine blocks in MB-ALBP and MB-ALTP are compared with the reference absolute temperature. The one more bit is encoded at the very beginning before the 8-bit clockwise coding sequence from the top left block in the coding process of MB-LBP or MB-LTP.

An MB-ALBP feature is encoded by an integer:

PM B−ALBP(x_i) = Ψ⁸_a=0(SM B−ALBP (g_a, θM B−ALBP)), (3.5) whereg_a represents the average pixel temperature of one of the total nine blocks, its index is a. θM B−ALBP is the absolute reference temperature, which is decided in the training process for each feature as one feature parameter. Ψ⁸_q=0 encodes the nine blocks with indexes from 0 to 8 with the labeling function S_{M B−ALBP} as:

SM B−ALBP (λ, θ) =

(1 (λ−θ)≥0

0 (λ−θ)<0. (3.6)

For MB-ALTP, it is also encoded in the similar way as that of MB-ALBP, the difference is that a margin is considered. The encoding process is as follows:

PM B−ALT P(x_i) = Ψ⁸_a=0(SM B−ALT P (g_a, θM B−ALT P, tM B−ALT P)), (3.7) with each block has the index a. The margin tM B−ALT P and absolute reference temperature θM B−ALT P are decided in the training process for each feature as feature parameters. Ψ⁸_q=0 encodes the nine blocks with indexes from 0 to 8 with the labeling function as:

SM B−ALT P (λ, θ, t) =







1 (λ−θ)≥t 0 −t <(λ−θ)< t

−1 (λ−θ)≤ −t.

(3.8)

Figure 3.3 (c) and (d) illustrate the encoding processes of MB-ALBP and MB- ALTP by examples. Because MB-ALBP and MB-ALTP consider the generally constant distribution of facial temperature, we can see from Figure 3.3 (b) that both of them can discriminate faces from other two objects in the example by different responses.

(33)

Figure 3.4 shows the parameter decision processes of MB-ALBP and MB-ALTP by examples. These processes are similar to those of MB-LBP and MB-LTP demonstrated in Figure 3.2. The differences are the parameters. From the parameter lists in Figure 3.4 we can see, other than the location (indicated by X and Y) and the size (indicated by W and H), MB-ALBP has the parameter RT (the reference temperature), while MB-ALTP has both M (the margin) and RT.

These parameters are also decided for MB-ALBP or MB-ALTP by the process of AdaBoost selection.

Figure 3.4: Examples demonstrating the parameter decision processes by using MB-ALBP and MB-ALTP. In the figure, X and Y represent the horizontal and vertical coordinates of the top-left point of a MB-ALBP or MB-ALTP feature in the image. W and H represent the width and height of the block in a MB-ALBP or MB-ALTP feature. RT represents the reference temperature, while M represents the margin. (a) AdaBoost can select the best feature from the feature pool of MB-ALBP features. When the best feature is selected, the parameters in the MB- ALBP parameter list are also decided. (b) AdaBoost can select the best feature from the feature pool of MB-ALTP features. When the best feature is selected, the parameters in the MB-ALTP parameter list are also decided.

Similar to the advantage of robustness to thermal camera noise of MB-LTP over MB-LBP, we can expect an improved robustness to thermal camera noise from MB-ALTP over MB-ALBP.

(34)

Summary of the Multi-Block Feature Types

Figure 3.5: The relationship between the multi-block feature types.

In this chapter, totally four different multi-block feature types: MB-LBP, MB- LTP, MB-ALBP and MB-ALTP are introduced. The latter three feature types are the proposed feature types in this thesis, which are extended from the MB- LBP features. The relationship of the four feature types are shown in Figure 3.5.

From Figure 3.5 we can see, first, the MB-LTP features are obtained by considering margins around the reference temperatures (average pixel temperatures of the central blocks) of the MB-LBP features. This margin is decided for each specific MB-LTP feature in the training process. We can deem all the MB-LBP features as the subset of the MB-LTP features, since the MB-LTP features consider all the possible margins, which cover the MB-LBP features which consider only the zero margin (without considering margins).

Second, the MB-ALBP features are obtained by replacing the reference temperatures of the original MB-LBP features with absolute temperatures. This absolute temperature is decided for each specific feature in the training process. Even MB- ALBP features are extended from MB-LBP features by considering the absolute temperatures, we cannot deem MB-LBP features as subset of MB-ALBP features for two reasons: 1) the MB-LBP features have 8 dimensions, while the MB-ALBP features have 9 dimensions. They have different dimensions; 2) the reference temperatures for one specific MB-LBP feature on different image patches might be different. The reason is that the average pixel temperatures of the central blocks are generally different for different image patches. On the other hand, the reference temperatures of one specific MB-ALBP feature for different image patches are always the same.

Third, for the same reason that MB-LTP features cover MB-LBP features, MB- ALTP features also cover MB-ALBP features. Furthermore, for the same reason

(35)

that MB-ALBP features do not cover the MB-LBP features, MB-ALTP features also do not cover the MB-LTP features.

3.2.2 Learning Mixed Features

Overview

The author proposed an AdaBoost-based training algorithm to train a cascade classifier with multiple feature types. The input for the algorithm includes a sample pool, and a feature pool. The output is a cascade classifier with a chain of strong classifiers.

With respect to the input, facial and non-facial patches in thermal images are used as positive and negative samples, respectively. Figure 3.6 (a) illustrates the sample pool. The author expects to take advantage of different description power of different feature types. the author mixes Haar-like, HOG, and one or more feature types in the set {MB-LBP, MB-LTP, MB-ALBP, MB-ALTP}to the feature pool. The author builds up a mixed feature pool as illustrated in Figure 3.6 (b).

With respect to the output, a cascade classifier with a chain of strong classifiers is obtained given the input above. The construction of the cascade classifier consists of three steps: 1) preparing training samples for a strong classifier; 2) training the strong classifier by AdaBoost algorithm using the samples prepared in 1); 3) appending the obtained strong classifier in 2) to the cascade classifier. These three steps repeat until the preseted number of strong classifiers are appended to the cascade classifier. For step 1), the prepared samples of both positive and negative samples for one strong classifier should pass through all its previous strong classifiers already built in the cascade classifier. For example, the samples for training the third strong classifier must pass through the first and second strong classifiers.

The negative samples passing through the previous strong classifiers are the hard samples for those previous strong classifiers, since only the positive samples are expected to pass through. The samples for training the first strong classifier are randomly selected in all the training samples, since there is no previous strong classifier for the first one. For step 2), in building one strong classifier, the proposed algorithm repeatedly selects the best feature from the mixed feature pool.

Using the selected feature, a weak classifier is built for a strong classifier under construction. The building of a strong classifier is finished until it meets the pre- defined requirements in minimum detection rate (DR) and maximum false alarm rate (FAR). In this way, a strong classifier may contain multiple feature types.

Figure 3.6 (d) illustrates an example strong classifier. Figure 3.6 (c) illustrates an example cascade classifier, which is composed of a chain of strong classifiers.

The proposed training algorithm is similar to that by Viola and Jones [30].

The main difference lies in the optimization process. In the proposed approach,

(36)

Figure3.6:Conceptofthemixedfeaturetraining.(a)Samplepoolexample.(b)Mixedfeaturepoolcontainingdifferent featuretypes.(c)Resultingcascadeclassifier.(d)Exampleofastrongclassifiertrainedbytheproposedalgorithm. Eachfeatureisthebestfromthemixedfeaturepool{Haar-likeFeatures,HOGFeatures,MB-LBPFeatures}inits selectioniteration.Inthisway,astrongclassifiermaycontainfeaturesofdifferenttypes.

(37)

different optimization methods are adopted for the number-type and category-type features, respectively, and performance of features from same type as well as that from different types need to be compared. Meanwhile, the approach by Viola and Jones only employs the optimization method for Haar-like features and selects among the same feature type. The following section will describe how to build a strong classifier using the proposed approach.

Building One Strong Classifier

A strong classifier is built by AdaBoost algorithm which consists of many voters, with each voter including a weak classifier and its weight. This weight is determined in the training process by AdaBoost. A weak classifier includes prediction function for one feature type and a feature of that type. The feature and the parameters of the prediction function are also determined in the training process by AdaBoost. In the training stage, AdaBoost works in repeated process of three steps: 1) assigning weights to the training samples; 2) selecting the best feature under the sample weights and determining the parameters of the prediction function for the selected feature. This feature and its prediction function form one weak classifier; 3) determining the weight for the weak classifier. This weak classifier and its weight are deemed as one voter. This process of the three steps is defined as one iteration.

When a weak classifier classifies a sample, first it calculates the feature response, then it applies the prediction function to the response for prediction. There are two types of weak classifiers associated with the two feature types: 1) number- type feature; 2) category-type feature.

For a weak classifier containing a number-type feature m, the prediction function h_m on sample x_i is

h_m(x_i) =

(0 d_mR_m(x_i)< d_mT_m

1 otherwise, (3.9)

whereR_m(x_i) represents the real-number response of the featurem on the sample x_i, and T_m is the threshold of the feature m. This threshold is decided in the training process. dm ∈ {−1,1} is a directional factor indicating the direction of the inequality sign. Its value as -1 or 1 is also decided in the training process [30], since it is necessary to control the direction of the inequality.

To determine (dm, Tm) for a number-type features m in the training stage, the method in [30] is used. To be specific, the optimum (d_m, T_m) is obtained by:

(d_m, T_m) = arg min

d,T

X

xi∈Ω

w_k,i|ε[d(R_m(x_i)−T)]−y_i|, (3.10) where Ω represents all the training samples, w_k,i is the weight of sample x_i at iteration k. y_i = 0 for negative samples and y_i = 1 for positive samples. ε is

(38)

defined as:

ε(λ) =

(1 λ≥0

0 otherwise. (3.11)

For a weak classifier containing a category-type feature n, the prediction function h_n on sample x_i is

hn(xi) =LU Tn(Pn(xi)), (3.12) where P_n(x_i) represents the response of the feature n on the samplex_i by a high dimensional vector. LU T_n represents the lookup table operator for featuren. Give a high dimensional vector P_n(x_i), the LU T_n outputs the binary prediction result 0 or 1 for a negative or positive sample, thus LU T_n(P_n(x_i))∈ {0,1}. For all the possible responses of a category-type feature n, the LU T_n for predicting can be expressed by:

LU T_n(P_n(x_i)) =







a₀ I(P_n(x_i)) = 0 . . .

a_d I(P_n(x_i)) =d,

(3.13)

where a₀, ..., a_d ∈ {0,1} are the binary prediction results of 0 and 1 for negative and positive samples, respectively. I(P_n(x_i)) is the index number of the response vector Pn(xi) in totally d+ 1 kinds of response vectors. For example, MB-LBP encodes the 8 surrounding blocks around the center one using 0 or 1, there are totally 2⁸ = 256 kinds of response vectors, thusd= 255 for MB-LBP. The function I assigns the index from 0 to 255 to each kind of those response vectors. The purpose of this index is for differentiating those responses.

To determineLU T_n for category-type features in the training stage is equal to determininga₀, ..., a_d in the Equation (3.13) for all kinds of response vectors with index from 0 tod. To DetermineLU T_n, the method in [16] is used. To be specific, a_d is determined as:

ad=ε P

xi∈Ωwk,iyiδ(I(Pn(xi))−d) P

xi∈Ωw_k,iδ(I(P_n(x_i))−d) − 1 2

, (3.14)

where

δ(λ) =

(1 λ = 0

0 otherwise. (3.15)

After all the (d_m, T_m) and LU T_nare decided for all number-type and category- type features in the mixed feature pool, the best featurev is selected from it. The rule for selection is by minimizing the error:

e_l =

r

X

i=1

w_k,i|h_l(x_i)−y_i|, (3.16)

(39)

where h_l is the prediction function of feature l in the mixed feature pool. In minimizing e_l in (3.16), the prediction functions (3.9) and (3.12) are used for number-type and category-type features, respectively. With the minimization of e_l, the algorithm finds the best feature v in the mixed feature pool as:

v = arg min

l

e_l. (3.17)

In summary, constructing a weak classifier needs two works. 1) determining (d_m, T_m) or LU T_n if the feature is number-type or category-type feature, respectively; 2) after (d_m, T_m) andLU T_nare determined for all the features in the mixed feature pool, selecting the best featurev from the mixed feature pool. The strong classifier is built by repeatedly adding voters. The whole algorithm for building a strong classifier is described in details as follows:

• Input:

1. r training samples {(x_i, y_i)} where y_i = 0 for the s negative samples and yi = 1 for the t positive samples.

2. Mixed feature pool: M F P ={l}.

3. User defined training parameter: minimum detection rate (DR), and maximum false alarm rate (FAR) for one strong classifier.

• Output:

1. Feature setF ={m}∪{n}, its associated voter setU ={(m, dm, Tm, αm)}∪

{(n, LU T_n, α_n)}, where{(m, d_m, T_m, α_m)}for number-type features, and {(n, LU T_n, α_n)} for category-type features. α_m and α_n represent the weights of the weak classifiers with features m and n, respectively.

2. A strong classifier built fromU, with a trained thresholdT, its prediction function H on sample x_i is:

H(x_i;U) =

(1 P

f∈F α_fh_f(x_i)≥T

0 otherwise. (3.18)

• Step 1 Initialization:

1. k := 1.

(40)

2. Initialize the sample weights w_1,i = _2t¹ and _2s¹ for positive and negative samples, respectively.

3. Initialize the feature set F₁ =∅, and voter set U₁ =∅.

• Step 2 Strong Classifier Building:

1. Normalize sample weights so that their sum equals 1:

˜

w_k,i← wk,i

Pr

j=1w_k,j. (3.19)

2. Obtain the weak classifier with feature v by optimization. The feature v has the minimal error e_v in the mixed feature pool:





(v, d_v, T_v) = arg min

l∈M F P

e_l,if v is a number-type feature, (v, LU Tv) = arg min

l∈M F P

el,if v is a category-type feature. (3.20)

3. Determine the weight of the weak classifier with featurev using α_v =ln

1−e_v e_v

. (3.21)

4. F_k+1 =F_k∪ {v}, add the voter to the voter set U_k:

"

U_k+1 =U_k∪(v, d_v, T_v, α_v), if v is a number-type feature,

U_k+1 =U_k∪(v, LU T_v, α_v), if v is a category-type feature. (3.22) 5. Update the weight of all training samples for current strong classifier:

w_k+1,i ← w˜_k,iβ^1−λ, where λ = 0 if the sample x_i is correctly classified by the classifier with feature v, otherwise λ = 1, β = _1−e^e^v

v . 6. k ←k+ 1.

• Step 3 Stop Condition Checking:

1. Check currently built strong classifier to decide whether it is finished or not. The voting result of current strong classifier built from U_k on sample x_i is calculated by P

f∈F_kα_fh_f(x_i). Sort the voting results of all training samples from small to large, and find the minimum value T where the detection rate satisfies DR.