訂正確認報告書

(1)

訂正確認報告書

訂正承認日 2019年9月18日訂正申請日 2018年9月7日

題名 Human Detection Algorithm Based on Discriminative Local Feature

著者氏名 Jiu XU

報告者氏名

集積システム分野、博士論文訂正ワーキング長木村晋二

確認者氏名

吉江修

(2)

本論文は、学位規則第23条第1項に照らし、学位の取消には該当しないが、訂正を要する箇所が認められたため、これに対して著者によりなされた訂正について確認した結果を以下の通り報告する。

1. Content before and after modification

(1) 訂正箇所：Page I, line 2-9, paragraph 1 in section abstract 訂正内容：文書の訂正

具体的内容：

Human detection is one of the most active research areas in computer vision and is quite useful for many applications, such as smart vehicles and video surveillance systems. It is also an important preliminary step for content analysis, such as for behavior recognition and human tracking. Due to the variations in human body poses and clothing together with as well as those in cluttered backgrounds and environmental conditions, human detection remains challenging work. In particular, because of its huge computational workload, real-time human detection now seems the most critical step in intelligent video surveillance systems.

In computer vision, human detection is one of the most active fields of research and is core for lots of applications, such as smart vehicles and video surveillance systems. It is also an important preliminary step for content understanding, such as for behavior analysis and human tracking. Because of the variations in human body poses and clothing together with as well as those in cluttered backgrounds and environmental conditions, human detection remains challenging. In particular, due to its huge computational complexity, real-time human detection now seems to be the most critical step in intelligent video surveillance systems.

(2) 訂正箇所：Page 2, line 8-11, paragraph 7 in section 1.1 訂正内容：文書の訂正

具体的内容：

Object detection, is an important extension of top-down based image segmentation. With the development of high-powered computers, the usability of high quality and inexpensive video cameras, and the increasing requirement for automated video analysis, object detection has generated plenty of interest.

Object detection, is an important extension of top-down based image segmentation. With the development of high-performance computers, the converge of high-resolution video surveillance cameras, and the increasing requirement for intelligent video analysis, object detection has attracted lots of interest.

(3) 訂正箇所：Page 3, line 5-page 5, line 12, all paragraphs in section 1.1.1 訂正内容：文書の訂正

具体的内容：

(3)

A common method is based on discriminative features. In feature-based object detection, one or multiple features are extracted to describe and model the specific objects of interest. The standardization of features is important in this method because there are lots of other problems such as the light changes, objects change in attitude, objects nonlinear distortion, noise and interference on background, in this way, searching for some powerful and robust features is quite necessary in improve the tracking performance. … (中略) …

Because of its simplicity and accuracy, the most popular edge detection approach is the Canny Edge detector.

A common method is based on discriminative features. In feature-based object detection, one or multiple features are extracted to describe and model the specific objects of interest. The standardization of features is important in this method because there are lots of other problems such as objects pose change and distortion, illumination variations, and background noise and occlusions. In this way, searching for some powerful and robust features is quite necessary in improve the tracking performance. Feature selection and object representation are highly relevant.

Generally, the most desirable attributes of visual features are their uniqueness, so that objects in the feature space can be easily identified. For instance, object edges are often utilized as features for contour-based representations, whereas color is commonly used as features for histogram- based appearance representation. On top of using a single feature as the representation, lots of tracking algorithms utilize a combination of different features. The following paragraphs introduce several common visual features:

Color: One of the most frequently-used features is color. Color space is a multi-dimensional space in which different dimensions represent different components. To be more specific, in the field of image processing, a linear combination of three main components including R (red), G (green), B (blue) is used for the color representation. However, this RGB space is not perceptually uniform, which means that the differences between the colors in RGB space do not correspond to human perception of color differences. Therefore, perceptually uniform color spaces have been proposed, such as CIELUV and HSV (Hue, Saturation, Value); however, these color spaces are noise-sensitive. All in all, the color feature is a very effective feature, so a variety of color spaces have been used for detection and tracking.

Texture: Another effective and noticeable feature is the image texture. Texture is a visual feature which reflects the homogeneity of the image. It also describes the material structure of the object surface with slowly changing or periodically changing properties. Local binary pattern (LBP) can be considered as the most popular texture feature. This feature could extract the local texture information by calculating the relationship between each pixel and its neighboring pixels in an intensity image. LBP feature has been broadly applied to face recognitions, texture classification, optical character recognition, and etc. . Some other texture-based features, such as Gabor filter and Gray-Level Co-occurrence Matrices (GLCM) are also widely used.

Gradient: Image gradient is also frequently used to extract representative information in the image processing filed. The definition of image gradient is the change in direction of intensity or color in an image. Image gradient can be seen as two-dimensional image discrete function. A common way for image gradient calculation is to compute the horizontal derivative by subtracting each pixel value from the one on its right, then compute the vertical derivative by subtracting each pixel value from the one below it.

The most popular gradient-based feature is called Histogram of Oriented Gradient (HOG). . The basic idea of HOG is to use gradient information to reflect the edge information of image

(4)

objects and to characterize the appearance and shape of the image by the local gradient. Similar to the edge orientation histogram and shape context, HOG calculates pixel-wise statistical information of the image; however, instead of using a full image scan, HOG descriptor is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization techniques to improve performance.

Edge: An edge is a curve along the path of rapidly changing image intensities inside the image. Significant local changes in image properties are often associated with object boundaries and the purpose of edge detection is to identify these boundaries. Compared to color features, edge features are less sensitive to contrast variations, which makes the edge as a representative feature for object detection and tracking. Image edge detection greatly reduces the amount of data and eliminates information that can be considered irrelevant, while it still preserves the important structural attributes of the image. There are many ways for edge detection, but most of them can be divided into two categories, search-based and zero-crossing based.

The search-based approach detects the boundary by finding the maximum and minimum values in the first derivative of the image, often locating the boundary in the direction of the largest gradient. The zero-crossing-based approach looks for the boundary by looking for zero- crossing of the second derivative of the image, usually a zero crossing of the Laplacian zero- crossing or of a non-linear differential representation.

(4) 訂正箇所：Page 5, line 18-page 6, line 2 , paragraph 2,3 in section 1.1.2 訂正内容：文書の訂正

具体的内容：

Chamfer matching is an effective and widely used technique for detecting objects or parts by means of their shape. Matching process contains two steps, translating and positioning the template at various locations of the search image. A matching measure is determined by the pixel values of the search image which lie under the data pixels of the transformed template. Lower values represent higher similarity between search image and template at this location. If the average distance value of a certain position lies below a predefined threshold, the target object is considered detected.

A recent popular method appears as deformable part model based template matching. This method represents objects in terms of mixtures of deformable part models. These models are trained offline based on a discriminative method that only requires bounding boxes for the objects in an image.

Chamfer matching is an effective and broadly applied method for localizing objects.

Matching process contains two steps, placing and translating the template at potential positions of the search image. A matching criterion is determined by the pixel values of the search image which lie under the data pixels of the transformed template. Smaller values represent higher similarity between search image and template at this location. When the mean distance value of a certain position lies below a predefined threshold, the target object is considered as being found.

A recent popular method appears as deformable part model based template matching. This method represents objects in terms of mixtures of deformable part models. These models are trained offline based on a discriminative method that only requires bounding boxes for the objects in an image.

(5)

(5) 訂正箇所：Page 6, line 8–20, paragraph 1, 2 in section 1.1.3 訂正内容：文書の訂正

具体的内容：

Any significant change in an image region from the background model signifies a moving object. The pixels constituting the regions undergoing change are marked for further processing.

Usually, a connected component algorithm is applied to obtain connected regions corresponding to the objects. This process is referred to as the background subtraction.

One of the most popular methods is the Gaussian Mixture Model (GMM) proposed in . A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from a well-trained prior model.

Given a background mode, in most cases, significant image variations in a region imply moving objects. All the pixels in those area is marked as potential regions; then a connect component approach is applied to acquire the connected boundary of the objects. Such kind of pipeline is referred to as the background subtraction or the foreground extraction method.

One of the most popular methods is the Gaussian Mixture Model (GMM) proposed in . GMM is a type of parametric model that the probability density is modeled as a weighted sum of multiple Gaussian components. In applications including the image segmentation and the image recognition, GMM has been widely applied for modelling the feature probability distributions.

The maximum-a-posteriori (MAP) approach is utilized for estimating the GMM parameter from well-trained prior model while statistical expectation-maximization (EM) approach is often applied for the parameter estimation of GMM from the training dataset iteratively.

(6) 訂正箇所：Page 9, line 2-7, paragraph 1 in section 1.2.1 訂正内容：文書の訂正

具体的内容：

For an intelligent video surveillance system, the detection of a human being is important for numerous fields like abnormal event detection, gait characterization, people counting, person identification and tracking, gender classification or fall detection of elderly people. Particularly, in recent years, due to the increase in terrorist activities and general social problems, it becomes the top most priorities for almost all the nations to provide security to their citizens.

For an intelligent video surveillance system, the localization of a human being is meaningful for numerous fields like people analysis (gender classification, behavior recognition, counting, identification, tracking), abnormal event detection or fall detection of elderly people. Particularly, in recent years, because of the raise of terrorist activities and general social problems, it becomes one of the highest priority for all the nations to provide security to their citizens.

(6)

具体的内容：

In another case, since the rear mirrors cannot provide a full view of the scene behind the vehicle, pedestrians might be knocked down when the drivers diving in reverse. This situation becomes worse for child due to their lower height. It is reported in that compared with occupant injuries, pedestrian injuries are more severe, with a fivefold higher likelihood of death among those injured. Stable human detector can significantly reduce such kind of accident.

In another case, since the scene behind the vehicle provided by the rear mirrors is limited, pedestrians might be knocked down when the drivers diving in reverse. This situation becomes worse for child due to their lower height. It is reported in that compared with passenger injuries, pedestrian injuries are more severe, with a five-time higher death likelihood than injury rate.

Stable pedestrian detector can help to avoid the accident possibility significantly.

具体的内容：

Toshiba has already produced a image processor named Visconti™, Fig.4. shows its applications. The powerful Visconti™ Series image recognition processor ICs are used in camera-based vision systems for advanced driver assistance systems (ADAS). Visconti changes the way people drive by detecting and recognizing target objects such as humans, faces, hands, vehicles and processes image data from multiple camera inputs frame-by-frame and then outputs results to an LCD panel. The Visconti Series can monitor and detect a wide range of driving conditions such as rear and front pedestrian detection, driver authentication and monitoring, and lane departure warning. Once again, human detection is key part for such kind of applications.

Toshiba has already produced a image processor named Visconti™, Fig. 4 shows its applications. This powerful image recognition processor is one of the first integrated chips applied to visual surveillance camera systems for advanced driver assistance systems (ADAS).

The technologies behind this processor are target objects (including face, hand, vehicle, and human) detection and recognition. The system takes the image data captured by multiple cameras as inputs and outputs the recognition results to a LCD panel. The system could provide useful assistance to the drivers for better monitoring and understanding the driving conditions, such as driver authentication and monitoring, front and rear pedestrian detection, and lane departure warning. Again, human detection is most important part for such kind of applications.

(9) 訂正箇所：Page 12, line 2-15, all paragraphs in section 1.2.3 訂正内容：文書の訂正

具体的内容：

(7)

Consider the image content analysis system, for example, the large volume and variety of digital images currently acquired and used in different application domains has given rise to the requirement for intelligent image management and retrieval techniques. It is very impossible to search and manage these images manually due to the huge amount. In this way, there is an increasing need for the development of automated image content analysis and description techniques in order to retrieve images efficiently from large collections, based on their visual content.

Large collections of images can be found in many application domains such as journalism, advertising, entertainment, and navigation. Among these types of application, human being is no doubt to be one of the most common objects inside the images related to people daily life. For films and videos, the human detection will form an complete component of applications for automatic content management. Together with face and behavior recognition, human detection can help the investigation of some particular contents to a great extent.

Take an image content analysis system as an example, extremely large volumes of digital image are obtained and applied to various applications every day, which raises the needs of content-based intelligent image retrieval and management technologies. It is not practical and almost impossible to manually search and annotate these images because of its huge amount.

Therefore, developing automated image content-understanding systems is inevitable, and the demand keeps increasing with more and more image-processing applications.

In daily life, it is easy to find a great number of images everywhere, such as social networks, e-commerce website, online advertising and entertainment, and navigation systems. Among these applications, human beings are one of the most common objects in the images. Therefore, human detection can be a complete component of applications for automatic content management of videos. Moreover, apart from face recognition and behavior understanding, human detection can help the better investigation of some particular contents.

具体的内容：

Recent years, although many research focus on the human detection, the detection rate is still not satisfied to the practical application. Different from the face detection system, which has already been generally used in commercial product like mobile phones or digital cameras.

In recent year, although lots of researchers have been working on human detection, the detection rate is not yet satisfied to the practical application. Different from the face detection system, which has already been generally commercialized in products like mobile phones or digital cameras.

具体的内容：

(8)

It is hard for the system to be robust enough to be able to work in uncontrolled environment conditions (indoor and outdoor) with varying lighting (day-time and night-time) and a cluttered background, as required by most real-world applications. On the another hand, if some parts of the human bodies are occluded, the features corresponding to the occluded area are inherently noisy and will deteriorate the classification result.

It is hard for the system to be robust enough to be able to work in uncontrolled environment conditions (indoor and outdoor) with lighting variations (day-time and night-time) and a cluttered background, as essential for most applications in real-world. On the other hand, if some parts of the human bodies are occluded, the features corresponding to the occluded regions are inherently noisy and therefore the classification result will be deteriorated.

具体的内容：

The dissertation concentrates on how to detect the human from the images and videos.

Robust features for human detection are discussed in order to improve the detection rate.

The dissertation focuses on how to localize human out of the videos and images.

Discriminative local features for human detection are studied in order to enhance the detection performance.

(13) 訂正箇所：Page 15, line 7-10, paragraph 6 in section 1.4 訂正内容：文書の訂正:

具体的内容：

Results on both INRIA and Daimler datasets show that our NRGSLBP based linear detector achieves the best detection rate compared with other linear detectors. Besides, the calculation complexity of proposed LBP based method is considerably low.

Results on both INRIA and Daimler datasets prove that our NRGSLBP based detector outperforms all with other linear detectors in terms of detection rate. Besides, the calculation workload of proposed LBP based method is considerably low.

具体的内容：

Pedestrian detection, in particular, is widely applied in intelligent vehicle and visual surveillance systems. The primary aim of human detection is to locate the positions of human forms by using bounding boxes to distinguish them from the background images.

(9)

Pedestrian detection, in particular, is widely applied in smart vehicle and visual monitoring systems. The principal aim of detecting human is to localize the positions of human forms by using bounding boxes to distinguish them from the background images.

具体的内容：

Part-based deformable models are parameterized by the appearance of each part of human body and a geometric model capturing spatial relationships among parts. Fig.8. gives the human model used in this method. For generative models one can learn model parameters using maximum likelihood estimation. Moreover, a general method for building cascade classifiers from part-based deformable models such as pictorial structures is described in to reduce the calculation complexity.

Part-based deformable models are parameterized by the appearance of each part of human body and a geometric model that captures the spatial relationships among parts. Fig.8. gives the human model used in this method. For such kind of generative models, the maximum likelihood estimation can be applied to learn the model parameters. In addition, a general approach for building cascade classifiers on the top of part-based deformable models is described in to reduce the calculation complexity.

具体的内容：

Compared with the sub-windows based methods, the computation and time consumption of part based detectors is significantly larger. However, part based methods can achieve better performance, especially under occlusion situations, because the feature values with the occlusion part will not deteriorate the classification result.

In comparison to sub-windows based methods, the computational complexity of part based detectors is significantly larger. However, part based methods could achieve better performance, especially under occlusion situations. The reason is that the classification result is not deteriorated due to the feature value of the occlusion area.

(17) 訂正箇所：Page 20, line 13-page 23, line 2, all paragraphs in section 2.1.1 訂正内容：文書の訂正

具体的内容：

(10)

Histogram of Oriented Gradient: HOG was derived from the scale-invariant feature transform (SIFT) method and is used to describe the body’s shape information via gradient direction histogram on small pieces of an image. …(中略)…

Here T is a small value used as a threshold for difference in intensity values in order to increase the robustness of the CS-LBP method on flat image regions.

Histogram of Oriented Gradient: HOG was derived from the scale-invariant feature transform (SIFT) method and is adopted to describe the body’s shape information via gradient direction histogram on sub-block of an image. The image is first gridded into overlapping blocks with four cells apiece. For every single pixel I in (x, y) position, orientation (θ) and magnitude (m) of its gradient are computed as follows:

dx = I(x + 1, y)− I(x − 1, y) (1)

dy = I(x, y + 1)− I(x, y − 1) (2)

θ(x, y) = tan^ିଵ(dy/dx) (3)

m(x, y) =ඥdx^ଶ+ dy^ଶ (4)

For each cell, HOG generates a histogram of angle distribution for the pixels in that cell. The height of each bin is the sum of magnitudes for all the pixels fall within the corresponding angle.

As there are nine directions, each cell can be represented by a 9-dimensional feature vector. In our implementations, each block has four cells; therefore each block is represented by a 36- dimensional feature vector. Note that window is resized to 64 × 128.

Covariance Matrix: COV first calculates the feature for each pixel. There are eight features being considered, which are listed in the following formula:

൤x, y, |I୶|,หI୷ห,ටI୶ଶ+ I୷ଶ, |I୶୶|,หI୷୷ห, arctan^|୍_ห୍^౮^|

౯ห൨^୘ (5)

where x, y represent the coordinates of the pixel.I୶,I୷,I୶୶andI୷୷are the first and second partial derivatives of the magnitudes. The last term represents the edge orientation. If we denote the maximum index of the target region R as S, the covariance matrix is calculated as:

Cୖ =_ୗିଵ^ଵ ∑^ୗ_୧ୀଵ(Z୧− μ)(Z୧− μ)^୘ (6) where Z୧(i = 1, . . , S)is the feature vector for the pixel inside the target region, μis the statistical mean of Z୧, and S is the number of these vectors. The covariance matrix is symmetrical by definition, thus for time and space efficiency, only the upper diagonal part will be calculated.

Histogram of Template: Tang proposed the HOT method in . The block size and stride are set to the same parameters as for the HOG method. For all pixels within one 16 × 16 block, 8 templates (see Fig.9.) and 4 formulas are used to obtain a 32-dimensional vector with value 0 or 1.

I(P) >ܫ(P1)&&ܫ(P) >ܫ(P2) (7) k == arg max୧{I(P୧) + I(P1_୧) + I(P2_୧)} (8) Mag(P) >ܯ ܽ݃(P1)&&ܯ ܽ݃(P) >ܯ ܽ݃(P2) (9) k == arg max୧{Mag(P୧) + Mag(P1୧) + Mag(P2୧)} (10) Formula (7) checks whether the pixel value P is the largest for each template. Formula (8) means the sum of intensity values of three pixels in the template k is larger in comparison to that of the other templates. Formulas (9) and (10) are used to compute the image gradient.

(11)

Center-symmetric Local Binary Patterns: CS-LBP is a derived version of LBP . This method was initially proposed for reducing the length of the conventional LBP method and to overcome some other drawbacks, such as a poor robustness when applied to image with nearly uniform intensity regions.

As illustrated in Fig.10., different from comparing the intensity value of each neighboring pixel to its central pixel, this method compares difference between the center-symmetric pixel pairs. The CS-LBP method can be calculated by

CSLBP(x, y) = ෍ S൬P୧− P

୧ାቀ୒ଶቁ൰2^୧

ቀ୒ଶିଵቁ

୧ୀ଴

S(x) =ቄ1 x >ܶ

0 otherwise (11)

Here T is a tiny number, as the threshold value of the intensity difference, in order to improve the robustness of the CS-LBP method on planar image region.

(18) 訂正箇所：Page 23, line 7-page 24, line 1paragraph 1 in section 2.2.2 訂正内容：文書の訂正

具体的内容：

Gradient-based methods, such as HOG and COV, only use the values of four neighboring pixels to calculate direction and magnitude. In above-stated methods, not considering the gradient information entirely has obvious drawbacks.

Gradient-based methods, such as HOG and COV, put only four neighboring pixels values into the calculation of direction and magnitude information. In above-stated methods, not considering the gradient information entirely has obvious drawbacks.

具体的内容：

Note the loss of gradient information in CS-LBP due to its ignorance of the central pixel. It is also hard for CS-LBP to choose an adaptable threshold.

Note that due to its neglect of the central pixel in CS-LBP, some part of gradient information is missing, and selecting an adaptable threshold is also difficult for CS-LBP.

(20) 訂正箇所：Page 30, line 15- page 31, line 6, all paragraphs in section 2.4.1 訂正内容：文書の訂正

具体的内容：

As mentioned in Section 1, the choice of classifiers is crucial, and it significantly affects the detection result. Here we select the SVM method to train the detector. SVM is especially

(12)

effective for learning from a relatively small sample in high-dimensional space. The decision rule is given by the following formula:

f(x) =∑ β^୒_୧ୀଵ^౩ ୧K(x୧, x) + b (19)

whereeach x୧is supported by the support vector,Nୱis the number of support vectors, and K(x, y) is the kernel function. The four major kernel functions are shown as the following formulas:

Linear: K x y( , ) x y' ₍₂₀₎

Polynomial: ^{K x y}^{( , )}



   ^x^' ^{y C}



^D (21) Radial basis function: ^{K x y}^{( , ) exp}^



^{  }^ ^x ^y²



₍₂₂₎

Sigmoid: K x y( , ) tanh(   x y C' ) ₍₂₃₎ where

exp( ) exp( ) tanh( )

exp( ) exp( )

x x

x x x

 

   (24)

For our experiments, we chose LibSVM , let all training parameters retain their default settings (Given in Table 2) and used Radial basis function (RBF) and linear kernel functions respectively.

As mentioned in Section 1, the choice of classifiers is crucial, and it significantly affects the detection result. Here we select the SVM method to train the detector because its relatively good performance to train on a small number of samples in high-dimensional feature space. The decision rule is defined as follows:

f(x) =∑ β^୒_୧ୀଵ^౩ _୧K(x_୧, x) + b (19) where x୧represents support vectors, andNୱrepresents the number of support vectors. The kernel functionܭ(ݔ,ݔ^ᇱ)measures the similarity between two vectors. Various kernels have been developed for different purposes, and below are several popular kernels:

Linear kernel:ܭ(ݔ,ݔ^ᇱ) =ݔ^்ݔ^ᇱ (20)

Gaussian kernel (Radial basis function):ܭ(ݔ,ݔ^ᇱ) = exp(−^ฮ௫ି௫_ଶఙ_మ^ᇲ^ฮ^మ) (21) Polynomial kernel:ܭ(ݔ,ݔ^ᇱ) = (ߛݔ^்ݔ^ᇱ+ܥ)^ௗ (22) Sigmoid kernel:ܭ(ݔ,ݔ^ᇱ) =ݐܽ݊ℎ(ߛݔ^்ݔ^ᇱ+ܥ) (23) Wheretanh(ݔ)function can be viewed as a scaled logistic function:

ݐܽ݊ℎ((exp(ݔ)− (exp(−ݔ))/(exp(ݔ) + (exp(−ݔ))) (24) We chose LibSVM in our test, let all training parameters retain their default settings (Given in Table 2) and used Radial basis function (RBF) and linear kernel functions respectively.

(21) 訂正箇所：Page 31, line 7-page 32 line 2, paragraph 1 in section 2.4.2 訂正内容：文書の訂正

具体的内容：

(13)

We selected the training and testing dataset from INRIA dataset to evaluate our system.

Since this dataset is widely used for human detection, it helped us make fair comparisons with other methods. The dataset contains 1774 human annotations and 1671 human free images.

Exactly, 1208 human annotations and 1218 non-human images were used for training the detectors and the remaining images were used for testing. For positive images, left-right reflections are also used, resulting in a total of 2416 human images are applied for training. This dataset includes various kinds of samples from different points of view, colorful clothing, diverse poses, partial occlusion and sundry illumination. (See Fig.17.), making it a suitable benchmark database for human detection.

We selected the training and testing dataset from INRIA dataset to evaluate our system.

Since this dataset is commonly adopted for human detection, it helped us make fair comparisons with other methods. The dataset contains 1774 human annotations and 1671 images without human. Exactly, 1208 human annotations and 1218 human-free images were used to build the detector, and we apply the rest of the image set for validation and testing. Left-right flipping are also used as data augmentation for positive samples, resulting in a total of 2416 human images are applied for training. The data set consists of various samples under different viewing angles and lighting conditions, human body in different postures, various colors of clothing, and partial occlusion. (See Fig.17.), making it a suitable benchmark database for human detection.

具体的内容：

Although a longer feature vector is more sparse and discriminative, it is significantly costly in terms of memory storage and classification complexity.

Although higher-dimensional features can store more sparse and differentiated information, it is significantly costly in terms of memory storage and classification complexity.

(23) 訂正箇所：Page 48, line 1- page 50, line 4, paragraph 5-10 in section 3.1.1 訂正内容：文書の訂正

具体的内容：

Fig.29. shows the principle process. The authors use the number of “1” pixels as the length and use the position of the middle pixel in these continuous “1” pixels as the angle from the binary code of the uniform LBP8,1 to extract a 2-dimensional histogram. Then, all the pixels vote the same weight “1” to their bins of in this 2-D histogram. For each block, the authors transform the bin values of 2-D histogram to a 1-D vector for Support Vector Machine (SVM) training.

GLBP: GLBP is proposed in as a modified version of conventional SLBP. In this intra- combined feature, the feature length is the same as S-LBP, less than other simply combined feature. By reducing the feature length, the calculations and times will be decreased. The feature extraction flow of pixel level is shown in Fig.30. In step A, the authors calculate the eight bit

(14)

binary code by comparing the value of middle pixel with the values of eight neighbor pixels one by one. The binary value is “0” when the value of middle pixel is bigger, “1” when the value of neighbor pixel is bigger. Then the authors put the neighbor pixel with same binary value together to get several “1” area and “0” area. When the “1” area and “0” area appear only once, the authors define this pattern as Uniform Pattern. The process of checking a pixel is a uniform pattern or non-uniform pattern is Uniform Checking. The authors do the next step if the binary pattern of this pixel is uniform. If not, the authors ignore this pixel and calculate the next pixel.

In step B, the authors calculate the width value and angle value for the pixel with uniform pattern. Width value is the number of value “1” in the binary code of this pixel. Eight direction codes with 0 to 7 are defined in the direction of eight neighbor pixels. Angle value is the direction code of the middle pixel in “1” area of its binary code. When the width value is an even number, the authors set the angle value at the smaller value of these two direction values except the middle direction of “1” area is between direction “7” and direction “0”. We set the angle value at “0” in this case.

In step C, the authors calculate the gradient value by the value of original pixel and the values of its 4 neighbor pixels as HOG method.

In step D, angle value and width value from step B are used for mapping the position of bin in GLBP Table. Then the authors add the gradient value from step C into this bin of the GLBP Table.

Fig.29. shows the principle process. The authors use the number of “1” pixels as the length and use the position of the middle pixel in these continuous “1” pixels as the angle from the binary code of the uniform LBP8,1 to extract a 2-dimensional histogram. Following this rule, all the pixels vote the same weight “1” into their bins in this two-dimensional histogram. For each block, the authors convert the bin values of the two-dimensional histogram into one-dimensional vectors for support vector machine (SVM) training..

GLBP: GLBP is proposed in , which is based on the conventional SLBP feature. GLBP, as an intra-combined feature, has the same dimensions as the SLBP, whose dimension is generally smaller than other simple-combined features. Therefore, it is good for real-time processing.

Fig.30. shows an example about how the GLBP feature is calculated.

In Step A, for each pixel, the authors calculate an eight-bit binary code based on its relative value to its surrounding pixels. If the magnitude of a neighbouring pixel is smaller than the inspected pixel, the neighbouring pixel will be assigned to 1; otherwise, the neighbouring pixel will be assigned to 0. The neighbouring pixels with same value will be grouped in to “1” area or

“0” area. The pattern that has only one “1” area and one “0” area is defined as a uniform pattern.

Only pixels that shows uniform pattern in its neighboring pixels will be considered for further calculations, and the rest pixels will be ignored.

In step B, the width and angle are calculated for pixels that pass the uniform checking. The width is defined as the number of pixels in the “1” area. To calculate the angle, we assign eight direction codes to each neighbouring pixel. The angle is defined as the direction code that corresponding to the middle pixel of the “1” area. In the case that the width is even, we chose one of the two middle pixel that has smaller direction code.

In step C, the gradient of the inspected pixel to its neighbouring pixel is calculated similar as the HOG method.

In step D, the authors locate each pixel in the GLBP table using its width and angle. The gradient of each pixel is added to the corresponding bin in the GLBP table.

(15)

(24) 訂正箇所：Page 50, line 16-page 51, line 2, paragraph 2 in section 3.1.2 訂正内容：文書の訂正

具体的内容：

Another limitation of SLBP is the loss of gradient information. In the HOG method, the gradient values of pixels are votes to the histogram as the weights, but in SLBP, a fixed weight is voted on the intensity map only. Pixels with large magnitude values are always crucial for classification. Thus, SLBP cannot obtain enough information from these important pixels with larger magnitude values.

Another limitation of SLBP is the loss of gradient information. In the HOG method, the gradient values of pixels are accumulated as the weights into each histogram bin. While in SLBP, only the intensity map is taken into account with a fixed scale. In image classification task, important information such as edge or texture can usual be learnt from pixels with larger magnitude value, however, such pixel information is partly missing in SLBP method.

(25) 訂正箇所：Page 64, last line-page 65, line 8, paragraph 2 in section 4.1.1 訂正内容：文書の訂正

具体的内容：

For example, in , an HOG and LBP were combined into a single feature set. In a histogram of template (HOT) was proposed by combining the gradient and textural methods, which used eight templates and four formulas to extract the local template-based binary features. A MB-HOT was proposed as its extended version. Some other methods have been considered. Specifically, a multi-level HOG was extracted in by constructing histograms of oriented gradients by summarizing the normalized response in each cell while a multi-resolution HOG was proposed in by adopting four resolutions for pedestrian detection.

For example, in , an HOG and LBP were integrated into a one feature set. In a HOT was proposed by combining the gradient and textural methods, which used eight patterns with four predefined formulas to get the local template-based feature information . A MB-HOT was proposed as its extended version. Some other methods have been considered. Specifically, a multi-level HOG was extracted in by building multi-level histograms of oriented gradients which are summarized from the normalized feature vector in each cell, while a multi-resolution HOG was proposed in by adopting four resolutions for pedestrian detection.

(26) 訂正箇所：Page 65, second last line- page 66 line 3, paragraph 3 in section 4.1.2 訂正内容：文書の訂正

具体的内容：

(16)

Then, the normalized responses are obtained by a normalization method applied over all directions in each distinct 16 × 16 block. Finally, multi-level features are extracted by constructing histograms of oriented gradients by summing the normalized response in each cell.

The figure depicts progressively smaller cell sizes from 64 × 64 to 8 × 8.

Then, the normalized responses are acquired by a normalization method applied over all directions in each distinct 16 × 16 block. Finally, by accumulating the normalized response in each cell and building histograms of oriented gradients, the multi-level features are extracted.

The figure illustrates different cell sizes from 64 × 64 to 8 × 8, respectively.

具体的内容：

Any detection window is classified hierarchically from its lowest resolution to the full resolution along the horizontal axis in resolution space while the image is downscaled in order to use a detection window of fixed size to locate objects of larger scale along the vertical axis in scale space. ‘N’ means that once a window is rejected by the classifiers it will no longer be taken into account, while ‘Y’ means that the window satisfies the classifier and is chosen as a candidate at the next higher resolution. The method can reduce computational cost since detections on different scales are independent and a huge number of windows can be eliminated at low resolution.

Any detection window is ranked along the horizontal axis from the lowest resolution to the full resolution in the resolution space, while the image is downscaled along the vertical axis in the scale space so as to localize the object with large scale using a fixed-size detection window.

‘N’ means that once a window is rejected by the classifiers it will no longer be taken into account, while ‘Y’ means that the window satisfies the classifier and is chosen as a candidate at the next step with a higher resolution. This method can reduce computational cost since detections on different scales are independent and a huge number of windows are able to be eliminated at low resolution steps.

具体的内容：

Wang et al. also put forward an occlusion handling idea based on global and partial detectors trained using the HOG-LBP method. They construct the occlusion binary likelihood image according to the response of each block of HOG method to the trained linear SVM. Then do the segment out the possible occlusion regions and detector them by part detector.

Wang et al. also put forward an occlusion handling proposal by using global and partial detectors trained with HOG-LBP method. To achieve this, they build an occlusion binary likelihood image according to the response of each block of HOG method to the trained linear

(17)

SVM as the first step, then do the segment out the possible occlusion regions and detector them by part detector.

(29) 訂正箇所：Page 68, last four lines, paragraph 11 in section 4.1.2 訂正内容：文書の訂正

具体的内容：

Hence, in middle level, the method not only contains 3 × 3 pixels regions, but also 6 × 6, 9 × 9, and 12 × 12 and so forth. This has more macrostructures than the original HOT method.

Hence, in middle level, the method not only contains 3 × 3 pixels regions, but also 6 × 6, 9 × 9, and 12 × 12 and so forth, which could extract more global structure information than the standard HOT approach.

(30) 訂正箇所：Page 79, line 18-28, paragraph 4-6 in section 4.2.1 訂正内容：文書の訂正

具体的内容：

Boosting methods, on the other hand, are often used in cascade detectors, as they show good performance when combining a strong classifier with several weak classifiers. These detectors can save considerable detection time by discarding background windows rapidly.

In order to reduce the computation workload, some GPU based approaches of HOG method are proposed. Fur modules: scaling, feature extraction, classification and reduction are applied for the detector.

The entire detection procedure can be seen in Fig.54. Instead of going back to the raw method of HOG method extraction, the GPU helps to compute the features independently in advanced and reuses them among sub-windows, which turns out to take even less computational cost.

Boosting methods, on the other hand, are often used in cascade detectors, as they show promising results when integrating a strong classifier with several simple and low-cost classifiers.

Such kind of detectors could reduce considerable detection cost by quickly discarding the negative window.

In order to reduce the computation workload, some GPU based approaches of HOG method are proposed. Several modules including image scaling, feature extraction, vector classification, and dimension reduction are applied for the detector.

The entire detection procedure can be seen in Fig.54. Rather than going back to the original feature pipeline of HOG method, the GPU helps to compute the features independently in advanced and reuses them among different sub-windows, results in even less computational workload.

(31) 訂正箇所：Page 82, title of Fig. 56 in section 4.2.2 訂正内容：文書の訂正

(18)

具体的内容：

Rapid calculation of rectangular feature using integral image.

Fast rectangular feature calculation using integral image.

(32) 訂正箇所：Page 80, line 7-end of page 82, paragraph 2-9 in section 4.2.2 訂正内容：文書の訂正

具体的内容：

Integral Image: The integral image, was first introduced in 1984, but wasn’t properly introduced to the computer vision area until Viola and Jones applied it into object detection framework by 2001 in .

The integral image is used as a quick and effective way of calculating the sum of pixel values in a given image or a rectangular subset of a grid inside the given image. It can also, or is mainly, used for calculating the average intensity within a given image.

When creating an integral image, we need to calculate a summed area. In this table, if we go to any point (x, y) then at this table entry we will come across a value. This value is the sum of all the pixel values above, to the left and of course including the original pixel value of (x, y) itself. If we want to get the value of s (x, y) in position (x, y) integral image, the value can be calculated by:

S(x, y) = i(x, y) + s(x −1, y) + s(x,y −1)−s(x −1,y −1) (34)

For example, see Fig 55, the value of i (x, y) = 10, s (x-1, y) = 14, and s (x, y-1) = 25, so the value of s (x, y) = 7 + 14 + 25 - 10 = 36.

Once you have used the equation to calculate and fill up the value inside the integral image, the task of calculating the sum of pixels in some rectangle which is a subset of the original image can be done in constant time under O(1) complexity.

In order to do this we only need to use 4 values from the integral image. With these 4 values, we then add or subtract them for the correct value of the sum of the pixels within that region. For example, it we want to get the sum value of pixels in region D in Fig 56, to do this, we use this equation:

sum = s(A) + s(B)−s(C)−s(D) (35)

With the help of the integral image, both the training and detection part for the feature extraction can be reduced. Here we give a example of how to apply integral image to accelerate the feature extraction for B-LTP method.

In B-LTP method, there are six formulas for four templates for feature calculation. Thus, for each block, the histogram requires 48 bins which means it can be replaced by 48 additional images for acceleration. Each time we calculate a value for a certain pixel, we store the value in corresponding additional image at the same time. In this way, the integral images for all the additional images can be generated. See Fig.57.

Integral Image: The integral image, was first introduced in 1984, but it wasn't until Viola and Jones applied it into the object detection pipeline by 2001 in , that it was thoroughly introduced into the field of computer vision.

(19)

Given an intensity image, an integral image method is commonly applied as a rapid and effective approach to calculate the sum of pixel values over sub-rectangular regions of the image.

Moreover, it can also be utilized for acquiring the average intensity within that image.

A summed area table is required in order to generate an integral image. An entry in this table contains a pixel and its associated value, and this value represents the sum of every pixel value out of original image from the upper-left corner to the position of that pixel. If we want to get the value of s (x, y) in position (x, y) integral image, the value can be calculated by:

S(x, y) = i(x, y) + s(x − 1, y) + s(x, y − 1)− s(x − 1, y − 1) (34)

For example, see Fig 55, the value of i (x, y) = 10, s (x-1, y) = 14, and s (x, y-1) = 25, so the value of s (x, y) = 7 + 14 + 25 - 10 = 36.

After applying the equation and calculating all values inside the integral image, only constant time under O(1) space complexity is necessary for computing the sum of pixels in any rectangle sub-region out of the original image.

When doing this calculation, only four values from the integral image are required. Then, we need to perform adding or subtraction on these four values to obtain the right value of the sum of the pixels inside that sub-region. For example, it the target is to get the sum value of pixels in region D in Fig 56, to achieve this, we follow this equation:

sum = s(A) + s(B)− s(C)− s(D) (35)

Benefited from the idea of integral image, both the training and detection part for the feature extraction can be reduced. We give an example to illustrate the calculation of applying integral image for speeding up feature extraction process for B-LTP method.

In B-LTP method, there are six formulas for four templates for feature calculation. Thus, for each block, the histogram requires 48 bins which means it can be replaced by 48 extra images for computational reduction. Thus, when we compute a value for a certain pixel, we could also retain that value to a corresponding integral map simultaneously. Therefore, we can create integral mappings for additional images in the same way. See Fig.57.

具体的内容：

The efficient boosted classifiers are constructed and placed at the early stages of the cascades so as to reject man of the simple non-human samples while detection almost all human samples.

A the same time, complicated and time-consuming boosted classifiers are placed in the later stages of the cascades to remove more complex negative samples. In this way, it is able to quickly discard simple background regions such as road and sky, while spending more time on human-like regions of the image. Only samples that can pass through all stages of the cascades are classified as human.

The efficient boosted classifiers are trained and adopted at the top parts of the cascade pipeline so as to reject many of the simple non-human samples while keep almost all human samples. At the same time, complicated and time-consuming boosted classifiers are adopted in the bottom stages of the cascade pipeline to remove more complex negative samples. Therefore, it allows to efficiently discard simple background regions such as road and sky, while spending more time on human-like position of the image. The samples that can pass all the different stages of the cascade pipeline are eventually identified as humans.

(20)

(34) 訂正箇所：Page 84, line 21-page 86, line 2, paragraph 3-9 in section 4.2.3 訂正内容：文書の訂正

具体的内容：

In CUDA, the parallel portions of an application are executed on the GPU as kernels. One kernel is executed at a time and many threads execute each kernel. The difference between CUDA and CPU threads is that the former's threads are extremely lightweight. CUDA kernel launches a grid of thread blocks, threads within a block cooperate via shared memory and threads in different blocks cannot cooperate, Fig 59 gives the structure of thread blocks.

For the memory hierarchy, all threads are accessible to the same global memory. There are also two additional read-only memory spaces available by all threads: the constant memory spaces and the texture memory spaces. The global, constant and texture memory space are optimized for different memory usages.

We first select our B-LTP method for the GPU-based implementation.

For the classification part, the classifier is stored in textural memory before the detection process is initiated. The classifier is a set of support vectors, which undergoes a training process by the SVM method as a preliminary step.

For the detection part, Fig.60. illustrates the main process to generate the GPU-based B-LTP detector.

First, in step (a) the original intensity image is loaded into the graphics card as character vectors.

In the next step (b), a gradient magnitude map is built for intensity image. In order to calculate the gradient magnitude value of each pixel, a thread block of GPU containing n × n threads is applied for each n × n pixel region. Here n depends on the block size of a feature; for example, if the block size is 16 × 16, a thread block containing 16 × 16 (256) threads is used to deal with a 16 × 16 pixel region. Six thread blocks with 256 threads are utilized for maximum 1536 parallel computations in the system.

In the CUDA platform, functions that run on the device are referred to as kernels. At each time point, only one kernel is executed and it is executed by many GPU threads simultaneously.

Compared with the multi-threads programming in CPU, the GPU thread is extremely lightweight.

To run a kernel, the CUDA platform launches a grid of thread blocks, where threads within a block share the memory and there is no cross-talk between different blocks, Fig 59 gives the structure of thread blocks.

For the memory hierarchy, all threads are accessible to the shared global memory. In addition, two additional read-only memory spaces (constant memory space and the texture memory space) are also available by all threads. All these three memory spaces are optimized for various kind of purposes.

We first select our B-LTP method for the GPU-based implementation.

For the classification part, the classifier is uploaded into the textural memory in advance of the start of the detection process, while classifier is composed of a set of vectors, which undergoes a training process by the SVM method as a preliminary step.

For the detection part, Fig.60. illustrates the main process to generate the GPU-based B-LTP detector.

(21)

First, in step (a) the original intensity image is loaded into the graphics card as character vectors.

In the next step (b), a gradient magnitude map is built for intensity image. For the purpose of calculating the gradient magnitude value of each pixel, a thread block of GPU containing n × n threads is applied for each n × n pixel region. Here n depends on the block size of a feature; for example, if the block size is 16 × 16, a thread block containing 16 × 16 (256) threads is used for handling this 16 × 16 block. Six thread blocks with 256 threads are utilized for maximum 1536 parallel computations in the system.

(22)

2. 訂正理由

博士論文中にオンラインの記事あるいはほかの著者の論文からの不適切な引用が認められたため、訂正を行わせた。具体的には(1) – (4), (6) – (10), (16), (22), (27), (32), (33)はオンラインの記事から、(5), (11), (19), (21), (26), (28), (29)は他の著者の論文から、(12), (31), (34)は同じ研究室の先輩の博士論文から、(13), (14), (16) – (18), (20), (23) – (25), (30) は第一著者でない共著論文からの不適切な引用であった。

3. 訂正を認めた理由

訂正箇所は、いずれも他者のこれまでの研究の成果の説明の部分であり、本論文の主たる成果に影響を与えることがない。よって、本訂正は妥当であると認める。

訂 正 確 認 報 告 書