局所特徴を用いた動画像中の人間動作認識

(1)

九州大学学術情報リポジトリ

Kyushu University Institutional Repository

局所特徴を用いた動画像中の人間動作認識

姫, 艶麗

九州大学システム情報科学府情報知能工学専攻

https://doi.org/10.15017/25185

出版情報：Kyushu University, 2012, 博士（工学）, 課程博士バージョン：

権利関係：

(2)

Recognition of Human Actions using Visual Local Features

Yanli JI

Department of Advanced Information Technology

Graduate School of Information Science and Electrical Engineering Kyushu University

August, 2012

(3)

Abstract

Recognition of human actions in videos is a process of naming actions which are captured by cameras, usually in a simple form of an action verb. Action recognition is an attractive research topic, and its application fields are not limited to video surveillance, human-computer interaction, sport video analysis, computer motion animation, and so on. However, human action recognition is still a challenging problem because of two reasons. One reason is owed to quite a lot of appearance variations in human actions, such as various action classes, different physiques of humans and a variety of clothing styles and colors. Furthermore, camera based action recognition needs to overcome some difficulties brought by occlusion, view point changes, scale variation of video screen, etc..

In this thesis, the author aims to recognize human actions captured by cameras from basic to complex situations. For the target, the author propose a method for local feature calculation, and design a recognition system using these local features.

Furthermore, the author proposes a local feature based method to solve the problem of more complex action recognition: human interaction.

Firstly, a new local feature calculation method is proposed for human action representation. In the method, FAST detector is extended to spatio-temporal space to detect feature points from videos. Then a compact descriptor is proposed which represents actions with compact peak kept histograms of oriented spatio-temporal gradients (CHOG3D). It is calculated in a small spatio-temporal support region around the candidate feature point in order to obtain a compact descriptor. It employs the first order gradient in spatial and temporal orientations for descriptor cal-

(4)

culation. In addition, it keeps the peak value of orientation quantized gradient to make the descriptor CHOG3D being able to represent actions more exactly and being distinguished more easily. The efficiency of peak kept is certified by comparing with threshold setting method for action recognition. By parameter training, the optimal parameters for CHOG3D are determined. The local features calculated with FAST and CHOG3D are applied for action recognition using SVM. Based on the computation cost comparison and performance evaluation, the compact descriptor CHOG3D performs well on human action recognition, and it has a lower computation cost. Though CHOG3D has the limitation of containing less information, a proper quantity of feature points help to overcome the disadvantage.

Secondly, a self-organizing map (SOM) based recognition system is proposed for local feature used human action recognition. In the proposed system, the compact descriptor CHOG3D is adopted for local feature calculation to represent human actions. Then the SOM is employed to train local features and to extract key features of actions because of its advantage in mapping data into a low dimension.

After training, the key features are assigned action labels of the training data. For action recognition, we adopt k-Nearest Neighbor algorithm (k-NN) to classify features of a testing action sequence into different action classes. By calculating the statistics of feature classification, the action class of the testing sequence is determined. We search for the optimal map size of SOM for training and the proper valuekfor k-NN classification. With the optimal parameters, we test the proposed method for action recognition on three datasets, KTH, Weizmann and UCF sports datasets and the results certify the efficiency of the proposed recognition system.

Compared with the method CHOG3D and SVM, the SOM based method performs better and faster.

Finally, we extend our research to recognize complex human actions, i.e. in- teractive actions, and propose a contribution estimation method for improving in- teractive action recognition. Unlike previous algorithms using both of two partici-

(5)

pants action information, the proposed algorithm estimates the action contribution of participants to select the major participant action for correct interaction recognition. To estimate contributions, we construct contribution interaction model for each interaction category in which both of two participants do major actions. Then we design a method using these contribution interaction models to estimate the contribution of participants and classify interaction samples to “co-contribution”

or “single-contribution” interactions. Furthermore, we determine the major action in a “single-contribution” interaction. If a given interaction is determined to be

“co-contribution,” the actions of both the two participants are adopted for recognition. While for “single-contribution” interaction, the major action is selected for recognition. Experiments show that the method is effective for human interaction recognition, which outperforms other methods.

(6)

List of Figures

1.1 Vicon motion capture system . . . 4

1.2 Multi-camera captured human action in MuHAVi dataset [1] . . . 5

1.3 Kinect sensor and its outputs . . . 6

2.1 Moving light displays (MLD) representing human motions [2] . . . 12

2.2 Action representation by models. (a) A 2-D stick-figure model fleshed out with ribbons [3]; (b) Hierarchical 3D model based on cylindrical primitives [4]; (c) 2D marker trajectories [5]; (d) Annotation body model [6]; (e) 3D skeleton of human action [7]. . . 14

2.3 Silhouettes for tennis action recognition [8] . . . 15

2.4 Silhouette of image sequence for the calculation of motion-energy image (MEI) and motion-history image (MHI) [9] . . . 15

2.5 Silhouette used for action feature extraction [10]. . . 15

2.6 Event recognition using over-segmented spatio-temporal volumes [11]. . . 16

2.7 Grids of optical flow magnitude for human action recognition [12]. . . 16

2.8 Optical flow split into directional components for action recognition [13]. (a) original image. (b) Optical flow. (c) Separating optical flow tox, y two vectors. (d) Half-wave rectification of each component. . . 17

2.9 Local features. (a) Spatio-temporal interest points from the motion of the legs of a walking person [14]. (b)Visualization of cuboid based behavior recognition [15]. 19 2.10 Scale-invariant spatio-temporal interest points (Hessian 3D). [16]. . . 20

(10)

2.11 A hierarchical representation of interaction “punching” [17]. . . 22

2.12 Spatio-temporal relationship matching process [18]. . . 23

2.13 Samples for different action classes (columns) in four scenarios (rows) in the KTH action dataset. . . 25

2.14 Action samples for different action classes in the Weizmann action dataset. . . 26

2.15 Action samples for different action classes in the UCF sports action dataset. . . 26

2.16 Interaction samples in the UT interaction dataset. . . 27

2.17 Interaction samples in the LIMU interaction dataset. . . 28

3.1 Overview of HOG3D calculation procedure. . . 31

3.2 Spatio-temporal channels . . . 33

3.3 Compact descriptor CHOG3D calculation. In (a) the support region is separated tol cells; one feature histogram is calculated for one cell in (b); (c) shows the quantization of gradient of one pixel point in a cell; (d) shows a view of gradient vector of the pixel. . . 35

3.4 The advantage of peak kept in descriptor calculation. . . 36

3.5 Comparison of peak kept and threshold setting method . . . 41

3.6 The recognition accuracy comparison of each action . . . 46

4.1 Results of different map size in KTH. . . 56

4.2 Results of differentkvalues in KTH. . . 56

4.3 Confusion matrix of KTH. . . 57

4.4 Confusion matrix of Weizmann. . . 58

4.5 Confusion matrix of UCF. . . 59

5.1 The major participant actions selection for interaction recognition . . . 63

5.2 Contribution interaction model construction . . . 66

5.3 Co-contribution interaction with dissimilar actions . . . 69

5.4 Participants separation . . . 72

(11)

5.5 Definition of early time action . . . 74 5.6 Optimal value determination for threshold and vocabulary size using HoG/HoF. 77

(12)

List of Tables

3.1 Recognition accuracy of different parameter sets on KTH . . . 39

3.2 Calculation speed and average local feature numbers per frame . . . 42

3.3 Comparison with related methods . . . 43

3.4 Confusion matrix of the KTH dataset . . . 45

3.5 Comparison of recognition accuracy for each action . . . 45

3.6 Average recognition results of combinations . . . 47

4.1 Comparison of average recognition accuracy with other algorithms . . . 61

5.1 Contribution estimation accuracy (%) . . . 78

5.2 Recognition accuracy comparison with state-of-art(%) . . . 80 5.3 Recognition result comparison of co-contribution and single-contribution(%) . 81

(13)

Chapter 1 Introduction

With the development of IT technology, digital cameras are widely used in our real life.

Generally all the cell phones and computers are embedded with cameras. Surveillance cameras are installed in almost all the backbone roads for traffic surveillance. Cameras are also installed in stores, banks and other public places for security surveillance. In addition, they are equipped on cars for small range localization, pedestrian detection and parking assistant. Since cameras are used so widely, video becomes the most popular and important media for information recording and transmission. It leads to large quantity of video information in the world.

Continuously, the quantity is still increasing with a fast speed.

Automatic analysis and understanding of videos could help humans greatly to process such large quantity of information. However, the computer vision system is far behind the capa- bility of human vision. One of the major research issues in computer vision is human action recognition because which has a widely applications in our modern life.

1.1 Application of human action recognition

For example, human action recognition has been used in video surveillance, which has a number of applications related to home and office security. It is also required in places such as casinos, banks, hospitals, and health care facilities. Recently, the main function of video surveillance

(14)

is to provide video data to monitoring people to help them analyzing the security situation of monitoring area. Though many researchers are trying to design a smart surveillance system, most of them are still staying in experimental stages. One example is IBM Smart Surveillance System¹, which is a middleware offering for use in surveillance systems and provides video based behavioral analysis capabilities. Its function contains web based real time alerts, event search, and event statistics. Otherwise, some smart surveillance products have been designed with simple functions, for example, counting the number of customers in TrueView People Counter^{T M} ². To apply the computer vision technique for smart surveillance to replace human operations, there is still much work to do.

Another application field for human action analysis is human-computer interaction. In this field, the image/video input device plays an important role in recent years, especially after Mi- crosoft brought Kinect on market, which is a motion sensing input device for the Xbox 360 video game console and Windows PCs³. Based on a webcam style add-on peripheral for the Xbox 360 console, it enables users to control and to interact with the Xbox 360, without touch- ing a game controller, through a natural user interface using gestures and spoken commands.

Furthermore, Microsoft released Kinect software development kit (SDK) for users to design their own systems. It is respected that more and more interesting systems would be developed in the future. In addition, human-robot interaction is studied around the world, such as Carnegie Mellon University’s Human-Robot Interaction research group (HRI)⁴, NASA project on peer-to-peer human-robot interaction, the research on the embodied proactive human Inter- face between human and robots by professor Kurazume and professor Hasegawa in Kyushu University, and so on. The researchers are putting great enthusiasm to build an intuitive and easy communication with the robot through speech, gestures, and facial expressions in future.

Except the above examples, analysis of human actions is also applied for computer motion animation, producing films which attempt to simulate or approximate the look of live-action

1http://www.research.ibm.com/peoplevision/

2http://www.cognimatics.com/products/intelligent-surveillance/

3http://en.wikipedia.org/wiki/Kinect

4http://www.peopleandrobots.org

(15)

cinema, with nearly photo-realistic digital character models. Medical auxiliary system need to employ the human action analysis for medical treatment and nursing. In addition, human action understanding can help to analyze sports video, or help to correct postures of amateurs and new athletes.

In a summary, there is a great demand to develop the computer vision technology of the human action analysis and recognition in various fields. It can be expected that computer vision technology will change our daily life a lot in future.

1.2 Problems in human action recognition

In computer vision, human action refers to a composition of posture sequences or movements which may continue for a period of time. Action recognition is a process of naming actions which are captured by cameras, usually in the simple form of an action verb [19]. There are various types of human actions. In the survey paper [20], the human actions are conceptually categorized into four levels: gestures, actions, interactions and group activities, depending on the complexity of them. According to their definition, gestures are the elementary movements of a person’s body part, and are the atomic components describing the meaningful motion of a person. The examples of “stretching an arm” and “raising a leg” are given for gestures. Actions are difined as single-person activities that may be composed of multiple gestures organized temporally. To make it not to be confused with the collective name of human actions, we call the single-person activities as basic actions. Some basic actions repeat one or more geatures several times, such as “walk” and “run.” Otherwise some basic actions are acted only once in a short period, for instance, “sit down” and “stand up.” The interaction is defined as human activities that involve two or more persons and/or objects. As an example of human interaction,

“handshake” is acted by two persons synchronously. For the interaction “drink tea,” it refers to human interacting with objects. Even more complex situations are group activities, interactions between more than two persons and objects, such as “football playing.” In this thesis, we follow the definition of human actions in the survey [20]. Because the recognition of basic actions is

(16)

(a) Marker setting (b) Joints

Figure 1.1: Vicon motion capture system

a fundamental problem, and is still unsolved in human action recognition, the author first pays attention on the recognition of basic actions. Furthermore, the author extends the proposed method to the recognition of human-human interactions.

Human action recognition is a challenging problem mainly because of two reasons. The first reason is quite a lot of appearance variation in human actions, such as various action classes, different physiques of humans and a variety of clothing styles and colors. Further- more, camera-based action recognition need to overcome some difficulties brought by motion sensing, for instance, occlusion, view point changes, scale variation of video screen, etc. In the past years, numerous algorithms have been proposed to solve the recognition problem. Several survey papers [19, 21–23] provide us the review of these frameworks. In these frameworks, some researchers captured actions with markers which indicated the joints or limbs of persons, and some ones used more than one cameras to obtain the 3D construction of joints or body parts.

These researches recognized human actions in a conditioned case. Otherwise, some other researches designed their methods using video data captured in outdoor, with scale variance, and with some other natural situations. In the following, we discuss the action recognition problem based on the videos used in two situation: conditioned video data and unconditional video data.

Conditioned video data The conditioned video data is generally captured with special experimental equipments in experimental surroundings. A prominent method uses several markers

(17)

(a) Human actions captured in multiple view points

(b) Silhouettes of actions of different view points

Figure 1.2: Multi-camera captured human action in MuHAVi dataset [1]

attached to human body to capture motion information by cameras. In the early theoretical work on representation of three dimensional actions, Marr et. al [24] proposed a body model consisting of a hierarchy of 3D cylindrical primitives. Recently, the CMU Graphics Lab uses a Vicon motion capture system to capture various human actions, and the lab provides a large action dataset which can be downloaded by anyone for free. The system takes infrared images of markers set around human body by 12 Vicon infrared MX-40 cameras⁵, and the captured images were triangulated to get 3D data of human body. Figure 1.1 shows the marker setting in their capture system and the visualization of the captured data. Another way to obtain 3D motion information is multiple cameras. Some researchers use multiple cameras to capture human action from different view points, e.g. the researches in paper [25] and [1], the constructed 3D body joints or calculated silhouettes are applied for recognition. Figure 1.2 shows a multi- camera system which captures the human motion from different view points, and the silhouettes are calculated based on the multi-camera images in MuHAVi dataset⁶.

The conditioned video data is captured in limited surroundings to satisfy the requirement of some special application. Soon after the sensor camera Kinect is produced by Microsoft

5http://mocap.cs.cmu.edu/info.php

6http://dipersec.king.ac.uk/MuHAVi-MAS/

(18)

(a) Kinect sensor (b) Outputs of Kinect

Figure 1.3: Kinect sensor and its outputs

for computer games, a broader application field is developed by researchers. The sensor camera contains a RGB camera, depth sensor and multi-array microphone running proprietary software.

The camera captures general color image and depth image, calculating human skeleton(see Figure 1.3). The application of Kinect provides us more information for action analysis, but it has its limitation of limited sensing area, and only supporting individual tracking of skeleton, etc..

Based on the above consideration, we should note that the application of conditioned video data are limited. Marker based methods rely on heavy user interaction [19], the user wearing special clothes which are pasted markers, which makes them unsuitable for general recognition tasks. Multi-camera used motion capture most fits laboratory surroundings other than anywhere in outdoor. Furthermore, locating body parts and estimating parametric body models from images remain an unresolved problem. These problems limit their application fields.

Unconditional video data Unconditional video data is captured in natural surroundings, without special equipments and place limitation. The data may be captured indoor or outdoor, using ordinary cameras to record human motion. The actors may wear whatever clothing style and color in these videos. The cases of unconditional videos contain experimental data captured in relative natural situations, e.g. surveillance before an entrance, sport broadcast on TV, and even actions in movies. Without special control, we have to face the challenge of changes of viewpoint, scale, and lighting variances, partial occlusion of humans and objects, cluttered backgrounds, etc. [26].

At the early time, researchers tried to extract human’s silhouette, contours, and optical flow

(19)

of motions in the dataset with static background for action recognition. Yamato et al. [8, 27], used pixel statistics on extracted silhouette images as features to represent human action, and Bobike et al. [9] integrated silhouettes over time in so called motion history images (MHI) and motion energy images (MEI). Blank et al. [10, 28] directly made use of the space-time volume spanned by a silhouette sequence over time. J. Rittscher et al. [29] extracted contours of human gestures to represent human actions. As many of the above mentioned approaches certified, silhouettes provide strong cues for action recognition, and moreover have the advantages of being insensitive to color, texture, and contrast changes. However, silhouette based representations are failed in detecting self-occlusions, and depend on a robust background segmentation [19].

Another efficient method to construct human motion models was optical flow based method.

The paper [12] proposed motion features by accumulating flow magnitudes in a regular grid of non-overlapping bins on detected optical flow. More recently, the researches [13, 30] employed a method which split the optical flow field into four different scalar fields (corresponding to the negative and positive, horizontal and vertical component of the flow), which were used to represent human motions. Optical flow based methods do not rely on background, thus it can be used for action extraction in complex surroundings. However, the disadvantage is that optical flow calculation is sensitive to illumination changes and background moving.

Recently, more researchers turned to the recognition of realistic human actions, and a series of unconditional action datasets were published for experiments, e.g. KTH [31], Weizmann [10], YouTube video sequences [32], UCF sports dataset [33], Hollywood movie data [34], and IXMAS [35], etc.. To recognize realistic actions, various of local features were proposed to describe human actions [15, 16, 36–40], using Bag of Words (BoW) to integrate local features.

Except that, some proposed systems arranged local features in special ways for action recognition [32, 41, 42]. These works certified that local features performed better than previous methods for action representation. Since local features have the characteristics of local area, they need to be arranged in a larger range to represent actions. Therefore, it should be noted that the quantity of local features, the feature calculation method, and the arrangement method influence recognition performance greatly.

(20)

1.3 Action recognition considering interaction

When talking about complex human action recognition, it should be considered about the recognition of actions interacting with the realistic surroundings, with other persons or with objects.

According to the complexity of actions and the interaction targets, the recognition of realistic actions is discussed in three categories: interaction with object, interaction with human, and more complex interactions, e.g. group activities. In our work, the human-human interaction is mainly discussed.

Human-human interaction includes high level human actions. It involves two persons, and the actions occurring between them, sometimes involving objects. Some interactions happen between two participants directly, such as “punch,” “hand shake.” While some other interactions involve objects as intermediary to complete the activity, e.g. “hand over a book,” “hand over a ball” or some other objects. Recognition of these interactions is no longer as simple as recognition of single person actions.

Various methods were proposed to solve the interaction recognition problem. Considering the interaction is composed of two persons’ action, recognition of the interaction was realized by combining the recognition results of single person action which were recognized independently [17,43–45]. Meanwhile, the trajectories of participants were analyzed for interaction detection, and interaction recognition was realized by combining the trajectory information with the actions of persons [46]. Another method made use of the spatio-temporal relationships of local features and context features for interaction recognition [18]. Furthermore, group activity was also discussed based on single person action recognition [47].

Among the existing methods, a frequently used approach for multi-person action recognition is to analyze the composition of single person’s action temporally. Using single person action is a natural way for multiple human action recognition, and it is applied to not only human-human interaction recognition, but also group activity recognition. However, it is not always easy to separate one group into a set of single persons clearly, and to analyze each person’s action independently, especially when occlusion exists between persons. Realistically, occlusion is an

(21)

unavoidable problem in multiple human actions. Except that, another problem in human-human interaction recognition is whether it is necessary to use all persons’ actions for recognition. The previous methods have not considered about this problem.

1.4 Major contributions and structure of the thesis

In this thesis, the author tries to solve the problem of human action recognition from basic actions to complex and realistic actions. The details of the contribution points are explained in the following.

At first, a local feature calculation method is proposed for human action representation. The FAST corner detector is extended to spatio-temporal space to extract feature points which represent the shape and motion information of human actions. For action description, a compact peak kept histogram of oriented spatio-temporal gradients (CHOG3D) is proposed for local feature calculation. Through parameter training, the proper dimension of CHOG3D is determined.

The low computation cost of the compact descriptor is certified experimentally. Using SVM as classifier, the performance of extended FAST detector and CHOG3D descriptor is evaluated, and compared with other detectors and descriptors. The chapter 3 introduces the calculation of compact descriptor CHOG3D in detail.

Using the compact descriptor CHOG3D for action description, an action recognition system based on self-organizing map (SOM) is proposed. Since SOM provides a low dimensional view on high dimensional data samples, and it has the advantage of handling complex situations, it is employed for local feature used action recognition. By combining SOM and the classifier k- Nearest Neighbor algorithm (k-NN) to recognize human actions, the system recognizes actions in a simple way. In the system, human actions are described by the descriptor CHOG3D. SOM is adopted in feature training to extract keywords of local features of action information. After training, each neuron is given an action label in the trained map. Finally, k-NN is used to classify local features, and the statistics of classified local features are used to determine the action classes of testing videos. It is certified that the proposed recognition system performs

(22)

better and faster than the recognition system with SVM. The experiment processing is shown in chapter 4.

In chapter 5, the research is extended to realistic human action recognition: human interaction recognition. An efficient algorithm considering the contribution of two participants is proposed for human interaction recognition. In the method, local features are adopted for action representation. Unlike previous algorithms using both of two participants’ actions, the proposed algorithm estimates the action contribution of participants, and chooses the actions which make contribution to interactions for interaction recognition. The method is tested on two interaction datasets, and the recognition results show the efficiency of the proposed method.

Finally in chapter 6, the main contributions in the thesis are summarized, and the possible research directions in the future are discussed.

(23)

Chapter 2 Related works and datasets

2.1 Related works

In this section, the previous algorithms for action recognition and human interaction recognition are reviewed in the following two subsections. Some surveys [19–23, 48–53] provide the technology progress on human action representation, tracking, classification, and so on, in recent years. Among them, J. K. Aggarwal et al. summarized the methods of human action representation and recognition until the year of 1999 [21] and 2011 [20], respectively. In their papers, not only the methods of representation and recognition for basic human actions were summarized, but also the high level human actions, e.g. human interactions and group activities, were discussed. R. Poppe [23] and D. Weinland [19] provided a detail overview of action representation and classification proposed by researchers by far. L. Wang [50], J. Candamo [53] and Hu [51]

focused on action representation and tracking, analysis in the field of visual surveillance.

2.1.1 Basic action representation and recognition

Based on the methods for action representation, we introduce the methods for action recognition in three categories: human model based methods, spatio-temporal volume based methods and local feature based methods.

(24)

Figure 2.1: Moving light displays (MLD) representing human motions [2]

Human model based methods

Some researches constructed human models by analyzing human body parts, building body structure on each frame of the observed video sequence. Action recognition was performed based on such models. Johansson [2] showed that humans can recognize actions merely form the motion of a few moving light displays (MLD) attached to the human body (Figure 2.1). The author’s experiments inspired approaches in action recognition. Figure 2.2 shows some body models constructed to represent human actions.

At the early research stage, commonly used methods for action representation were stick figures [3, 54] (Figure 2.2(a)) and 2D anatomical landmarks. Some other approaches directly worked on the trajectories of 2D/3D anatomical landmarks, e.g. head and hand trajectories [5, 55] (See Figure 2.2(c)), or recorded the 3D positions of feet, hands and head per frame for action recognition [56]. Lately, B. Yao and Li Fei-fei [57] proposed a new random field model to encode the mutual context of objects and human poses in human-object interaction activities, where human body parts were located and represented by spatial layout for analyzing the relationship between different body parts and object. Experiments showed their model significantly outperformed other state-of-the-art methods.

A more complex model for action representation was 3D human model. O. Rourke et al.

[58] constructed an elaborate volumetric model which consists of 24 rigid segments and 25 joints. Marr et al. [4] proposed a human body model consisting of a hierarchy of 3D cylindrical primitives (see Figure 2.2(b)). Such a model was later adopted by several more flexible body

(25)

models. D. Ramanan and N. Ikizler et al. [59, 60] started from tracked patches in 2D and then lifted the 2D configurations into 3D for action recognition (Figure 2.2(d)). Recently, V. Ferrari et al. [61] and D. Ramanan [62] localization of body parts in movies has been investigated by annotating 2D action information to 3D sequence in a video (See Figure 2.2(d)). Agarwal [6]

and Urtasun [7] constructed more efficient 3D human pose model using strong prior knowledge (Figure 2.2(e)).

Though the human body models seem to represent human actions accurately and conve- niently, we should note that locating body parts and estimating parametric body models from images remain an unresolved problem, independent of the model used (2D or 3D) [19]. Mark- ers or special prior situations could help construct models easier, but they limit their application to realistic situation.

Spatio-temporal volume based methods

Spatio-temporal volume based methods use image information, gradient, edge, silhouette information, and optical flow etc. to represent human actions in a spatio-temporal volume.

Some researches used global feature volume, such as silhouette and contour, in spatio- temporal space for human action recognition. A typical method using silhouette sequence to represent actions was proposed by J. Yamato et al. [8] in the early research of action recognition. In their method, the silhouettes of a tennis player were extracted, and the ratio of pixels occurring in foreground to background within a small grid in time-sequential images was calculated as features of actions (see Figure 2.3). Then these features were learned by HMM to obtain vocabulary of tennis actions, and actions were recognized by comparing testing action with trained vocabulary words.

Aaron F. Bobick et al. [9] proposed two temporal templates, a binary motion-energy image (MEI) which represented where motion had occurred in an image sequence; and a motion- history image (MHI) which was a scalar-valued image where intensity was a function of recency of motion to represent human actions (See Figure 2.4). The MEI and MHI were considered as two components of a temporal template, a vector-valued image where each component of each

(26)

(a) (b)

(c) (d) (e)

Figure 2.2: Action representation by models. (a) A 2-D stick-figure model fleshed out with ribbons [3]; (b) Hierarchical 3D model based on cylindrical primitives [4]; (c) 2D marker trajectories [5]; (d) Annotation body model [6]; (e) 3D skeleton of human action [7].

pixel was some function of the motion at that pixel location. These templates were compared with the temporal templates of stored instances of views of known actions for action recognition.

Instead of using statistic of silhouettes temporally, M. Blank [10] developed a method which extracted volumetric silhouettes of human actions, and extract various shape properties, such as local saliency, action dynamics, shape structure and orientation (Figure 2.5). These features were utilized directly for shape representation and classification by nearest neighbor procedure.

(27)

Figure 2.3: Silhouettes for tennis action recognition [8]

Figure 2.4: Silhouette of image sequence for the calculation of motion-energy image (MEI) and motion-history image (MHI) [9]

Furthermore, block volumes of images in spatio-temporal space were employed for action representation. A technique was proposed by Ke et al. [11] which matched the volumetric representation of an event against over-segmented spatio-temporal video volumes for event recognition in crowded videos. Without figure/ground separation, they demonstrated shape matching in cluttered scenes with dynamic backgrounds. Volumetric and optical flow features are then matched to action templates in form of space-time shapes (Figure 2.6). By breaking the action

(a) Space-time shapes. (b) The solution to the Poisson equa- tion on space-time shapes.

(c) Example of local space-time saliency features.

Figure 2.5: Silhouette used for action feature extraction [10].

(28)

Figure 2.6: Event recognition using over-segmented spatio-temporal volumes [11].

Figure 2.7: Grids of optical flow magnitude for human action recognition [12].

template into parts, the work reliably identified actions in the presence of partial occlusion and background clutter.

Though optical flow is extracted in the form of local features, it was applied to represent human global actions in some methods. An early method was proposed in paper [12], where flow magnitudes were accumulated in a regular grid of non-overlapping bins for action representation, as shown in Figure 2.7. Being matched with reference motion templates of known actions, the testing action was determined. Later, Efros et al. [13] proposed another method which split the optical flow field into four different scalar fields (corresponding to the negative and positive, horizontal and vertical component of the flow), as shown in Figure 2.8, which were separately matched during the recognition procedure.

As the above examples show, silhouettes provide strong cues for action recognition if human actions can be represented correctly by silhouettes. However, silhouette based representations depend on a robust foreground segmentation. Foreground segmentation is still a difficult problem to be solved. Even more, silhouettes can not represent human actions correctly with occlusion or worse view points. These reasons make it not a good choice for action recog-

(29)

(a) (b) (c) (d)

Figure 2.8: Optical flow split into directional components for action recognition [13]. (a) original image. (b) Optical flow. (c) Separating optical flow to x, y two vectors. (d) Half-wave rectification of each component.

nition. Though optical flow based representations do not depend on background subtraction, which makes it more practical than silhouettes in many settings, optical flow detection is easily affected by lighting change, camera moving, and image noise, etc..

Local feature based methods

Using local features for human action recognition is inspired by object detection and recognition. The motivation behind local feature used approaches is the fact that a 3D space-time volume is essentially a rigid 3D object. This implies that if a system is able to extract ap- propriate features describing characteristics of each action’s 3D volumes, the action can be recognized by solving an object-matching problem [20]. Local feature used methods contain three step procedures, feature points detection, descriptor calculation for the points and feature used recognition. In the following, we discuss the related methods of feature calculation.

Detectors and descriptors During the past decades, various of local feature detectors and descriptors were produced for human action recognition. Among them, several 3D volume methods were certified to be efficient and popularly used by other researchers for human action representation.

Laptev et al. [14] was the first one to propose a 3D feature detector based on a spatio- temporal extension of the Harris corner criterion [64], which was commonly used for object recognition. In their method, they proposed the Space-Time Interest Points (STIP) for a sparse

(30)

representation of human actions. They extended the 2D point detectors proposed by Harris [64]

in order to detect interest points in a space-time volume, which is also commonly used as Harris3D in the recent researches. Results of detecting Harris interest points in an outdoor image sequence of a person walking is illustrated in Figure 2.9(a). These features have been used to distinguish the walking person from complex backgrounds. Following that, C. Schuldt [36] firstly adopted STIP for interest points extraction. To describe these points, they developed and compared different descriptor types: single- and multi-scale higher-order derivatives (local jets), histograms of optical flow, and histograms of spatio-temporal gradients. These descriptors described the surrounding motion and apparency of a given position of STIP. Finally, SVM was employed for local feature used action recognition. In their experiments, Laptev et al. reported best results for descriptors based on histograms of optical flow and spatio-temporal gradients.

For their experiments, a new database, KTH actions dataset, were published. It will be discussed in section 2.2.

Later, I. Laptev et al. [34] introduced the descriptor of histograms of oriented spatial gradients (HoG) and histograms of optical flow (HoF) to characterize local motion and appearance of human actions. The histograms are accumulated in the space-time neighborhood of detected interest points. Each local region is subdivided into aN ×N ×M grid of cells; for each cell, 4-bin HoG histograms and a 5-bin HoF histogram were computed. In their framework, the descriptors HoF and HoG were combined together to represent actions. The two descriptors were also used independently in other researches [65].

Dollar et al. [15] proposed a new spatio-temporal feature detector for action recognition.

Their detector is especially designed to extract space-time points of local periodic motions, obtaining sparse distribution of interest points from a video. After the feature points being detected, their system associates a small 3D volume called “cuboid” to each feature point (See Figure 2.9(b)). An attended vector of brightness gradients in a cuboid was chosen to describe the feature points. The BoW paradigm was used to model each action to be a histogram of cuboid types detected in 3D space-time volume while ignoring their locations. They have recognized facial expressions, mouse behaviors, and human activities using their method. The local feature

(31)

(a) (b)

Figure 2.9: Local features. (a) Spatio-temporal interest points from the motion of the legs of a walking person [14]. (b)Visualization of cuboid based behavior recognition [15].

calculation method was employed by J. C. Niebles [66] in their unsupervised learning method for human action categories. With local features, they used probabilistic Latent Semantic Anal- ysis (pLSA) model to classify action categories.

Scovanner et al. [40] extended the image SIFT descriptor [37] to 3D version for action recognition. Using sub-histograms to encode local time and space information allowed 3D SIFT to generalize the spatio-temporal information better than features in the previous works.

For recognition, the BoW was used to integrate features in videos, and a method to discover the relationship between spatio-temporal words was presented in order to better describe the video data.

Willems [16] extended the Hessian saliency measure which was applied for blob detection in images [67] spatio-temporally to be Hessian3D detector (Figure 2.10). The method aimed at detecting spatio-temporal interest points which are scale-invariant (both spatially and temporally), and which densely covered the video content. An integral video structure made computation fast by approximating derivatives with box-filter operations. Furthermore, the Extended SURF (ESURF) descriptor was proposed by extending the SURF descriptor [39] to 3D to describe the information around feature points. In the method, the authors divided 3D patches into a grid of local cells, and each cell was represented by a vector of weighted sums of responses of

(32)

Figure 2.10: Scale-invariant spatio-temporal interest points (Hessian 3D). [16].

Haar-wavelets which were uniformly sampled alongx, y, tthree axes.

With gradient information, Kläser proposed a descriptor named as histograms of oriented 3D spatio-temporal gradients (HOG3D) [68]. In the method, the gradient was projected into 20 directions which were normal vectors of an icosahedron faces. After projecting the mean gradient information in neighborhood of a feature point, the quantized gradient was combined to be one histogram, which was the descriptor of the feature point. To extract feature points, a dense sample method, which provided a very dense distribution of interest points, was used.

Finally, Wang [65] gave an evaluation on currently used local feature detectors and descriptors through classifying human activities by SVM. Except SIFT3D, the methods introduced above were included in their evaluation. Based on their results, the combination of detector Harris3D and descriptor HoF, the combination of detector Harris3D and descriptor HoG/HoF, and the combination of Cuboids detector and descriptor HOG3D performed better than other algorithms. Following that, Kläser obtained a better recognition result of the combination of Harris3D and HOG3D in KTH dataset in his thesis [26]. Finally, as the above results indicated, the descriptor HOG3D performed best for recognizing human actions in different surroundings.

However, descriptor HOG3D employed a complex procedure to calculate gradient, and it used a vector with more than 1000 elements to describe one feature point. It is a huge length for local feature description. Except that, a dense sample detector was used in their algorithm. The calculation cost of the descriptor HOG3D is high, and the length of descriptor increases the difficulty to distinguish two vectors correctly.

A practical advantage of the interest point approaches is that the feature points can be detected directly from video frames without any pre-processing. Another important advantage is

(33)

that local feature represents local region information, so it is robust against occlusion, and translation in most cases. In some methods, descriptors were calculated in multi-scales [16, 34, 68], and, therefore, the local features are robust against scale change. The detected interest points showed some consistency for similar observations, but usually they can also account for outliers.

On the downside, the detected features are usually unordered and of variable size, and conse- quently modeling geometrical and temporal structure is difficult with local features. Many approaches therefore packed local features to generate the feature occurrence histograms for action representation, which described sequences simply.

The method of local feature processing in action recognition is not limited to producing occurrence histograms, but various of methods were proposed in recent years. Gilbert et al. [42]

introduced a hierarchical combination of features along with an efficient data mining technique to recognize actions. Tsz-Ho Yu et al. [69] presented a novel real-time solution that utilized local appearance and structural information for action representation, and the kernel k-means for- est classifier using pyramidal spatio-temporal relationship match (PSRM) was used to perform classification. A. Yao [70] presented a method using a Hough transform voting framework to classify and to localize human actions in video. Some other methods employed SOM to project large quantity of features to a lower dimension feature set for action recognition.

2.1.2 Human-human interaction

In realistic surroundings, human usually interact with objects, other human, or in a group activity. Recognizing these realistic actions is another challenge in human action recognition.

Human-human interaction refers to the actions which occur between two persons, and here, the two persons are called “participants.” Their actions are combined together to express one action meaning, thus we name the action pair as action category. Several methods have been proposed to recognize human interactions.

As interaction refers to actions of two persons, it is natural to separate two persons and to analyze each single person action for recognition. The most earliest method was presented by

(34)

Figure 2.11: A hierarchical representation of interaction “punching” [17].

S. Park and J.K. Aggarwal [17], and a hierarchical graphical model that unifies multiple-level processing from pixel level to event level was used to present human actions and to recognize interactions between participants (See Figure 2.11). By analyzing the actions of the modeled body parts of two participants temporally, they recognized interactions by dynamic Bayesian network (DBN). The framework provided a bottom-up processing that actions in an interaction were analyzed individually.

Y. Du et al. [46] proposed a hierarchical durational-state dynamic Bayesian network (HDS- DBN) to represent and to recognize human interactions. In their method, they extracted global and local features to represent interactions. The positions and velocities of each person/object were recorded as global features, while the bounding box surrounding human region, and the angle of inclination of human body were used as local features. HDS-DBN modeled the global and local features temporally, respectively. The method tested simple interactions, such as two persons running in opposite directions, two persons walking, meeting, chatting, and so on.

Similar with S. Park’s method, Ryoo and J. K. Aggarwal [43, 44] introduced a context-free grammar (CFG) based representation scheme to represent composite actions and interactions.

Their method extracted poses and gestures of human actions in the image sequences, and then a hierarchical processing was applied on multi-layer features. Bayesian networks were used to implement the pose layer, and hidden Markov models (HMMs) were implemented for the gesture layer. At the highest layer, actions and interactions were represented semantically using a context-free grammar (CFG). Following the production rules of CFG, the system recognized composite actions and interactions.

For the contestants of the “High-level Human Interaction Recognition Challenge” in ICPR

(35)

(a) Video volumes (b) Spatio-temporal feature points

(c) Relationships in histograms

Figure 2.12: Spatio-temporal relationship matching process [18].

2010 [71], Waltisberg et al. [45] adopted a pedestrian detection algorithm to obtain individual action information for human interaction recognition. They modeled each actor’s action using a spatio-temporal voting with extracted local XYT features, forming a hierarchical system consisting of two levels, single person level and interaction level. Their algorithm showed better results comparing with others.

Interaction refers to two participants’ actions, so that it is natural to process human action individually. Most of the previous works chose separation of two participants, and it is certified to be beneficial to interaction recognition.

Otherwise, some algorithms treated the interactions of two participants as single action category for recognition. M. S. Ryoo [18] proposed a spatio-temporal relationship matching approach which constructed spatio-temporal relationships among local features extracted from interaction videos (See Figure 2.12). Their method measured structural similarity between two videos by computing pairwise spatio-temporal relations among local features (e.g., before and during), enabling detection and localization of complex-structured activities. Their system not only classified simple actions (i.e., those from the KTH datasets [31]), but also recognized interaction-level activities (e.g., UT interaction dataset [71]) from continuous videos.

Recently, Ryoo [72] proposed an algorithm for activity prediction which represented an

(36)

activity as an integral histogram of spatio-temporal features, efficiently modeling how feature distributions change over time. The algorithm was also performed both on simple actions and human interactions (e.g., UT interaction dataset [71]). In the paper of T. Yu et al. [69], the author applied the method Semantic texton forests (STFs) to local space-time volumes as a powerful discriminative codebook for realistic action recognition. They tested their method on human interaction recognition (UT interaction dataset [71]), where the actions of two participants in interactions were treated as one action category.

These algorithms ignored the difference of the two participants’ actions in an interaction category, regarding them to be one atomic action, which limits them to obtain good performance. The processing did not distinguish the two persons, so that it is not suitable to extend the methods to more complex action situations.

2.2 Datasets

In this section, we introduce datasets which are used for experimental studies in the following chapters. Since our work mainly concentrates on the basic human action recognition and human interaction recognition, we introduce datasets in these two categories.

2.2.1 Basic human action

Among the datasets presented by researchers for human action recognition, frequently used ones of basic human actions contain the KTH dataset [31], Weizmann dataset [10]. Recent years, more realistic human action datasets, UCF sport dataset [33], youtube dataset [32] and Hollywood [34] are introduced to researchers. UCF sport dataset collects sport videos from TV, and youtube dataset contains the realistic human actions on the web. Even more, videos in Hollywood dataset are of hollywood movies. In our research, the first three datasets are selected to test our methods since they are most frequently used by other researches. Therefore, we introduce the datasets of KTH, Weizmann and UCF sport in detail.

(37)

Figure 2.13: Samples for different action classes (columns) in four scenarios (rows) in the KTH action dataset.

KTH action dataset

The KTH action dataset¹ was introduced by Schuldt et al. in 2004. It contains six classes of human actions: walk, jog, run, box, hand wave, and hand clap. Each action is performed 4 or 6 times by 25 persons. The video sequences in the dataset are recorded in four different scenarios:

outdoors, outdoors with scale variation, outdoors with different clothes, and indoors. Totally, the dataset consists of 600 video samples. Generally, each video contains more than 300 frames with the spatial resolution of160×120. The background is homogeneous and static in most sequences, and the scale varies. Figure samples of the dataset are shown in Figure 2.13.

Weizmann action dataset

The Weizmann dataset was presented by M. Blank et al. [10] in 2005. Nine different action classes in the Weizmann dataset²are used in our experiments: run, walk, skip, jump-jack(jack), jump forward, jump in place(pjump), gallop sideways(side), bend and hand wave. In the dataset, each action class is performed once by nine subjects. As a result, total 81 video samples are used in our experiments, and each video is composed of near 100 frames with the spatial resolution

1http://www.nada.kth.se/cvap/actions/

2http://www.wisdom.weizmann.ac.il/ vision/SpaceTimeActions.html

(38)

Figure 2.14: Action samples for different action classes in the Weizmann action dataset.

Figure 2.15: Action samples for different action classes in the UCF sports action dataset.

of 180×144. Similar with KTH dataset, the background in the videos is homogeneous and static, too. Some sample frames of the Weizmann dataset are shown in Figure 2.14.

UCF sport action dataset

The UCF sport dataset³ consists of a set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. It contains close to 200 video sequences at the resolution of720×480 pixels. The collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. It contains nine different human action categories, diving, golf swinging, kicking, lifting, horse riding, running, skating, swing (consisting of bench swing on the pommel horse and on the floor and side swing

3http://www.cs.ucf.edu/vision/public_html/

(39)

Figure 2.16: Interaction samples in the UT interaction dataset.

at the high bar) and walking. Each action category is done by several subjects. Figure samples of the action categories used in our experiments are shown in Figure 2.15.

2.2.2 Human-human interaction

There are not many datasets for human-human interaction. The most frequently used one is the UT interaction dataset which was firstly introduced by M.S. Ryoo et al. [18]. Then it was used as a public dataset in the Contest on Semantic Description of Human Activities (SDHA), 2010.

To perform experiments on more interaction categories, we captured some interaction data by ourselves, which is named as LIMU interaction dataset.

UT interaction dataset

The UT dataset⁴ contains videos of 6 categories of human-human interactions: shake hands, point, hug, push, kick and punch. The videos in the dataset were captured in two surroundings, so they are separated to two sets. The set one is taken on a parking lot. The videos of the set one are taken with slightly different zoom rate, and their backgrounds are mostly static with little camera jitter. The set two are taken on a lawn in a windy day. Background is moving slightly (e.g. tree moves), and they contain more camera jitters. The videos are taken with the resolution of720×480, 30 fps, and the height of a person in the video is about 200 pixels.

In these two sets, each interaction category is performed by ten pairs of participants. The

4http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html

(40)

Figure 2.17: Interaction samples in the LIMU interaction dataset.

interaction categories are performed by actors with more than 15 different clothing conditions in two environments. The dataset provides both of the continuous executions of 6 interaction videos and short videos cropped from the continuous ones. In the cropped videos, one video contains only one interaction execution, total 120 cropped video samples in the two datasets.

The interaction samples of all interaction categories in two sets are shown in Figure 2.16.

LIMU interaction dataset

The LIMU interaction dataset⁵is captured by the author in an indoor scenario. Compared with UT interaction dataset, three interaction categories are increased. The dataset contains nine interaction categories: hand clap, hand shake, hug, hand over, kick, pull, punch, push and touch shoulder. Each category is acted by 10 pairs of participants. Thus total 90 interaction samples, which are saved in the form of videos, one video containing one interaction sample. The videos are captured with the resolution of640×480 pixels, about 120 frames per video. Some frame samples in LIMU interaction dataset are shown in Figure 2.17.

5http://limu.ait.kyushu-u.ac.jp/dataset/en/interaction_dataset.html

(41)

Chapter 3 A compact descriptor CHOG3D for human action recognition

As it is introduced in chapter 2, local features have the advantages: robust against occlusion, scale change, and translation in most cases. Even more, feature points are detected directly from video frames, needing not pre-processing. Recent years, it is widely applied to human action detection, localization, and recognition, etc..

In this chapter, a new method is introduced to calculate local features for human action recognition. In the method, the FAST corner detector is extended from spacial space to spatio- temporal spaces to extract the shape and motion information of human actions. Then based on the descriptor HOG3D [68], a compact peak kept histogram of oriented spatio-temporal gradients (CHOG3D) is proposed to calculate local features around the detected feature points.

The optimal parameters of descriptor CHOG3D are trained, and the proposed feature calculation method is applied for action recognition by SVM, and it is tested in two datasets (KTH and Weizmann datasets).

(42)

3.1 Introduction

A series of local feature calculation methods are presented for human action recognition. Re- searchers proposed spatio-temporal detectors to extract feature points which indicated shape and motion information of human actions (e.g. Harris3D detector [14], Gabor detector [15], Hessian3D detector [16] and Dense sampling [68]). Then local features are calculated using descriptors in spatio-temporal volumes around detected feature points, which represent human actions in a local volume. The frequently used descriptors contain Cuboids [15], HoG/HoF [34], SIFT3D [40], ESURF [16] and HOG3D [68]. Wang [65] gave an evaluation on currently used feature point detectors and feature descriptors by classifying human activities via BoW and SVM. Kläser [26] updated some evaluation results of local features using the same method with Wang [65]. Their evaluation was all performed on KTH, UCF sports and Hollywood2 datasets. As a result of their evaluation, the descriptor HOG3D performed best for human action recognition in different surroundings.

In the calculation of HOG3D [68], the dense sampling was used to obtain feature points.

According to the evaluation provided by Wang [65] and by Kläser [26], the recognition results of the combinations of descriptors and the dense detector performed better than other combinations in UCF sports and Hollywood dataset. It indicates that the dense detector performs better than sparse detectors on the two datasets. However in KTH dataset, the dense detector performs worse than sparse detector Harris3D. That may be because in the videos of the KTH, the size of human actions is small, and the actions repeat several times. The sparse feature points can represent actions correctly. Otherwise, in the UCF dataset and Hollywood dataset [34], the size of actions is large, and actions do not repeat. Sparse feature points are not sufficient yet. Dense feature points are necessary to represent actions correctly. However, as [68] showed, the dense sampling detected a large quantity of feature points on each frame, for instance, 643 feature points per frame in Hollywood dataset. It increases the computation cost greatly, even can not be realized using a general computer. In this chapter, we extend the FAST detector [38] spatio- temporally to detect feature points which are neither too sparse nor too dense. It represents

(43)

Figure 3.1: Overview of HOG3D calculation procedure.

actions correctly with less computation cost. The FAST considers comparison of the intensity between the corner candidate and a circle of sixteen pixels around the corner candidate for corner point detection. If a set of contiguous pixels in the circle are all darker or brighter than the intensity of the candidate pixel out of a threshold, the candidate pixel is detected as a corner.

The descriptor HOG3D employed a complex procedure to calculate local features, as Figure 3.1 shows. It was calculated in a support region around a feature point, and the support region was divided into a grid of cells (Figure 3.1(a)). Each cell was represented by one gradient orientation histogram, the descriptor of one feature point were the combination of histograms in all cells. the cell number and dimension of gradient histogram determined the dimension of descriptor. In each cell, there were a grid of mean gradients. These mean gradients were quantized using regular polyhedrons, and large values in quantized gradient vectors were kept by setting a threshold (Figure 3.1(c)). Then a gradient histogram was calculated by summing the quantized gradients in one cell (Figure 3.1(b)). The mean gradient for descriptor calculation was computed using integral videos (Figure 3.1(d)). The descriptor used a vector with more than 1000 elements to describe one feature point in KTH dataset. It is a huge length for local feature. In the processing of gradient orientation quantization, it is not easy to determine a proper threshold. Except that, since the descriptor employs a complex gradient calculation method and calculates descriptors in a large support region, the calculation cost of the descriptor HOG3D is high, and the length of descriptor increases the difficulty to distinguish two vectors

(44)

correctly. Therefore, it is necessary to propose a compact descriptor.

In this chapter, a new method is proposed to calculate local features for human action recognition. In the method, FAST detector [38] is extended to spatio-temporal space to detect feature points from videos. Then a compact descriptor is proposed which represents local features with a compact peak kept histograms of oriented spatio-temporal gradients (CHOG3D). Com- paring with HOG3D, CHOG3D is calculated in a small support region around feature points.

It employs the first order gradient in spatial and temporal orientations. In addition, it keeps the peak value of orientation quantized gradient to make the descriptor CHOG3D being distinguished more correctly. These modification reduces much calculation, and makes the descriptor CHOG3D be easier to be distinguished. By parameter training, the optimal parameter for CHOG3D is determined. In addition, the performance of peak kept and threshold setting on action recognition is compared to certify the efficiency of peak kept method. Finally, the computation cost and performance on action recognition of the descriptor CHOG3D with the descriptor HOG3D are compared. Furthermore, The detectors of Harris3D, FAST, Dense sampling and descriptors of HOG3D, CHOG3D are evaluated to emphasize the contribution of FAST detector and CHOG3D descriptor in our algorithm.

3.2 Feature point detection by extended FAST

To represent actions correctly with a low computation cost, FAST [38] is extended to spatio- temporal space for spatio-temporal feature points detection in this section. Ifx, y represent the axes of spatial plane, andtis temporal axis, a video can be regarded as a spatio-temporal space with axesx, yandt, as shown in Figure 3.2(a). Here, each frame in the video refers toxyplane, so we define it to bexychannel. When we arrange the video to be a series of tangent planes in horizontal orientation (as shown in Figure 3.2(b)), we obtain a sequence ofxtplanes, which is xtchannel. Similarly, ytchannel is composed ofytplanes, which are series of tangent planes in vertical orientation of a video. The spatio-temporal feature points in videos can be obtained by detecting feature points inxy, xt, ytplanes.

(45)

(a) xychannel (b) xtchannel (c)ytchannel

Figure 3.2: Spatio-temporal channels

To obtain spatio-temporal features, FAST detection can be operated in the three channels. In xychannel, the FAST detector extracts the feature points on edges, corners and the place where sharp illumination change occurs on each frame. These feature points represent the shape and edge information of human, objects and other things in background on frames. From xt, yt channels, the detector also extracts feature points on edges and corners onxt, ytplanes, which are mainly produced by human motion or camera movement. Because motion information extraction is the target in our experiments, feature points is detected in xt, ytchannels, and these feature points indicate the position where action occurs. Furthermore, to achieve more efficient and stable feature points, feature point detection is performed twice on each video.

First time, detection is performed on the original size video to obtain feature points of original scale. Secondly, the video is down sampled, and the detection is done on the sampled video again to obtain the second scale feature points. The repeat feature points of the two scale points are kept for the following processing.

3.3 Descriptor CHOG3D calculation

In this section, the calculation method of descriptor CHOG3D is introduced. First of all, a spatio-temporal volume around a feature point is defined to be the support region of the point, and local features are calculated in the support regions. For descriptor calculation, the support region is divided into several cells. Supposing that the size of support region ism ×m ×n,