Human Action Recognition from 3D Videos
Using Multi-Projection-Based Approach
LE QUANG CHIEN
Doctor of Philosophy
Department of Informatics
School of Multidisciplinary Sciences
SOKENDAI (The Graduate University for
Advanced Studies)
定
Summary of the PhD thesis
Problem Statement and Main Contributions
Thesis title: ”Human Action Recognition from 3D Videos Using Multi-Projection-Based Ap- proach”
PhD student: LE QUANG CHIEN
Academic Supervisor: Professor Shin’ichi Satoh Department: Department of Informatics
School: School of Multidisciplinary Sciences, The Graduate University for Advanced Studies (SOKENDAI)
Motivation
Video-based human action recognition has been one of the important areas of computer vision research today [1, 5, 12]. The aim of human action recognition is to automatically detect and analyze actions from sensors, such as a sequence of images which is either captured by traditional RGB cameras, range sensors, or a combination of multiple sensors. During the last decade human action recognition has been introduced in many applications. Some typical applications are showed below, including:
• Surveillance system
• Human-Machine Interaction
• Healthcare system
• Video annotation
Inspired of these interesting applications, this dissertation aims to develop technologies for building an automatic action recognition system. This dissertation focuses on actions and does not explicitly consider context such as the environment, interactions between persons or objects. Additionally, approaches described in this dissertation are proposed only for depth data because of two following advantages. First, depth data enriches the information available as cues, such as the shape of body and its motions. Second, depth data is less sensitive to the problems, such as background clutter and illumination variation, usually affecting to RGB information. The research scope of this dissertation is described in the next section.
Problem
The work described in this thesis addresses video-based human action recognition, i.e., au- tomatic analysis and understanding of human actions recorded by video. Hand gestures and simple actions, such as wave, clap, walk, etc., are different kind of actions considered in this work. The actions are recorded by depth sensors (e.g., Kinect) and each action is performed by one person in some times. The depth map recorded by depth sensors captures actions in scenes with one fixed cameras (i.e., test actions are from the seen viewpoint) and with multiple cameras (i.e., test actions can be from unseen viewpoints). Due to the variations in viewpoint, it leads to the large variations in the appearances of actions. It is also interesting to know which source
viewpoints are able to provide the most valuable information for effectively discriminating the actions? How can we search these viewpoints? And an interesting problem is the correlation between source viewpoints and target viewpoints for human action recognition. Generally, our target problem focuses on three important characteristics.
First, we are dealing with action recognition from seen viewpoints. This problem is to per- form recognition from the same viewpoints seen in the training data. Observing actions from seen viewpoints can lead to confusion when some different actions are performed through some similar movements. For example, a forward punch and hammer may be confused if we view them from the front because they contain “lift arm up” and “stretch out” movements that are indistinguishable when viewed from the front. Therefore, investigating this problem is neces- sary because of many particular applications, such as gesture-based interactive games [7, 6], and intelligent driver assistance systems [8, 3].
Second, we are dealing with action recognition from unseen viewpoints. Unlike recognizing actions from seen viewpoints, the performance of features tends to dramatically decrease as the viewpoint changes. One reason for this is that the same action may be confused when observed from different angles. Therefore, a practical system should be able to recognize human actions from different unknown and more importantly unseen viewpoints, such as smart environments [2, 9, 11], video surveillance [10, 4].
Third, we are dealing with action recognition from arbitrary viewpoints. Unlike recognizing actions from known viewpoints, the performance of the recognition systems has been compro- mised by generalizing situations of viewpoints. In addition, directly applying the aforementioned approaches in this case is not effective when any information related to target viewpoints is not shared. Therefore, it is very important to deal with this challenging issue for applying to more realistic applications.
Contributions
This section will point out the main contributions presented in this thesis.
• We propose to use the multiple projection-based greedy selection (MPGS) method to overcome the limitation of the heuristic projection-based approach. The idea is to exploit pool of multiple viewpoints instead of using some typical projections, such as front, side and top to form the final video representation. We carry thorough experiments to evaluate our proposed method by decomposing 3D actions into a set of 2D actions. This method includes a greedy selection to search for the optimal combination of projections. By using a compact set of projections, the proposed method can handle the general unconstrained motions challenge as well.
• We propose a simple but quite effective method, called called viewpoint augmentation- based (VAB) classification method, to deal with observing actions from unseen viewpoints. This classification method is based on the discrimination power of classifiers. Basically, we apply cross-validation method for training data to select a set of suitable viewpoints. The selected viewpoints depend on the recognition performance of classifiers built on these viewpoints correspondingly. The discrimination-based method is simple but quite effective to deal with recognizing actions from unseen viewpoints.
• We propose a new hybrid fusion method to exploit the benefits of both above approaches for generating a unified action recognition framework. The proposed framework is ef-
fective and widely applicable for action recognition with arbitrary viewpoints. In this framework, multiple representations proposed in the first contribution and single repre- sentations proposed in the second contribution are used. The multiple representations are effective to discriminate actions from a single view and the single representations are flexible and widely applicable for cross-view action recognition by using the late fusion strategy. To adapt the multiple representations for cross-view action recognition, we pro- pose a sliding-based matching method to find possible combinations of projections. Fusing both approaches can improve handling the viewpoint variation challenge.
Conclusion
In this dissertation, we only focus on exploiting a pool of multiple projections to reduce redun- dancy and deal with confusion. However, our approach can also handle two important challenges general unconstrained motions and viewpoint variation. The challenge general unconstrained mo- tions often compromises the performance of action recognition-based systems. We handle this challenge by decomposing each 3D action into a set of 2D actions and investigating feature representation method from 3D video. In addition, the challenge viewpoint variation increases the complexity of motion presentation. We deal with this challenge by exploiting augmented data using multiple projections and investigating the correlation between the source and target viewpoints.
Experimental results have shown that the proposed approach (i.e., based on exploiting multiple projections) is effective to deal with challenging issues from viewpoint variations. Moreover, the multi-projection-based approach often outperforms recently published state-of-the-art methods on benchmark datasets.
The structure of the thesis
This thesis is organized as follows:
Chapter 1 describes the problem statement and its important role to many realistic applica- tions. In addition, some challenging issues in the problem are also introduced. And viewpoint variation is the main challenge which is investigated in this thesis. To address this challenging issue, this chapter also summarizes the thesis’ main contributions.
Chapter 2 introduces some background that is related to our research. This background includes an introduction to human action recognition for depth videos and datasets. This chapter also encompasses basic knowledge about components in a general framework for human action recognition. The knowledge is necessary to re-implement our framework.
Chapter 3 presents our multiple projection-based approach for human action recognition from seen viewpoints. At first we introduce the heuristic projection-based approach and some of its limitation. After that we present our multiple projection-based approach with greedy method to select an optimal combination of projections. Finally, based on the selected combination, we generate multiple representations to present actions in depth sequences.
Chapter 4 presents our discrimination-based classification for cross-view action recognition. At first we introduce the multiple projection-based naive classification method for cross-view action recognition. Based on this method, we propose a new classification method which learn robust classifiers to effectively recognize actions from unseen viewpoints. In the proposed method, a
late fusion method is used to predict labels of test samples.
Chapter 5 presents our hybrid fusion method to construct a unified framework for action recognition from 3D videos. This framework shows all the benefits from action representation to classification in different scenarios of viewpoints. We also present a sliding method to adapt multiple representations for viewpoint variations. This method does not use any knowledge related to target viewpoints.
Chapter 6 concludes this dissertation by summarizing our contributions and discussing about the future work.
References
[1] Thomas B Moeslund, Adrian Hilton, and Volker Kr¨uger. A survey of advances in vision- based human motion capture and analysis. Computer vision and image understanding, 104(2):90–126, 2006.
[2] Erik Murphy-Chutorian and Mohan M Trivedi. 3d tracking and dynamic analysis of human head movements and attentional targets. In Distributed Smart Cameras, 2008. ICDSC 2008. Second ACM/IEEE International Conference on, pages 1–8. IEEE, 2008.
[3] Erik Murphy-Chutorian and Mohan Manubhai Trivedi. Head pose estimation and aug- mented reality tracking: An integrated system and evaluation for monitoring driver aware- ness. Intelligent Transportation Systems, IEEE Transactions on, 11(2):300–311, 2010. [4] Sangho Park and Mohan M Trivedi. Understanding human interactions with track and body
synergies (tbs) captured from multiple views. Computer Vision and Image Understanding, 111(1):2–20, 2008.
[5] Ronald Poppe. A survey on vision-based human action recognition. Image and vision computing, 28(6):976–990, 2010.
[6] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 56(1):116–124, 2013.
[7] Cuong Tran and Mohan M Trivedi. Introducing “xmob”: Extremity movement observation framework for upper body pose tracking in 3d. In Multimedia, 2009. ISM’09. 11th IEEE International Symposium on, pages 446–447. IEEE, 2009.
[8] Mohan M Trivedi and Shinko Y Cheng. Holistic sensing and active displays for intelligent driver support systems. Computer, (5):60–68, 2007.
[9] Mohan Manubhai Trivedi, Kohsia Samuel Huang, and Ivana Miki´c. Dynamic context cap- ture and distributed video arrays for intelligent spaces. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 35(1):145–163, 2005.
[10] Akos Utasi and Csaba Benedek. A 3-d marked point process model for multi-view people detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3385–3392. IEEE, 2011.
[11] Alexander Waibel, Rainer Stiefelhagen, Rolf Carlson, J Casas, Jan Kleindienst, Lori Lamel, Oswald Lanz, Djamel Mostefa, Maurizio Omologo, Fabio Pianesi, et al. Computers in the human interaction loop. In Handbook of Ambient Intelligence and Smart Environments, pages 1071–1116. Springer, 2010.
[12] Daniel Weinland, Remi Ronfard, and Edmond Boyer. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Under- standing, 115(2):224–241, 2011.