Datasets - Related works and datasets - 局所特徴を用いた動画像中の人間動作認識

Related works and datasets

2.2 Datasets

activity as an integral histogram of spatio-temporal features, efficiently modeling how feature distributions change over time. The algorithm was also performed both on simple actions and human interactions (e.g., UT interaction dataset [71]). In the paper of T. Yu et al. [69], the author applied the method Semantic texton forests (STFs) to local space-time volumes as a powerful discriminative codebook for realistic action recognition. They tested their method on human interaction recognition (UT interaction dataset [71]), where the actions of two participants in interactions were treated as one action category.

These algorithms ignored the difference of the two participants’ actions in an interaction category, regarding them to be one atomic action, which limits them to obtain good perfor-mance. The processing did not distinguish the two persons, so that it is not suitable to extend the methods to more complex action situations.

Figure 2.13: Samples for different action classes (columns) in four scenarios (rows) in the KTH action dataset.

KTH action dataset

The KTH action dataset¹ was introduced by Schuldt et al. in 2004. It contains six classes of human actions: walk, jog, run, box, hand wave, and hand clap. Each action is performed 4 or 6 times by 25 persons. The video sequences in the dataset are recorded in four different scenarios:

outdoors, outdoors with scale variation, outdoors with different clothes, and indoors. Totally, the dataset consists of 600 video samples. Generally, each video contains more than 300 frames with the spatial resolution of160×120. The background is homogeneous and static in most sequences, and the scale varies. Figure samples of the dataset are shown in Figure 2.13.

Weizmann action dataset

The Weizmann dataset was presented by M. Blank et al. [10] in 2005. Nine different action classes in the Weizmann dataset²are used in our experiments: run, walk, skip, jump-jack(jack), jump forward, jump in place(pjump), gallop sideways(side), bend and hand wave. In the dataset, each action class is performed once by nine subjects. As a result, total 81 video samples are used in our experiments, and each video is composed of near 100 frames with the spatial resolution

1http://www.nada.kth.se/cvap/actions/

2http://www.wisdom.weizmann.ac.il/ vision/SpaceTimeActions.html

Figure 2.14: Action samples for different action classes in the Weizmann action dataset.

Figure 2.15: Action samples for different action classes in the UCF sports action dataset.

of 180×144. Similar with KTH dataset, the background in the videos is homogeneous and static, too. Some sample frames of the Weizmann dataset are shown in Figure 2.14.

UCF sport action dataset

The UCF sport dataset³ consists of a set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. It contains close to 200 video sequences at the resolution of720×480 pixels. The collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. It contains nine different human action categories, diving, golf swinging, kicking, lifting, horse riding, running, skating, swing (consisting of bench swing on the pommel horse and on the floor and side swing

3http://www.cs.ucf.edu/vision/public_html/

Figure 2.16: Interaction samples in the UT interaction dataset.

at the high bar) and walking. Each action category is done by several subjects. Figure samples of the action categories used in our experiments are shown in Figure 2.15.

2.2.2 Human-human interaction

There are not many datasets for human-human interaction. The most frequently used one is the UT interaction dataset which was firstly introduced by M.S. Ryoo et al. [18]. Then it was used as a public dataset in the Contest on Semantic Description of Human Activities (SDHA), 2010.

To perform experiments on more interaction categories, we captured some interaction data by ourselves, which is named as LIMU interaction dataset.

UT interaction dataset

The UT dataset⁴ contains videos of 6 categories of human-human interactions: shake hands, point, hug, push, kick and punch. The videos in the dataset were captured in two surroundings, so they are separated to two sets. The set one is taken on a parking lot. The videos of the set one are taken with slightly different zoom rate, and their backgrounds are mostly static with little camera jitter. The set two are taken on a lawn in a windy day. Background is moving slightly (e.g. tree moves), and they contain more camera jitters. The videos are taken with the resolution of720×480, 30 fps, and the height of a person in the video is about 200 pixels.

In these two sets, each interaction category is performed by ten pairs of participants. The

4http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html

Figure 2.17: Interaction samples in the LIMU interaction dataset.

interaction categories are performed by actors with more than 15 different clothing conditions in two environments. The dataset provides both of the continuous executions of 6 interaction videos and short videos cropped from the continuous ones. In the cropped videos, one video contains only one interaction execution, total 120 cropped video samples in the two datasets.

The interaction samples of all interaction categories in two sets are shown in Figure 2.16.

LIMU interaction dataset

The LIMU interaction dataset⁵is captured by the author in an indoor scenario. Compared with UT interaction dataset, three interaction categories are increased. The dataset contains nine interaction categories: hand clap, hand shake, hug, hand over, kick, pull, punch, push and touch shoulder. Each category is acted by 10 pairs of participants. Thus total 90 interaction samples, which are saved in the form of videos, one video containing one interaction sample. The videos are captured with the resolution of640×480 pixels, about 120 frames per video. Some frame samples in LIMU interaction dataset are shown in Figure 2.17.

5http://limu.ait.kyushu-u.ac.jp/dataset/en/interaction_dataset.html

Chapter 3 A compact descriptor CHOG3D for

ドキュメント内局所特徴を用いた動画像中の人間動作認識 (ページ 36-41)