Detection of Activities and Events without Explicit Categorization

全文

(1)IPSJ SIG Technical Report. Vol.2012-MPS-90 No.6 2012/9/19. Regular Paper. Detection of Activities and Events without Explicit Categorization M Y1,2,a). M M2,b). M S3,c). Abstract: We propose a method of unsupervised event detection from a video sequence that compares probability. distributions of past and current video sequence data in a hierarchical way. However, because estimation of probability distributions is known to be difficult, naive distribution comparison via probability distribution estimation tends to be unreliable in practice. To cope with this problem, we use the state-of-the-art machine learning method called density ratio estimation, which directly estimates the ratio of probability densities without density estimation. Thus, distribution comparison via density ratio estimation is more reliable than the naive approach with probability distribution estimation. Through experiments on a walking scene and a tennis match, we demonstrate the usefulness of our approach. Keywords: event detection, direct density-ratio estimation, cubic higher-order local auto-correlation. 1.. Introduction. Analysis of events and human actions is useful in various applications such as content-based video retrieval, visual surveillance, and human-computer interaction. For this reason, event detection and human action recognition have gathered a great deal of attention recently [2, 19]. However, these are challenging computer vision tasks because they contain a wide variety of backgrounds, contexts, and events [14]. Considerable variations in clothes, sizes, or postures of people, illumination conditions, occlusion conditions, and camera angels are additional sources of difficulty that make the problems even more challenging. For this reason, automatic semantic categorization of events or actions generally requires a huge amount of video data with ground truth annotations or segmentation of video data [6]. However, because gathering annotated or segmented data is costly, unsupervised approaches have been explored recently for learning human action categories [17, 20] and for detecting abnormal behavior [28]. Most event detection methods extract features such as optical flow based features [17, 28], spatio-temporal features [4, 11–13, 17, 19, 23], or static features including appearance, shape, and spatial relations of local features [7, 24]. Some unsupervised approaches, after extraction of those primitive features, utilize codebook representation [1, 18, 27] that is effective for describing and 1 2 3. a) b) c). Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama 226-8502, Japan. CANON Inc., 3-30-2 Shimomaruko, Ohta-ku, Tokyo 146-8501, Japan. Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan. [email protected] [email protected] [email protected]. ⓒ 2012 Information Processing Society of Japan. discriminating various event categories. In particular, the bagof-words representation of spatio-temporal points-of-interest has received considerable attention for the recognition of actions and events [1, 14, 18, 29]. Existing technologies for detecting semantic events can be summarized as finite state machines, statistical models (e.g., HMM [5, 25] and Bayesian networks [3]), kernel methods [4, 24, 29], and tree-structured classifiers [17]. In this paper, we regard “events” as changes or meaningful states of human activities that can be observed as visual contents or something significantly different from “normal” states. We address the problem of unsupervised detection of semantic event categories by comparing probability distributions of past and current video data in various time scales. Thus, the proposed algorithm does not involve any probabilistic generative models, nor does require any sophisticated codebook representation as in [1, 18]. However, estimation of probability distributions is known to be difficult [22], and thus distribution comparison via probability distribution estimation is unreliable in practice. To cope with this problem, we use the state-of-the-art machine learning technique called density ratio estimation [21], which avoids distribution estimation and directly estimates the ratio of probability densities. Thus, distribution comparison via density ratio estimation is more reliable than the naive approach with probability distribution estimation [10]. We compare the probability distributions of past and current sequence data in various time scales described by spatiotemporal features (more specifically, we use the cubic higher order local auto correlation; CHLAC [12]). The remainder of this paper is structured as follows. In Section 2, we describe our event detection method based on density ratio estimation. In Section 3, we report experimental results on real-world video sequences. Finally, we conclude by summarizing our contribution in Section 4.. 1.

(2) IPSJ SIG Technical Report. Vol.2012-MPS-90 No.6 2012/9/19. M. Train interval. N. M4. N4. Test interval. Train interval4. Test interval4. t T- M. T Fig. 1. 2.. T+ N. M3. N3. Train interval3. Test interval3. M2. N2. Train interval2. Test interval2. Definition of train and test intervals.. Problem formulation and proposed approach. M1. N1. Train interval1. Test interval1. M0. In this section, we formulate an event detection problem based on density ratio estimation, and describe our proposed approach. 2.1 Problem formulation Let x(t) be a d-dimensional feature vector at time t. Our task is to detect whether there exists a change point between two consecutive time intervals, respectively called the train and test intervals (see Fig.1, where M and N are respectively the number of frame in the train and test intervals). Let ptr (x(t)) and pte (x(t)) be the probability density functions of the test and train time series features, respectively. A naive approach to this problem would be to first estimate the train and test probability density functions separately from the train and test time-series data, and then evaluate the difference between train and test intervals based on the estimated probability density functions, e.g., by monitoring the ratio of probability densities: S =. TX +N−1. ptr (x(t)) , pte (x(t)). t=T −M. where µ (≥ 0) is a predetermined threshold for the detection of events. However, since non-parametric density estimation is known to be a hard problem [22], this naive approach to change point detection may not be effective. Instead, directly estimating the probability density ratio without estimating each probability density would be more promising [21]: ptr (x(t)) . pte (x(t)). In practice, choice of the length of train and test intervals may affect the change detection performance: Shorter length may be suitable for detecting micro events, while longer length may be useful for detecting macro events. In this paper, we consider hierarchical intervals as illustrated in Fig.2. This hierarchical structure makes it possible to detect a variety of events from micro to macro levels. Let S h be the score of the h-th hierarchy: Sh =. T +N h −1 X t=T −Mh. phtr (x(t)) phte (x(t)). Train interval0 Test interval0. t T – M4. T – M0 Fig. 2. T. T + N0. T + N4. Definition of train and test intervals in a hierarchy.. functions of the test and train times-series features in the h-th hierarchy. Finally, the anomaly score S is defined by S = max (S h ) . h. 2.2 Direct density-ratio estimation Here, we explain how the density ratio is directly estimated without going through density estimation. 2.2.1 Formulation of density ratio estimation Suppose that we are given a set of Ntr samples drawn independently from a probability distribution Ptr with density ptr : oNtr i.i.d. n

(3) χtr := xtri

(4)

(5) xtri ∈ <d ∼ Ptr . i=1. where ttr and tte are respectively the starting time points of the train and test intervals. Based on the value of S , we conclude that    → no event occurs,  S ≤µ    otherwise → an event occurs,. w(x(t)) =. N0. ,. where phtr (x) and phte (x) are respectively the probability density ⓒ 2012 Information Processing Society of Japan. We also suppose that another set of Nte samples drawn independently from (possibly) another probability distribution Pte with density pte : oNte i.i.d. n

(6) χte := xtej

(7)

(8) xtej ∈ <d ∼ Pte . j=1. The goal of density ratio estimation is to estimate the density ratio function ptr (x) w(x) := pte (x) from the samples χtr and χte , where we assume pte (x) > 0 for all x. 2.2.2 Unconstrained least-squares approach to density ratio estimation Here, we review a direct density-ratio estimation method called unconstrained least-squares importance fitting (uLSIF) [9]. Let us model the density ratio function w(x) by the following kernel model: e(x) = w. Ntr X. αi K(x, xtri ) = α0 k(x),. i=1. where α := (α1 , α2 , . . . , αNtr )0 are parameters to be learned from data samples, •0 denotes the transpose of a matrix or a vector, and 0 k(x) := K(x, xtr1 ), K(x, xtr2 ), . . . , K(x, xtrNtr ) are kernel basis functions. A popular choice of the kernel is the Gaussian function:. 2.

(9) IPSJ SIG Technical Report. K(x, y) = exp −. Vol.2012-MPS-90 No.6 2012/9/19. ! ||x − y||2 , 2σ2. (1). where σ (> 0) is the Gaussian width. e(x) so that the We determine the parameter α in the model w following squared-error J0 is minimized: Z 2 1 e(x) − w(x) pte (x)dx w J0 := 2 Z Z 1 e(x)2 pte (x)dx − w e(x)ptr (x)dx w = 2 Z 1 w(x)2 pte (x)dx, (2) + 2 where the last term is a constant and therefore can be safely ignored. Let us denote the first two terms by J: Z Z 1 e(x)2 pte (x)dx − w e(x)ptr (x)dx w J(α) := 2 1 = α0 Hα − h0 α, 2 where H is the Ntr × Ntr matrix defined by Z H := k(x)k(x)0 pte (x)dx, and h is the Ntr -dimensional vector defined by Z h := k(x)ptr (x)dx. 2.2.3 Empirical approximation Since J contains the expectation over unknown densities pte (x) and ptr (x), we approximate the expectations by empirical averages. Then we obtain b := J(α) =. Nte Ntr 1 X 1 X e(xtej )2 − e(xtri ) w w 2Nte j=1 Ntr i=1. 1 0c α Hα − α0b h, 2. c is the Ntr × Ntr matrix defined by where H Nte X c := 1 k(xtej )k(xtej )0 , H Nte j=1. and b h is the Ntr -dimensional vector defined by Ntr 1 X b k(xtri ). h := Ntr i=1. By including a regularization term, the uLSIF optimization problem is formulated as # " λ 1 c b := argmin α0 Hα − α0 b h + α0 α , α 2 2 α. where IN is the N-dimensional identity matrix. Finally, a density ratio estimator b w(x) is given by b 0 k(x). b w(x) := α Thanks to the simple analytic-form expression, uLSIF is computationally more efficient than alternative density-ratio estimators which involve non-linear optimization [21]. 2.2.4 Model selection by cross-validation The practical performance of uLSIF depends on the choice of the kernel function (e.g., the kernel width σ in the case of Gaussian kernel Eq.(1)) and the regularization parameter λ. Model selection of uLSIF is possible based on cross-validation with respect to the error criterion J defined by Eq.(2). Ntr More specifically, each of the sample sets χtr = {xtri }i=1 and te Nte l L χte = {x j } j=1 is divided into L disjoint subsets {χtr }l=1 and L el (x) is obtained using χtr \χltr {χlte }l=1 . Then an uLSIF solution w l and χte \χte (i.e., all samples without χltr and χlte ), and its J-value for the hold-out samples χltr and χlte is computed as l JbCV :=. 1 X 1 X el (xte )2 − l el (xtr ), w w l 2|χte | x ∈χl |χtr | x ∈χl te. tr. te. tr. where |χ| denotes the number of elements in the set χ. This prol cedure is repeated for l = 1, . . . , L, and the average of JbCV over all l is computed as 1 X bl J . JbCV := L l=1 CV L. Finally, the model (the kernel width σ and the regularization parameter λ in the current setup) that minimizes JbCV is chosen as the most suitable one. 2.2.5 Relative density-ratio estimation Depending on the condition of the denominator density pte (x), the density-ratio value ptr (x)/pte (x) can be unbounded (i.e., it can be infinity). This is actually problematic because the nonparametric convergence rate of uLSIF is governed by the supnorm of the true density-ratio function [26]: max x. ptr (x) . pte (x). To overcome this problem, relative density-ratio estimation [26] was introduced. The β-relative density ratio is defined by ptr (x) . βptr (x) + (1 − β)pte (x). c + λIN )−1b b = (H α h, tr. This is reduced to the plain density ratio if β = 0, and it tends to be smoother as β gets larger. More specifically, it can be confirmed that the β-relative density ratio is bounded above by 1/β for β > 0, even when the plain density ratio ptr (x)/pte (x) is unbounded. Thus, estimation of the relative density ratio is expected to be more reliable than that of the plain density ratio. The β-relative density ratio can be learned in the same way as the plain density ratio. Indeed, the optimization problem of a relative variant of uLSIF, called RuLSIF, is given by the same form c as uLSIF; the only difference is the definition of the matrix H:. ⓒ 2012 Information Processing Society of Japan. 3. where α0 α/2 is a regularizer and λ (≥ 0) is the regularization parameter that controls the strength of regularization. By taking the derivative of the above objective function with respect to the parameter α and equating it to zero, we can analytically obtain b as the solution α.

(10) IPSJ SIG Technical Report. Fig. 3. Vol.2012-MPS-90 No.6 2012/9/19. Example of a mask pattern [12]. (0) L = 0. (1) L = 1 for a1 = (−1, −1, −1). Mutually shifted mask patterns are eliminated. Ntr Nte X 1−β X c= β k(xtri )k(xtri )0 + k(xtej )k(xtej )0 . H Ntr i=1 Nte j=1. Thus, the RuLSIF solution can still be obtained analytically. Furthermore, RuLSIF was proved to possess an even better nonparametric convergence property than uLSIF [26]. For this reason, we decided to use RuLSIF to compute our anomaly score S. 2.3 Cubic higher-order local auto-correlation Spatio-temporal features have received a great deal of attentions and have been successfully used for event and action detection. In this paper, we adopt the spatio-temporal features called the cubic higher order local auto correlation (CHLAC) [12], which and whose extension [16] have been successfully used in action recognition. CHLAC directly handles three-dimensional data, and it possesses three useful properties [12]: additivity, shift invariance, and robustness to noise. Let f (r) be three-way data with r = (x, y, z), and the L-th order auto-correlation function is defined as Z xL (a1 , . . . , aL ) = f (r) f (r + a1 ) · · · f (r + aL )dr, (3) where al (l = 1, . . . , L) are called the displacement vectors. In Eq.(3), displacement vector al is limited to a 3 × 3 × 3 local region around r and the number of displacement vectors, L, is set to be less than or equal to 2. The CHLAC feature, the value of xL (a1 , . . . , aL ), remains the same, as long as the patterns of (r, a1 , . . . , aL ) are identical in the point configuration—such redundant features are eliminated (see Fig.3). Taking inter-frame difference and thresholding, we convert input image data into motion-image sequences composed of binary data. We use 251 mask patterns, so our CHLAC feature vector is 251-dimensional.. 3.. Experiments. In this section, we show two experimental studies on a walking scene and a tennis match to evaluate the performance of the proposed method. As shown in Fig.4, we compute CHLAC features in train and test intervals for (M0 , N0 ) = (10, 10), (M1 , N1 ) = (20, 20), (M2 , N2 ) = (30, 30), (M3 , N3 ) = (40, 40), and (M4 , N4 ) = (50, 50). 3.1 Detection of abnormal actions in walking scene In the first set of experiments, we use our in-house video sequences that record walking scenes. While people are walking (Fig.5), typical abnormal actions are that people are hit each other and/or tumble (Fig.6). ⓒ 2012 Information Processing Society of Japan. M4=50. N4=50. Train interval4. Test interval4. M3=40. N3=40. Train interval3. Test interval3. M2=30. N2=30. Train interval2. Test interval2. M1=20. N1=20. Train interval1. Test interval1. M0=10. N0=10. Train interval0 Test interval0. t T – M4 Fig. 4. T – M0. T. T + N0. T + N4. Hierarchical structure of time-scales for training and testing sequences.. Fig.7 and Fig.8 show the anomaly score obtained by the proposed method. The scene from the 200th to 300th frames has relatively small anomaly scores and this corresponds to an ordinary walking scene. On the other hand, the scene from 300th to 400th frames has relatively high anomaly scores and this corresponds to a tumbling scene. Fig.6 shows some snapshot of the walking scenes, where frame #356 corresponds to a local peak of the anomaly score. The result clearly shows that the proposed unsupervised method could detect the onset of abnormal actions and also distinguish normal actions from abnormal ones. Fig.8 compares the anomaly scores obtained by the proposed method and a reference method [12], which also uses the CHLAC feature together with the supervised subspace method for event (action) detection. The results show that both methods have peaks at almost the same frame number, which correspond to a person’s tumbling. However, a critical difference is their training methods: The reference method uses the subspace method for event detection, which is trained in a supervised manner using video data of 1000 frames. On the other hand, the proposed method is unsupervised and no pre-training is necessary. Because the proposed method does not require manual video-sequence annotation that is often costly in practice, the proposed method will be more useful than the existing method. 3.2 Detection of various actions in a tennis match Next, we show experimental results on a tennis match video (see Fig.10). In this video, various events can be observed, for example, a ball person is running in the court (Fig.11), a player bounces a ball before his service (Fig.12), and a player makes smashing shots (Fig.13 and Fig.14). Fig.15–Fig.17 show the anomaly score obtained by the proposed method. The peak between the 150th and 200th frames corresponds to a ball person’s running in the court (Fig.11), while the peak between the 300th and 400th frames corresponds to ball bouncing (Fig.12). The peak around the 1000th frame corresponds to the forehand smashing stroke (Fig.13), while the peak around the 1100th frame corresponds to the backhand smashing stroke (Fig.14). These results indicate that the proposed method can successful detect notable events as well as their onset frame.. 4.

(11) IPSJ SIG Technical Report. Vol.2012-MPS-90 No.6 2012/9/19. . . . . . . . . . . Fig. 5. An ordinary walking scene.. . . . . . . . . . . Fig. 6. A person tumbles while other people are walking.. Score. 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0. 500. 1000. 1500. 2000. Frame number. Score. Fig. 7. Anomaly score obtained by the proposed method for the walking scene.. 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0. Fig. 9. 50. 100 150 200 250 300 350 400 450 500. Anomaly scores by the proposed method (top) and a reference method [12] (bottom) from 251th to 500th frame.. Frame number Fig. 8. Enlargement of Fig.7 from the beginning to 500th frames.. ⓒ 2012 Information Processing Society of Japan. 5.

(12) IPSJ SIG Technical Report. Vol.2012-MPS-90 No.6 2012/9/19. . . . . . . . . Fig. 10. A tennis match video.. . . . . . Fig. 11. A ball person is running in the court.. . . . . . . Fig. 12. . . . . . ⓒ 2012 Information Processing Society of Japan. . A player bounces a ball before his service.. . Fig. 13. . . A player makes a forehand smashing shot.. 6.

(13) IPSJ SIG Technical Report. Vol.2012-MPS-90 No.6 2012/9/19. . . . . . . Score. Fig. 14. . A player makes a backhand smashing shot.. training based on annotated data, the proposed algorithm requires only a small amount of data for event detection [2, 6–8, 18, 28]. This is a significant advantage in practice, because manually annotating video sequence is costly.. 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0. References 0. 200. 400. 600. 800 1000 1200 1400 1600. [1]. Frame Number. Score. Fig. 15. Anomaly score obtained by the proposed method for the tennis match video.. [3]. 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0. [4]. [5]. 50. 100 150 200 250 300 350 400 450 500 [6]. Frame Number Fig. 16. [2]. Enlargement of Fig.15 from the beginning to 500th frames.. Score. [7]. 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 800. [8] [9] [10]. 900. 1000. 1100. 1200. 1300. Frame Number Fig. 17. [11]. Enlargement of Fig.15 between the 800th and 1300th frames. [12]. 4.. Summary and conclusions. We proposed a video-based event detection method based on direct density-ratio estimation. A similar idea has already been explored in terms of change detection in time-series [10, 15], but we newly introduced a multi-scale hierarchy of train and test intervals, which mitigates the difficulty of finding appropriate time intervals. We experimentally demonstrated the usefulness of the proposed method on a walking scene and a tennis match. Compared with an existing method [12] that requires supervised preⓒ 2012 Information Processing Society of Japan. [13] [14] [15] [16]. L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari, and G. Serra. Effective codebooks for human action categorization. In Proceedings of IEEE 12th International Conference on Computer Vision and Pattern Recognition, pages 506–513, 2009. L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari, and G. Serra. Event detection and recognition for semantic annotation of video. Multimedia Tools and Applications, 51(1):279–302, 2011. L. Ballan, M. Bertini, A. Del Bimbo, and G. Serra. Video event classification using string kernels. Multimedia Tools and Applications, 48(1):69–87, 2010. C. Chao, H. C. Shih, and C. L. Huang. Semantics-based highlight extraction of soccer program using dbn. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1057–1060, 2005. S. Ebadollahi, L. Xie, S. F. Chang, and J. Smith. Visual event detection using multi-dimensional concept dynamics. In Proceedings of IEEE International Conference on Multimedia and Expo, pages 881–884, 2006. W. Jiang, C. Cotton, D. Ellis, and A. C. Loui. Short-term audio-visual atoms for generic video concept classification. In Proceedings of the 17th ACM international conference on Multimedia, pages 5–14, 2009. L. Jie, B. Caputo, and V. Ferrari. Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In Advances in Neural Information Processing Systems 22, pages 1168– 1176, 2009. Y. Avrithis K. Rapantzikos and S. Kollias. Spatiotemporal features for action recognition and salient event detection. Cognitive Computation, 3(1):167–184, 2011. T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009. Y. Kawahara and M. Sugiyama. Sequential change-point detection based on direct density-ratio estimation. Statistical Analysis and Data Mining, 5(2):114–127, 2012. Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In Proceedings of IEEE 11th International Conference on Computer Vision, pages 1–8, 2007. T. Kobayashi and N. Otsu. Action and simultaneous multiple-person identification using cubic higher-order local auto-correlation. In Proceedings of the 17th International Conference on Pattern Recognition, pages 741–744, 2004. I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(2):107–123, 2005. T. Li, T. Mei, I.-S. Kweon, and X.-S. Hua. Contextual bag-of-words for visual categorization. IEEE Transactions on Circuits and Systems for Video Technology, 21(4):381–392, 2011. S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Change-point detection in time-series data by relative density-ratio estimation. Technical Report 1203.0453, arXiv, 2012. T. Matsukawa and T. Kurita. Action recognition using three-way cross-correlations feature of local motion attributes. In Proceedings of 20th International Conference on Pattern Recognition, pages 1731–. 7.

(14) IPSJ SIG Technical Report. [17] [18] [19] [20] [21] [22] [23]. [24] [25] [26]. [27]. [28] [29]. Vol.2012-MPS-90 No.6 2012/9/19. 1734, 2010. K. Mikolajczyk and H. Uemura. Action recognition with motionappearance vocabulary forest. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008. J. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3):299–318, 2008. A. Oikonomopoulos, I. Patras, and M. Pantic. Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetices, 36(3):710–719, 2005. H. J. Seo and P. Milanfar. Detection of human actions from a single example. In Proceedings of IEEE 12th International Conference on Computer Vision, pages 1965–1970, 2009. M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, Cambridge, UK, 2012. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, NY, USA, 1998. C. Wang, L. Zhang, and H. J. Zhang. Learning to reduce the semantic gap in web image retrieval and annotation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 355–362, 2008. D. Xu and S. F. Chang. Video event recognition using kernel methods with multilevel temporal alignment. IEEE Transactions on Pattern Analysis and machine Intelligence, 30(11):1985–1997, 2008. G. Xu, Y. F. Ma, H.J. Zhang, and S. Yang. A hmm based semantic analysis framework for sports game event detection. In Proceedings of International Conference on Image Processing, pages 25–28, 2003. M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in Neural Information Processing Systems 24, pages 594– 602, 2011. B. Yao and L. Fei-Fei. Grouplet: A structured image representation for recognizing human and object interaction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 9–16, 2010. T.-H. Yu and Y.-S. Moon. Unsupervised abnormal behavior detection for real-time surveillance using observed history. In Proceedings of IAPR Conference on Machine Vision Applications, pages 9–16, 2010. X. Zhou, X. Zhuang, S. Yan, S. F. Chang, M. Hasegawa-Johnson, and T.S. Huang. Sift-bag kernel for video event analysis. In Proceedings of the 16th ACM international conference on Multimedia, pages 229– 238, 2008.. ⓒ 2012 Information Processing Society of Japan. 8.

(15)