Future works - 複雑環境下における物体検出のための背景モデリング

Chapter 6. Conclusion

Analysis of the appearance and disappearance of stationary objects based on objectness: As discussed in Section 5.2, the proposed multi-layered background model has difficulty dis-tinguishing the appearance of a newly-observed stationary object and the disappearance of the background objects, which have existed in the initial background. Additionally, it is difficult for the proposed multi-layered background model to keep detecting non-rigid objects, such as people waiting for a bus at a bus stop, as stationary objects. The reason why the proposed multi-layered model cannot handle those objects is that the appearance of stationary objects is detected only based on an overlap ratio of blobs between two consecutive images. By con-sidering additional information such as the objectness [62], future research will aim to solve the problems described above. Here, objectness [62] is a measurement by edges, color con-trast, textures, etc., and it indicates how likely a particular region contains an object. Zhang et al. [63] proposed to segment the primary object regions in each video frame using object-ness. Therefore, objectness will help the proposed multi-layered background model to handle the disappearance of background objects and non-rigid stationary objects.

Acknowledgement

I would like to express my gratitude to all those who have made this thesis possible.

First of all, I am deeply indebted to my supervisor, Prof. Richiro Taniguchi, for his in-valuable advice and in-valuable discussions, encouragements, and also trust in me. During my six years of research in the laboratory, I have learned many things from him, which will be helpful to me in my future career. I would like to thank Associate Prof. Hajime Nagahara and Associate Prof. Atsushi Shimada, who helped me very much and gave me a lot of beneficial suggestions not only in the research, but also in private. I am also greatly obliged to Prof. Vincent Charvil-lat for giving me the opportunity to study at his laboratory (ENSEEIHT, Toulouse, France).

Thanks to his suggestions, I could come up with the basic idea of the proposed background model, i.e., StSIC.

I appreciate very much the significant contribution made by Prof. Shun-ichi Kaneko and Prof. Ryo Kurazume. They gave me a lot of beneficial advice and suggestions on my presenta-tion and the direcpresenta-tion of my research as my advisers. I am also grateful to all my colleagues in the LIMU and ENSEEIHT, and especially to the secretary Kiyoko Furuta.

I sincerely appreciate the support from the Japan Society for the Promotion of Science (JSPS). Thanks to the financial support from the JSPS, I could concentrate on my research for three years.

Finally, I would like to express my sincere appreciation to my parents. They always give me their continuous support and encouragements for my life and study.

Reference

[1] M. Enzweiler and D. M. Gavrila, “Monocular Pedestrian Detection: Survey and Experi-ments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2179–2195, December 2009.

[2] T. Bouwmans, F. E. Baf, and B. Vachon, “Background Modeling using Mixture of Gaus-sians for Foreground Detection - A Survey,”Recent Patents on Computer Science, vol. 1, no. 3, pp. 219–237, November 2008.

[3] M. Cristani, M. Farenzena, D. Bloisi, and V. Murino, “Background Subtraction for Au-tomated Multisensor Surveillance: A Comprehensive Review,” EURASIP Journal on Advances in Signal Processing, vol. 2010, pp. 1–24, February 2010.

[4] T. Bouwmans, “Recent Advanced Statistical Background Modeling for Foreground De-tection: A Systematic Survey,” Recent Patents on Computer Science, vol. 4, no. 3, pp.

147–176, September 2011.

[5] R. Lienhart and J. Maydt, “An Extended Set of Haar-like Features for Rapid Object Detection,”IEEE International Conference on Image Processing (ICIP), vol. 1, pp. 900–

903, 2002.

[6] S.-K. Pavani, DavidDelgado, and AlejandroF.Frangi, “Haar-like features with optimally weighted rectangles for rapid object detection,”Pattern Recogn, vol. 43, no. 1, pp. 160–

172, January 2010.

Reference

[7] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886–893, June 2005.

[8] N. Dalal, B. Triggs, and C. Schmid, “Human Detection Using Oriented Histograms of Flow and Appearance,” 9th European Conference on Computer Vision (ECCV), vol. 2, pp. 428–441, May 2006.

[9] Q. Zhu, S. Avidan, M.-C. Yeh, and K.-T. Cheng, “Fast Human Detection Using a Cascade of Histograms of Oriented Gradients,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 1491–1498, June 2006.

[10] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learn-ing and an Application to BoostLearn-ing,”Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, August 1997.

[11] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time track-ing,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp.

246–252, June 1999.

[12] A. Shimada, D. Arita, and R. Taniguchi, “Dynamic Control of Adaptive Mixture-of-Gaussians Background Model,”IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), November 2006.

[13] A. Elgammal, R. Duraiswami, D. Harwood, and L. Davis, “Background and Foreground Modeling using Non-parametric Kernel Density Estimation for Visual Surveillance,”

Proceedings of the IEEE, vol. 90, pp. 1151–1163, July 2002.

[14] T. Tanaka, A. Shimada, D. Arita, and R. Taniguchi, “A Fast Algorithm for Adaptive Background Model Construction Using Parzen Density Estimation,”IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 528–533, September 2007.

Reference

[15] S. Jabri, Z. Duric, and H. Wechsler, “Detection and location of people in video images using adaptive fusion of color and edge information,” 15th International Conference on Pattern Recognition (ICPR), vol. 4, pp. 627– 630, September 2000.

[16] M. Mason and Z. Duric, “Using histograms to detect and track objects in color video,”

Applied Imagery Pattern Recognition Workshop (AIPR), pp. 154–159, October 2001.

[17] M. Heikkila, M. Pietikainen, and J. Heikkila, “A texture-based method for detecting mov-ing objects,”15th British Machine Vision Conference (BMVC), pp. 187–196, September 2004.

[18] M. Heikkila and M. Pietikainen, “A Texture-Based Method for Modeling the Background and Detecting Moving Objects,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 657–662, April 2006.

[19] Y. Satoh, S. Kaneko, Y. Niwa, and K. Yamamoto, “Robust object detection using a Radial Reach Filter (RRF),”Systems and Computers in Japan, vol. 35, no. 10, pp. 63–73, 2004.

[20] H. Yoshimura, Y. Iwai, and M. Yachida, “Object Detection with Adaptive Background Model and Margined Sign Cross Correlation,”18th International Conference on Pattern Recognition (ICPR), vol. 3, pp. 19–23, August 2006.

[21] A. Shimada and R. Taniguchi, “Hybrid Background Model using Spatial-Temporal LBP,”IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 19–24, September 2009.

[22] T. Tanaka, A. Shimada, R. Taniguchi, T. Yamashita, and D. Arita, “Towards robust object detection: integrated background modeling based on spatio-temporal features,”9th Asian Conference on Computer Vision (ACCV), pp. 201–212, September 2009.

[23] P. Noriega and O. Bernier, “Real Time Illumination Invariant Background Subtraction Using Local Kernel Histograms,” 17th British Machine Vision Conference (BMVC), vol. 3, pp. 979–988, September 2006.

Reference

[24] Y. Satoh and K. Sakaue, “Robust Background Subtraction based on Bi-polar Radial Reach Correlation,” IEEE Region 10 International Conference (TENCON), pp. 1–6, November 2005.

[25] K. Iwata, Y. Satoh, R. Ozaki, and K. Sakaue, “Robust Background Subtraction Based on Statistical Reach Feature Method,”IEICE Transactions on Information and Systems, vol.

J92-D, no. 8, pp. 1251–1259 (in Japanese), August 2009.

[26] K. Yokoi, “Probabilistic BPRRC: Robust Change Detection against. Illumination Changes and Background Movements,” IAPR Conference on Machine Vision Applica-tions (MVA), pp. 148–151, May 2009.

[27] X. Zhao, Y. Satoh, H. Takauji, S. Kaneko, K. Iwata, and R. Ozaki, “Object detection based on a robust and accurate statistical multi-point-pair model,” Pattern Recognition, vol. 44, no. 6, pp. 1296–1311, June 2011.

[28] K. Iwata, Y. Satoh, R. Ozaki, and K. Sakaue, “Robust Background Subtraction by Sta-tistical Reach Feature on Random Reference Points,” Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV), pp. 188–192, February 2012.

[29] D. Liang, S. Kaneko, M. Hashimoto, K. Iwata, X. Zhao, and Y. Satoh, “Co-Occurrence-Based Adaptive Background Model for Robust Object Detection,” IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 401–406, Au-gust 2013.

[30] X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition under difficult lighting conditions,”IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG), pp. 168–182, October 2007.

[31] S. Liao, G. Zhao, V. Kellokumpu, M. Pietikainen, and S. Z. Li, “Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes,”

Reference

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1301–1306, June 2010.

[32] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real-Time Track-ing of the Human Body,” IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 19, no. 7, pp. 780–785, July 1997.

[33] J. Heikkila and O. Silven, “A real-time system for monitoring of cyclists and pedestri-ans,”IEEE Workshop on Visual Surveillance, pp. 74–81, June 1999.

[34] N. J. B. McFarlane and C. P. Schofield, “Segmentation and tracking of piglets in images,”

Machine Vision and Applications (MVA), vol. 8, no. 3, pp. 187–193, May 1995.

[35] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting Moving Objects, Ghosts, and Shadows in Video Streams,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1337–1342, October 2003.

[36] N. Friedman and S. Russell, “Image Segmentation in Video Sequences: A Probabilistic Approach,”Conference on Uncertainty in Artificial Intelligence, pp. 175–181, 1997.

[37] Z. Zivkovic, “Improved Adaptive Gaussian Mixture Model for Background Subtraction,”

17th International Conference on Pattern Recognition (ICPR), vol. 2, pp. 28–31, August 2004.

[38] J. Cheng, J. Yang, and Y. Zhou, “A Novel Adaptive Gaussian Mixture Model for Back-ground Subtraction,” Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), vol. 1, pp. 587–593, June 2005.

[39] R. Tan, H. Huo, J. Qian, and T. Fang, “Traffic Video Segmentation Using Adaptive-K Gaussian Mixture Model,” International Workshop on Intelligent Computing in Pattern Analysis/Synthesis (IWICPAS), pp. 125–134, August 2006.

Reference

[40] J. Cheng, J. Yang, Y. Zhou, and Y. Cui, “Flexible background mixture models for fore-ground segmentation,” Image and Vision Computing, vol. 24, no. 5, pp. 473–482, May 2006.

[41] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric Model for Background Sub-traction,”6th European Conference on Computer Vision (ECCV), pp. 751–767, June/July 2000.

[42] E. Monari and C. Pasqual, “Fusion of Background Estimation Approaches for Motion Detection in Non-static Backgrounds,” IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 347–352, September 2007.

[43] A. Tavakkoli, M. Nicolescu, and G. Bebis, “Automatic Statistical Object Detection for Visual Surveillance,”IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pp. 144–148, March 2006.

[44] Z. Zivkovic and F. van der Heijden, “Efficient adaptive density estimation per image pixel for the task of background subtraction,” Pattern Recognition Letters, vol. 27, no. 7, pp.

773–780, May 2006.

[45] S. D. Cvetkovic, P. Bakker, J. Schirris, and P. H. N. de With., “Background Estimation and Adaptation Model with Light-Change Removal for Heavily Down-Sampled Video Surveillance Signals,” IEEE International Conference on Image Processing (ICIP), pp.

1829–1832, 2006.

[46] I. Codrut, G. Vasile, T. C. I., and P. Dan, “A Fast Algorithm for Background Tracking in Video Surveillance, Using Nonparametric Kernel Density Estimation,” Facta Universi-tatis Series : Electronics and Energetics, vol. 18, pp. 127–144, 2005.

[47] T. Tanaka, A. Shimada, D. Arita, and R. Taniguchi, “Non-parametric Background and Shadow Modeling for Object Detection,” 8th Asian Conference on Computer Vision (ACCV), vol. 1, pp. 159–168, November 2007.

Reference

[48] S. D. Witherspoon and M. Zhang, “Negative coeffcient polynomial kernel density es-timation for visualization,” 18th IASTED International Conference on Modelling and Simulation (ICMS), pp. 398–403, May 2007.

[49] R. Ramezani, P. Angelov, and X. Zhou, “A Fast Approach to Novelty Detection in Video Streams using Recursive Density Estimation,”IEEE International Conference on Intelli-gent Systems, vol. 2, pp. 14–2–14–7, September 2008.

[50] O. Barnich and M. V. Droogenbroeck, “ViBe: a powerful random technique to esti-mate the background in video sequences,”IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 945–948, April 2009.

[51] O. Barnich and M. V. Droogenbroeck, “ViBe: A universal background subtraction algo-rithm for video sequences,” IEEE Transactions on Image Processing, vol. 20, no. 6, pp.

1709–1724, June 2011.

[52] O. Barnich and M. V. Droogenbroeck, “Background subtraction: Experiments and im-provements for ViBe,” IEEE Workshop on Change Detection (CDW), pp. 32–37, June 2012.

[53] M. Hofmann, P. Tiefenbacher, and G. Rigoll, “Background segmentation with feedback:

The Pixel-Based Adaptive Segmenter,” IEEE Workshop on Change Detection (CDW), pp. 38–43, June 2012.

[54] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Principle and Practice of Background Maintenance,”IEEE International Conference on Computer Vision (ICCV), vol. 1, pp. 255–261, September 1999.

[55] T. Tanaka, S. Yoshinaga, A. Shimada, R. Taniguchi, T. Yamashita, and D. Arita, “Object detection based on combining multiple background modelings,” IPSJ Transactions on Computer Vision and Applications, vol. 2, pp. 156–168, November 2010.

Reference

[56] T. Yang, Q. Pan, S. Z.Li, and J. Li, “Multiple Layer Based Background Maintenance In Complex Environment,”IEEE Symposium on Multi-Agent Security and Survivability, pp.

112–115, December 2004.

[57] J. Yao and J.-M. Odobez, “Multi-layer background subtraction based on color and tex-ture,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8, June 2007.

[58] A. Shimada, S. Yoshinaga, and R. Taniguchi, “Maintenance of Blind Background Model for Robust Object Detection,” IPSJ Transactions on Computer Vision and Applications, vol. 3, pp. 148–159, December 2011.

[59] H. Fujiyoshi and T. Kanade, “Layered detection for multiple overlapping objects,”IEICE Transactions on Information and systems, vol. E87-D, no. 12, pp. 2821–2827, December 2004.

[60] S. J. Koppal and S. G. Narasimhan, “Appearance Derivatives for Isonormal Clustering of Scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 8, pp. 1375–1385, August 2009.

[61] H. J. Chang, H. Jeong, and J. Y. Choi, “Active Attentional Sampling for Speed-up of Background Subtraction,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2088–2095, June 2012.

[62] I. Endres and D. Hoiem, “Category Independent Object Proposals,”11th European Con-ference on Computer Vision (ECCV), pp. 575–588, September 2010.

[63] D. Zhang, O. Javed, and M. Shah, “Video Object Segmentation through Spatially Accu-rate and Temporally Dense Extraction of Primary Object Regions,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 628–635, June 2013.

Appendix A

Background model using a mixture of Gaussians

Here, the author presents the details of the background modeling based on Gaussian mixture model (GMM) [12], which allows the dynamic control of GMM.

For simplicity, a pixel at (x, y)is considered, and then we can represent the recent history of its features{X¹, . . . ,X^t}by a mixture ofM Gaussians as shown in Figure A.1, whereX^t

P(X)

Figure A.1: Gaussian mixture model representation of a probability density function

Appendix A. Background model using a mixture of Gaussians

is a pixel feature of(x, y)at timet. The probability of observing the current pixel feature is P(X^t) =

M m=1

w_m^t η(X^t|μ^t_m,Σ^t_m), (A.1) wherew_m^t ,μ^t_mandΣ^t_mare the weight, the mean and the covariance matrix of them-th Gaussian in the mixture at timetrespectively, andηis the Gaussian probability density

η(X^t|μ^t,Σ^t) = 1

(2π)^d²|Σ|¹² exp

−1

2(X^t−μ^t)^TΣ⁻¹(X^t−μ^t)

. (A.2)

In the method [12], the number of Gaussian distributions M is automatically controlled in response to background changes. In particular, for the pixels that observe background changes, M is increased with the addition of new distributions. Conversely, for stable pixels whose features are almost constant for a while, M is decreased with the elimination or integration of the distributions. The details of automatic control of M are discussed later (see Step6).

Additionally, in the method [12], they approximate the form of the covariance matrix as

Σ^t_m =σ^t_mI, (A.3)

where each component of the pixel feature is assumed to be independent and have the same variance. While this is certainly not the case, the assumption allows us to avoid a costly matrix inversion at the expense of some accuracy.

Thus, the distribution of recently observed features of each pixel in a scene is character-ized by a GMM. A new pixel feature will be represented by one of the major components of the GMM and used to update the GMM. The details of detecting and updating schemes are described in the following 8 steps.

Step1: Every new pixel featureX^tis checked against the existingM Gaussians, until a match is found. A match is defined as a feature vector within 2.5 standard deviations of a distri-bution.

Step2: When a match is found for the new pixel feature in Step1, it is regarded as the back-ground if the matched distribution is one of the backback-ground models (described inStep8).

Otherwise, the pixel is the foreground.

Appendix A. Background model using a mixture of Gaussians

Step3: The prior weightsw^t_mof theM Gaussians at timetare updated as

w^t_m = (1−α)w^t−1_m +αR^t_m, (A.4) where α is the learning rate and R^t_m is 1 for the matched distributions and 0 for the remaining distributions. After this process, each weightw_m^t is renormalized.

Step4: Theμ^t_m andσ_m^t parameters for unmatched distributions remain the same. The param-eters of the matched distribution are updated as follows

μ^t_m = (1−ρ)μ^t−1_m +ρX^t, (A.5) σ^t_m = (1−ρ)σ^t−1_m +ρ(X^t−μ^t_m)^T(X^t−μ^t_m), (A.6) where the second learning rateρis defined as

ρ=αη(X^t|μ^t_m,Σ^t_m). (A.7)

Step5: If none of theMdistributions match the current pixel feature inStep1, a new Gaussian distribution is added to the GMM as follows

w^t_M+1 =W, (A.8)

μ^t_M₊₁ =X^t, (A.9)

σ_M+1^t =σ_M^t (A.10)

whereW is the initial weight value for the new Gaussian. IfW is higher, the distribution is chosen as the background model for a long time. After this process, all the weights are renormalized.

Step6: When the weight of the least distribution is smaller than a threshold, the distribution is deleted and the remaining weights are renormalized. In cases where the difference between means of two Gaussians (the one is ηa and the other is ηb) is smaller than a

Appendix A. Background model using a mixture of Gaussians

threshold, these distributions are integrated into one Gaussian. The new wight, mean and variance of the integrated Gaussianηc are calculated as follows

w^t_c =w_a^t +w^t_b, (A.11)

μ^t_c = w^t_aμ^t_a+w^t_bμ^t_b

w_a^t +w^t_b , (A.12)

σ_c^t= w_a^tσ^t_a+w_b^tσ_b^t

w^t_a+w^t_b . (A.13)

Step7: The Gaussians are ordered by the value of w/σ. This value increases both as the distribution gains more evidence and as the variance decreases.

Step8: The firstB distributions are chosen as the background model as follows B = argmin

k=1

w^t_m > T

(A.14) whereT is a measure of the minimum portion of the data that should be accounted for by the background. If a small value forT is chosen, the background model is usually unimodal. If T is higher, a multi-modal distribution caused by a repetitive background motion (e.g. the movement of tree branches or leaves, the waves on water, etc.) could result in more than one feature being included in the background model. This results in a transparency effect which allows the background to accept two or more separate features.

Appendix B

Background model using kernel density estimation

Here, the author presents the details of the fast background modeling algorithm using kernel density estimation [14].

For simplicity, a pixel at(x, y)is considered, and then its probability density function (PDF) at timet is estimated by kernel density estimation (KDE) using the past samples. In the fast algorithm [14], the authors employed a rectangular function as the kernel functionW as shown

) (u W

2 h 2

−h

h 1

u 0

Figure B.1: Rectangular kernel used in the fast background modeling algorithm using kernel density estimation: This figure shows an example withd= 1in Eq.(B.1).

Appendix B. Background model using kernel density estimation

in Figure B.1, instead of Gaussian function, which is often used in KDE.

W(u) =

⎧⎨

⎩

h^d if |u| ≤ ^h₂, 0 otherwise,

(B.1)

wherehis a parameter representing the width of the kernel, i.e., some smoothing parameter and dis the dimension of feature vector. Using this kernel and the pastS samples, the PDF at time tis represented as

P^t(X) = 1 S

S i=1

|X −X^t−i| , (B.2)

whereX^tis ad-dimensional feature vector of the pixel(x, y), and|X−X^t−i|means the chess-board distance of pixel values ind-dimensional space. Thus, according to Eq.(B.2), the PDF P^t(X)at timetis calculated by enumerating samples whose values are inside of the kernel lo-cated atX. Then, in a naive way (i.e., by enumerating the particular pixels), the computational time is proportional to the number of samplesS. Instead, in the fast algorithm [14], they have developed the PDF estimation, whose computation cost does not depend onS.

In background modeling, the PDF P^t(X) at time t is estimated by referring to the pixel features{X^t−1, . . . ,X^t−S} observed in the latest S frames. Then, at timet+ 1, the updated PDF P^t+1(X) is estimated by referring to a new sample X^t. Basically, the essence of PDF estimation is accumulation of the kernel estimator, and when the new value X^t is acquired, the kernel estimator corresponding toX^tshould be accumulated. At the same time, the oldest kernel estimator corresponding toX^t−Sat framet−S should be discarded, since the length of the pixel process is constant,S. This idea leads to reduction of the PDF computation into the following incremental computation

P^t+1(X) =P^t(X) + 1 SW

|X−X^t| − 1 SW

|X−X^t−S| . (B.3) This equation means that when a new pixel value is observed, the PDF is updated by:

• increasing the probabilities of pixel values which are inside of the kernel located at the new pixel valueX^tby _h¹_d (see Figure B.2 red parts),

ドキュメント内複雑環境下における物体検出のための背景モデリング (ページ 86-104)