• 検索結果がありません。

Figures A.3 and A.4 show annotation examples.

A.3. Annotation Examples 74

The following three videos show the same scene captured from different viewpoints.

You can see two people in  the third-person vision; hereinafter called  an actor and  an observer.

 The first-person vision is from the actor who is in everyday activities.

 The second-person vision is from the observer who monitors the actor.

 First-person Vision  Second-person Vision  Third-person Vision

Please describe  the actor with a single sentence in detail, in terms of his/her state and behavior.

1. Spellcheck  2. Submit  3. Next videos  Please write here

How to use

When finished to annotate the videos, please check the spell (1) and submit (2).

If difficult to understand the scene, skip the video (3).

Notice (the instruction from COCO dataset) Describe all the important parts of the scene.

Do not start the sentences with "There is".

Do not describe unimportant details.

Do not describe things that might have happened in the future or past.

Do not describe what a person might say.

Do not give people proper names.

The sentences should contain at least 8 words.

Figure A.2: Appearance of the web user interface for annotation. The interface is composed of the video pane (top), the submission form (middle), and the instruc-tion pane (bottom).

A.3. Annotation Examples 75

ThirdSecondFirst

Annotations: A man sitting on the edge of the bed is reading a book. / A person sitting on the bed is browsing a book. / A person sits on the bed in the room. / A person is reading a book on the bed. / A man browsing a book while sitting on the bed.

. . . .

ThirdSecondFirst

Annotations: A man sitting on the bed typing on the keyboard. / A person sitting on the bed is typing a black keyboard. / A person sitting on the bed is typing a keyboard. / A person typing a black keyboard on the bed. / A person is typing a keyboard on the bed.

. . . .

ThirdSecondFirst

Annotations: A man is taking something out of the refrigerator. / A standing person is reaching a hand into an opened refrigerator. / A person is looking inside an empty refrigerator. / A person is looking into an empty refrigerator. / A person standing in front of lined two refrigerators.

. . . .

Figure A.3: Example of collected annotations. Each group shows six frames from a video for each viewpoint and the corresponding annotations.

A.3. Annotation Examples 76

ThirdSecondFirst

Annotations: Men are watching something in front of an open white refrigerator. / A person standing in front of a refrigerator is examining a small item. / A person is standing in front of an empty refrigerator. / A person standing in front of the refrigerator is holding a can. / A man standing in front of the open refrigerator is watching a can.

. . . .

ThirdSecondFirst

Annotations: A man is walking with a red box. / A person is passing a red box to other in a white shirt. / A person in gray is passing a red box to the other. / Two walking men are getting close to each other to pass a red cylindrical box / A man walking towards the other walking man is giving him a red cylindrical box

. . . .

ThirdSecondFirst

Annotations: A man in white clothes hands a green bottle. / A man in a white shirt is passing a green plastic bottle to the other in a room. / A man in a white shirt is walking in the room while holding a green bottle. / A person holding a green bottle is walking in the room. / A person with a green bottle is walking nearby a kitchen.

. . . .

Figure A.4: Example of collected annotations. Each group shows six frames from a video for each viewpoint and the corresponding annotations.

Bibliography

[1] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Doll´ar, and C. L.

Zitnick, “Microsoft COCO captions: Data collection and evaluation server,”

arXiv preprint arXiv:1504.00325, 2015.

[2] H. Jung, Y. Oto, O. M. Mozos, Y. Iwashita, and R. Kurazume, “Multi-modal panoramic 3d outdoor datasets for place categorization,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4545–4550, 2016.

[3] G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari, “Actor and observer: Joint modeling of first and third-person videos,” in Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7396–7404, 2018.

[4] B. Shi, S. Bai, Z. Zhou, and X. Bai, “DeepPano: Deep panoramic representa-tion for 3-d shape recognirepresenta-tion,”IEEE Signal Processing Letters, vol. 22, no. 12, pp. 2339–2343, 2015.

[5] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra,

“Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.

[6] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and vi-sual question answering,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086, 2018.

77

Bibliography 78 [7] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” inProceedings of the European Con-ference on Computer Vision (ECCV), pp. 382–398, 2016.

[8] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”

International Journal of Computer Vision (IJCV), vol. 60, no. 2, pp. 91–110, 2004.

[9] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual catego-rization with bags of keypoints,” in Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22, 2004.

[10] F. Perronnin, J. S´anchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” inProceedings of the European Conference on Computer Vision (ECCV), pp. 143–156, Springer, 2010.

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), pp. 248–255, IEEE, 2009.

[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Pro-cessing Systems (NIPS), pp. 1097–1105, 2012.

[13] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” inAdvances in Neural Information Processing Systems (NIPS), pp. 487–495, 2014.

[14] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 40, no. 6, pp. 1452–1464, 2017.

[15] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3668–3678, 2015.

[16] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” inProceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), pp. 2847–2854, 2012.

Bibliography 79 [17] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocen-tric activities,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3281–3288, IEEE, 2011.

[18] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,”

Journal of Big data, vol. 3, no. 1, p. 9, 2016.

[19] T. Baltruˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning:

A survey and taxonomy,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 41, no. 2, pp. 423–443, 2018.

[20] Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “MFNet:

Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5108–5115, IEEE, 2017.

[21] O. Mees, A. Eitel, and W. Burgard, “Choosing smartly: Adaptive multi-modal fusion for object detection in changing environments,” in Proceed-ings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 151–156, 2016.

[22] D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe, “Learning cross-modal deep representations for robust pedestrian detection,” pp. 5363–5371, 2017.

[23] Y. Iwashita, K. Nakashima, S. Rafol, A. Stoica, and R. Kurazume, “MU-Net:

Deep learning-based thermal ir image estimation from rgb image,” in Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.

[24] H. Chang, A. Koschan, M. Abidi, S. G. Kong, and C.-H. Won, “Multispec-tral visible and infrared imaging for face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Work-shops, pp. 1–6, IEEE, 2008.

[25] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,

“Multimodal deep learning for robust rgb-d object recognition,” in Proceed-ings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 681–687, IEEE, 2015.

Bibliography 80 [26] U.-H. Kim, J.-M. Park, T.-J. Song, and J.-H. Kim, “3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents,”IEEE Transactions on Cybernetics, 2019.

[27] B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 1887–1893, IEEE, 2018.

[28] A. Dewan, G. L. Oliveira, and W. Burgard, “Deep semantic classification for 3d lidar data,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3544–3549, IEEE, 2017.

[29] A. Ross and A. K. Jain, “Multimodal biometrics: an overview,” in Proceed-ings of the European Signal Processing Conference, pp. 1221–1224, IEEE, 2004.

[30] E. J. Hoffmann, Y. Wang, M. Werner, J. Kang, and X. X. Zhu, “Model fusion for building type classification from aerial and street view images,” Remote Sensing, vol. 11, no. 11, p. 1259, 2019.

[31] J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Perona, “Cataloging public objects using aerial and street-level images-urban trees,” in Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6014–6023, 2016.

[32] M. Xu, C. Fan, Y. Wang, M. S. Ryoo, and D. J. Crandall, “Joint per-son segmentation and identification in synchronized first-and third-perper-son videos,” inProceedings of the European Conference on Computer Vision (ECCV), pp. 637–652, 2018.

[33] C. Fan, J. Lee, M. Xu, K. Kumar Singh, Y. Jae Lee, D. J. Crandall, and M. S.

Ryoo, “Identifying first-person camera wearers in third-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5125–5133, 2017.

[34] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene under-standing benchmark suite,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 567–576, 2015.

Bibliography 81 [35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 39, no. 4, pp. 652–663, 2017.

[36] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with vi-sual attention,” inProceedings of the International Conference on Machine Learn-ing (ICML), pp. 2048–2057, 2015.

[37] J.-H. Lee and H. Hashimoto, “Intelligent space̶concept and contents,” Ad-vanced Robotics, vol. 16, no. 3, pp. 265–280, 2002.

[38] R. Kurazume, Y. Pyo, K. Nakashima, A. Kawamura, and T. Tsuji, “Feasibility study of iort platform “big sensor box”,” in Proceedings of the IEEE Interna-tional Conference on Robotics and Automation (ICRA), pp. 3664–3671, 2017.

[39] A. Pronobis and P. Jensfelt, “Large-scale semantic mapping and reasoning with heterogeneous modalities,” inProceedings of the IEEE International Con-ference on Robotics and Automation (ICRA), pp. 3515–3522, 2012.

[40] C. Stachniss, O. M. Mozos, and W. Burgard, “Speeding up multi-robot ex-ploration by considering semantic place information,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 1692–

1697, 2006.

[41] T. Kollar and N. Roy, “Utilizing object-object and object-scene context when planning to find things,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2168–2173, 2009.

[42] A. Pronobis, P. Jensfelt, K. Sj ¨o ¨o, H. Zender, G.-J. M. Kruijff, O. M. Mozos, and W. Burgard, “Semantic modelling of space,” inCognitive Systems, pp. 165–

221, Springer, 2010.

[43] H. Christensen, G.-J. M. Kruijff, and J. L. Wyatt, Cognitive Systems, vol. 8.

Springer Science & Business Media, 2010.

[44] J. Iba ˜nez-Guzman, C. Laugier, J.-D. Yoder, and S. Thrun, “Autonomous driving: Context and state-of-the-art,” in Handbook of Intelligent Vehicles, pp. 1271–1310, Springer, 2012.

Bibliography 82 [45] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Proceedings of the European Confer-ence of Computer Vision (ECCV), pp. 746–760, 2012.

[46] O. M. Mozos, H. Mizutani, R. Kurazume, and T. Hasegawa, “Categorization of indoor places using the kinect sensor,” Sensors, vol. 12, no. 5, pp. 6695–

6711, 2012.

[47] G. Pandey, J. R. McBride, and R. M. Eustice, “Ford campus vision and li-dar data set,” The International Journal of Robotics Research (IJRR), vol. 30, pp. 1543–1552, 2011.

[48] J.-L. Blanco-Claraco, F.-A. Moreno-Due ˜nas, and J. Gonzalez-Jimenez, “The malaga urban dataset: High-rate stereo and LiDAR in a realistic urban sce-nario,” The International Journal of Robotics Research (IJRR), vol. 33, pp. 207–

214, 2014.

[49] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving?

the kitti vision benchmark suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361, 2012.

[50] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “24/7 place recognition by view synthesis,” inProceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pp. 1808–1817, 2015.

[51] J. Xiao, J. Hays, K. Ehinger, A. Oliva, A. Torralba, et al., “SUN database:

Large-scale scene recognition from abbey to zoo,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492, 2010.

[52] E. Fazl-Ersi and J. K. Tsotsos, “Histogram of oriented uniform patterns for robust place recognition and categorization,” The International Journal of Robotics Research (IJRR), vol. 31, no. 4, pp. 468–483, 2012.

[53] O. M. Mozos, C. Stachniss, and W. Burgard, “Supervised learning of places from range data using adaboost,” inProceedings of the IEEE International Con-ference on Robotics and Automation (ICRA), pp. 1730–1735, 2005.

Bibliography 83 [54] E. Fernandez-Moral, W. Mayol-Cuevas, V. Ar´evalo, and J. Gonzalez-Jimenez, “Fast place recognition with plane-based maps,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2719–

2724, 2013.

[55] O. M. Mozos, H. Mizutani, H. Jung, R. Kurazume, and T. Hasegawa, “Cat-egorization of indoor places by combining local binary pattern histograms of range and reflectance data from laser range finders,” Advanced Robotics, vol. 27, no. 18, pp. 1455–1464, 2013.

[56] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5297–

5307, 2016.

[57] M. Lopez-Antequera, R. Gomez-Ojeda, N. Petkov, and J. Gonzalez-Jimenez,

“Appearance-invariant place recognition by discriminatively training a con-volutional neural network,” Pattern Recognition Letters, vol. 92, pp. 89–95, 2017.

[58] N. S ¨underhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4297–4304, 2015.

[59] P. Ur˘si˘c, R. Mandeljc, A. Leonardis, and M. Kristan, “Part-based room cate-gorization for household service robots,” in Proceedings of the IEEE Interna-tional Conference on Robotics and Automation (ICRA), pp. 2287–2294, 2016.

[60] D. Maturana and S. Scherer, “VoxNet: A 3d convolutional neural network for real-time object recognition,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928, 2015.

[61] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920, 2015.

Bibliography 84 [62] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using fully con-volutional network,” in Proceedings of Robotics: Science and Systems (RSS), 2016.

[63] E. Sizikova, V. K. Singh, B. Georgescu, M. Halber, K. Ma, and T. Chen, “En-hancing place recognition using joint intensity-depth analysis and synthetic data,” inProceedings of the European Conference on Computer Vision Workshop (ECCVW), pp. 901–908, 2016.

[64] R. Goeddel and E. Olson, “Learning semantic place labels from occupancy grids using cnns,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3999–4004, 2016.

[65] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.

[66] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the Interna-tional Conference on Machine Learning (ICML), pp. 448–456, 2015.

[67] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-nition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.

[68] J. Moody, S. Hanson, A. Krogh, and J. A. Hertz, “A simple weight decay can improve generalization,” Advances in Neural Information Processing Systems (NIPS), vol. 4, pp. 950–957, 1995.

[69] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in Py-Torch,” inNIPS Autodiff Workshop, 2017.

[70] A. E. Johnson,Spin-Images: A Representation for 3-D Surface Matching. PhD thesis, The Robotics Institute, Carnegie Mellon University, 1997.

[71] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic rep-resentation of the spatial envelope,” International Journal of Computer Vision (IJCV), vol. 42, no. 3, pp. 145–175, 2001.

Bibliography 85 [72] M. P. T. Ojala and T. M¨aenp¨a¨a, “Multiresolution gray-scale and rotation in-variant texture classification with local binary patterns,”IEEE Trans. on Pat-tern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.

[73] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for se-mantic segmentation,” in Proc. of the IEEE Conf. on Computer Vision and Pat-tern Recognition (CVPR), pp. 3431–3440, 2015.

[74] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon,

“Dynamic graph cnn for learning on point clouds,” ACM Transactions on Graphics (TOG), vol. 38, no. 5, pp. 1–12, 2019.

[75] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660, 2017.

[76] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer, “SqueezeSegV2: Improved model structure and unsupervised domain adaptation for road-object seg-mentation from a lidar point cloud,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 4376–4382, IEEE, 2019.

[77] S. Manivasagam, S. Wang, K. Wong, W. Zeng, M. Sazanovich, S. Tan, B. Yang, W.-C. Ma, and R. Urtasun, “Lidarsim: Realistic lidar simulation by leveraging the real world,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11167–11176, 2020.

[78] L. Caccia, H. van Hoof, A. Courville, and J. Pineau, “Deep generative mod-eling of lidar data,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5034–5040, 2019.

[79] Y.-C. Su and K. Grauman, “Learning spherical convolution for fast fea-tures from 360 imagery,” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 529–539, 2017.

[80] T. S. Cohen, M. Geiger, J. K ¨ohler, and M. Welling, “Spherical cnns,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.

Bibliography 86 [81] S. Hodges, L. Williams, E. Berry, S. Izadi, J. Srinivasan, A. Butler, G. Smyth, N. Kapur, and K. Wood, “SenseCam: A retrospective memory aid,” in Pro-ceedings of the International Conference of Ubiquitous Computing (UbiComp), pp. 177–193, 2006.

[82] M. Bola ˜nos, M. Dimiccoli, and P. Radeva, “Toward storytelling from visual lifelogging: An overview,” IEEE Transactions on Human-Machine Systems, vol. 47, no. 1, pp. 77–90, 2016.

[83] B. Xiong and K. Grauman, “Detecting snap points in egocentric video with a web photo prior,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 282–298, 2014.

[84] Y. Poleg, C. Arora, and S. Peleg, “Temporal segmentation of egocentric videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2537–2544, 2014.

[85] S. Singh, C. Arora, and C. Jawahar, “First person action recognition using deep learned descriptors,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2620–2628, 2016.

[86] V. Bettadapura, I. Essa, and C. Pantofaru, “Egocentric field-of-view local-ization using first-person point-of-view devices,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 626–633, 2015.

[87] C. Fan, Z. Zhang, and D. J. Crandall, “Deepdiary: Lifelogging image cap-tioning and summarization,”Journal of Visual Communication and Image Rep-resentation, vol. 55, pp. 40–55, 2018.

[88] M. Bola ˜nos, ´A. Peris, F. Casacuberta, S. Soler, and P. Radeva, “Egocentric video description based on temporally-linked sequences,” Journal of Visual Communication and Image Representation, vol. 50, pp. 205–216, 2018.

[89] K. Nakashima, Y. Iwashita, P. Yoonseok, A. Takamine, and R. Kurazume,

“Fourth-person sensing for a service robot,” inProceedings of the IEEE Con-ference on Sensors, pp. 1–4, 2015.

Bibliography 87 [90] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems (NIPS), pp. 568–576, 2014.

[91] S. Clinch, P. Metzger, and N. Davies, “Lifelogging for ‘observer’ view mem-ories: an infrastructure approach,” in Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 1397–

1404, 2014.

[92] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Com-putation, vol. 9, no. 8, pp. 1735–1780, 1997.

[93] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural In-formation Processing Systems (NIPS), pp. 91–99, Curran Associates, Inc., 2015.

[94] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Inter-national Journal of Computer Vision (IJCV), vol. 123, no. 1, pp. 32–73, 2017.

[95] L. Zheng, Y. Yang, and Q. Tian, “SIFT meets CNN: A decade survey of in-stance retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelli-gence (TPAMI), vol. 40, no. 5, pp. 1224–1244, 2017.

[96] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Infor-mation Theory, vol. 28, no. 2, pp. 129–137, 1982.

[97] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for auto-matic evaluation of machine translation,” inProceedings of the Annual Meet-ing on Association for Computational LMeet-inguistics (ACL), pp. 311–318, 2002.

[98] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proceedings of the ACL workshop on Text Summarization Branches Out, pp. 74–

81, 2004.

[99] S. Banerjee and A. Lavie, “METEOR: An automatic metric for mt evalua-tion with improved correlaevalua-tion with human judgments,” inProceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine transla-tion and/or summarizatransla-tion, pp. 65–72, 2005.

Bibliography 88 [100] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[101] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generat-ing image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137, 2015.

[102] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical se-quence training for image captioning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7008–7024, 2017.

[103] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seed-ing,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Dis-crete Algorithms, pp. 1027–1035, 2007.

[104] D. Pelleg and A. Moore, “X-means: Extending k-means with efficient esti-mation of the number of clusters,” inProceedings of the International Confer-ence on Machine Learning (ICML), pp. 727–734, Morgan Kaufmann, 2000.

[105] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, “The elements of sta-tistical learning: data mining, inference and prediction,” The Mathematical Intelligencer, vol. 27, no. 2, pp. 83–85, 2005.

[106] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2054–2063, 2018.

[107] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Inter-preting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3674–3683, 2018.

[108] E. Ng, D. Xiang, H. Joo, and K. Grauman, “You2me: Inferring body pose in egocentric video via first and second person interactions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9890–9900, 2020.

関連したドキュメント