Discussions - in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy

directions of the scanning vehicle are activated. These regions include skylines of artiﬁcial structures and roads. Inforest area and the outdoor parking, the maps show relatively uniform activations. It can be seen that the contributing features appeared within local regions with a certain size and everywhere on the map.

These ubiquitous features can be considered textural objects, such as woods, cars, or distant buildings.

Moreover, the results show modality-independent characteristics in terms of network modiﬁcation: RWMP and HCC. Focusing particularly onforest area and outdoor parkingagain, we can see that the side regions are ignored on the baseline VGG, whereas the modiﬁed VGG responds to the side regions that are intrinsically continuous but spatially isolated on the map. These effects can also be seen in other categories, although slight differences occurred between the modalities. For instance, in the reﬂectance results onindoor parkingclass, the network modiﬁcation worked on equilibration of the feature extractability between the center and the side area. As we can see in Fig. 2.1, these two areas correspond to the left and right side views from the scanning vehicle, respectively, and thus the appearances are reﬂected equally to the LiDAR. On the other hand, in the depth results onindoor parkingclass, we can observe the imbalanced distribution on the center and side areas. This is mainly due to the geometrical difference while the vehicle drives along one lane of the road. It can be seen that the model learned the different features even if the same type of objects appeared in these areas, and which led to the difﬁculty of learning depth maps with VGG+RWMP+HCC.

2.5 Discussions

This chapter presented a novel approach for outdoor scene classiﬁcation by mul-timodal convolutional neural networks. In the unimodal experiment, this work used either depth or reﬂectance inputs and trained four types of deep models based on the popular baseline, VGG [65]. As a result, the baseline VGG showed the best result of 97.18% for the depth input, and the modiﬁed VGG showed the best result of 95.92% for the reﬂectance input. These results were better than the traditional hand-engineered features, Spin-Image, GIST, and LBP. In addition, this work visualized the learned features in the networks and veriﬁed how they are

ac-2.5. Discussions 35

Class Depth Reﬂectance

Coast

Forest

InP

OutP

Res

Urban

Figure 2.8: Contributing regions which the models activate to predict correct classes using Grad-CAM [5]. Coast, Forest, InP, OutP, Res, and Urban denote scene classes of coast area, forest area, indoor parking, outdoor parking, residen-tial area, and urban area, respectively. Each class group shows Grad-CAM maps averaged on the test set, which are obtained from the baseline VGG (top row) and the proposed VGG with RWMP and HCC (bottom row). The modiﬁcation works for sharpening of feature selectivity and equilibration of feature extractability be-tween the center and the side area.

2.5. Discussions 36 tivated for the speciﬁc categories. In the multimodal experiment, both depth and reﬂectance maps are used, and four types of multimodal models are trained. The ﬁnal results showed that the decision-level fusion approachSoftmax Average and Adaptive Fusionbrought the performance gain against unimodal cases. In other words, we could not see the remarkable improvements in joint learning multi-modal features by Early Fusion and Late Fusion. Similar ﬁndings were reported in the other tasks [73]. Baltruˇsaitis et al. [19] addressed that joint representation learning has some difﬁculties for different noise levels and an insufﬁcient amount of training data. From the qualitative experiment, we observed that the proposed mechanism could process the circular data appropriately, while the performance was slightly degraded in depth modality. I consider the circular mechanism is potentially crucial for future applications such as image generation tasks.

With the tremendous progress of deep learning, studies on point clouds have been shifting to graph-based approaches [74] and more sophisticated networks to process pointsets intact [75]. However, LiDAR data still has advantages where the point clouds can be represented as a bijective map, which saves the computational cost and leads to real-time applications. Remained issues include explicit model-ing of ray drops (missmodel-ing points) and validation of different representations of depth maps. For the ﬁrst issue, recent studies revealed the ray drops signiﬁcantly affect the performance in learning models [76, 77]. In terms of the second issue, LiDAR point clouds can be projected into the different 2D representation such as a cylindrical coordinate map [78] or cartesian coordinate map [78, 76], while our approach employed a 1-channel equirectangular depth map. Moreover, there are some approaches to distortion-aware convolution for spherical images [79, 80].

The above techniques have the potential to improve the scene classiﬁcation task.

Caption Generation from 3

Multi-perspective Images

3.1 Introduction

Motivated by the availability of consumer wearable devices, lifelogging has been attracting increasing attention. By simply attaching a wearable device to their bodies, people can easily accumulate daily records of their states, activities, or experiences aslifelogdata. Accumulated data are then analyzed and organized as an indexed digital collection that people can access whenever they want to review their lifestyle. For example, people with wrist-mounted sensors such as an Apple Watch can measure the number of steps, heart rates, or multi-axis acceleration in order to analyze their activities. Such biometric data are widely used to estimate health level, stress level, and the number of calories burned, but these applications are limited to quantifying ones’ internal dynamics.

In contrast, lifelogging with images taken from a wearable camera such as SenseCam [81], GoPro HERO, and Narrative Clip offers us high-ﬁdelity records of everyday visual experiences, which is specially referred to as visual

lifelog-37

3.1. Introduction 38 ging [82]. Wearable cameras are generally placed on the wearer’s chest or head to get a ﬁrst-person perspective such that the images involve everyday scenes showing what the wearer gazes, reacts to, manipulates, and any other interac-tions throughout the day. By applying various methods of parsing image content, the pooled images are then tagged with characteristic attributes such as wearer’s activities, objects, colors, among others, and that allows the users to explore the collection by using keywords. In comparison to a non-visual lifelog, observing the wearer’s social activities is easy in this case. Therefore, the ﬁrst-person vision has traditionally been utilized not only for visual lifelogging purposes but also for social modeling and path prediction.

In visual lifelogging, to make it easy to explore the large collection, the ac-quired images are cleansed and structured with semantic tags that represent the wearer’s visual experiences. For example, at ﬁrst, the uninformative images are ﬁltered out [83], and the remaining ones are divided into homogeneous temporal segments [84]. Then, they are automatically indexed with predeﬁned semantic tags such as the types of wearer’s actions [85], places [86], and objects manip-ulated by hands [16], so that the user can search and retrieve the intended im-ages/videos by specifying the visual characteristics in queries. The semantic tags can be extracted via various image recognition techniques such as object recog-nition, object detection, and semantic segmentation, which have been improving rapidly with the use of deep neural networks.

In the ﬁeld of ﬁrst-person vision, principal tasks have been recognition/detec-tion of sports acrecognition/detec-tion and detecrecognition/detec-tion of the grasping objects. In the meantime, with the progress of recent deep learning techniques, recent studies [87, 88] reported an attempt to describe the ﬁrst-person images with natural language sentences.

Natural language descriptions not only simply list the visual concepts present in an image independently for each object but also represent their relationships in a natural and free form. Simultaneously, research on encoding and semantic un-derstanding of natural language sentences is also progressing. Thus, the visual lifelogging ﬁeld can evolve from a keyword-based system to a human-friendly, text-based system. That offers us an accessible interface for visual lifelogging; for example, instead of listing keywords like{“dog”, “couch”, “playing”}from a pre-deﬁned word dictionary, we just have to command “the moment of playing with my dog on the couch” in a natural manner.

As mentioned above, conventional visual lifelogging has relied on the

ﬁrst-3.1. Introduction 39 person wearable cameras directly capturing the wearer’s visual experiences and object manipulation history. However, their visual information tends to be noisy or often has meaningless frames due to the wearer’s dynamic ego-motion, occlu-sion by hands, and unintentional ﬁxing at a wall and a ceiling, which may obscure the events of interest. To tackle this problem, many studies have proposed prepro-cessing such as keyframe detection to ﬁlter out such frames [82]. However, even if the camera succeeds in photographing the static scenes, it is still insufﬁcient to understand the context of the wearer’s behavior from the limited forehand scenes, and the recorded collection is biased to static activities. Therefore, it is expected to combine with the complemental observer’s viewpoints.

This dissertation assumes the multi-perspective vision system in the “intelli-gent space” wherein a human and a service robot coexist. The intelli“intelli-gent space is the room or the area that is equipped with various sensors or cameras, which has been widely studied in the robotics community because of its feasibility with regard to human-robot coexistence [37, 38]. Although it is difﬁcult for a stan-dalone robot to observe the dynamic environment and operate diverse service tasks for humans with only onboard sensors, an intelligent space enables it to expand its observation area. In this system, we can use a robot-view camera, not just the user’s ﬁrst-person viewpoint. The camera is movable to follow and capture human behaviors and interactions with a human closely. Such a view fromcamera agentcan be deﬁned as the second-person viewpoint. Moreover, the typical intelligent space has embedded cameras on the wall or the ceiling to ob-serve the comprehensive state, which is used to track the human and the robot.

This type of camera can be deﬁned as the third-person viewpoint. Those ob-server’s viewpoints have the capability to capture exocentric information such as the user’s postures and place types, which are important cues to complement the ﬁrst-person description.

Furthermore, this dissertation introduces the novel lifelogging concept

“fourth-person vision”, which complementary exploits the ﬁrst-, second-, and third-person images as described above to generate accurate and detailed descrip-tions. The perspective term “fourth-person” was initially introduced into my pre-vious study [89], which is an analogy of a storyteller or a book reader who picks up unique information within multi-perspective sentences and appreciates the storylines, as illustrated in Fig. 3.1. This dissertation aims to demonstrate that the fourth-person vision system improves the accuracy in the caption generation task

3.1. Introduction 40

First-person:

Human Third-person:

Embedded camera

Second-person:

Service robot

(a) Intelligent space

Fourth-person:

Reader of books First-, second-, and third-person

sentences in novels

(b) Fourth-person vision

Figure 3.1: The fourth-person perspective in the intelligent space. The “fourth-person” is a concept of the omniscient perspective which complementary com-bines the ﬁrst-, second-, and third-person information acquired in the intelligent space.

required in the text-based visual lifelogging. To build this concept on the task, it is required to handle the visual complementarity and redundancy of the multi-perspective images and to learn visual-semantic relations, as depicted in Fig. 3.2.

Therefore, this work newly designs a neural architecture to form a single natu-ral language sentence describing the scene events from the synchronized multi-perspective images based on this concept. Through general caption evaluation schemes, this work demonstrates that the proposed method can accurately gener-ate sentences containing visual attributes of multi-level granularity compared to methods with single or double input images. To the best of my knowledge, this is the ﬁrst work that focuses on multi-perspective images for improving caption generation. The main contributions can be summarized as follows.

• A novel architecture to generate a sentence from multi-perspective lifelog images capturing the same moments in a human-robot symbiotic environ-ment.

• A novel dataset composed of synchronized multi-perspective image se-quences that are annotated with multiple natural language descriptions.

ドキュメント内 in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy (ページ 48-55)