• 検索結果がありません。

2.3 Approach

2.3.2 Models

This section introduces unimodal models to verify the modality-specific perfor-mance and four types of multimodal models that fuse the modalities.

2.3.2.1 Unimodal

The unimodal model is based on one of the popular architecture of convolutional neural networks, VGG proposed by Simonyanet al.[65]. In particular, this work uses VGG-11 described in the paper [65]. The VGG-11 architecture has a stack of 3×3 convolution layers where the numbers of convolution filters from input to output are{64, 128, 256, 256, 512, 512, 512, 512}. All the convolution operations do not change the spatial resolution. When the number of channels is doubled or convoluted twice, the resolution of the outputted feature map is halved with 2×2 max pooling. Outputted feature maps from the convolutions are followed

2.3. Approach 20

(a) Panoramic color image

(b) Panoramicdepthimage

(c) Panoramicreflectanceimage

(d) 3D point clouds

Figure 2.1: Samples from MPO dataset [2]. The panoramic depth/reflectance im-ages are built by a cylindrical projection of the 3D point clouds. The panoramic images cover 360 horizontal directions.

2.3. Approach 21

Feature map Output

Width

Height

Max pooling

Figure 2.2: Row-wise max pooling proposed by Shiet al.[4]. The maximum value (blue) is selected for each row.

by three fully-connected layers and a softmax function. All the outputs from the convolution and the fully-connected layers are activated by a rectified linear unit (ReLU) function, except for the final layer. The final output is a categorical dis-tribution p RK, whereKis the number of defined classes, and each dimension represents a probability of the specific class.

This work prepares a baseline model based on VGG-11 to verify the effective-ness of the proposed architecture. First, the last stack of fully-connected layers is shrunk to be fewer layers and units. The shrinking reduced a large number of training parameters and resulted in fast convergence and higher performance.

Second, batch normalization [66] is applied to the pre-activated outputs from all convolution layers and all fully-connected layers except for the final layer. Ac-cordingly, bias parameters of the normalized layers are removed because the bi-ases are canceled by the subsequent batch normalization. The final architecture of the baseline is shown in Table 2.1. This work empirically verified that the per-formance was not improved even when using the deeper VGG models [65] or skip-connections [67]. Moreover, this work introduces the custom operations that consider the circular nature of panoramic images.

Horizontally-Invariant Pooling The visual concepts on the panoramic images tend to move largely in the horizontal direction due to the yawing motion of the measurement vehicle and the installation angle of LiDAR. To cope with the varia-tion, this work applies a row-wise max pooling (RWMP) layer proposed by Shiet al.[4] to feature maps before the first fully-connected layer (fc1in Table 2.1). As depicted in Fig. 2.2, the RWMP outputs the maximum value for each row of the input feature map, which makes the CNNs invariant to horizontal translation.

2.3. Approach 22 Table 2.1: Baseline architecture to process 32×384 unimodal image. The numbers (k,s,p) in the setting column denote that incoming tensors are padded with p pixels, and the operation withk×kkernels is applied with the stride ofs.

Name Layer type Setting Parameters Output shape

conv1 Convolution (3, 1, 1) 64 kernels 64×32×384

pool1 Max pooling (2, 2, 0) 64×16×192

conv2 Convolution (3, 1, 1) 128 kernels 128×16×192

pool2 Max pooling (2, 2, 0) 128×8×96

conv3 1 Convolution (3, 1, 1) 256 kernels 256×8×96 conv3 2 Convolution (3, 1, 1) 256 kernels 256×8×96

pool3 Max pooling (2, 2, 0) 256×4×48

conv4 1 Convolution (3, 1, 1) 512 kernels 512×4×48 conv4 2 Convolution (3, 1, 1) 512 kernels 512×4×48

pool4 Max pooling (2, 2, 0) 512×2×24

conv5 1 Convolution (3, 1, 1) 512 kernels 512×2×24 conv5 2 Convolution (3, 1, 1) 512 kernels 512×2×24

pool5 Max pooling (2, 2, 0) 512×1×12

fc1 Fully-connected 128 units 128

fc2 Fully-connected 6 units 6

Softmax 6

Circular Convolution on the 2D Plane The depth and reflectance panoramic images have continuity between the left and right edges, while standard convo-lution only extracts features from local regions limited by image boundaries. To extract the features while retaining the circular structure, this work introduces the operation to circulate the convolution kernels horizontally, called horizontal circular convolution (HCC). In a standard convolution layer, a zero-padding op-eration is commonly used to keep the resolution of feature maps, which fills the periphery of an incoming tensor with zero. In contrast, the HCC layer pads the left and right edges with values of the opposite sides and follows by the normal convolution without padding, as shown in Fig. 2.3. This operation is equivalent to circulate the kernels over the edges. All convolution layers of the baseline model are replaced with the HCC layer.

2.3. Approach 23

Feature map

Width

Height

Padding

Output

0 0 0 0 0

0 0

0 0 0 0 0

0 0

Conv3×3

Figure 2.3: Horizontal circular convolution. The left and right edges are padded with values of the opposite sides. The convolution output has the receptive field beyond the edges.

Conv FCs Softmax

Conv FCs Softmax

x0.5 x0.5

Softmax Average Reflectance Depth

Adaptive Fusion

Reflectance Depth

Gating Network

Conv FCs Softmax

Conv

Late Fusion Reflectance Depth

Conv FCs Softmax

Early Fusion Conv

FCs Softmax

Conv FCs Softmax

Figure 2.4: Architectures of multimodal models. “FCs” and “Conv” denote fully-connected layers and convolution layers, respectively.

2.3.2.2 Multimodal

Multimodal models receive both a depth image and the corresponding reflectance image as input. This section herein introduces four types of multimodal architec-tures for fusing both modalities as depicted in Fig. 2.4.

Softmax Average This model is a type of decision-level fusion. The visual fea-tures of the depth map and the reflectance map are learned separately on the different models. In this work, each model is selected in terms of performance for the unimodal case. At inference time, each of the unimodal models predicts probabilities of K classes. Let pd and pr be the probabilities from the depth and the reflectance models, respectively. The final result p RK is determined by a category-wise averaged score p= (pd+pr)/2.

Adaptive Fusion The Adaptive Fusion model adaptively weighs the modality-specific predictions by adopting an additionalgating network g(·), which estimates

2.3. Approach 24 the certainties from intermediate activations of both modalities. This approach was originally proposed by Meeset al.[21] for a pedestrian detection task. Let the intermediate activation maps of the depth and the reflectance models berdandrr, respectively. In this work, thepool5feature map in Table 2.1 is used as the input to the gating network. Each model is selected in the same fashion as Softmax Average. At the training phase, each unimodal model is pre-trained separately and fixed, and then only the gating network is trained. The final result p RKis determined by weighted averaging as follows:

(wd,wr) = g(rd,rr) (2.1) p=w| {z }dpd

Depth

+ w|{z}rpr

Re f lectance

, (2.2)

wherewd andwr are the certainties estimated from the gating network g(·), and pdand prare probabilities from the depth and the reflectance models. The gating network is composed of two stacked fully-connected layers and softmax. Similar to the main part, the first fully-connected layer is followed by batch normalization, ReLU activation.

Early Fusion The Early Fusion model takes two-channel input that is a sim-ple stack of the modalities. Local co-occurrence features can be learned on the pixel level since the information of both modalities is merged at the first convo-lution layer of the unified network. Except for the number of input channels, the layer structure is the same as the unimodal baseline. All parameters of the input through the output are trained end-to-end.

Late Fusion The Late Fusion model has separate convolutional streams for each modality and shared fully-connected layers to output a final categorical distribu-tion. For both modalities, spatial features are learned separately while the global co-occurrence is learned in the shared part. Up to the last convolution layers, the architectures are the same as in the unimodal cases. At the training, to avoid the vanishing gradient problem, the model first copies the parameters of convolu-tion layers from the pre-trained unimodal models and then fine-tunes the entire model, including randomly initialized fully-connected layers.

2.4. Experiments 25

関連したドキュメント