Models - in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy

2.3 Approach

2.3.2 Models

This section introduces unimodal models to verify the modality-speciﬁc perfor-mance and four types of multimodal models that fuse the modalities.

2.3.2.1 Unimodal

The unimodal model is based on one of the popular architecture of convolutional neural networks, VGG proposed by Simonyanet al.[65]. In particular, this work uses VGG-11 described in the paper [65]. The VGG-11 architecture has a stack of 3×3 convolution layers where the numbers of convolution ﬁlters from input to output are{64, 128, 256, 256, 512, 512, 512, 512}. All the convolution operations do not change the spatial resolution. When the number of channels is doubled or convoluted twice, the resolution of the outputted feature map is halved with 2×2 max pooling. Outputted feature maps from the convolutions are followed

2.3. Approach 20

(a) Panoramic color image

(b) Panoramicdepthimage

(d) 3D point clouds

Figure 2.1: Samples from MPO dataset [2]. The panoramic depth/reﬂectance im-ages are built by a cylindrical projection of the 3D point clouds. The panoramic images cover 360^◦ horizontal directions.

2.3. Approach 21

Feature map Output

Width

Height

Max pooling

Figure 2.2: Row-wise max pooling proposed by Shiet al.[4]. The maximum value (blue) is selected for each row.

by three fully-connected layers and a softmax function. All the outputs from the convolution and the fully-connected layers are activated by a rectiﬁed linear unit (ReLU) function, except for the ﬁnal layer. The ﬁnal output is a categorical dis-tribution p ∈ R^K, whereKis the number of deﬁned classes, and each dimension represents a probability of the speciﬁc class.

This work prepares a baseline model based on VGG-11 to verify the effective-ness of the proposed architecture. First, the last stack of fully-connected layers is shrunk to be fewer layers and units. The shrinking reduced a large number of training parameters and resulted in fast convergence and higher performance.

Second, batch normalization [66] is applied to the pre-activated outputs from all convolution layers and all fully-connected layers except for the ﬁnal layer. Ac-cordingly, bias parameters of the normalized layers are removed because the bi-ases are canceled by the subsequent batch normalization. The ﬁnal architecture of the baseline is shown in Table 2.1. This work empirically veriﬁed that the per-formance was not improved even when using the deeper VGG models [65] or skip-connections [67]. Moreover, this work introduces the custom operations that consider the circular nature of panoramic images.

Horizontally-Invariant Pooling The visual concepts on the panoramic images tend to move largely in the horizontal direction due to the yawing motion of the measurement vehicle and the installation angle of LiDAR. To cope with the varia-tion, this work applies a row-wise max pooling (RWMP) layer proposed by Shiet al.[4] to feature maps before the ﬁrst fully-connected layer (fc1in Table 2.1). As depicted in Fig. 2.2, the RWMP outputs the maximum value for each row of the input feature map, which makes the CNNs invariant to horizontal translation.

2.3. Approach 22 Table 2.1: Baseline architecture to process 32×384 unimodal image. The numbers (k,s,p) in the setting column denote that incoming tensors are padded with p pixels, and the operation withk×kkernels is applied with the stride ofs.

Name Layer type Setting Parameters Output shape

conv1 Convolution (3, 1, 1) 64 kernels 64×³²×³⁸⁴

pool1 Max pooling (2, 2, 0) 64×16×192

conv2 Convolution (3, 1, 1) 128 kernels 128×16×192

pool2 Max pooling (2, 2, 0) 128×⁸×⁹⁶

conv3 1 Convolution (3, 1, 1) 256 kernels 256×8×96 conv3 2 Convolution (3, 1, 1) 256 kernels 256×8×96

pool3 Max pooling (2, 2, 0) 256×⁴×⁴⁸

conv4 1 Convolution (3, 1, 1) 512 kernels 512×4×48 conv4 2 Convolution (3, 1, 1) 512 kernels 512×4×48

pool4 Max pooling (2, 2, 0) 512×²×²⁴

conv5 1 Convolution (3, 1, 1) 512 kernels 512×2×24 conv5 2 Convolution (3, 1, 1) 512 kernels 512×2×24

pool5 Max pooling (2, 2, 0) 512×¹×¹²

fc1 Fully-connected 128 units 128

fc2 Fully-connected 6 units 6

Softmax 6

Circular Convolution on the 2D Plane The depth and reﬂectance panoramic images have continuity between the left and right edges, while standard convo-lution only extracts features from local regions limited by image boundaries. To extract the features while retaining the circular structure, this work introduces the operation to circulate the convolution kernels horizontally, called horizontal circular convolution (HCC). In a standard convolution layer, a zero-padding op-eration is commonly used to keep the resolution of feature maps, which ﬁlls the periphery of an incoming tensor with zero. In contrast, the HCC layer pads the left and right edges with values of the opposite sides and follows by the normal convolution without padding, as shown in Fig. 2.3. This operation is equivalent to circulate the kernels over the edges. All convolution layers of the baseline model are replaced with the HCC layer.

2.3. Approach 23

Feature map

Width

Height

Padding

Output

0 0 0 0 0

0 0

0 0 0 0 0

0 0

Conv3×3

Figure 2.3: Horizontal circular convolution. The left and right edges are padded with values of the opposite sides. The convolution output has the receptive ﬁeld beyond the edges.

Conv FCs Softmax

x0.5 x0.5

Softmax Average Reflectance Depth

Adaptive Fusion

Reflectance Depth

Gating Network

Conv FCs Softmax

Conv

Late Fusion Reflectance Depth

Conv FCs Softmax

Early Fusion Conv

FCs Softmax

Conv FCs Softmax

Figure 2.4: Architectures of multimodal models. “FCs” and “Conv” denote fully-connected layers and convolution layers, respectively.

2.3.2.2 Multimodal

Multimodal models receive both a depth image and the corresponding reﬂectance image as input. This section herein introduces four types of multimodal architec-tures for fusing both modalities as depicted in Fig. 2.4.

Softmax Average This model is a type of decision-level fusion. The visual fea-tures of the depth map and the reﬂectance map are learned separately on the different models. In this work, each model is selected in terms of performance for the unimodal case. At inference time, each of the unimodal models predicts probabilities of K classes. Let p_d and p_r be the probabilities from the depth and the reﬂectance models, respectively. The ﬁnal result p ∈ R^K is determined by a category-wise averaged score p= (p_d+p_r)/2.

Adaptive Fusion The Adaptive Fusion model adaptively weighs the modality-speciﬁc predictions by adopting an additionalgating network g(·), which estimates

2.3. Approach 24 the certainties from intermediate activations of both modalities. This approach was originally proposed by Meeset al.[21] for a pedestrian detection task. Let the intermediate activation maps of the depth and the reﬂectance models ber_dandr_r, respectively. In this work, thepool5feature map in Table 2.1 is used as the input to the gating network. Each model is selected in the same fashion as Softmax Average. At the training phase, each unimodal model is pre-trained separately and ﬁxed, and then only the gating network is trained. The ﬁnal result p ∈ R^Kis determined by weighted averaging as follows:

(w_d,w_r) = g(r_d,rr) (2.1) p=w| {z }_dp_d

Depth

+ w|{z}_rp_r

Re f lectance

, (2.2)

wherew_d andw_r are the certainties estimated from the gating network g(·), and p_dand p_rare probabilities from the depth and the reﬂectance models. The gating network is composed of two stacked fully-connected layers and softmax. Similar to the main part, the ﬁrst fully-connected layer is followed by batch normalization, ReLU activation.

Early Fusion The Early Fusion model takes two-channel input that is a sim-ple stack of the modalities. Local co-occurrence features can be learned on the pixel level since the information of both modalities is merged at the ﬁrst convo-lution layer of the uniﬁed network. Except for the number of input channels, the layer structure is the same as the unimodal baseline. All parameters of the input through the output are trained end-to-end.

Late Fusion The Late Fusion model has separate convolutional streams for each modality and shared fully-connected layers to output a ﬁnal categorical distribu-tion. For both modalities, spatial features are learned separately while the global co-occurrence is learned in the shared part. Up to the last convolution layers, the architectures are the same as in the unimodal cases. At the training, to avoid the vanishing gradient problem, the model ﬁrst copies the parameters of convolu-tion layers from the pre-trained unimodal models and then ﬁne-tunes the entire model, including randomly initialized fully-connected layers.

2.4. Experiments 25

ドキュメント内 in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy (ページ 33-39)