• 検索結果がありません。

2.4 Experiments

2.4.3 Quantitative Analysis

2.4.3.1 Unimodal performance

First of all, this section reports the unimodal performance of the following four types of CNNs on two types of modalities, depth and reflectance: baseline VGG, VGG in which either HCC layers or RWMP layer is applied (VGG+HCC, VGG+RWMP), and VGG in which both HCC layers and RWMP layer are applied (VGG+RWMP+HCC). Furthermore, this work compares with traditional hand-engineeredapproaches and another type of CNN, ResNet [67]. The 20-layer ResNet proposed in the CIFAR-10 experiment [67] is selected because the image size is similar to ours. The performance of all the approaches is shown in Table 2.3.

Comparison of CNNs As shown in Table 2.3, for the reflectance image, the VGG+RWMP+HCC achieved the best accuracy of 95.92% in total. Moreover, we can see that both RWMP and HCC improves the performance independently.

On the other hand, for the depth image, RWMP and HCC could not contribute to the accuracy; the baseline VGG achieved the best accuracy of 97.18% in to-tal. Compared to the baseline, VGG+RWMP has the subtle drop of 0.07%, while VGG+HCC has degraded by 0.29%. That indicates the left and right continuity by HCC does not contribute to scene discrimination on the depth modality. We

2.4. Experiments 28 Table 2.3: Classification accuracy [%] on unimodal inputs. Coast, Forest, InP, OutP,Res, andUrbandenote scene classes of coast area, forest area, indoor parking, outdoor parking, residential area, and urban area, respectively. The models with RWMP and/or HCC are proposed in this work.

Input Approach Coast Forest InP OutP Res Urban Total

Depth Spin-Image [70] + SVM 65.60 86.30 81.84 86.26 82.95 64.31 79.23 GIST [71] + SVM 75.42 91.52 82.72 86.72 86.54 81.16 84.53 LBP [72] + SVM 84.25 94.93 96.41 86.86 94.58 92.71 92.00 ResNet [67] 90.65 94.25 97.67 95.00 97.68 97.79 95.66

VGG 92.73 97.26 99.94 94.23 98.35 99.20 97.18

VGG + RWMP 93.74 97.53 98.98 94.45 98.35 98.35 97.11 VGG + HCC 92.98 96.54 99.54 93.99 98.23 98.81 96.89 VGG + RWMP + HCC 93.03 96.80 98.85 94.29 98.32 98.86 96.92 Reflectance GIST [71] + SVM 68.85 91.65 74.07 81.73 79.88 75.03 79.15 LBP [72] + SVM 76.66 95.67 80.18 92.74 88.44 79.53 86.19 ResNet [67] 90.30 96.62 94.60 93.63 96.13 96.09 94.83

VGG 90.58 97.99 92.37 93.89 96.40 95.06 94.75

VGG + RWMP 91.13 97.63 93.29 94.67 97.93 97.37 95.74 VGG + HCC 91.00 98.38 94.39 94.13 97.58 95.01 95.45 VGG + RWMP + HCC 91.83 98.20 91.45 95.16 97.99 98.27 95.92

will further discuss this point in Section 2.4.4. The visualization experiment re-veals that the models trained on the depth strongly rely on the specific regions, which is not the continuous edges. Focusing on the results for each category, we can see that the best reflectance model shows higher accuracy in the forest and outdoor parkingcategories than the best depth model. It suggests that the accuracy would be improved by combining both modalities. The confusion matrix of the best models is shown in Fig. 2.5 for each modality. Both models tend to have more errors betweencoastandforestcategories. One of the reasons is that thecoastdata have a unique characteristic in that the sky and sea regions are dropped regularly, and the opposite wooded regions are similar toforestcategory. It can be consid-ered that the model would misclassify the coast images into forest if its unique areas are covered by crossing cars or trees.

Comparison to Traditional Methods This work compares the classification re-sults with the three traditional approaches, Spin-Image [70], GIST [71], and Local Binary Patterns (LBP) [72], to extract hand-engineered features from a given im-age. Spin-Image is one of the popular technique for surface matching and 3D

2.4. Experiments 29

Coast Forest Indoor

Parking OutdoorParking ResidentialArea UrbanArea Prediction

Coast

Forest Indoor Parking Outdoor Parking Residential Area Urban Area

Ground truth

92.73 4.83 0.00 0.65 1.32 0.48 2.38 97.26 0.00 0.21 0.15 0.00 0.00 0.00 99.94 0.00 0.06 0.00 0.26 0.22 3.24 94.23 1.31 0.74 0.80 0.29 0.00 0.51 98.35 0.05 0.05 0.07 0.00 0.15 0.53 99.20

(a) Depth: VGG

Coast Forest Indoor

Parking OutdoorParking ResidentialArea UrbanArea Prediction

Coast

Forest Indoor Parking Outdoor Parking Residential Area Urban Area

Ground truth

91.83 4.80 0.02 0.92 1.43 0.99 1.18 98.20 0.03 0.18 0.39 0.02 0.00 0.00 91.45 3.94 4.59 0.02 0.16 0.06 2.53 95.16 2.07 0.02 0.49 0.30 0.08 0.95 97.99 0.18 0.49 0.02 0.00 0.70 0.53 98.27

(b) Reflectance: VGG + RWMP + HCC

Figure 2.5: Confusion matrix of best models on a depth and reflectance inputs.

object recognition tasks. This work obtains a single Spin-Image representation from an original point cloud in a scanner-oriented viewpoint in the cylindrical co-ordinate system. GIST [71] was proposed specifically for scene classification task.

Some recent studies [34, 51] on the scene classification used the GIST features for color and depth images. Although LBP was proposed originally for texture classi-fication by Ojalaet al.[72], it is reported that LBP is also effective for indoor place classification [55, 46]. These features are given to support vector machine (SVM) with a radial basis function (RBF) kernel to predict the category. Hyperparameters of the SVM are selected by a grid search, considering{102, . . . , 1010} for a cost parameterC and{1010, . . . , 100} for a kernel parameter γ. In all categories, the CNN-based approaches outperform the traditional approaches above.

Robustness Test on Horizontal Rotation In this section, the input images are rotated horizontally and examined the robustness of the classification accuracy.

The results are shown in Fig. 2.6. For both depth and reflectance modalities, we can see the steep drop of accuracies with the baseline VGG or VGG+RWMP mod-els when the angle of rotation is around 90 or 270. In this situation, the front or back regions on the image are shifted to discontinuous image edges. Thus, it can be considered that the performance drop is due to the critical textures are dis-persed into the left and right edges at the points of 90and 270rotations. Further

2.4. Experiments 30 analysis of this point is discussed in Section 2.4.4. On the other hand, the HCC modification contributes to rotational invariance for both input modalities. The cyclic ripples seen in the accuracy depend on the size of a receptive field modified by RWMP and HCC. The RWMP layer receives feature maps with a horizontal resolution of 12 units. Therefore, the receptive field of the single units has a 30 horizontal periodicity on the image plane.

2.4.3.2 Multimodal performance

This section reports the performance of the multimodal fusion approaches de-scribed in Section 2.3.2.2. For the Softmax Average and the Adaptive Fusion mod-els, this work uses the models that showed the best results in the unimodal exper-iments: the baseline VGG for the depth input and the VGG applying both HCC and RWMP for the reflectance input. The results are shown in Table 2.4.

As shown in Table 2.4, all multimodal models improved in accuracy in com-parison to the unimodal models except the Early Fusion model. Particularly, the Softmax Average model shows the best results in the total score and the two cat-egories, forest and urban. The Softmax Average was 0.69% better than the best result in depth and 1.95% better than the best result in reflectance. The Adap-tive Fusion model, which is similar to the Softmax Average model structurally, shows the best results in different two categories. We can see that the total ac-curacy is slightly inferior to the Softmax Average acac-curacy. This is primarily due to the shortage of samples to train the auxiliary network sufficiently. As for the Early Fusion model, Longet al.[73] observed a drop in accuracy when evaluating the segmentation network with the early-fusion setting. They suggested that the drop could be caused by the vanishing gradients,i.e., the difficulty to train the first layer that fuses modalities. Likewise, our deep network has a possibility to cause this problem. The Late Fusion model shows better results in a few categories in comparison to the unimodal cases; however, the effectiveness is low.

In all fusing approach, Softmax Average and Adaptive Fusionhave effective re-sults, where the depth and reflectance maps are separately processed, and the final decisions are merged. In contrast, the others that merge modalities within the networks have results equal to or less the results of the unimodal models. It can be considered that the efficiency of learning deep networks strongly affects the accuracies.

2.4. Experiments 31

0 30 60 90 120 150 180 210 240 270 300 330 360

Angle of Rotation [deg]

96.6 96.7 96.8 96.9 97.0 97.1 97.2 97.3

Accuracy [%]

Depth: VGG Depth: VGG+HCC Depth: VGG+RWMP Depth: VGG+RWMP+HCC

(a) Depth

0 30 60 90 120 150 180 210 240 270 300 330 360

Angle of Rotation [deg]

93.0 93.5 94.0 94.5 95.0 95.5 96.0 96.5

Accuracy [%]

Reflectance: VGG Reflectance: VGG+HCC Reflectance: VGG+RWMP Reflectance: VGG+RWMP+HCC

(b) Reflectance

Figure 2.6: Classification robustness to horizontal rotation of an input image. The models with HCC have less performance fluctuation on image rotation. Without HCC, the accuracies on both modalities drop when inputs are rotated by about 90 and 270 where the front or back views are shifted to discontinuous image edges.

2.4. Experiments 32 Table 2.4: Comparison of unimodal and multimodal models [%]. Coast, Forest, InP, OutP, Res, and Urban denote scene classes of coast area, forest area, indoor parking, outdoor parking, residential area, and urban area, respectively.

Input Approach Coast Forest InP OutP Res Urban Total

Depth VGG 92.73 97.26 99.94 94.23 98.35 99.20 97.18

Reflectance VGG + RWMP + HCC 91.83 98.20 91.45 95.16 97.99 98.27 95.92 Multimodal Softmax Average 94.27 98.38 99.58 94.91 99.12 99.56 97.87 Adaptive Fusion 94.59 98.20 99.77 94.85 99.19 99.37 97.62 Early Fusion 93.37 98.09 99.44 94.71 98.21 97.15 97.02 Late Fusion 93.49 97.57 99.25 94.23 98.39 98.88 97.19

関連したドキュメント