Deep Learning Representation - パターン認識のための表現学習法の考察と拡張

𝑤₁ 𝑤₂

𝑤₃

𝑥₁ 𝑥₂ 𝑥₃ 𝑥₄ 𝑥₅ 𝑥₆

Figure 5-3: Illustration of a 1D CNN with a convolutional layers with a size 3 filter at stride 1. The weights of each node are shared and applied to each window of the input.

forward and backward propagation passes, whereas a sparse convolutional node only requires calculations for each sparse connection.

Parameter Sharing

Parameter sharing or shared weights is the concept that each node only has one set of weights for its respective sliding window. By sharing the set of weights, the window acts as a convolution and functions similarly to filters in computer vision [120]. In Fig. 5-3, each color represents a shared weight applied at stride 1. By applying the convolutional filter across the visual field of the input, a feature map is created. Each pixel of the feature map 𝑧_𝑛,𝑖 is calculated by:

𝑧_𝑛,𝑖=𝜑(︀

w_n^Tx_{𝑖,...,𝑖+𝐹} +𝑏_𝑛)︀

(5.1)

=𝜑 (︃ _𝐹

∑︁

𝑓=0

𝑤_𝑛,𝑓𝑥_𝑖+𝑓 +𝑏_𝑛 )︃

, (5.2)

for each element 𝑖 of the feature map in node 𝑛 where 𝑓 is the index of the filter and 𝐹 + 1 is the size of the window. The shared weights are denoted as w_n, the receptive field of the filter as x_{𝑖,...,𝑖+𝐹}, and the bias as 𝑏_𝑛. In other words, 𝑧_𝑛,𝑖 is the inner product of the shared weights w_n and each window of the previous layer (𝑥_𝑖, . . . , 𝑥_𝑖+𝐹) with some bias 𝑏_𝑛.

Pooling

Pooling is a method of down sampling feature maps. Adding a pooling layer is com-monly used between successive convolutional layers in a CNN. The primary purpose of the pooling layer is to reduce redundancy and allow for a larger receptive field of the higher layers.

Given a convolutional filter with a filter size larger than the stride, every element 𝑧_𝑛,𝑖 would contain mostly overlapping information from the previous layer. In order to reduce computation time in the higher layers,z_𝑛can be down sampled by removing some of the overlapping information. Instead of increasing the stride, pooling can be used as a directed nonlinear down sampling method. For example, max pooling, the most commonly used pooling method, reduces the feature map to only the maximal value given each pooling window. By removing non-maximal values, the computa-tional time of the higher layers is reduced while preserving some of the intent of the convolutional layer.

The second purpose of using a pooling layer is to provide slight translation invari-ance by increasing the receptive field of the nodes of the higher layers. In the previous example, the pixels of the feature map in the next higher convolutional layer have a receptive field size of the convolutional filter. But since each input of the window is maximum value of the pooling window, the receptive field in relation to the feature map of the higher convolutional layer is increased by a factor of the pooling window size. Figure 5-4 demonstrates this where each box is the effective range of a pixel on each layer’s feature map. Furthermore, Zeiler and Fergus [190] realized that due to

Feature Map Max Pooling 2 x 2

Feature Map

Feature Map Max Pooling

2 x 2

Figure 5-4: Demonstration of Max Pooling with a2×2 pooling window.

pooling, the filters are effectively feature detectors and the filters of each successively higher layer learns higher level features. For example, for character recognition, the filters of the first layer may act like edge detectors while the second layer detects fragments of character features and the third layer would learn the realized character features.

5.2.2 Object Detection with CNNs

Object detection and segmentation with convolutional models is an active field. A classic approach of object detection with CNNs is to use a sliding window as the input for the CNN [191]. This method, however, is a brute force method and is very limited to constrained classes and does not address scale variance. More recent methods use flexible bounding boxes or a pixelwise loss with detailed object location annotations.

Region-Based Methods

One solution to address scale variance is to implement a model using multi-scale region proposals. In this method, differently scaled bounding boxes with the input images are used to isolate regions of interest containing target objects [23, 24]. Gir-shick et al. [25] proposed Regions with CNN features (R-CNN) to generate

category-independent region proposals for CNNs. There have also been improvements on the R-CNN model such as training the bounding boxes with SVMs [25], using multi-task loss with back propagation in Fast R-CNNs [192], and the use of Region Proposal Networks for Faster R-CNNs [193]. These methods use warped images from the region proposals for object classification.

There are also end-to-end bounding box proposal and object recognition networks.

Single Shot MultiBox Detectors (SSD) [194] combine object detection with trained bounding box proposals to annotate images with objects of different classes. You Only Look Once (YOLO) [195] networks and the recent improvement YOLO9000 [196]

similarly use bounding box proposals and combines them with class probability maps to annotate natural images.

Pixel-to-Pixel Methods

Another active approach to object detection is the use of Fully Convolutional Net-works (FCN) [26] for end-to-end for pixelwise prediction and segmentation. FCNs use convolutional layers and pooling down to single pixel convolutions for pixelwise classification. Combined with the idea of skip connections and unpooling, FCNs are able determine the location of objects within an image. To achieve this pixelwise classification, FCNs require full sized segmented ground truth. In addition, FCNs have been combined with R-CNNs for object detection [197].

Weak Supervision with CNNs

Shown in Fig. 5-5, region-based methods and pixel-to-pixel methods require detailed annotations of the objects as ground truth. While there are many established datasets with the detail required for the methods, there are many tasks without such ground truth. Thus, the methods are not all applicable to every dataset and it can be costly to acquire the supervision necessary for those tasks. In order to overcome this, it has been proposed to use weakly supervised labels for object detection and

“Person”

“Sky”

“Ground”

“Background”

Bounding Box Labels Pixelwise Labels Weak Labels Figure 5-5: Examples of supervision for detection.

localization [198, 199, 200], where the weakly supervised labels consist of only image-wide classes with no details on the location of the objects. CNNs have also been used with weakly supervised labels [201, 202, 203, 204]. In one instance, Bazzani et al. [205] localizes objects by monitoring a CNN recognition accuracy when masking regions.

The use of weakly supervised data is also commonly used for text and documents for word spotting. Word spotting is inherently a weakly supervised task in that query-by-string (QbS) approach is taken to search image documents for text strings.

Some of these methods include bag-of-features Hidden Markov Models [206] and using Pyramidal Histogram of Characters (PHOC) as labels enabling QbS [207, 208].

ドキュメント内パターン認識のための表現学習法の考察と拡張 (ページ 116-121)