• 検索結果がありません。

R-CNN

!"#

!"#

Decoder

Candidates

"#

!

$

!

Attn

Figure 3.3: Baseline modelUpDown [6]. R-CNN detects salient regions, where each region is encoded to a feature vector. Attnis called the attention module, which produces vector ˆvt from the set of region features.Decoderproduces word proba-bilitiesptone by one. ht1is the previous state ofDecoder.

3.4 Approach

In this dissertation, the task is defined to generate a single caption by jointly processing his/her first-person image, the second-person image from the service robot, and the third-person image from the embedded camera. To configure such a multi-image captioning model and to validate how much each perspective con-tributes to the resulting caption, this work extends the state-of-the-art image cap-tioning model UpDown by Anderson et al. [6]. The architecture is illustrated in Fig. 3.3. The details of the modules are shown in Fig. 3.5.UpDownfirst enumerates salient regions within a given image, encodes the spatial feature into a fixed-size vector per region (Section 3.4.1), and feeds them into the captioning process with an attention mechanism (Section 3.4.3). For our multi-perspective situation, the region features are given from each perspective and are fed into the captioning process as attention candidates to decode words. This study especially focuses on how we can reorganize the attention candidates from the multi-perspective im-ages. To this end, this section introduces a bottom-up fusion step that clusters the salient region features to suppress the appearance of the identical instances over multiple viewpoints (Section 3.4.2). This section first explains how the model pro-duces a sequence of words from a set of three images and follows by formulation about how the model is trained with a set of reference captions.

3.4. Approach 48

3.4.1 Image Encoding

Suppose that we are given a set of synchronized multi-perspective images: I1from the human’s egocentric viewpoint (first-person), I2 from the bystander robot’s viewpoint (second-person), and I3 from the birds-eye viewpoint on a fixed cam-era (third-person). For each perspective, a set of region-of-interests (ROIs) are detected and encoded into the feature vectors according to their visual attributes.

Then the all features will be attention candidates for subsequent captioning mod-ules. We refer to the feature set as ˜V ={v˜1, . . . ,v˜N}, whereNis the total number of ROIs detected from three images{I1,I2,I3}.

This work use Faster R-CNN [93] detector, modeled in UpDown [6], to detect objects or salient regions as bounding box assigned to 1,600 object classes and their 400 attribute classes of Visual Genome [94]. The detected raw regions are processed with non-maximum suppression to filter out overlapping, and for each selected region, the mean-pooled feature is extracted from the penultimate layer of the object/attribute classifiers. Each feature represents high-level semantic in-formation about the partial scene of the image.

3.4.2 Salient Region Clustering

This work found that even the baseline model as shown in Fig. 3.4a can generate reasonable captions by simply bundling the multi-perspective feature sets and be-ing followed by the decodbe-ing step (hereinafter calledEnsemble) since the attention module can implicitly fuse the correlated ROIs responding to the same top-down signal from the decoder. However, the implicit fusion by the top-down signal may result in biased weights on repetitively occurred objects or may fail to co-occur the ROI features of the identical object due to the subtle difference in the subspace.

Therefore, this work proposes a bottom-up fusion approach of the multi-perspective feature sets. As described in Section 3.4.1, the image encoder is pre-trained to classify the diverse object classes and attributes so that the encoded ROI features represent well-abstracted semantics. Regarding this property, many studies have demonstrated that CNN-coded features can be diverted for image retrieval tasks due to the fact that semantically similar images are embedded to be close in the learned feature space [95]. Similarly, this work assumes different views of an identical object are embedded close together in the high-dimensional feature space. Therefore, the bottom-up approach first clusters the set of

multi-3.4. Approach 49

"#!

1st 2nd 3rd

Candidates

!"#

!"#

Decoder

$!

Attn (a)Ensemble

"#!

centroids

1st 2nd 3rd

Attn Candidates

!"#

!"#

Decoder

$!

(b)KMeans

Figure 3.4: Proposed methods. (a) Ensemble is fundamentally equivalent to the baselineUpDown[6]. All salient regions intact are subject to the attention module Attn. (b) InKMeans, a clustering algorithm is applied to salient region features and thekcentroid features are subject to attend.

perspective features into a number of groups by a standard clustering algorithm such ask-means [96], as shown in Fig. 3.4b. When converged, the centroid vectors are selected as attention candidates. This work assumes each centroid vector av-erages a similar set of semantic information across multiple viewpoints. This step consistently employs an L2 distance for inter-sample dissimilarity so that mean-pooled centroids can be adapted to the top-down soft attention. We refer to the reorganized set of features asV ={v1, . . . ,vM}. This clustering approach can eas-ily be extended to a temporal extent so as to smoothen out captions in sequential images. This dissertation report the performance improvement by the approach in Section 3.5.4.4.

3.4. Approach 50

Attn

Language LSTM

!"!

#

Linear

$"

%!

&!"#

Attention LSTM

Embed

'$,!"# Softmax

($,!

($,! '$,!

'&,!"# '$,!"#

!"!

#

MLP Softmax

'&,! Attention

module

Decoder module )!

'&,!

Figure 3.5: Details of the attention module and the decoder module of UpDown model [6].

3.4.3 Caption Generation

A caption is composed of words, where each word is a kind of symbol. In general, captioning models to generate meaningful sentences are trained to maximize the likelihood p(S|I)of the reference sentenceS =s1, . . . ,sT under the condition of a given image I. The likelihoodp(S|I)is a joint probability of words composing the captionS:

p(S|I) =

T t=1

p(st|I,s1, . . . ,st1), (3.1) wherep(st|I,s1, . . . ,st1)is a probability of the wordst conditioned by the image I and the previous words s1, . . . ,st1. Therefore, the parameters of the model p can be trained by minimizing the following negative log-likelihood of the correct caption:

L=

T

t=1

logp(st|I,s1, . . . ,st1). (3.2) To learnp(S|I), this work uses the decoder architecture ofUpDown[6] depicted in Fig. 3.5. Letst RK be a word produced from the model at timestept, which is a one-hot representation where each of the defined K words is assigned to a unique dimension. As mentioned above, the wordst is determined by the image and the previous words up to timestept−1. The decoder models this recurrent procedure with a stack of two long short-term memories (LSTMs) [92], named

3.4. Approach 51 Attention LSTMandLanguage LSTM[6]. LSTM is one of the recurrent neural net-works, which has a hidden state updated by an input step by step. For example, Attention LSTM updates the hidden state hA,t given the previous hidden state hA,t1 and an input vectorxA,t at timestept. We denote the single-step operation with the following notation:

hA,t =LSTMA(xA,t,hA,t1). (3.3) Similarly, Language LSTM updates the hidden state hL,t given the previous hidden statehL,t1and an input vectorxL,t at timestept:

hL,t =LSTML(xL,t,hL,t1). (3.4) For input xA,t of Attention LSTM, word st1 at the previous step is first em-bedded with parameter matrixWe. Then it is concatenated with the global image feature v which is a mean vector of V, and the previous hidden state hL,t1 of Language LSTM.

xA,t = [West1,v,hL,t1] (3.5) The updated hidden state hA,t is then used for another decoder part, an at-tention module (Attnin Fig. 3.5). The role of the attention module is to compute the normalized weightsαt = 1,t, . . . ,αM,t} for timestept to sample the specific elements of candidatesV = {v1, . . . ,vM}. In other words, the attention module learns where to focus in the image for each timestep. The weight αi,t for vi is computed with the hidden statehA,t as follows:

ei,t =wa tanh(Wvvi+WhhA,t) (3.6) αi,t = exp(ei,t)

kM=1exp(ek,t), (3.7) whereWvandWhare parameter matrices, andwais a parameter vector to produce the scalarei,t. The tanh function operates element-wise hyperbolic tangent on the vector input. Using the weightαt, the candidatesV ={v1, . . . ,vM}are fused to a single attended image feature ˆvt by convex combination:

ˆ vt =

M i=1

αi,tvi. (3.8)

3.5. Experiments 52

関連したドキュメント