Approach - in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy

R-CNN

ℎ

_!"#

ℎ

_!"#

Decoder

Candidates

"#

$

Attn

Figure 3.3: Baseline modelUpDown [6]. R-CNN detects salient regions, where each region is encoded to a feature vector. Attnis called the attention module, which produces vector ˆv_t from the set of region features.Decoderproduces word proba-bilitiesp_tone by one. h_t₋₁is the previous state ofDecoder.

3.4 Approach

In this dissertation, the task is deﬁned to generate a single caption by jointly processing his/her ﬁrst-person image, the second-person image from the service robot, and the third-person image from the embedded camera. To conﬁgure such a multi-image captioning model and to validate how much each perspective con-tributes to the resulting caption, this work extends the state-of-the-art image cap-tioning model UpDown by Anderson et al. [6]. The architecture is illustrated in Fig. 3.3. The details of the modules are shown in Fig. 3.5.UpDownﬁrst enumerates salient regions within a given image, encodes the spatial feature into a ﬁxed-size vector per region (Section 3.4.1), and feeds them into the captioning process with an attention mechanism (Section 3.4.3). For our multi-perspective situation, the region features are given from each perspective and are fed into the captioning process as attention candidates to decode words. This study especially focuses on how we can reorganize the attention candidates from the multi-perspective im-ages. To this end, this section introduces a bottom-up fusion step that clusters the salient region features to suppress the appearance of the identical instances over multiple viewpoints (Section 3.4.2). This section ﬁrst explains how the model pro-duces a sequence of words from a set of three images and follows by formulation about how the model is trained with a set of reference captions.

3.4. Approach 48

3.4.1 Image Encoding

Suppose that we are given a set of synchronized multi-perspective images: I₁from the human’s egocentric viewpoint (ﬁrst-person), I₂ from the bystander robot’s viewpoint (second-person), and I₃ from the birds-eye viewpoint on a ﬁxed cam-era (third-person). For each perspective, a set of region-of-interests (ROIs) are detected and encoded into the feature vectors according to their visual attributes.

Then the all features will be attention candidates for subsequent captioning mod-ules. We refer to the feature set as ˜V ={v˜₁, . . . ,v˜_N}, whereNis the total number of ROIs detected from three images{I₁,I₂,I₃}.

This work use Faster R-CNN [93] detector, modeled in UpDown [6], to detect objects or salient regions as bounding box assigned to 1,600 object classes and their 400 attribute classes of Visual Genome [94]. The detected raw regions are processed with non-maximum suppression to ﬁlter out overlapping, and for each selected region, the mean-pooled feature is extracted from the penultimate layer of the object/attribute classiﬁers. Each feature represents high-level semantic in-formation about the partial scene of the image.

3.4.2 Salient Region Clustering

This work found that even the baseline model as shown in Fig. 3.4a can generate reasonable captions by simply bundling the multi-perspective feature sets and be-ing followed by the decodbe-ing step (hereinafter calledEnsemble) since the attention module can implicitly fuse the correlated ROIs responding to the same top-down signal from the decoder. However, the implicit fusion by the top-down signal may result in biased weights on repetitively occurred objects or may fail to co-occur the ROI features of the identical object due to the subtle difference in the subspace.

Therefore, this work proposes a bottom-up fusion approach of the multi-perspective feature sets. As described in Section 3.4.1, the image encoder is pre-trained to classify the diverse object classes and attributes so that the encoded ROI features represent well-abstracted semantics. Regarding this property, many studies have demonstrated that CNN-coded features can be diverted for image retrieval tasks due to the fact that semantically similar images are embedded to be close in the learned feature space [95]. Similarly, this work assumes different views of an identical object are embedded close together in the high-dimensional feature space. Therefore, the bottom-up approach ﬁrst clusters the set of

multi-3.4. Approach 49

"#_!

1st 2nd 3rd

Candidates

ℎ_!"#

Decoder

$_!

Attn (a)Ensemble

"#_!

centroids

1st 2nd 3rd

Attn Candidates

ℎ_!"#

Decoder

$_!

(b)KMeans

Figure 3.4: Proposed methods. (a) Ensemble is fundamentally equivalent to the baselineUpDown[6]. All salient regions intact are subject to the attention module Attn. (b) InKMeans, a clustering algorithm is applied to salient region features and thekcentroid features are subject to attend.

perspective features into a number of groups by a standard clustering algorithm such ask-means [96], as shown in Fig. 3.4b. When converged, the centroid vectors are selected as attention candidates. This work assumes each centroid vector av-erages a similar set of semantic information across multiple viewpoints. This step consistently employs an L₂ distance for inter-sample dissimilarity so that mean-pooled centroids can be adapted to the top-down soft attention. We refer to the reorganized set of features asV ={v₁, . . . ,v_M}. This clustering approach can eas-ily be extended to a temporal extent so as to smoothen out captions in sequential images. This dissertation report the performance improvement by the approach in Section 3.5.4.4.

3.4. Approach 50

Attn

Language LSTM

!"_!

Linear

%_!

&!"#

Attention LSTM

Embed

'_$,!"# _Softmax

(_$,!

($,! '$,!

'&,!"# '$,!"#

!"!

MLP Softmax

'&,! Attention

module

Decoder module )!

'&,!

Figure 3.5: Details of the attention module and the decoder module of UpDown model [6].

3.4.3 Caption Generation

A caption is composed of words, where each word is a kind of symbol. In general, captioning models to generate meaningful sentences are trained to maximize the likelihood p(S^∗|^I)of the reference sentenceS^∗ =s₁^∗, . . . ,s^∗_T under the condition of a given image I. The likelihoodp(S^∗|I)is a joint probability of words composing the captionS^∗:

p(S^∗|I) =

∏

T t=₁

p(s^∗_t|I,s₁^∗, . . . ,s^∗_t₋₁), (3.1) wherep(s^∗_t|I,s^∗₁, . . . ,s^∗_t₋₁)is a probability of the words^∗_t conditioned by the image I and the previous words s^∗₁, . . . ,s^∗_t₋₁. Therefore, the parameters of the model p can be trained by minimizing the following negative log-likelihood of the correct caption:

L=−

∑

t=₁

logp(s^∗_t|^I,^s^∗₁^{, . . . ,}^s^∗_t₋₁). (3.2) To learnp(S^∗|I), this work uses the decoder architecture ofUpDown[6] depicted in Fig. 3.5. Lets_t ∈ R^K be a word produced from the model at timestept, which is a one-hot representation where each of the deﬁned K words is assigned to a unique dimension. As mentioned above, the words_t is determined by the image and the previous words up to timestept−1. The decoder models this recurrent procedure with a stack of two long short-term memories (LSTMs) [92], named

3.4. Approach 51 Attention LSTMandLanguage LSTM[6]. LSTM is one of the recurrent neural net-works, which has a hidden state updated by an input step by step. For example, Attention LSTM updates the hidden state h_A,t given the previous hidden state h_A,t₋₁ and an input vectorx_A,t at timestept. We denote the single-step operation with the following notation:

h_A,t =LSTM_A(x_A,t,h_A,t₋₁). (3.3) Similarly, Language LSTM updates the hidden state h_L,t given the previous hidden stateh_L,t₋₁and an input vectorx_L,t at timestept:

h_L,t =LSTM_L(x_L,t,h_L,t₋₁). (3.4) For input x_A,t of Attention LSTM, word s_t₋₁ at the previous step is ﬁrst em-bedded with parameter matrixW_e. Then it is concatenated with the global image feature v which is a mean vector of V, and the previous hidden state h_L,t₋₁ of Language LSTM.

x_A,t = [W_es_t₋₁,v,h_L,t₋₁] (3.5) The updated hidden state h_A,t is then used for another decoder part, an at-tention module (Attnin Fig. 3.5). The role of the attention module is to compute the normalized weightsαt = {α1,t, . . . ,αM,t} for timestept to sample the speciﬁc elements of candidatesV = {^v1, . . . ,v_M}. In other words, the attention module learns where to focus in the image for each timestep. The weight α_i,t for v_i is computed with the hidden stateh_A,t as follows:

e_i,t =w^⊤_a tanh(W_vv_i+W_hh_A,t) (3.6) α_i,t = ^exp(e_i,t)

∑_k^M=₁exp(e_k,t)^, ^(3.7) whereW_vandW_hare parameter matrices, andw_ais a parameter vector to produce the scalare_i,t. The tanh function operates element-wise hyperbolic tangent on the vector input. Using the weightαt, the candidatesV ={^v1, . . . ,v_M}are fused to a single attended image feature ˆv_t by convex combination:

ˆ v_t =

∑

M i=₁

α_i,tv_i. (3.8)

3.5. Experiments 52

ドキュメント内 in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy (ページ 61-66)