Qualitative Analysis - in partial fulﬁllment of the requirements for the degree of Doctor of Ph

3.5 Experiments

3.5.5 Qualitative Analysis

3.5. Experiments 60

10 20 30 40 50 60

Number of clusters 15.0

17.5 20.0 22.5 25.0 27.5 30.0 32.5

CIDEr

Method k-means x-means k-medoids Ensemble

10 20 30 40 50 60

Number of clusters 10.0

11.0 12.0 13.0 14.0 15.0 16.0

SPICE

Method k-means x-means k-medoids Ensemble

Figure 3.8: CIDEr-D and SPICE scores with different methods of clustering. The number of clusterskis swept among{4, 8, 16, 32, 64}for each. Both metrics show the best scores atk=32.

grouped centroids still preserve the visual features, and that enables us to encour-age implicit joint attention across perspectives and to reﬂect in the captions.

3.5.4.4 Temporal Batch Clustering

As mentioned in Section 3.4.2, the clustering approach can easily be extended to a temporal extent so as to smoothen out captions in sequential images. This section reports the performance of KMeans(1,2,3) and Ensemble(1,2,3) where the attention candidates are aggregated across the consecutive frames and not just only perspectives. The number of pooled frames are swept in{1, 2, 4, 8, 16, 32}for both approaches. As seen in Fig. 3.9, both approaches boost the performance as the number of frames increase.

3.5. Experiments 61

0 5 10 15 20 25 30

Number of pooled frames 30.0

31.0 32.0 33.0 34.0

CIDEr

Method Ensemble k-means

0 5 10 15 20 25 30

Number of pooled frames 15.0

15.2 15.4 15.6 15.8 16.0 16.2 16.4

SPICE

Method Ensemble k-means

Figure 3.9: CIDEr-D and SPICE scores with temporal batch clustering. The num-ber of frames to be obtained is investigated among{1, 2, 4, 8, 16, 32, 64}.

as the type of place. In contrast, the second-person perspective captions succeed at describing where the actor is and his/her postures, which are invisible concepts from a ﬁrst-person perspective. However, in some cases, the types of activities are not clear due to their visual granularity. The third-person perspective captions are more ambiguous in terms of the participants’ situation, while novel objects not visible in other perspectives are described. Finally, the proposed approach generates more detailed captions about the actor and the context. For instance, on the top example in Fig. 3.10, the caption by the proposed method includes the actor’s posture, location, and detailed activity, which could only be described par-tially through single perspective cases. Moreover, in interactive cases in Fig. 3.11, we can see that the proposed method improves the third-person captions with the additional phrase about the manipulated objects derived from the ﬁrst-person and/or the second-person image sources. Qualitatively, we can see that the crit-ical visual concepts are potentially in the ﬁrst- and second-person perspectives, while the third-person perspective contributes to describe the interactive scenes.

Although the verbal expression is slightly different for each perspective, we can see the proposed method successfully produced a reasonable description integrat-ing three types of perspectives semantically. It can be considered that the cluster-ing on ROI feature space effectively works to summarize the multi-perspective images.

3.5. Experiments 62

First-person Second-person Third-person

Reference A person sitting on the bed is browsing a book.

First A person is reading an open book with papers.

Second A woman sitting on her bed in an room.

Third A man and woman playing in an room with some chairs.

Proposed A woman sitting on the bed reading an open book.

First-person Second-person Third-person

Reference A person sitting on the bed is typing a keyboard.

First A person’s hand on the computer keyboard.

Second A man is sitting on his bed in the room.

Third A young boy is standing in front of an open bed.

Proposed A person is sitting on the bed with his keyboard.

First-person Second-person Third-person

Reference A person standing in front of the refrigerator is holding a can.

First A person is holding an open refrigerator door.

Second A man is standing in the kitchen looking.

Third Two children standing in the living room with the tv.

Proposed A man standing in front of an open refrigerator.

Figure 3.10: Example of generated captions. Reference: a selected one out of ﬁve annotations. First: UpDownwith the ﬁrst-person image. Second: UpDownwith the second-person image. Third: UpDownwith the third-person image. Proposed:

KMeanswith all the images.

3.5. Experiments 63

First-person Second-person Third-person

Reference A person is passing a red box to other in a white shirt.

First A young man holding an empty bottle of beer.

Second A man standing in front of an open refrigerator.

Third Two children are standing in a large room.

Proposed Two children are playing with a bottle of water.

First-person Second-person Third-person

Reference A man in a white shirt is passing a green plastic bottle to the other in a room.

First The man is holding a bottle of water in their hand.

Second A woman standing in the kitchen holding something.

Third Two men standing in a living room with one beds.

Proposed Two men standing in a kitchen holding water.

Figure 3.11: Example of generated captions. Reference: a selected one out of ﬁve annotations. First: UpDownwith the ﬁrst-person image. Second: UpDownwith the second-person image. Third: UpDownwith the third-person image. Proposed:

KMeanswith all the images.

3.5.5.2 Visualizing ROI Attention

To verify how ROIs contributes to predict each word, this section visualizes the attention weights αi,t in Eq. 3.8, as shown in Fig. 3.12. All detected salient re-gions are colorized according to the corresponding weights. Figure 3.12 pro-vides two examples each of which has two types of results fromEnsemble(1,2,3) and KMeans(1,2,3). We can see remarkable differences in predicting human-related words. Ensemble(1,2,3)model focuses on a single region or an incorrect pair, whileKMeans(1,2,3)model successfully focuses on the same person in the second- and third-person images. The results indicate that the proposed bottom-up approach improves instance-correspondence in the top-down weighting.

3.5. Experiments 64 First-person (1st) Second-person (2nd) Third-person (3rd)

Ensemble 1st

a person is sitting on the bed reading her book <end>

2nd3rdKMeans 1st

a woman is sitting on the bed with her sheets <end>

2nd3rd

First-person (1st) Second-person (2nd) Third-person (3rd)

Ensemble 1st

a man standing in front of an open refrigerator <end>

2nd3rdKMeans 1st

a man standing in front of an open refrigerator <end>

2nd3rd

Figure 3.12: Top-down attention maps of KMeans and Ensemble. Strength of the attention is color-coded; the yellow region have a large weight, while the blue one have a small weight. The group of regions in column conditions the decoder in generating each word. It can be observed that the proposed method KMeans correctly discriminates instances againstEnsemble. Best viewed in color.

ドキュメント内 in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy (ページ 74-79)