kept 2 a 3 cat 4

A2 とのラベル混同 (2/2)

She 1 kept 2 a 3 cat 4

入力

スパン全列挙に関して

メモリ容量

・入力は一文なので問題にならない

・スパン長の上限を設定すればよい

計算速度

・基本的にBIO型モデルよりも早い

・文を for で回すBIO型に対して，

行列演算でスパンベクトルを計算

アンサンブル法

^*A.2

乾・鈴木研究室 / 工学セミナー 2019.07.05

h^moe_s = W^moe_s ⋅ {α₁h¹_(i,_j) + α₂h²_(i,_j) + . . . + α_|M|h^|M|_(i,_j)} ,

Firstly,f_basecalculates a base feature vectorh_tfor each wordw_t 2w_1:T. Then, from a sequence of the base feature vectors h1:T, fspan calculates a span feature vectorh_s for a spans = (i, j). Fi-nally, using h_s,f_label calculates the score for the spans= (i, j)with a labelr.

Each function in Eqs.4,5and6can arbitrarily be defined. In Section3, we describe our functions used in this paper.

2.4 Inference

The simple argmax inference (Eq. 1) selects one span for each label. While this argmax inference is computationally efficient, it faces the following two problematic issues.

(a) The argmax inference sometimes selects spans that overlap with each other.

(b) The argmax inference cannot select multiple spans for one label.

In terms of (a), for example, whenh1,3,A0iand h2,4,A1i are selected, a part of these two spans overlaps. In terms of (b), consider the following sentence.

He came to the U.S. yesterday at 5 p.m.

[A0] [ A4 ] [ TMP ] [ TMP ] In this example, the label TMPis assigned to the two spans (“yesterday” and “at 5 p.m.”). Semantic role labels are mainly categorized into (i)core la-bels or (ii)adjunct labels. In the above example, the labelsA0andA4are regarded as core labels, which indicate obligatory arguments for the pred-icate. In contrast, the labels likeTMPare regarded as adjunct labels, which indicate optional argu-ments for the predicate. As the example shows, adjunct labels can be assigned to multiple spans.

To deal with these issues, we use a greedy search that keeps the consistency among spans and can return multiple spans for adjunct labels.

Specifically, we greedily select higher scoring la-beled spans subject to two constraints.

Overlap Constraint: Any spans that overlap with the selected spans cannot be selected.

Number Constraint: While multiple spans can be selected for each adjunct label, at most one span can be selected for each core label.

As a precise description of this algorithm, we de-scribe the pseudo code and its explanation in Ap-pendixA.

1 1

)

1 1

)

-0

1 1 1 1

4 6 64 7 2 4 A

) , )

)

. (

-. , (

301 -

-1 1

1 1

4 5 C89

297 A 4₍ 6 8₎ 4A AC 7

( ) +

Figure 1: Overall architecture of our BiLSTM-span model.

3 Network Architecture

To compute the score for each span, we have intro-duced three functions (fbase, f_span, f_label) in Sec-tion2.3. As an instantiation of each function, we use neural networks. This section describes our neural networks for each function and the overall network architecture.

3.1 BiLSTM-Span Model

Figure1illustrates the overall architecture of our model. The first component f_base uses bidirec-tional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997; Graves et al.,2005,2013) to calculate the base features. From the base features, the second component f_span extracts span features. Based on them, the final componentflabel calculates the score for each labeled span. In the following, we describe these three components in detail.

Base Feature Function

As the base feature function fbase, we use BiL-STMs,

f_base(w_1:T, p) =BILSTM(w_1:T, p) . There are some variants of BiLSTMs. Following the deep SRL models proposed byZhou and Xu (2015) and He et al.(2017), we stack BiLSTMs in an interleaving fashion. The stacked BiLSTMs process an input sequence in a left-to-right man-ner at odd-numbered layers and in a right-to-left manner at even-numbered layers.

h(3,5)

m = 1

モデル

Firstly,f_basecalculates a base feature vectorh_tfor each wordw_t 2 w_1:T. Then, from a sequence of the base feature vectors h1:T, fspan calculates a span feature vectorh_sfor a span s = (i, j). Fi-nally, usingh_s, f_label calculates the score for the spans= (i, j)with a labelr.

Each function in Eqs.4,5and6can arbitrarily be defined. In Section3, we describe our functions used in this paper.

2.4 Inference

The simple argmax inference (Eq.1) selects one span for each label. While this argmax inference is computationally efficient, it faces the following two problematic issues.

(a) The argmax inference sometimes selects spans that overlap with each other.

(b) The argmax inference cannot select multiple spans for one label.

In terms of (a), for example, whenh1,3,A0iand h2,4,A1iare selected, a part of these two spans overlaps. In terms of (b), consider the following sentence.

He came to the U.S. yesterday at 5 p.m.

[A0] [ A4 ] [ TMP ] [ TMP ] In this example, the labelTMP is assigned to the two spans (“yesterday” and “at 5 p.m.”). Semantic role labels are mainly categorized into (i)core la-belsor (ii)adjunct labels. In the above example, the labelsA0andA4are regarded as core labels, which indicate obligatory arguments for the pred-icate. In contrast, the labels likeTMPare regarded as adjunct labels, which indicate optional argu-ments for the predicate. As the example shows, adjunct labels can be assigned to multiple spans.

To deal with these issues, we use a greedy search that keeps the consistency among spans and can return multiple spans for adjunct labels.

Specifically, we greedily select higher scoring la-beled spans subject to two constraints.

Overlap Constraint: Any spans that overlap with the selected spans cannot be selected.

Number Constraint: While multiple spans can be selected for each adjunct label, at most one span can be selected for each core label.

As a precise description of this algorithm, we de-scribe the pseudo code and its explanation in Ap-pendixA.

1 1

)

1 1

)

-0

1 1 1 1

4 6 64 7 2 4 A

) , )

)

. (

-. , (

301 -

-1 1

1 1

4 5 C89

297 A 4₍ 6 8₎ 4A AC 7

( ) +

Figure 1: Overall architecture of our BiLSTM-span model.

3 Network Architecture

3.1 BiLSTM-Span Model

Figure1illustrates the overall architecture of our model. The first component f_base uses bidirec-tional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997; Graves et al.,2005, 2013) to calculate the base features. From the base features, the second component f_span extracts span features. Based on them, the final componentflabelcalculates the score for each labeled span. In the following, we describe these three components in detail.

Base Feature Function

As the base feature function fbase, we use BiL-STMs,

f_base(w_1:T, p) =BILSTM(w_1:T, p) . There are some variants of BiLSTMs. Following the deep SRL models proposed byZhou and Xu (2015) andHe et al.(2017), we stack BiLSTMs in an interleaving fashion. The stacked BiLSTMs process an input sequence in a left-to-right man-ner at odd-numbered layers and in a right-to-left manner at even-numbered layers.

h(3,5)

m = 2

モデル

Firstly,f_basecalculates a base feature vectorh_tfor each wordw_t 2w_1:T. Then, from a sequence of the base feature vectorsh1:T, fspan calculates a span feature vectorh_sfor a spans = (i, j). Fi-nally, usingh_s,f_label calculates the score for the spans= (i, j)with a labelr.

Each function in Eqs.4,5and6can arbitrarily be defined. In Section3, we describe our functions used in this paper.

2.4 Inference

The simple argmax inference (Eq.1) selects one span for each label. While this argmax inference is computationally efficient, it faces the following two problematic issues.

(a) The argmax inference sometimes selects spans that overlap with each other.

(b) The argmax inference cannot select multiple spans for one label.

In terms of (a), for example, whenh1,3,A0iand h2,4,A1iare selected, a part of these two spans overlaps. In terms of (b), consider the following sentence.

He came to the U.S. yesterday at 5 p.m.

[A0] [ A4 ] [ TMP ] [ TMP ] In this example, the labelTMPis assigned to the two spans (“yesterday” and “at 5 p.m.”). Semantic role labels are mainly categorized into (i)core la-belsor (ii)adjunct labels. In the above example, the labelsA0andA4are regarded as core labels, which indicate obligatory arguments for the pred-icate. In contrast, the labels likeTMPare regarded as adjunct labels, which indicate optional argu-ments for the predicate. As the example shows, adjunct labels can be assigned to multiple spans.

To deal with these issues, we use a greedy search that keeps the consistency among spans and can return multiple spans for adjunct labels.

Specifically, we greedily select higher scoring la-beled spans subject to two constraints.

Overlap Constraint: Any spans that overlap with the selected spans cannot be selected.

Number Constraint: While multiple spans can be selected for each adjunct label, at most one span can be selected for each core label.

As a precise description of this algorithm, we de-scribe the pseudo code and its explanation in Ap-pendixA.

1 1

)

1 1

)

-0

1 1 1 1

4 6 64 7 2 4 A

) , )

)

. (

-. , (

301 -

-1 1

1 1

4 5 C89

297 A 4₍ 6 8₎ 4A AC 7

( ) +

Figure 1: Overall architecture of our BiLSTM-span model.

3 Network Architecture

3.1 BiLSTM-Span Model

Figure1illustrates the overall architecture of our model. The first component f_base uses bidirec-tional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997; Graves et al.,2005,2013) to calculate the base features. From the base features, the second component f_span extracts span features. Based on them, the final componentflabelcalculates the score for each labeled span. In the following, we describe these three components in detail.

Base Feature Function

As the base feature function fbase, we use BiL-STMs,

f_base(w_1:T, p) =BILSTM(w_1:T, p) . There are some variants of BiLSTMs. Following the deep SRL models proposed byZhou and Xu (2015) andHe et al.(2017), we stack BiLSTMs in an interleaving fashion. The stacked BiLSTMs process an input sequence in a left-to-right man-ner at odd-numbered layers and in a right-to-left manner at even-numbered layers.

h(3,5)

m = |M|

モデル

…

M m=1∑

α_m = 1

この分散を利用すると精度が向上異なるランダムシードより各モデルが構築

Mixture of Expert [Shazeer+ 2017]

重み

f_label^moe(h^moe_s , r) = W^moe[r] ⋅ h^moe_s

^{スコア計算}

重複制約個数制約

スパン重複がないように

コアラベルには一つのスパン（付加詞ラベルには複数選択可）

(3) 貪欲法によるデコード

乾・鈴木研究室 / 工学セミナー 20 2019.07.05

整形されたスコア表

出力スコア

(1,1)

(1,2) … (2,2) … (3,3)

(3,4)

(4,4)

She She kept

…

^kept

…

^a ^a cat ^cat

A0 .89 .03 … .01 … .02 .73 .66

A1 .67 .02 … .01 … .01 .88 .72

…

… … … … … … … …

DIR

.01 … … .03 .02

LOC

.02 .01 … .01 … .03 .12 .18

述語を含むスパンを除く

個数制約

スパン３に対する

重複制約

ドキュメント内 A Span Selection Model for Semantic Role Labeling (ページ 34-37)

A2 とのラベル混同 (2/2)

She 1 kept 2 a 3 cat 4

入力

スパン全列挙に関して

・入力は一文なので問題にならない

・スパン長の上限を設定すればよい

・基本的にBIO型モデルよりも早い

・文を for で回すBIO型に対して，

行列演算でスパンベクトルを計算

アンサンブル法

モデル

モデル

モデル

…

この分散を利用すると精度が向上 異なるランダムシードより各モデルが構築

スコア計算

重複制約 個数制約

スパン重複がないように

コアラベルには一つのスパン（付加詞ラベルには複数選択可）

(3) 貪欲法によるデコード

整形されたスコア表

出力 スコア

(1,2) … (2,2) … (3,3)

(4,4)

…

…

…

DIR

LOC

述語を含む スパンを除く

個数制約

スパン３に対する

重複制約

この分散を利用すると精度が向上異なるランダムシードより各モデルが構築

^{スコア計算}

重複制約個数制約

出力スコア

述語を含むスパンを除く