A2 とのラベル混同 (2/2)
She 1 kept 2 a 3 cat 4
入力
スパン全列挙に関して
メモリ容量
・入力は一文なので問題にならない
・スパン長の上限を設定すればよい
計算速度
・基本的にBIO型モデルよりも早い
・文を for で回すBIO型に対して,
行列演算でスパンベクトルを計算
10
アンサンブル法
*A.2乾・鈴木研究室 / 工学セミナー 2019.07.05
hmoes = Wmoes ⋅ {α1h1(i,j) + α2h2(i,j) + . . . + α|M|h|M|(i,j)} ,
W
Firstly,fbasecalculates a base feature vectorhtfor each wordwt 2w1:T. Then, from a sequence of the base feature vectors h1:T, fspan calculates a span feature vectorhs for a spans = (i, j). Fi-nally, using hs,flabel calculates the score for the spans= (i, j)with a labelr.
Each function in Eqs.4,5and6can arbitrarily be defined. In Section3, we describe our functions used in this paper.
2.4 Inference
The simple argmax inference (Eq. 1) selects one span for each label. While this argmax inference is computationally efficient, it faces the following two problematic issues.
(a) The argmax inference sometimes selects spans that overlap with each other.
(b) The argmax inference cannot select multiple spans for one label.
In terms of (a), for example, whenh1,3,A0iand h2,4,A1i are selected, a part of these two spans overlaps. In terms of (b), consider the following sentence.
He came to the U.S. yesterday at 5 p.m.
[A0] [ A4 ] [ TMP ] [ TMP ] In this example, the label TMPis assigned to the two spans (“yesterday” and “at 5 p.m.”). Semantic role labels are mainly categorized into (i)core la-bels or (ii)adjunct labels. In the above example, the labelsA0andA4are regarded as core labels, which indicate obligatory arguments for the pred-icate. In contrast, the labels likeTMPare regarded as adjunct labels, which indicate optional argu-ments for the predicate. As the example shows, adjunct labels can be assigned to multiple spans.
To deal with these issues, we use a greedy search that keeps the consistency among spans and can return multiple spans for adjunct labels.
Specifically, we greedily select higher scoring la-beled spans subject to two constraints.
Overlap Constraint: Any spans that overlap with the selected spans cannot be selected.
Number Constraint: While multiple spans can be selected for each adjunct label, at most one span can be selected for each core label.
As a precise description of this algorithm, we de-scribe the pseudo code and its explanation in Ap-pendixA.
1 1
)
1 1
)
-
-0
1 1 1 1
4 6 64 7 2 4 A
) , )
)
. (
-. , (
301 -
-1 1
1 1
4 5 C89
297 A 4( 6 8) 4A AC 7
( ) +
Figure 1: Overall architecture of our BiLSTM-span model.
3 Network Architecture
To compute the score for each span, we have intro-duced three functions (fbase, fspan, flabel) in Sec-tion2.3. As an instantiation of each function, we use neural networks. This section describes our neural networks for each function and the overall network architecture.
3.1 BiLSTM-Span Model
Figure1illustrates the overall architecture of our model. The first component fbase uses bidirec-tional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997; Graves et al.,2005,2013) to calculate the base features. From the base features, the second component fspan extracts span features. Based on them, the final componentflabel calculates the score for each labeled span. In the following, we describe these three components in detail.
Base Feature Function
As the base feature function fbase, we use BiL-STMs,
fbase(w1:T, p) =BILSTM(w1:T, p) . There are some variants of BiLSTMs. Following the deep SRL models proposed byZhou and Xu (2015) and He et al.(2017), we stack BiLSTMs in an interleaving fashion. The stacked BiLSTMs process an input sequence in a left-to-right man-ner at odd-numbered layers and in a right-to-left manner at even-numbered layers.
h(3,5)
m = 1
モデル
W
Firstly,fbasecalculates a base feature vectorhtfor each wordwt 2 w1:T. Then, from a sequence of the base feature vectors h1:T, fspan calculates a span feature vectorhsfor a span s = (i, j). Fi-nally, usinghs, flabel calculates the score for the spans= (i, j)with a labelr.
Each function in Eqs.4,5and6can arbitrarily be defined. In Section3, we describe our functions used in this paper.
2.4 Inference
The simple argmax inference (Eq.1) selects one span for each label. While this argmax inference is computationally efficient, it faces the following two problematic issues.
(a) The argmax inference sometimes selects spans that overlap with each other.
(b) The argmax inference cannot select multiple spans for one label.
In terms of (a), for example, whenh1,3,A0iand h2,4,A1iare selected, a part of these two spans overlaps. In terms of (b), consider the following sentence.
He came to the U.S. yesterday at 5 p.m.
[A0] [ A4 ] [ TMP ] [ TMP ] In this example, the labelTMP is assigned to the two spans (“yesterday” and “at 5 p.m.”). Semantic role labels are mainly categorized into (i)core la-belsor (ii)adjunct labels. In the above example, the labelsA0andA4are regarded as core labels, which indicate obligatory arguments for the pred-icate. In contrast, the labels likeTMPare regarded as adjunct labels, which indicate optional argu-ments for the predicate. As the example shows, adjunct labels can be assigned to multiple spans.
To deal with these issues, we use a greedy search that keeps the consistency among spans and can return multiple spans for adjunct labels.
Specifically, we greedily select higher scoring la-beled spans subject to two constraints.
Overlap Constraint: Any spans that overlap with the selected spans cannot be selected.
Number Constraint: While multiple spans can be selected for each adjunct label, at most one span can be selected for each core label.
As a precise description of this algorithm, we de-scribe the pseudo code and its explanation in Ap-pendixA.
1 1
)
1 1
)
-
-0
1 1 1 1
4 6 64 7 2 4 A
) , )
)
. (
-. , (
301 -
-1 1
1 1
4 5 C89
297 A 4( 6 8) 4A AC 7
( ) +
Figure 1: Overall architecture of our BiLSTM-span model.
3 Network Architecture
To compute the score for each span, we have intro-duced three functions (fbase, fspan, flabel) in Sec-tion2.3. As an instantiation of each function, we use neural networks. This section describes our neural networks for each function and the overall network architecture.
3.1 BiLSTM-Span Model
Figure1illustrates the overall architecture of our model. The first component fbase uses bidirec-tional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997; Graves et al.,2005, 2013) to calculate the base features. From the base features, the second component fspan extracts span features. Based on them, the final componentflabelcalculates the score for each labeled span. In the following, we describe these three components in detail.
Base Feature Function
As the base feature function fbase, we use BiL-STMs,
fbase(w1:T, p) =BILSTM(w1:T, p) . There are some variants of BiLSTMs. Following the deep SRL models proposed byZhou and Xu (2015) andHe et al.(2017), we stack BiLSTMs in an interleaving fashion. The stacked BiLSTMs process an input sequence in a left-to-right man-ner at odd-numbered layers and in a right-to-left manner at even-numbered layers.
h(3,5)
m = 2
モデル
W
Firstly,fbasecalculates a base feature vectorhtfor each wordwt 2w1:T. Then, from a sequence of the base feature vectorsh1:T, fspan calculates a span feature vectorhsfor a spans = (i, j). Fi-nally, usinghs,flabel calculates the score for the spans= (i, j)with a labelr.
Each function in Eqs.4,5and6can arbitrarily be defined. In Section3, we describe our functions used in this paper.
2.4 Inference
The simple argmax inference (Eq.1) selects one span for each label. While this argmax inference is computationally efficient, it faces the following two problematic issues.
(a) The argmax inference sometimes selects spans that overlap with each other.
(b) The argmax inference cannot select multiple spans for one label.
In terms of (a), for example, whenh1,3,A0iand h2,4,A1iare selected, a part of these two spans overlaps. In terms of (b), consider the following sentence.
He came to the U.S. yesterday at 5 p.m.
[A0] [ A4 ] [ TMP ] [ TMP ] In this example, the labelTMPis assigned to the two spans (“yesterday” and “at 5 p.m.”). Semantic role labels are mainly categorized into (i)core la-belsor (ii)adjunct labels. In the above example, the labelsA0andA4are regarded as core labels, which indicate obligatory arguments for the pred-icate. In contrast, the labels likeTMPare regarded as adjunct labels, which indicate optional argu-ments for the predicate. As the example shows, adjunct labels can be assigned to multiple spans.
To deal with these issues, we use a greedy search that keeps the consistency among spans and can return multiple spans for adjunct labels.
Specifically, we greedily select higher scoring la-beled spans subject to two constraints.
Overlap Constraint: Any spans that overlap with the selected spans cannot be selected.
Number Constraint: While multiple spans can be selected for each adjunct label, at most one span can be selected for each core label.
As a precise description of this algorithm, we de-scribe the pseudo code and its explanation in Ap-pendixA.
1 1
)
1 1
)
-
-0
1 1 1 1
4 6 64 7 2 4 A
) , )
)
. (
-. , (
301 -
-1 1
1 1
4 5 C89
297 A 4( 6 8) 4A AC 7
( ) +
Figure 1: Overall architecture of our BiLSTM-span model.
3 Network Architecture
To compute the score for each span, we have intro-duced three functions (fbase, fspan, flabel) in Sec-tion2.3. As an instantiation of each function, we use neural networks. This section describes our neural networks for each function and the overall network architecture.
3.1 BiLSTM-Span Model
Figure1illustrates the overall architecture of our model. The first component fbase uses bidirec-tional LSTMs (BiLSTMs) (Schuster and Paliwal, 1997; Graves et al.,2005,2013) to calculate the base features. From the base features, the second component fspan extracts span features. Based on them, the final componentflabelcalculates the score for each labeled span. In the following, we describe these three components in detail.
Base Feature Function
As the base feature function fbase, we use BiL-STMs,
fbase(w1:T, p) =BILSTM(w1:T, p) . There are some variants of BiLSTMs. Following the deep SRL models proposed byZhou and Xu (2015) andHe et al.(2017), we stack BiLSTMs in an interleaving fashion. The stacked BiLSTMs process an input sequence in a left-to-right man-ner at odd-numbered layers and in a right-to-left manner at even-numbered layers.
h(3,5)
m = |M|
モデル
…
M m=1∑
αm = 1
この分散を利用すると精度が向上 異なるランダムシードより各モデルが構築
Mixture of Expert [Shazeer+ 2017]
重み
flabelmoe(hmoes , r) = Wmoe[r] ⋅ hmoes
スコア計算
34
重複制約 個数制約
スパン重複がないように
コアラベルには一つのスパン(付加詞ラベルには複数選択可)
(3) 貪欲法によるデコード
乾・鈴木研究室 / 工学セミナー 20 2019.07.05
整形されたスコア表
出力 スコア
(1,1)
(1,2) … (2,2) … (3,3)
(3,4)(4,4)
She She kept
…
kept…
a a cat catA0 .89 .03 … .01 … .02 .73 .66
A1 .67 .02 … .01 … .01 .88 .72