• 検索結果がありません。

taro.watanabe at nict.go.jp

N/A
N/A
Protected

Academic year: 2021

シェア "taro.watanabe at nict.go.jp"

Copied!
32
0
0

読み込み中.... (全文を見る)

全文

(1)

オンライン学習

渡辺太郎

taro.watanabe at nict.go.jp

(2)

機械翻訳について勉強したい。

I want to study about machine translation.

なるべく良い翻訳

...

(3)

infobox ) infobox buddhist 道元 は 鎌倉 時代 初期 の で あ る 。 道元 ( どう げん ) は 、 鎌倉 時代 初期 の 禅僧 。 曹洞 禅 の 開祖 曹洞 宗 の 開祖 。 その 生涯 に は kigen 名 が あ る 。 晩年 に 希玄 と い う 異称 も 用い た 。 一般 に は 宗 と 呼 ば れ る こと に よ っ て 尊称 は 高僧 が あ る 。 同宗旨 で は 高祖 と 尊称 さ れ る 。 死後 に と い っ た 仏所 伝灯 国師 、 joyo-daishi で あ る 。 は 、 仏性 伝東 国師 、 承陽 大師_ ( 僧 ) 。 一般 に は 道元 禅師 と 呼 ば れ る 。 一般 に は 道元 禅師 と 呼 ば れ る 。 また ∼ 686 の 普及 に つ い て の 修行 を tooth brushing 、 大峰 、 食事 作法 cleaning と し て 、 日本 に お い て い る 。 日本 に 歯磨き 洗面 、 食事 の 際 の 作法 や 掃除 の 習慣 を 広め た と い わ る 。

(4)

エラー

探索エラー

: スコアの高い翻訳を出すのに失敗

モデルエラー

: スコアの高い翻訳が誤っている

学習データの問題

: 小さい、異なる

(5)

I want to study about machine translation

-2

-3

-4

I need to master machine translation

-3

-4

-4

machine translation want to study

-2

-5

-1

I don’t want to learn anything

-5

-2

-3

f =

機械翻訳について勉強したい

-9

-11

-8

-10

どうしましょう

?

log P r( |e) log P r(e)

0.5

×-2

0.4

×-3

0.2

×-4

0.5

×-3

0.4

×-4

0.2

×-4

0.5

×-2

0.4

×-5

0.2

×-1

0.5

×-5

0.4

×-2

0.2

×-3

-3.0

-3.9

-3.2

-3.9

log P r(f , | )

kbestを正しく並び替

えるように重みを学習

5

(6)

より一般化

:

複数の素性

h(e,d, f)をlog-linearに組み合わせ

重み付け

ˆ

e =

arg max

e d

exp w h(f , d, e)

e ,d

exp w h(f , d , e )

arg max

e,d

w h(f , d, e)

0.5

ˆ

e =

`; Kt

e

P r(f ,

| , e) P r( |e) P r(e)

0.2

0.4

最適化

= 最適なwを学習

(7)

MTパイプライン

対訳データ

翻訳モデル

言語モデル

デコーダ

対訳データ

重みパラメータ

(少量?)高品質

大量

(低品質?)

kyoto-train.{ja,en}

kyoto-tune.{ja,en}

最適化の問題

(8)

最適化

ある損失関数

 を仮定し、対訳データ

(F, E)に対するリスクを最小化" "

真の分布は未知なため、正則化

(

Ω

) さ

れた経験リスクを最小化

ˆ

w

= arg min

w W

E

P r(F,E)

[ (F, E;

w)]

= arg min

w W

(F, E;

w) +

(w)

(9)

kbest近似最適化

“k-best結合”以外の近似?

9 R,

T`Q+2/m`2 G2`MU F, E V

k,

C =

{}

j,

7Q` t = 1...T /Q

9,

w

でデコード、

kbestリスト

を生成

8,

kbestリスト

をCへ結合

e,

w

(t+1)

=

`; KBM

w

(F, E, C;

w) +

(w)

d,

2M/ 7Q`

3,

2M/ T`Q+2/m`2

(10)

各繰り返しで

k-bestを結合しつつ学習

収束はするけど

.... 局所解

(11)

局所解

MERTは、パラメータの初期値に大きく依存

ランダムな初期値から複数

MERT

optimizer samples, we have a statistic that jointly quantifies the impact of test set effects and optimizer instability on a test set. We call this statistic ssel. Different values of this statistic can suggest method-ological improvements. For example, a large ssel in-dicates that more replications will be necessary to draw reliable inferences from experiments on this test set, so a larger test set may be helpful.

To compute ssel, assume we have n indepen-dent optimization runs which produced weight vec-tors that were used to translate a test set n times. The test set has segments with references R =

R1, R2, . . . , R . Let X = X1, X2, . . . , Xn where each Xi = Xi1, Xi2, . . . , Xi is the list of translated segments from the ith optimization run list of the translated segments of the test set. For each hypothesis output Xi, we construct k bootstrap

replicates by drawing segments uniformly, with re-placement, from Xi, together with its corresponding reference. This produces k virtual test sets for each optimization run i. We designate the score of the jth virtual test set of the ith optimization run with mij. If mi = 1 k k j=1 mij, then we have: si = k j=1 (mij mi)2 k 1 ssel = 1 n n i=1 si

4.2 Comparing Two Systems

In the previous section, we gave statistics about the distribution of evaluation metrics across a large number of experimental samples (Table 1). Because of the large number of trials we carried out, we can be extremely confident in concluding that for both pairs of systems, the experimental manipulation ac-counts for the observed metric improvements, and furthermore, that we have a good estimate of the magnitude of that improvement. However, it is not generally feasible to perform as many replications as we did, so here we turn to the question of how to compare two systems, accounting for optimizer noise, but without running 300 replications.

We begin with a visual illustration how opti-mizer instability affects test set scores when com-paring two systems. Figure 1 plots the histogram of the 300 optimizer samples each from the two BTEC Chinese-English systems. The phrase-based

Figure 1: Histogram of test set BLEU scores for the BTEC phrase-based system (left) and BTEC hierarchical system (right). While the difference between the systems is 1.5 BLEU in expectation, there is a non-trivial region of overlap indicating that some random outcomes will re-sult in little to no difference being observed.

Figure 2: Relative frequencies of obtaining differences in BLEU scores on the WMT system as a function of the number of optimizer samples. The expected difference is 0.2 BLEU. While there is a reasonably high chance of observing a non-trivial improvement (or even a decline) for 1 sample, the distribution quickly peaks around the expected value given just a few more samples.

system’s distribution is centered at the sample mean 48.4, and the hierarchical system is centered at 49.9, a difference of 1.5 BLEU, correspond-ing to the widely replicated result that hierarchi-cal phrase-based systems outperform conventional phrase-based systems in Chinese-English transla-tion. Crucially, although the distributions are dis-tinct, there is a non-trivial region of overlap, and experimental samples from the overlapping region could suggest the opposite conclusion!

To further underscore the risks posed by this over-lap, Figure 2 plots the relative frequencies with which different BLEU score deltas will occur, as a function of the number of optimizer samples used. When is a difference significant? To determine whether an experimental manipulation results in a 179

11

optimizer samples, we have a statistic that jointly quantifies the impact of test set effects and optimizer instability on a test set. We call this statistic ssel. Different values of this statistic can suggest method-ological improvements. For example, a large ssel in-dicates that more replications will be necessary to draw reliable inferences from experiments on this test set, so a larger test set may be helpful.

To compute ssel, assume we have n indepen-dent optimization runs which produced weight vec-tors that were used to translate a test set n times. The test set has segments with references R =

R1, R2, . . . , R . Let X = X1, X2, . . . , Xn where each Xi = Xi1, Xi2, . . . , Xi is the list of

translated segments from the ith optimization run list of the translated segments of the test set. For each hypothesis output Xi, we construct k bootstrap

replicates by drawing segments uniformly, with re-placement, from Xi, together with its corresponding

reference. This produces k virtual test sets for each optimization run i. We designate the score of the jth virtual test set of the ith optimization run with mij.

If mi = 1k kj=1 mij, then we have: si = k j=1 (mij mi)2 k 1 ssel = 1 n n i=1 si

4.2 Comparing Two Systems

In the previous section, we gave statistics about the distribution of evaluation metrics across a large number of experimental samples (Table 1). Because of the large number of trials we carried out, we can be extremely confident in concluding that for both pairs of systems, the experimental manipulation ac-counts for the observed metric improvements, and furthermore, that we have a good estimate of the magnitude of that improvement. However, it is not generally feasible to perform as many replications as we did, so here we turn to the question of how to compare two systems, accounting for optimizer noise, but without running 300 replications.

We begin with a visual illustration how opti-mizer instability affects test set scores when com-paring two systems. Figure 1 plots the histogram of the 300 optimizer samples each from the two BTEC Chinese-English systems. The phrase-based

Figure 1: Histogram of test set BLEU scores for the BTEC phrase-based system (left) and BTEC hierarchical system (right). While the difference between the systems is 1.5 BLEU in expectation, there is a non-trivial region of overlap indicating that some random outcomes will re-sult in little to no difference being observed.

Figure 2: Relative frequencies of obtaining differences in BLEU scores on the WMT system as a function of the number of optimizer samples. The expected difference is 0.2 BLEU. While there is a reasonably high chance of observing a non-trivial improvement (or even a decline) for 1 sample, the distribution quickly peaks around the expected value given just a few more samples.

system’s distribution is centered at the sample mean 48.4, and the hierarchical system is centered at 49.9, a difference of 1.5 BLEU, correspond-ing to the widely replicated result that hierarchi-cal phrase-based systems outperform conventional phrase-based systems in Chinese-English transla-tion. Crucially, although the distributions are dis-tinct, there is a non-trivial region of overlap, and experimental samples from the overlapping region could suggest the opposite conclusion!

To further underscore the risks posed by this over-lap, Figure 2 plots the relative frequencies with which different BLEU score deltas will occur, as a function of the number of optimizer samples used. When is a difference significant? To determine whether an experimental manipulation results in a 179

(Clark et al., 2011)

最適化

(12)

大規模化の問題

データが多い、高次元な素性ベクトル

効率のよい学習手法

?

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

h(f , d, e) =

...

...

...

...

...

...

(13)

どうせ近似するなら...

学習データを近似

非常に簡単なアルゴリズムで実現

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

一部をラン

ダムに選択

w

更新

繰り返し

13

(14)

オンライン学習

自然言語処理ではよく使われる

: 形態素解析、構文

解析

etc.

R, T`Q+2/m`2 PMHBM2G2`MU F, E = f(i), e(i) N

i=1V k, w(1) j, 7Q` t {1, . . . , T} /Q 9, F˜(t), ˜E(t) F, E 8, C˜(t) :1L( ˜F (j), w(t)) e, w(t+1) `; KBMw W ( ˜F (t), ˜E(t), ˜C(t); w) + (w) d, 2M/ 7Q` 3, `2im`M w(T +1) N, 2M/ T`Q+2/m`2

ランダムにサンプ

リング

: mini batch

デコード

更新

(15)

問題

:

BLEU

k-bestのランキングに基づく最適化

コーパス単位の

BLEU

≠文単位の

BLEU

の平均

文単位では、正しく学習するのは困難

I want to study about machine translation I need to master machine translation

machine translation want to study I don’t want to learn anything

e

1

e

2

e

3

e

4

(16)

BLEU

の近似

今までの各文に対する

BLEU

の統計量を保存

(1-bestあるいはoracle) (Watanabe et al., 2007)

GEN(f

(s)

, w)

e

(1)

,

· · · ,

c

(s)1

..

.

c

(s)i

..

.

c

(s)K

,

· · · , e

(S)

(17)

減衰による

BLEU

の近似

sentence-

BLEU

に対して、今までの

BLEU

の履歴

(×0.9)を加える(Chiang et al., 2008)

17

b

0.9

(b + c(e))

l

0.9

(l +

|f|)

B(e) = (l +

|f|)

Bleu(b + c(e))

ˆ

e

(s)

= argmax

e

B(e) + w h(f

(s)

, e)

˙e

(s)

= argmax

e

+B(e) + w h(f

(s)

, e)

(18)

パーセプトロン

R, T`Q+2/m`2 S2`+2Ti`QMU F, E = f(i), e(i) N

i=1V k, w(1) 0 j, 7Q` t {1 . . . T} /Q 9, f , e F, E 8, ˜c :1L(f, w(t)) e, e, ˆˆ d ˜c d, e , d o˜ c˜ 3, B7 e , d = ˆe, ˆd i?2M N, w(t+1) w(t) + h(f , e , d ) h(f , ˆe, ˆd) Ry, 2M/ B7 RR, 2M/ 7Q` Rk, `2im`M w(T +1)あるいは 1 T T +1 t=2 w(t) Rj, 2M/ T`Q+2/m`2

一文だけサンプル

1-best

オラクル翻訳

間違ったらペナルティ

最後のパラメータ

or平均値

(19)

パーセプトロン

I want to study about machine translation I need to master machine translation

machine translation want to study

I don’t want to learn anything

e

1

e

2

e

3

e

4

e

1

のスコアが、

e

3

のスコアよりも、大き

くなるように、

wを更新

arg min

w

max 0, 0

w h(f ,

e

1

)

w h(f ,

e

3

)

19

(20)

hinge損失

I want to study about machine translation I need to master machine translation

machine translation want to study

I don’t want to learn anything

e

1

e

2

e

3

e

4

e

1

のスコアが、

e

3

のスコアよりも、

1 以

大きくなるように、

wを更新

arg min

w

max 0, 1

w h(f ,

e

1

)

w h(f ,

e

3

)

(21)

MIRA

以前のパラメータ

w

(t)

との差分を小さく

しつ

つ、

e1

のスコアが、

e

3

のスコアよりも、

訳のエラー以上

大きくなるように、

wを更新

arg min

w

1

2

w

w

(t) 2

+

error(

e

1

,

e

3

)

w h(f ,

e

1

)

w h(f ,

e

3

)

21 (Crammer et al., 2006)

(22)

MIRA

非常に簡単な更新

: α

(t)

により更新の量を調整

α

(t)

=1: パーセプトロン

h(f ,

e

1

,

e

3

) = h(f ,

e

1

)

h(f ,

e

3

)

w

(t+1)

w

(t)

+

(t)

h(f ,

e

1

,

e

3

)

(t)

= min

1

,

error(

e

1

,

e

3

)

w

(t)

h(f ,

e

1

,

e

3

)

h(f ,

e

1

,

e

3

)

2

他にも、

CWやAROWなど様々な更新式

(23)

大きいマージン

正しい翻訳と誤った翻訳のうち、誤りの差が大

きく、かつ、スコアの差が大きいペアを選択

I want to study about machine translation

I need to master machine translation machine translation want to study

I don’t want to learn anything

e

1

e

2

e

3

e

4

error(

e

1

,

e

3

)

w

(t)

h(f ,

e

1

,

e

3

)

error(

e

2

,

e

3

)

w

(t)

h(f ,

e

2

,

e

3

)

あるいは

23

(24)

y

y

^

y

*

ˆ?*^y*

score

-

cost

score -cost

-

cost

score

-

cost

y*

y

score

-

cost

y*

^

y

-

cost

y*

^

y

cost diminished

ˆ

+

y

-                                           score -cost             score

-cost                                                        score -cost          

Figure 1: Hypothetical output space of a translation model for an input sentence x(i). Each point corresponds to a single translation/derivation output pair. Horizontal “bands” are caused by output pairs with the same translation (and hence the same cost) but different derivations. The left plot shows the entire output space and the right plot highlights outputs in the k-best list. Choosing the output with the lowest cost in the k-best list is similar to finding y+, h+ .

output is shown as y , h in Fig. 1; finding y is often called cost-augmented decoding, which is also used to define hinge loss (§3.3).

The second form, Eq. 7, penalizes the model prediction ( y , h = ˆy, ˆh ) and favors an out-put pair that has both high model score and low cost; this is the converse of cost-augmented decod-ing and therefore we call it cost-diminished decod-ing; y , h = y+, h+ in Fig. 1. The third form,

Eq. 8, sets y , h = y+, h+ and y , h =

y , h . This loss underlies RAMPION. It is

sim-ilar to the loss optimized by the MIRA-inspired al-gorithm used by Chiang et al. (2008, 2009).

Optimization The ramp losses are continuous but non-convex and non-differentiable, so gradient-based optimization methods are not available.5 For-tunately, Eq. 8 can be optimized by using a concave-convex procedure (CCCP; Yuille and Rangarajan, 2002). CCCP is a batch optimization algorithm for any function that is the the sum of a concave and a convex function. The idea is to approximate the sum as the convex term plus a tangent line to the con-cave function at the current parameter values; the resulting sum is convex and can be optimized with (sub)gradient methods.

5For non-differentiable, continuous, convex functions,

gradient-based methods are available, such as stochastic sub-gradient descent (SSD), and it is tempting to apply them here. However, non-convex functions are not everywhere subdiffer-entiable and so a straightforward application of SSD may en-counter problems in practice.

With our loss functions, CCCP first imputes the outputs in the concave terms in each loss (i.e., solves the negated max expressions) for the entire training set and then uses an optimization procedure to op-timize the loss with the imputed values fixed. Any convex optimization procedure can be used once the negated max terms are solved; we use stochastic subgradient descent (SSD) but MIRA could be eas-ily used instead.

The CCCP algorithm we use for optimizing lossramp 3, which we call RAMPION, is shown as

Alg. 1. Similar algorithms can easily be derived for the other ramp losses. The first step done on each iteration is to generate k-best lists for the full tun-ing set (line 3). We then run CCCP on the k-best lists for T iterations (lines 4–15). This involves first finding the translation to update towards for all sen-tences in the tuning set (lines 5–7), then making pa-rameter updates in an online fashion with T epochs of stochastic subgradient descent (lines 8–14). The subgradient update for the 2 regularization term is done in line 11 and then for the loss in line 12.6

Unlike prior work that targeted similar loss func-tions (Watanabe et al., 2007; Chiang et al., 2008; Chiang et al., 2009), we do not use a fully online al-gorithm such as MIRA in an outer loop because we are not aware of an online learning algorithm with theoretical guarantees for differentiable, non-convex loss functions like the ramp losses. CCCP

6

2 regularization done here regularizes toward 0, not 0.

224

(25)

SGD

勾配が計算できるなら、

hinge損失以外でも使える

学習率

ηは、学習が進むにつれて、減衰

25

(26)

AdaGrad

各素性毎に、学習率を自動的に調整

他にも、

RDA、AdaDec、AdaDeltaなど

(Duchi et al., 2011)

g

(t+1)

g

(t)

+

( ˜

F

(t)

, ˜

E

(t)

, ˜

C

(t)

; w)

2

w

(t+1)

w

(t) 0

g

(t+1)

( ˜

F

(t)

, ˜

E

(t)

, ˜

C

(t)

; w) +

(w)

(27)

並列化

27 R, T`Q+2/m`2 S`HH2HG2`MU F, E V k, { F1, E1 , . . . , FS, ES } aTHBi( F, E ) j, 7Q` s {1, . . . , S} T`HH2H /Q 9, ws PMHBM2G2`M( Fs, Es ) あるいは"i+?G2`M( Fs, Es ) 8, 2M/ 7Q` e, w KBt ({w1, . . . , wS}) d, `2im`M w 3, 2M/ T`Q+2/m`2

shardへ分割

shardで学習

混合

shard独立に学習されるため、全体

で最適解が保証されない

(28)

同期更新

(McDonald et al., 2010)

shard1shard2shard3shard4shard5

t

t + 1

繰り返し

毎に混合

(29)

非同期更新

29

パラメータの更新

(30)

まとめ

オンライン学習

: 大規模データ、高次元素性

パーセプトロン、

MIRA、SGD、AdaGrad

(31)

参考文献

David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proc. of EMNLP

2008, pages 224-233, Honolulu Hawaii, October.

David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In Proc. of NAACL-HLT 2009, pages 218-226, Boulder, Colorado, June

Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for

optimizer instability. In Proceed ings of the 49th Annual Meeting of the

Association for Computational Linguistics: Human Language Technologies, pages

176-181, Portland, Oregon, USA, June.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. Journal of

Machine Learning Research, 7:551-585 March.

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn.

(32)

参考文献

Kevin Gimpel and Noah A. Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the As sociation for Computational Linguistics: Human Language Technologies, pages 221-231, Montréal, Canada, June.

Ryan McDonald, Keith Hall, and Gideon Mann. 2010. Distributed training strategies for the structured perceptron In Proc. of NAACL-HLT 2010, pages 456-464, Los Angeles California, June.

Franz Josef Och and Hermann Ney. 2002. Discriminative training and

maximum entropy models for statistica machine translation. In Proc. of ACL 2002, pages 295-302 Philadelphia, Pennsylvania, USA, July.

Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hidek Isozaki. 2007.

Online Large-Margin Training for Statistica Machine Translation. In Proc. of EMNLP-CoNLL 2007, pages 764-773, Prague, Czech Republic, June.

Figure 1: Histogram of test set BLEU scores for the BTEC phrase-based system (left) and BTEC hierarchical system (right)
Figure 1: Hypothetical output space of a translation model for an input sentence x (i)

参照

関連したドキュメント

If condition (2) holds then no line intersects all the segments AB, BC, DE, EA (if such line exists then it also intersects the segment CD by condition (2) which is impossible due

We list in Table 1 examples of elliptic curves with minimal discriminant achieving growth to each possible torsion group over Q

We show that a discrete fixed point theorem of Eilenberg is equivalent to the restriction of the contraction principle to the class of non-Archimedean bounded metric spaces.. We

Theorem 4.8 shows that the addition of the nonlocal term to local diffusion pro- duces similar early pattern results when compared to the pure local case considered in [33].. Lemma

The dimension d will allow us in the next sections to consider two different solutions of an ordinary differential equation as a function on R 2 with a combined expansion.. The

Keywords: continuous time random walk, Brownian motion, collision time, skew Young tableaux, tandem queue.. AMS 2000 Subject Classification: Primary:

Our method of proof can also be used to recover the rational homotopy of L K(2) S 0 as well as the chromatic splitting conjecture at primes p > 3 [16]; we only need to use the

This paper presents an investigation into the mechanics of this specific problem and develops an analytical approach that accounts for the effects of geometrical and material data on