• 検索結果がありません。

統計的言語モデルにおける確率的潜在意味解析の学習初期化手法の一検討

N/A
N/A
Protected

Academic year: 2021

シェア "統計的言語モデルにおける確率的潜在意味解析の学習初期化手法の一検討"

Copied!
5
0
0

読み込み中.... (全文を見る)

全文

(1)Vol.2013-SLP-97 No.6 2013/7/26. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ౷‫ܭ‬త‫ޠݴ‬Ϟσϧʹ͓͚Δ ֬཰తજࡏҙຯղੳͷֶशॳ‫ظ‬Խख๏ͷҰ‫ݕ‬౼ େౡ ‫࢙׮‬1,a). ઒୺ ߽1. ֓ཁɿ֬཰తજࡏҙຯղੳʢҎԼ PLSA ʣ͸ɼจॻͷ࿩୊Λ൓өͨ͠‫ޠݴ‬ϞσϧΛߏங͢Δख๏Ͱ͋Δɽ PLSA ͸ॳ‫ظ‬஋ͱͯ͠༩͑ΒΕͨ࿩୊ unigram Λ‫ݾࣗʹج‬૊৫ԽΛߦ͏͕ɼ౷‫ֶܭ‬शͷࡍͷॳ‫ظ‬஋ʹ 0 ͕ ଟ͘‫·ؚ‬ΕΔ͔൱͔Ͱͷ‫ޠݴ‬Ϟσϧͱͯ͠ͷ PLSA ͷৼΔ෣͍ͷมԽΛଊ͑Δɽ݁Ռͱͯ͠ɼॳ‫ظ‬஋ʹ 0 Λଟ͘‫ؚ‬Ή৔߹͸ิਖ਼ Perplexity ͸௿͘཈͑ΒΕɼֶश΋ૣ͍ஈ֊Ͱऩଋ͢Δ΋ͷͷɼॳ‫ظ‬஋ґଘੑ͕‫ڧ‬ ͘ɼ·ͨɼະ஌‫ޠ‬ͷׂ߹͕ߴ·Δ͜ͱ͕‫؍‬ଌ͞Εͨɽଞํɼ0 Λ‫͍ͳ·ؚ‬৔߹ʹ͸ॳ‫ظ‬஋ґଘੑ͸ऑ·Δ΋ ͷͷɼิਖ਼ Perplexity ͸ 0 Λ‫ؚ‬Ή৔߹ͱൺ΂ͯ਺ׂఔ౓૿Ճ͠ɼऩଋ·Ͱʹඞཁͳ EM ΞϧΰϦζϜͷε ςοϓ਺͕େ෯ʹ૿Ճͨ͠ɽ ΩʔϫʔυɿPLSAɼ‫ޠݴ‬ϞσϧɼEM ΞϧΰϦζϜ. Consideration about the Sparseness of Parameter Initialization of PLSA Language Models Abstract: PLSA (Probabilistic Latent Semantic Analysis) is a promising technology to reduce the perplexity for speech recognition systems. In this method, the topic structure is self-organized as the topic unigram vector. This paper describes parameter initialization methods taking the sparseness of topic vectors into account. The perplexity reduction experiments show that the sparse initialization of topic vectors enables the faster and more accurate topic cluster organization. However in this case, the ratio of unknown words increases and the dependency to initial data selection also increases. Keywords: PLSAɼlanguage modelɼEM algorithm. 1. ͸͡Ίʹ ‫ࡏݱ‬ɼେ‫ޠ‬ኮ࿈ଓԻ੠ೝࣝͷࡍʹ༻͍ΒΕΔ‫ޠݴ‬Ϟσϧ ͱͯ͠͸ɼ N-gram Ϟσϧ͕ҰൠతͰ͋Δ [1]ɽN-gram Ϟ. ؔ࿈ͷ͋ΔจॻΛऩू͠ɼͦΕΒΛֶशσʔλͱͯ͠‫ޠݴ‬ ϞσϧΛ࠶ߏங͢Δख๏Ͱ͋Δɽ͜ͷख๏ʹؔ͢Δ‫ͱڀݚ‬ ͯ͠͸ɼ༗ޮͳ‫ࡧݕ‬ΫΤϦΛߏங͢Δ͜ͱͰɼWWW ͔ ΒֶशσʔλΛऩू͢Δͱ͍͏΋ͷ͕͋Δ [2], [3]ɽ. σϧͰ͸௚લʹग़‫͢ݱ‬Δ N − 1 ୯‫ޠ‬Λ‫཰֬ʹج‬Λਪఆ͢. ΋͏Ұͭ͸ɼෳ਺ͷ࿩୊͕ࠞࡏ͍ͯ͠Δֶशσʔλ͔Β. ΔͨΊɼ࿩୊΍࿩͠ํͷελΠϧͳͲɼจ຺શମ͔ΒಘΒ. ࿩୊ਪఆΛߦ͏ͱ͍͏ख๏Ͱ͋ΔɽϚϧίϑϞσϧΛ༻͍. ΕΔಛ௃ʹ͍ͭͯ͸ɼͦͷ͘͝Ұ෦͔͠൓ө͢Δ͜ͱ͸Ͱ. ͯ࿩୊Λ੍‫͢ޚ‬Δํ๏ [4] ΍ MAP ਪఆΛ༻͍ͨλεΫద. ͖ͳ͍ɽͦ͜ͰɼͦΕΒͷจ຺৘ใΛऔΓೖΕΔ͜ͱͰɼ. Ԡͷ‫[ ڀݚ‬5] ͳͲ͕ͳ͞Ε͍ͯΔɽ͜ͷख๏Λ༻͍Δར఺. ‫ޠݴ‬ϞσϧΛ‫ݱ‬ঢ়ΑΓ΋࠷దԽ͢Δ͜ͱ͕ՄೳͱͳΔɽ. ͸ɼ໨ඪͱ͢Δจॻͱྨࣅͨ͠࿩୊Λ࣋ͭจॻू߹Λ༻ҙ. ࿩୊΍ελΠϧΛ‫ޠݴ‬Ϟσϧʹ൓өͤ͞Δख๏ʹ͸େ͖ ͘෼͚ͯೋछྨ͋Δɽ Ұͭ͸ɼೝࣝ݁Ռ͔Βਪఆ͞ΕΔ࿩୊Λ‫ͦʹج‬ͷ࿩୊ʹ. ͤͣʹࡁΉͱ͍͏఺Ͱ͋Δɽ·ͨɼ࿩୊ΛϞσϧԽ͢Δख ஈͱͯ֬͠཰తજࡏҙຯղੳ [6] ͕͋Δɽ͜Ε͸ɼೝࣝର ৅ͷ࿩୊Λ൑ఆ͠ɼ୯‫ޠ‬ͷग़‫ʹ཰֬ݱ‬ॏΈΛ͚ͭΔ͜ͱͰɼ ‫ޠݴ‬ϞσϧΛλεΫʹదԠͤ͞Δ΋ͷͰ͋Δɽֶशͷ࠷ద. 1 a). ؔ੢ֶӃେֶେֶӃ Kwansei Gakuin Uniersity [email protected]. ⓒ 2013 Information Processing Society of Japan. Խͷํ๏΍ [7]ɼ‫ޠ‬ኮ෼ׂʹؔ͢Δ‫ݕ‬౼ [8]ɼ࿩ऀదԠʹΑ ΔԻ੠ೝࣝ཰ͷվળ [9]ɼWWW ͔ΒಘΒΕΔ‫ޠࡧݕ‬ॏΈ. 1.

(2) Vol.2013-SLP-97 No.6 2013/7/26. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ෇͚Λ༻͍ͨ࿩୊దԠ [10] ͳͲͷൃల‫͕ڀݚ‬ଘࡏ͢Δɽ͠ ͔͠ͳ͕Βɼͦͷੑೳ͕౷‫ֶܭ‬शͷࡍʹ༩͑Δॳ‫ظ‬஋ʹ‫ڧ‬ ͘ґଘ͢Δ͜ͱͳͲɼ͍͔ͭ͘ͷ໰୊఺͕ଘࡏ͍ͯ͠Δɽ. 2. ୯‫ޠ‬ͷස౓෼෍ʹ‫ͮ͘ج‬࿩୊Ϟσϧ จ຺΍࿩୊ͳͲͷ‫ޠݴ‬తͳ৘ใΛ༻͍ͯൃ࿩͞ΕΔԻ੠ ͷจ‫ܕ‬΍‫ޠ‬ኮΛߜΓࠐΈɼೝࣝੑೳΛ্͛Δ͜ͱ͕ɼԻ੠ ೝࣝʹ͓͚Δ‫ޠݴ‬Ϟσϧͷॏཁͳ໾ׂͰ͋ΔɽԿΒ͔ͷ࿩ ୊Λઃఆ͠ɼ૝ఆ͞ΕΔ‫ޠ‬ኮΛ੍‫͢ݶ‬Δख๏ʹ͓͍ͯɼ࿩ ୊ͷछྨ΍ɼͦΕʹରԠ͢Δ‫ޠ‬ኮू߹ΛਓखͰઃఆ͢Δͷ ͸ࠔ೉Ͱ͋Δɽ͜ͷ໰୊Λղܾ͢ΔͨΊʹ͸ɼͦΕΒͷཁ ૉΛࣗಈతʹܾఆ͢ΔΑ͏ͳख๏͕๬·ΕΔɽ ຊઅͰ͸ɼͦͷΑ͏ͳख๏ͷҰͭͰ͋Δ֬཰తજࡏҙຯ ղੳͷ౷‫ֶܭ‬शʹ༻͍Δ EM ΞϧΰϦζϜʹ͓͍ͯɼॳ‫ظ‬ ஋ͷ༩͑ํɼಛʹֶशͷࡍʹ༩͑Δॳ‫ظ‬஋ʹ‫·ؚ‬ΕΔ 0 ͷ ༗ແʹΑΔ౷‫ܭ‬త‫ޠݴ‬Ϟσϧͱͯ͠ͷ֬཰తજࡏҙຯղੳ ͷৼΔ෣͍Λ‫͠࡯؍‬ɼੑೳ΍ॳ‫ظ‬஋ґଘੑͷมԽΛଊ͑Δɽ. ਤ 1 PLSA ͷ֓೦ਤ. l(θ; N ) =.   w∈W d∈D. . P (w|z)P (z|d). (2). z∈Z. ͜͜Ͱ n(d, w) ͸จॻ d ʹ͓͚Δ୯‫ ޠ‬w ͷग़‫ݱ‬ճ਺Λ ද͢ɽ. 2.1 ֬཰తજࡏҙຯղੳͷ࿮૊Έ ֬཰తજࡏҙຯղੳ ( Probabilistic Latent Semantic. Analysis ɼҎԼ PLSA) [6] ͱ͸ɼֶशσʔλ͔ΒಘΒΕ Δ୯‫ޠ‬ͷग़‫ݱ‬ස౓Λ‫ʹج‬ɼ࿩୊ΛϞσϧԽ͢Δख๏Ͱ͋ ΔɽPLSA ͕ k-means ๏ [11], [12] ͳͲͷ࿩୊ΛϞσϧԽ ͢Δख๏ͱҧ͍ͬͯΔͷ͸ɼෳ਺ͷ࿩୊͕ೖΓࠞͬͨ͡ Α͏ͳɼෳࡶͳ࿩୊ʹରͯ͠΋ޮՌΛൃ‫͢ش‬Δͱ͜ΖͰ ͋ΔɽPLSA Ͱ͸ɼ಺෦ʹ࿩୊͝ͱʹͦͷಛ௃Λ൓өͨ͠. unigram Ϟσϧʢ୯‫ޠ‬ग़‫཰֬ݱ‬ϕΫτϧʣΛ࣋ͪɼͦΕΒ. ౷‫ֶܭ‬शʹ͸ Tempered EM ΞϧΰϦζϜ (ҎԼ T-EM) ͱ͍͏൓෮ֶश๏Λ༻͍ΔɽT-EM ʹ༻͍Δࣜ͸ҎԼͷࣜ. (3) ∼ (6) ͱͳΔɽ E-Step: P (k) (z|d, w) = . ʹର͠࠷దԽ͞Εͨ unigram ΛಘΔ͜ͱ͕Ͱ͖Δɽ(ਤ 1ʣ. PLSA ʹ͓͚Δ࿩୊ h Λ൓өͨ͠୯‫ ޠ‬w ͷग़‫ݱ‬ස౓. (1). z∈Z. ͜͜ͰɼP (w|z) ͸࿩୊ unigram z ͕୯‫ ޠ‬w ʹରͯ͠༩. {P (k) (z)P (k) (d|z)P (k) (w|z)}β (3) (k) (z)P (k) (d|z)P (k) (w|z)}β z∈Z {P. M-Step: P. (k+1). (w|z) = . ͷ࿩୊ unigram Λద੾ʹࠞ߹͢Δ͜ͱʹΑΓɼ໨తͷ࿩୊. P (w|h) ͸ҎԼͷࣜ (1) Ͱ༩͑ΒΕΔɽ  P (w|h) = P (w|z)P (z|h). n(d, w) log.  . n(d, w)P (k) (z|w, d) (4) (k) (z|w, d)} d∈D n(d, w)P. d∈D. w∈W {. . n(d, w)P (k) (z|w, d)  (5) (k) (z|w, d)} d∈D { w∈W n(d, w)P   n(d, w)P (k) (z|d, w) w∈W (k+1)  d∈D P (z|d, w) = (6) w∈W d∈D n(d, w). P. (k+1). (d|z) = . w∈W. E-Step ͱ M-Step Λަ‫܁ʹޓ‬Γฦ͢͜ͱͰࣜ (2) Λ࠷େ Խ͢ΔϞσϧΛੜ੒͢Δ͜ͱ͕Ͱ͖Δɽ. ͑Δ֬཰ͱͳΔɽଞํɼP (w|h) ͸໨తͷ࿩୊ h ʹରͯ͠ ࠷దͳ࿩୊ unigram ͷࠞ߹ൺͰ͋Δɽ. 2.3 PLSA Λ༻͍ͨ࿩୊ಉఆ. 2.2 PLSA ʹΑΔ࿩୊ू߹͓Αͼ࿩୊ unigram ͷֶश. ୊ͷࠞ߹Λߦ͏ɽ͜͜Ͱ໨తจॻͱ͸ɼྫ͑͹ɼ௚લʹൃ. PLSA Ͱ͸࿩୊ʹॏΈ෇͚Λߦ͍ɼ໨తจॻʹଈͨ͠࿩ ‫ܗ‬੒. ࿩͞ΕͨԻ੠ͷೝࣝ݁Ռɼ͋Δ͍͸ɼࠓ͔Βೝࣝ͠Α͏ͱ. PLSA ͸಺෦ύϥϝʔλͱͯ͠࿩୊ z ʹ͓͚Δ୯‫ ޠ‬w ͷ. ͍ͯ͠ΔԻ੠ͷೝࣝީิͰ͋Δɽ͜ΕΒͷจॻ͕ PLSA ͷ. ग़‫ݱ‬ස౓Λද͢ P (w|z) ͱɼจॻ d ʹ͓͚Δ࿩୊ unigram. ಺แ͢ΔͲͷ࿩୊ʹͲΕ͚ͩଐ͢Δ͔Λɼ࿩୊ unigram ͷ. ͷࠞ߹ൺΛද͍ͯ͠Δ P (z|d) Λ࣋ͭɽP (z|d) ͸ P (z) ‫ٴ‬. ࠞ߹ൺͱͯ͠ਪఆ͢Δɽ͜Ε͸࿩୊ unigram ͷࠞ߹ൺΛ. ͼ P (d|z) ͔ΒϕΠζͷఆཧʹΑͬͯಋ͖ग़͢͜ͱ͕Ͱ͖. ໨తͷจॻʹର͠࠷໬ਪఆʹΑΓ࠷దԽ͢Δ͜ͱͰߦ͏ɽ. Δɽ͜ΕΒͷ஋Λ༻͍ͯɼֶशσʔλʹ‫·ؚ‬ΕΔ 1 จॻ͝. ͦͷࡍʹ͸ֶश࣌ͱಉ͘͡ T-EM Λ༻͍ΔɽT-EM ʹ༻. ͱͷ୯‫ޠ‬ͷग़‫ݱ‬ճ਺Λֶशσʔλͱ͠ɼEM ΞϧΰϦζϜ. ͍Δࣜ͸ҎԼͷࣜ (7),(8) ͱͳΔɽ. ʹΑΓ൓෮ֶशΛ‫܁‬Γฦ͢͜ͱͰɼҎԼͷࣜ (2) Λ࠷େ Խ͢Δ͜ͱͰɼ໬౓Λ࠷େԽ͢ΔΑ͏ͳ೚ҙͷ਺ͷ࿩୊. unigram Λֶश͢Δ͜ͱͰੜ੒͞ΕΔɽ ⓒ 2013 Information Processing Society of Japan. E-step: {P (z)P (k) (h|z)P (w|z)}β (k) (h|z)P (w|z)}β z∈Z {P (z)P. P (k) (z|h, w) = . (7). 2.

(3) Vol.2013-SLP-97 No.6 2013/7/26. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report ද 1. M-step: . P. (k+1). PLSA ͷֶश৚݅. ֶशσʔλ. n(h, w)P (k) (z|h, w)  (8) (h|z) =  (k) (z|h, w)} z∈Z { w∈W n(h, w)P w∈W. CSJ. จॻ਺. 972. ‫ޠ‬ኮ਺. ໿ 10000. જࡏϞσϧ਺. 50. EM ൓෮ճ਺. 100. ΞχʔϦϯάεέδϡʔϧ. inverse annealing. Λ๷͙໨తͰɼΞχʔϦϯάͱ‫ݺ‬͹ΕΔૢ࡞͕ߦΘΕΔɽ. β ॳ‫ظ‬஋. 1.0. ௨ৗͷ EM ΞϧΰϦζϜͱͷҧ͍ͱͯ͠ɼࣜʢ3ʣͷΑ͏. β ऴ୺஋. 0.8. β ߋ৽ճ਺. 4. 2.4 ΞχʔϦϯά T-EM ʹ͓͍ͯ͸ɼֶश଎౓ͱ‫࠷ॴہ‬దղ΁ͷམͪࠐΈ. ʹ E-Step ͷࡍʹӈลશମΛ β ৐ ( β ≥ 0 ) ͢Δɽβ = 1.0 ͷ৔߹ʹ௨ৗͷ EM ΞϧΰϦζϜͱ౳͘͠ͳΔɽβ ͕ 1.0 ΑΓখ͚͞Ε͹໬౓ؔ਺͕ฏ‫׈‬Խ͞Εɼ‫࠷ॴہ‬దղ΁ͷऩ ଋΛ๷͙౳ͷޮՌ͕͋Δɽ. T-EM Ͱ͸͜ͷ β Λ൓෮ֶश͕ਐߦ͢Δʹ࿈ΕͯมԽ ͍ͤͯ͘͞ɽ͜ͷ β ΛมԽͤ͞Δखଓ͖ΛΞχʔϦϯά εέδϡʔϧͱ͍͏ɽΞχʔϦϯάεέδϡʔϧ͸େ͖͘ ෼͚ͯೋछྨ͋ΔɽҰͭ͸ β ͷॳ‫ظ‬஋Λ 1.0 ΑΓ΋খ͞ ͍஋͔Β࢝Ίɼঃʑʹ૿΍͍ͯ͘͜͠ͱͰ࠷ऴతʹ 1.0 ʹ ͢Δ΋ͷͰɼDAEM ( Deterministic Annealing EM ) ͱ‫ݺ‬ ͹ΕΔɽ͜Εʹ͸ɼֶशॳ‫ظ‬ஈ֊Ͱͷ‫࠷ॴہ‬దղ΁ͷऩଋ Λ๷͙ޮՌ͕͋Δɽଞํ β ͷॳ‫ظ‬஋Λ 1.0 ͱ͠ɼֶश͕ ਐΉʹͭΕͯ β Λ‫ݮ‬Β͍ͯ͘͠΋ͷͰɼinverse annealing ͱ‫ݺ‬͹ΕΔɽͪ͜Βʹ͸ɼֶशΛՃ଎ͤ͞ɼ·ͨɼաֶश Λ๷͙ޮՌ͕͋ΔɽຊߘͰ͸ֶशͷ଎౓ΛૣΊΔ໨త͔Β. inverse annealing ʹΑΓֶशΛߦ͏͜ͱͱ͢Δɽ. 3. PLSA ֶशʹ͓͚Δॳ‫ظ‬஋ͷ‫ݕ‬౼ PLSA ͷֶशʹࡍͯ͠ɼ͸͡Ίʹ࿩୊਺ L Λ༩͑Δɽແ. ͷ໰୊఺͕ߟ͑ΒΕΔɽҰํͰɼॳ‫ظ‬஋ґଘੑ͸େ෯ʹվ ળ͞ΕΔͱߟ͑ΒΕΔɽ ͦ͜Ͱॳ‫ظ‬஋ʹ 0 Λ‫ؚ‬Ή৔߹ͱ‫͍ͳ·ؚ‬৔߹ʹ͓͚Δ. PLSA ͷ౷‫ܭ‬త‫ޠݴ‬Ϟσϧͱͯ͠ͷੑೳͱॳ‫ظ‬஋ґଘੑͷ ൺֱΛߦͬͨɽ. 3.2 ࣮‫ݧ‬৚݅ (1) ࣜʹ͓͚Δ P (w|z) ΛɼҎԼͷ 3 ௨Γͷํ๏Ͱॳ‫ظ‬ Խ͢Δɽ. 1.. ཚ਺ʹΑΓແ࡞ҝʹ L ‫ݸ‬ͷจॻΛબͼɼͦΕΒʹ‫·ؚ‬ ΕΔ୯‫ޠ‬ස౓ͷ෼෍ʹ‫ ͍ͯͮج‬L ‫ݸ‬ͷ࿩୊ unigram Λॳ‫ظ‬Խ͢Δɽཚ਺Λม͑ͯ 5 ௨Γߦ͏ɽ. 2a. 1. ͰಘΒΕͨ࿩୊ unigram ͷ֤ཁૉΛҰ༷ʹϑϩΞ Ϧϯά͢Δɽ۩ମతʹ͸ɼ1. ͰಘΒΕͨ P1 (w|z) ͔ ΒҎԼͷࣜʹΑͬͯ‫ٻ‬ΊΔ. P2a (w|z) = p · P1 (w|z) + (1 − p) ·. 1 n(W ). (9). ࡞ҝʹֶशσʔλΑΓநग़ͨ͠ L ‫ݸ‬ͷจॻͷ୯‫ ޠ‬unigram. ͜ͷ࣌ɼn(W ) ͸‫ޠ‬ኮ਺Λද͢ɽ·ͨɼࠓճ͸ p = 10−6. Λ֤࿩୊ unigram ͷॳ‫ظ‬஋ͱͯ͠༻͍Δɽͦͷࡍɼॳ‫ظ‬. ͱͨ͠ɽ1. Ͱ࡞੒ͨ͠ཚ਺Λม͑ͨ 5 ௨Γͷॳ‫ظ‬஋Λ. ஋ͱͯ͠༩͑Δ஋ʹ 0 Λଟ͘‫ؚ‬Ήεύʔεͳ unigram Λ. ϕʔεʹ֤ʑϑϩΞϦϯάΛߦ͏ɽ. ༻͍Δ৔߹ͱɼ͢΂ͯͷཁૉʹԿΒ͔ͷ஋Λ༩͑Δ৔߹ʹ. 2b. 1. ͰಘΒΕͨ஋ΛશֶशσʔλΛ༻͍ͯಘͨ unigram. ͍ͭͯɼ‫ܗ‬੒͞Εͨ PLSA ʹΑΔɼิਖ਼ Perplexity ͷҧ. ʹΑΓฏ‫׈‬Խ͢Δɽ۩ମతʹ͸ɼ1. ͰಘΒΕͨ P1 (w|z). ͍Λൺֱ͢Δɽ. ͔ΒҎԼͷࣜʹΑͬͯ‫ٻ‬ΊΔ. P2b (w|z) = p · P1 (w|z) + (1 − p) · uniall (w) (10). 3.1 ॳ‫ظ‬஋ͱͯ͠ͷ 0 PLSA ͷֶशͷࡍʹ༻͍Δࣜ (3) ‫( ͼٴ‬4) ΑΓɼॳ‫ظ‬Խ ͷࡍʹಛఆͷ z ʹରͯ͠ P. (0). ͜ͷ࣌ɼuniall (w) ͸ֶशσʔλ͢΂ͯΛ࢖ͬͯߏங. (w|z) = 0 ͱͳΔ w ͕ଘࡏ. ͨ͠ unigram ʹ͓͚Δ୯‫ ޠ‬w ͷग़‫཰֬ݱ‬Λද͢ɽ·. ͢Δͱ͖ɼk ͷ஋ʹ͔͔ΘΒͣɼP (k) (w|z) = 0 ͱͳΔɽ͜. ͨɼࠓճ͸ p = 10−6 ͱͨ͠ɽ1. Ͱ࡞੒ͨ͠ཚ਺Λม. ͷ͜ͱ͔Βɼॳ‫ظ‬Խʹ͓͍ͯ P (w|z) ʹ 0 Λ༩͑Δ৔߹ɼ. ͑ͨ 5 ௨Γͷॳ‫ظ‬஋Λϕʔεʹ֤ʑϑϩΞϦϯάΛߦ. PLSA ͸ଞͷ஋Λ༩͑Δ৔߹ͱൺ΂ͯಛผͳৼΔ෣͍Λ͢. ͏ɽ. Δͱߟ͑ΒΕΔɽ Ծʹॳ‫ظ‬Խͷࡍʹ 0 ͕ଟ͘‫·ؚ‬Εͨ৔߹ɼ0 Λ࣋ͭ෦෼. ͜ͷͱ͖ɼ1. ͷख๏ʹΑͬͯಘΒΕͨॳ‫ظ‬஋ʹ͸ 0 ͕ଟ. ʹؔͯ͠͸‫ࢉܭ‬ͷඞཁੑ͕ͳ͘ͳΔͨΊɼֶशͷࡍʹඞཁ. ͘‫·ؚ‬ΕΔͷʹରͯ͠ɼ2. ͷํ๏ͰಘΒΕͨॳ‫ظ‬஋ʹ͸ 0. ͳ‫ࢉܭ‬ίετͷ࡟‫͕ݮ‬ՄೳͱͳΔɽ͔͠͠ͳ͕Βɼநग़͞. ͕‫·ؚ‬ΕΔ͜ͱ͸ͳ͍ɽ·ͨɼ2a. ͱ 2b. Λൺֱ͢Δ͜ͱ. Εͨจॻͷಛ௃͕‫͘ڧ‬൓ө͞ΕΔͨΊɼॳ‫ظ‬஋ґଘੑ͕ߴ. Ͱฏ‫׈‬Խͷख๏ʹΑΔ࣮‫݁ݧ‬Ռ΁ͷӨ‫ڹ‬Λ‫؍‬ଌ͢Δ͜ͱ͕. ·Δ͜ͱ͕‫ݒ‬೦͞ΕΔɽ·ͨɼະ஌‫཰ޠ‬ͷ্ঢ΋ߟ͑ΒΕ. Ͱ͖Δͱߟ͑ΒΕΔɽ. Δɽଞํɼॳ‫ظ‬Խͷࡍʹ 0 Λ‫͍ͳ·ؚ‬৔߹ʹ͸ɼֶशʹ͓. PLSA ͷֶश৚݅Λද 1 ʹࣔ͢ɽֶशσʔλʹ͸೔ຊ‫ޠ‬. ͚Δऩଋ͕஗͘ͳΔɼ͋Δ͍͸࠷దղʹ୧Γ͚ͭͳ͍ͳͲ. ࿩͠‫ݴ‬༿ίʔύε (ҎԼ CSJ) ʹ‫·ؚ‬ΕΔ࣮ߨԋσʔλ 987. ⓒ 2013 Information Processing Society of Japan. 3.

(4) Vol.2013-SLP-97 No.6 2013/7/26. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report ද 2. 3.3 ࣮‫݁ݧ‬Ռ. PLSA ͷదԠ৚݅. EM ൓෮ճ਺. 60. ΞχʔϦϯάεέδϡʔϧ. inverse annealing. β ॳ‫ظ‬஋. 1.0. β ऴ୺஋. 0.9. β ߋ৽ճ਺. 2. ॳ‫ظ‬Խ๏ʹΑΔ PLSA ͷੑೳͷࢄ෍ਤΛਤ 2 ʹࣔ͢ɽά ϥϑͷԣ࣠͸༻͍ͨςετηοτΛɼॎ࣠͸ิਖ਼ Perplexity ͷ஋Λද͢ɽ֤ςετηοτʹର͠ɼ1. ʹΑΔ 5 छͷ݁Ռ ʢ੺‫ؙ‬ʣ ɼ2a. ʹΑΔ 5 छͷ݁Ռʢ྘ˎʣ ɼ2b. ʹΑΔ 5 छͷ ݁Ռʢ੨ʴʣ͕ϓϩοτ͞Ε͍ͯΔɽͨͩ͠ɼ྘ͱ੨ʹͭ ͍ͯ͸΄ͱΜͲॖୀͯ͠Ұͭͷ఺ʹ‫͑ݟ‬Δɽ ͜ͷਤΑΓ 1. ͷख๏͕΋ͬͱ΋ิਖ਼ Perplexity Λ௿͘ ཈͑Δ͜ͱ͕Ͱ͖͍ͯΔͷ͕Θ͔ΔɽͦͷҰํͰ 1. ʹ͓͍ ͯ͸ཚ਺ͷγʔυʹΑΓ࠷େͰ 15% ఔ౓ͷิਖ਼ Perprexity ͷେ͖ͳมಈ͕‫ݟ‬ΒΕͨɽ͜ͷ͜ͱ͔Βɼॳ‫ظ‬஋ʹ 0 Λଟ ͘‫ؚ‬Ή 1. ͷख๏Ͱ͸ॳ‫ظ‬஋ґଘੑ͕‫ݱ͘ڧ‬ΕΔ͜ͱ͕Θ ͔ΔɽҰํɼਤதͷ 2a. ͱ 2b. Ͱ͸ཚ਺ͷγʔυʹΑΔ ͕ࠩ΄΅‫ݟ‬ΒΕͳ͍ɽ͜ͷ‫ݪ‬Ҽͱͯ͠͸ 2 ͭͷՄೳੑ͕ߟ ͑ΒΕΔɽҰͭ͸ॳ‫ظ‬Խͷࡍʹ 0 Λ‫ظॳͰͱ͍͜ͳͤ·ؚ‬ ஋ґଘੑ͕ऑ͘ͳΔՄೳੑɽͦͯ͠΋͏ͻͱͭ͸ɼॳ‫ظ‬Խ ͷࡍʹ 0 Λ‫ֶͰͱ͍͜ͳͤ·ؚ‬शʹඞཁͳ EM ΞϧΰϦ ζϜͷ൓෮ճ਺͕૿͑Δɼ·ͨ͸ɼ࠷దղʹ୧Γ͚ͭͳ͘. ਤ 2. ͳ͍ͬͯΔՄೳੑͰ͋Δɽ·ͨɼ͜ͷ࣮‫͍͓ͯʹݧ‬͸ֶश. ॳ‫ظ‬Խ๏ʹΑΔ PLSA ͷੑೳͷࢄ෍ਤ. ͷࡍͷ EM ΞϧΰϦζϜͷ൓෮ճ਺Λఆ਺ͱ͍ͯͨ͜͠ͱ ߨԋͷ͏ͪɼධՁ༻σʔλ 15 ߨԋ෼Λআ͍ͨ 972 ߨԋ෼. ͔Βɼ΋͠ॳ‫ظ‬஋ʹΑͬͯ EM ΞϧΰϦζϜʹΑΔֶशͷ. ͷσʔλΛ༻͍ͨɽ‫ޠ‬ኮ͸ֶशσʔλʹ‫·ؚ‬ΕΔ୯‫ޠ‬ͷ͏. ਐߦ଎౓ʹมԽ͕͋Δ৔߹ɼͦͷ͕ࠩੑೳࠩͱͯ͠‫ݱ‬Εͯ. ͪɼ10 ճҎ্ग़‫ͨ͠ݱ‬໿ 1 ສ୯‫ͨ͠ͱޠ‬ɽ·ͨɼજࡏϞσ. ͠·ͬͨՄೳੑ͕͋Δɽ. ϧ਺͸ 50ɼֶश࣌ͷ EM ΞϧΰϦζϜͷ൓෮ճ਺͸ 100 ճ ͱͨ͠ɽT-EM ͷ β ߋ৽ͷͨΊͷΞχʔϦϯάεέδϡʔ ϧʹ͸ inverse annealing Λ༻͍ɼ࠷ऴతͳ β ͷ஋͕ 0.8 ʹͳΔΑ͏ʹ 5 ஈ֊ʹߋ৽Λߦͬͨɽ. 3.4 ॳ‫ظ‬஋ͷੑ࣭ͷҧ͍ʹΑΔֶश࣌ͷৼΔ෣͍ ॳ‫ظ‬஋ͷੑ࣭ͷҧ͍ʹΑΔֶश࣌ͷৼΔ෣͍Λ‫؍‬ଌ͢Δ ͨΊɼ3.3 ͱಉ৚݅Ͱ EM ൓෮ճ਺ͷΈΛ 200 ʹ͠ɼ10 ճ. ධՁͷࡍͷ PLSA ͷదԠ৚݅Λද 2 ʹࣔ͢ɽςετηο τʹ͸ CSJ ʹ‫·ؚ‬ΕΔ࣮ߨԋσʔλ 987 ߨԋ͔Βແ࡞ҝ. ຖͷิਖ਼ Perprexity ͷมԽ͔Βɼֶश࣌ͷৼΔ෣͍ʹͭ ͍ͯ‫࡯؍‬Λߦͬͨɽ. ʹநग़ͨ͠ 15 ߨԋΛ༻͍ͨɽධՁͷࡍͷ EM ΞϧΰϦζ. ͦͷ݁ՌΛਤ 3 ͓Αͼਤ 4 ʹࣔ͢ɽਤͷॎ࣠͸ิਖ਼ Per-. Ϝͷ൓෮ճ਺͸ 60 ͱ͠ɼ·ͨɼదԠ࣌ͷΞχʔϦϯάε. prexity Λɼԣ࣠͸ EM ΞϧΰϦζϜͷֶशʹ͓͚Δ൓෮. έδϡʔϧ͸աֶशΛ๷ࢭ͢ΔͨΊ inverse annealing Λ. ճ਺ͱͳ͍ͬͯΔɽͦΕͧΕͷਤʹ͓͍ͯɼԼଆͷ఺͕ 1.. ༻͍ɼ2 ஈ֊ʹߋ৽ΛߦͬͨɽධՁई౓ʹ͸ςετηοτ. ͷख๏ʹΑΓॳ‫ظ‬Խͨ͠৔߹Λɼ্ଆͷ఺͕ 2b. ͷख๏ʹ. ʹର͢Δิਖ਼ Perplexity Λ༻͍ͨɽิਖ਼ Perplexity ͷ‫ܭ‬. ΑΓॳ‫ظ‬ԽΛߦͬͨ৔߹Λද͢ɽ. ࢉࣜ͸࣍ࣜʢ11ʣʹͳΔɽ. AP P = {P (w1 , w2 ...wn ) · m−o }. ςετηοτͷ 15 จॻʹ͓͚Δֶश࣌ͷৼΔ෣͍Λ‫؍‬ 1 −n. (11). ࡯ͨ݁͠Ռɼେ͖͘͜ΕΒͷ 2 छʹ෼͚ΒΕΔ͜ͱ͕Θ ͔ͬͨɽਤ 3 ͷΑ͏ͳৼΔ෣͍Λ͢Δ΋ͷ͸ɼֶश͕ਐΉ. ͜ͷͱ͖ɼP (w1 , w2 ...wn ) ͸୯‫ ྻޠ‬w1 , w2 ...wn ͕ੜ੒. ʹͭΕͯΏΔ΍͔ʹ‫ݮ‬গ͍͍ͯͬͯ͠Δͷ͕‫ͯݟ‬औΕΔɽ. ͞ΕΔ֬཰ΛɼO ͸ະ஌‫ޠ‬ͷ਺Λɼm ͸ະ஌‫ޠ‬ͷछྨ਺Λ. ͜Ε͸ॳ‫ظ‬஋ʹ 0 Λ‫Ͱͱ͍͜ͳ·ؚ‬ɼֶशʹඞཁͳ൓෮ճ. ͦΕͧΕද͢ɽ. ਺͕૿Ճ͍ͯ͠Δͱଊ͑Δ͜ͱ͕Ͱ͖Δɽଞํɼਤ 4 ͷΑ. P (d|z) ͱ P (z) ʹ͍ͭͯ͸ࣜʢ12ʣ‫ࣜͼٴ‬ʢ13ʣʹΑΓ ॳ‫ظ‬ԽΛߦͬͨɽ. P (d|z) = P (z) =. 1 n(D). 1 L. ্͢Δ͜ͱ͸ͳ͘ɼૣ͍ஈ֊Ͱ‫ॴہ‬ղʹऩଋͯ͠͠·ͬͯ. (12). ⓒ 2013 Information Processing Society of Japan. ͍Δ͜ͱ͕࢕͑Δɽ. 4. ͓ΘΓʹ (13). ͜ͷ࣌ n(D) ͸ֶशσʔλʹ‫·ؚ‬ΕΔจॻ਺ΛɼL ͸જ ࡏϞσϧ਺ΛͦΕͧΕද͍ͯ͠Δɽ. ͏ͳৼΔ෣͍Λ͢Δ΋ͷ͸ɼֶशΛ‫܁‬Γฦͯ͠΋ੑೳ͕޲. ຊߘͰ͸ɼ֬཰తજࡏҙຯղੳʹ͓͍ͯɼॳ‫ظ‬஋ͷ༩͑ ํɼಛʹֶशͷࡍʹ༩͑Δॳ‫ظ‬஋ʹ 0 ͕‫·ؚ‬ΕΔ͔Ͳ͏͔ ʹΑΔʹΑΔ౷‫ܭ‬త‫ޠݴ‬Ϟσϧͱͯ͠ͷ֬཰తજࡏҙຯղ. 4.

(5) Vol.2013-SLP-97 No.6 2013/7/26. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. [5]. [6] [7]. [8]. [9] [10]. ਤ 3. ॳ‫ظ‬Խ๏ʹΑΔ PLSA ͷֶशਐߦͷҧ͍ 1. ਤ 4. ॳ‫ظ‬Խ๏ʹΑΔ PLSA ͷֶशਐߦͷҧ͍ 2. [11] [12]. ੓୍ߒ࿨, ೏ࡔ๕య, ‫ٱ‬໦࿨໵, Տ‫ݪ‬ୡ໵ɿMAP ਪఆ Λ༻͍ͨ N-gram ‫ޠݴ‬ϞσϧͷλεΫదԠɼ৴ֶٕใ, SP96-103(1997). Thomas HofmannɿProbabilistic Latent Semantic AnalysisɼUncertainity in Artificial Intelligence (1999). Daniel Gildea, Thomas HofmannɿTOPIC-BASED LANGUAGE MODELS USING EMɼEuroSpeechʟ99, pp.2167-2170(1999). ‫ࢁ܀‬௚ਓ, ླ໦‫ج‬೭, ҏ౻জଇ, ຀໺ਖ਼ࡾɿPLSA ‫ޠݴ‬Ϟ σϧͷֶश࠷దԽͱ‫ޠ‬ኮ෼ׂʹؔ͢Δ‫ݕ‬౼ɼ৘ॲ‫ݚ‬ใ, 2006-SLP-60(2006). ळా༞࠸, Տ‫ݪ‬ୡ໵ɿ࿩୊ͱ࿩ऀʹؔ͢Δ PLSA ʹ‫ͮ͘ج‬ ‫ޠݴ‬ϞσϧదԠɼ৘ॲ‫ݚ‬ใ, 2003-SLP-49(2003). ‫࡚ٶ‬কོɿWWW ͔ΒಘΒΕΔ Term Frequency ৘ใ ʹ‫ ͮ͘ج‬PLSA ‫ޠݴ‬Ϟσϧɼ৘ॲ‫ݚ‬ใ, 2011-SLP-85, No.14(2011). ‫ٶ‬ຊఆ໌ɿΫϥελʔ෼ੳೖ໳ɼ৿๺ग़൛ʢ1999ʣ. ‫ా؛‬࿨໌ɿจॻΫϥελϦϯάͷٕ๏ɼLibrary and Information Science, Vol. 49, pp. 33?75, 2003.. ੳͷৼΔ෣͍ͷมԽΛ‫ͨ͠࡯؍‬ɽ ݁Ռͱͯ͠ɼॳ‫ظ‬஋ʹ 0 Λଟ͘‫ؚ‬Ή͜ͱΛೝΊͨ৔߹ɼ ֶश͕ૣ͘ऴΘΓɼ·ͨɼิਖ਼ Perprexity ʹ͓͍ͯ΋ྑ͍ ੑೳ͕ͰΔ݁Ռͱͳͬͨɽ͔͠͠ͳ͕Βɼཚ਺ͷγʔυʹ ΑΔੑೳͷϒϨ͕େ͖͍ͱ͍ͬͨܽ఺͕͋Δɽ·ͨɼ͜ͷ ॳ‫ظ‬Խ๏ʹ͓͍ͯ͸ɼະ஌‫ͳ͘ߴ͕཰ޠ‬Δ͜ͱ͕֬ೝ͞Ε ͓ͯΓɼ࣮ࡍʹԻ੠ೝࣝʹ࣋ͪࠐΜͩ৔߹ͷੑೳʹ͸ෆ҆ ͕࢒Δɽଞํɼॳ‫ظ‬஋ʹ 0 Λ‫͍ͳ·ؚ‬৔߹ʹ͸ɼֶशͷ଎ ౓͕஗͘ͳΔ͜ͱ͕Θ͔ͬͨɽ·ͨɼ৔߹ʹΑͬͯ͸‫ॴہ‬ ࠷దղ΁ͷऩଋ͕‫ݟ‬ΒΕͨɽͪ͜Βͷख๏Ͱ͸ɼະ஌‫཰ޠ‬ ͸ 0 Λ‫ؚ‬Ή৔߹ͷ൒෼ҎԼʹ཈͑ΒΕ͍ͯͨɽ ࢀߟจ‫ݙ‬ [1] [2]. [3]. [4]. ๺‫ݚ‬ೋɿ֬཰త‫ޠݴ‬Ϟσϧɼ౦‫ژ‬େֶग़൛ձ (1999). ֿӜହஐɼླ໦‫ج‬೭ɼҏ౻জଇɼ຀໺ਖ਼ࡾ: WWW Λར ༻ͨ͠‫ޠݴ‬Ϟσϧ‫͠ͳࢣڭ‬λεΫదԠʹ͓͚Δ༗ޮ‫ࡧݕ‬ ΫΤϦܾఆ๏, ৘ॲ‫ݚ‬ใ, 2006-SLP-64(2006). ૿ଜ྄ɼҏ౻ਔɼҏ౻জଇɼ຀໺ਖ਼ࡾ: WWW Λར༻͠ ͨ‫ޠݴ‬ϞσϧదԠͷͨΊͷ‫ࡧݕ‬ΫΤϦߏ੒ͷ‫ݕ‬౼, ৘ॲ‫ݚ‬ ใɼVol.2009-SLP-76 Noɽ10(2009). ௕໺༤, ླ໦‫ج‬೭, ຀໺ਖ਼ࡾɿHMM Λ༻͍ͨෳ਺ n-gram ϞσϧʹΑΔ‫ޠݴ‬Ϟσϧͷߏஙɼ৘ॲ࿦, Vol.J43 , No.7, pp.2075-2081(2002).. ⓒ 2013 Information Processing Society of Japan. 5.

(6)

参照

関連したドキュメント

ベクトル計算と解析幾何 移動,移動の加法 移動と実数との乗法 ベクトル空間の概念 平面における基底と座標系

2813 論文の潜在意味解析とトピック分析により、 8 つの異なったトピックスが得られ

(Construction of the strand of in- variants through enlargements (modifications ) of an idealistic filtration, and without using restriction to a hypersurface of maximal contact.) At

Theorem 4.8 shows that the addition of the nonlocal term to local diffusion pro- duces similar early pattern results when compared to the pure local case considered in [33].. Lemma

This paper presents an investigation into the mechanics of this specific problem and develops an analytical approach that accounts for the effects of geometrical and material data on

While conducting an experiment regarding fetal move- ments as a result of Pulsed Wave Doppler (PWD) ultrasound, [8] we encountered the severe artifacts in the acquired image2.

[r]

einer rechtliche Wirkung gerichtete