A Neural Language Model for Dynamically Representing the Meanings of Unknown Words and Entities in a Discourse

全文

(1)A Neural Language Model for Dynamically Representing the Meanings of Unknown Words and Entities in a Discourse Sosuke Kobayashi Naoaki Okazaki. Kentaro Inui. Tokyo Institute of Technology. Tohoku University / RIKEN.

(2) RNN Language Model. • Output matrix: calculate probability • RNN matrices: encoding context • Embedding matrix: represent word meaning.

(3) Neural Language Model • Embedding matrix. Output matrix • Cannot cover all words. → Unknown words • Referents differ by discourses. → Unknown entities.

(4) Neural Language Model • Embedding matrix. Output matrix • Cannot cover all words. → Unknown words • Referents differ by discourses. → Unknown entities. mumps, ceraunomancy, … <UNK> “John”, “Mary”, ….

(5) Dynamic Entity Representation [Kobayashi et al. NAACL 2016]. • Unknown’s meaning representation. cannot be obtained statically… ↓ Dynamically update meaning representation while reading text • Infer on-the-fly meanings from context • “she contracted mumps” → mumps is a disease? • “John loves Fender” → “John” is a guitarist?.

(6) Usage: Input Embedding • Language models. encode context words and predict next words • Input word embeddings can be replaced • Dynamic modeling makes context informative • “… with him, John played [???]” • with dynamic model:. “… with him, <John; guitarist> played [???]”.

(7) Usage: Output Matrix • Language models. encode context words and predict next words • Output matrix’s rows can be replaced • Dynamic modeling makes target informative • “… she is a big fan of [???]”. John? Mary? • with dynamic model:. “… she is a big fan of [???]” <guitarist>? <mother>?.

(8) Recipe: Context Encoding • Encode context of the target word • e.g. bi-directional RNN.

(9) Recipe: Context Encoding • Encode context of the target word • e.g. bi-directional RNN.

(10) Recipe: Context Merging • Merge multiple contexts where the target occurs • e.g. RNN, max-pooling.



(13) Dataset for Evaluation • Dataset for language modeling from OnteNotes • Coreferents are unified and anonymized. John, he, … → [UNK1] Mary, she, … → [UNK2] • John loves guitars. RAW. Mary did not prefer music. But, many people are big fans of him. …. OURS • [UNK1] loves guitars.. [UNK2] did not prefer music. But, many people are big fans of [UNK1]. ….

(14) Result: Language Modeling • Dynamic modeling improves perplexity • Especially when entities reappear All tokens. Baseline +Dynamic. input only output only input & output. 64.8 62.8 62.5 4.1 60.7. Reappearing entities. Tokens following them. 48.0 42.4 35.9 14.0 34.0. 128.6 109.5 129.0 21.8 106.8.

(15) Result: Language Modeling • Dynamic modeling improves perplexity • Especially when entities reappear All tokens. Baseline +Dynamic. input only output only input & output. 64.8 62.8 62.5 4.1 60.7. Reappearing entities. Tokens following them. 48.0 42.4 12.1 35.9 14.0 34.0. 128.6 109.5 129.0 21.8 106.8. “… she is a big fan of [???]” <John; guitarist>.

(16) Result: Language Modeling • Dynamic modeling improves perplexity • Especially when entities reappear All tokens. Baseline +Dynamic. input only output only input & output. 64.8 62.8 62.5 4.1 60.7. Reappearing entities. Tokens following them. 48.0 42.4 35.9 14.0 34.0. 128.6 109.5 19.1 129.0 21.8 106.8. “… <John; guitarist> [???]”.

(17) Result: Language Modeling • Dynamic modeling improves perplexity • Especially when entities reappear All tokens. Baseline +Dynamic. input only output only input & output. 64.8 62.8 62.5 4.1 60.7. Reappearing entities. Tokens following them. 48.0 42.4 35.9 14.0 34.0. 128.6 109.5 129.0 21.8 106.8.

(18) Result: Language Modeling. # of parameters Models Merging function (to be finetuned) Only GRU-ReLU# of parameters 18.9M (14.2M) Modelsdynamic input MergingGRU function (to be finetuned) (1) All 18.9M (14.2M) Max pool. 18.9M (14.2M) 17.3M (12.6M) Only GRU-ReLU 62.8±0.3 17.3M (12.6M) dynamic input GRU Only latest 18.9M (14.2M) 63.2±0.4 63.6±0.4 Max pool. Only GRU-ReLU17.3M (12.6M) 18.9M (14.2M) 64.0±0.4 Only latest 17.3M (12.6M) dynamic output GRU 18.9M (14.2M) Reappearing (3) Following Only GRU-ReLU 62.5±0.3 Max pool. 18.9M (14.2M) 17.3M (12.6M) 62.6±0.2 dynamic output GRU Only latest 18.9M (14.2M) entities entities (4) Non-entities 17.3M (12.6M) Max pool. 62.2±0.4 GRU-ReLU17.3M (12.6M) 19.2M (14.4M) 48.0±2.6 128.6±2.0 68.5±0.2 Dynamic Only latest 64.9±0.1 17.3M (12.6M) input & output GRU 19.2M (14.4M) 42.4±1.1 109.5±1.4 66.4±0.3 Dynamic GRU-ReLU 60.7±0.2 Max pool. 19.2M (14.4M) 17.6M (12.9M) 60.9±0.3 input & output GRU 19.2M (14.4M) 35.9±3.7 129.0±0.7 69.5±0.3 Only latest 17.6M (12.9M) pool. 17.6M (12.9M) 60.7±0.3 34.0±1.3 106.8±0.6 67.6±0.04 Baseline Max 12.3M (12.3M) 63.4±0.2 Only latest 17.6M (12.9M). (2) Reappearing (3) Following entities entities (1) All (2) Reappearing42.4±1.1 (3) Following 109.5±1.4 62.8±0.3 entities entities (4) Non-enti 63.2±0.4 43.3±2.7 111.2±0.7 63.6±0.4 42.4±1.1 45.0±2.6 109.5±1.4 116.0±1.0 66.4±0.3 64.0±0.4 43.3±2.7 44.1±1.6 111.2±0.7 127.6±0.7 66.8±0.4 45.0±2.6 35.9±3.7 116.0±1.0 129.0±0.7 67.0±0.2 62.5±0.3 44.1±1.6 39.0±2.0 127.6±0.7 121.1±8.3 67.5±0.2 62.6±0.2 35.9±3.7 41.1±1.9 129.0±0.7 126.9±1.5 69.5±0.3 62.2±0.4 39.0±2.0 49.8±1.8 121.1±8.3 129.1±1.6 69.1±0.2 64.9±0.1 41.1±1.9 34.0±1.3 126.9±1.5 106.8±0.6 68.4±0.6 60.7±0.2 49.8±1.8 37.5±0.3 129.1±1.6 108.9±0.8 70.6±0.2 60.9±0.3 34.0±1.3 39.5±3.4 106.8±0.6 107.5±1.3 67.6±0.0 60.7±0.3 37.5±0.3 108.9±0.8 67.2±0.4 63.4±0.2 47.9±4.2 116.4±0.4 39.5±3.4 107.5±1.3 66.8±0.8 64.8±0.6 48.0±2.6 128.6±2.0 47.9±4.2 116.4±0.4 68.9±0.1. • Dynamic modeling works well for long documents. The latter of a document, The more often targets occur, the more improved The more targets occur, n the test set of Anonymized Language Modeling Table 3: Results for models with different merging functions on the test set of the Anony Baseline. 12.3M (12.3M). 64.8±0.6. 48.0±2.6. 128.6±2.0. 68.5±0.2. calculated respectively by three models (trained Table 3: Results for models merging Modeling dataset, aswith samedifferent as in Table 2. functions on the test set of the Anonymized Langu d GRU followed by ReLU as the merging function.dataset, as same as in Table 2. Modeling. 60 55 50 45 1-20. 21-40. →latter→. 11-. →more often→. 100 90 80 70 60 50 40 30 20 10 0. Perplexity of entities. 65. Proposed. 160 Baseline Proposed 150 Baseline Proposed 140 130 120 110 100 90 80 70 60 12 2 3-6 7-10 1 3-6 7-10 11t-th occurrence t-th occurrence of entities of entities. Perplexity of entities. 160 150 140 130 120 110 100 90 80 70 41-60 61-100 101-200 201- 60 t-th token. Baseline. Perplexity of tokens following entities. 70. Perplexity of tokens following entities. Perplexity. →perplexity→. • Organizing context is useful for long documents. 0. 100 Baseline Propos 90 Baseline Proposed 80 70 60 50 40 30 20 10 0 4-6211 20 3 1 4-62 7-103 11-20 of antecedent enti # of antecedent#entities. →more targets→. Figure 4: Perplexity of all tokens relative to the 6: Perplexity of entities rela Perplexity of entities relative to the n 5: Perplexity of following tokens following enti- 6: Figure FigureFigure 5: Perplexity of tokens the enti- the Figure ime at which they appear in the document. ber of antecedent entities. entities. ber of antecedent ties relative to the time whichatthe entitythe occurs. ties relative to theat time which entity occurs..

(19) Summary • Dynamic modeling of word vectors. improves language models • For prediction of the unknowns • For prediction of tokens following the unknowns • Future work • Story generation with organizing entities • Joint modeling with coreference resolution • Joint modeling with character/subword vectors.

(20) Result: Cloze Test • Pseudo coreference resolution task • Solve this task by calculating. the sentence likelihood by filling in with each entity • [UNK1] loves guitars.. [UNK2] did not prefer music. But, many people are big fans of [???]. … • Mean Quantile (mean rank of answers). is improved .525→.642 by dynamic modeling.

(21)