• 検索結果がありません。

Hierarchical Latent Words Language Models for Robust Modeling to Out Of Domain Tasks

N/A
N/A
Protected

Academic year: 2018

シェア "Hierarchical Latent Words Language Models for Robust Modeling to Out Of Domain Tasks"

Copied!
1
0
0

読み込み中.... (全文を見る)

全文

(1)

Copyright©2015 NTT corp. All Rights Reserved.

Ryo MASUMURA , Taichi ASAMI , Takanobu OBA ,

Hirokazu MASATAKI , Sumitaka SAKAUCHI and Akinori ITO

NTT Corporation, Japan Tohoku University, Japan

Hierarchical Latent Words Language Models

for Robust Modeling to Out-Of Domain Tasks

Motivation

To construct LM that can achieve high performance in not only in-domain tasks but also out-of domain tasks

Overview

 We propose hierarchical latent words language models (h-LWLMs)

Results

 H-LWLM offers improved robustness to out-of domain tasks

• Key advance is introducing a multiple latent word space with

hierarchical structure into LWLMs

• It can handle linguistic phenomena not present in the training data

• H-LWLM is superior to standard LWLM in terms of PPL and WER

• H-LWLM is significantly superior to the conventional n-gram LM

or the recurrent neural network LM in out-of domain tasks

2. Hierarchical latent words language models (h-LWLMs)

 H-LWLMs are novel LWLMs with multiple latent word spaces that

are hierarchically structured

1. Latent words language models (LWLMs)

 LWLMs are generative models with single latent word space

[Deschacht+, 2011]

� � = � � , � � ℎ|�, �

�= � � �� = � � � , � � ℎ , �

�=

�=

Latent variable sequence

Observed word sequence

Valid Test 1 Test 2

�− �− �+ �+

�−�−

�− �− �+ �+

�− �− �+ �+

�−�− �− �+�− �+�−

�− �− �+ �+

�− �− �+ �+

 Latent word is represented as a specific word that is selected from the entire vocabulary

 The number of latent words equals the number of observed words

 The latent space helps to efficiently increase the robustness to out-of domain tasks

• LWLM significantly superior in out-of domain tasks while the performance

was comparable in domain-matched task to conventional LMs [Masumura+ 2013]

Points

Problems

• Standard LWLM simply represents the latent space as n-gram model of latent words

• Standard LWLM do not model the hierarchy of function and meaning of words

 It assumes that there is a latent word behind a latent word

 Standard LWLMs correspond to h-LWLMs with just one layer

 Latent words in all layers are represented as a specific word that is selected from the entire vocabulary

� � = � ℎ , � � ℎ�− , � � ℎ , �

1

�=

� � ��

= � � ℎ , � � ℎ�− , � � ℎ , �

1

�=

�=

=

=

= �

�−

 To create the hierarchical latent word structure from training data

sets, we also propose a layer-wise inference

 LWLM structure is recursively accumulated by estimating a latent word sequence in an upper layer from a latent word sequence in the lower layer

• In line 4, Gibbs sampling is used for estimating a latent word sequence in an upper layer from that in the lower layer

• In line 6, t-th point estimated parameter indicates parameters of each LWLM for all layers in t-th iteration

3. Parameter inference of h-LWLMs

 Advantages

• It can be expected that hierarchy in the latent space flexibly calculate generative probability for unseen words unlike non-hierarchical LWLMs

 Usage for natural language processing tasks

• It is impractical to directly apply the h-LWLM to natural language processing tasks

• We use n-gram approximation technique that is conducted as by generating a lot of text data via random sampling using trained h-LWLM and constructing an n-gram model from the generated data

1-th

latent variable sequence

Observed word sequence (M-1)-th

latent variable sequence (M)-th

latent variable sequence

4. PPL and speech recognition experiments

 Our experiments assumes domain-matched training data is not available

• MKN: 3-gram LM with modified Kneser-Ney

• HPY: Hierarchical Pitman-Yor 3-gram LM

• RNN: Class-based RNNLM with 500 nodes

• LW: 3-gram approximation of h-LWLM (LW with 1 layer means standard LWLM)

Train CSJ 2672 lectures (700M words) Valid CSJ 10 lectures (20K words)

Test 1 Contact center dialogue (20K words) Test 2 Voice mail data (20M words)

Decoder VoiceRex (WFST-based) Acoustic model DNN-HMM with 8 layers

Setup Layer Valid Test 1 Test 2

MKN - 24.79 38.67 32.31

HPY - 24.67 38.29 32.00

LW 1 24.54 36.96 30.26

LW 5 24.60 36.49 29.57

LW+HPY 1

23.62

36.49 29.76 LW+HPY 5 23.68

36.03 29.21

 Word Error rate (WER) [%]

 Methods

Valid

• PPL was not improved by the hierarchical structure in LW

• LW is comparable to MKN and HPY, and inferior to RNN in terms of PPL

Test

• PPL improved with the increase in the number of layers in LW

• LW with 5 layers was superior to 1 layer in terms of PPL and WER

• The best results were obtained by LW+HPY with 5 layers

120 130 140 150 160

1 2 3 4 5

MKN HPY RNN LW

LW+HPY

130 140 150 160 170 180 190 200

1 2 3 4 5

MKN HPY RNN LW

60 62 64 66 68 70

1 2 3 4 5

MKN HPY RNN LW

LW+HPY

Number of layers

P P L

Results

 Perplexity (PPL)

 For LM training, we used the Corpus of Spontaneous Japanese (CSJ) whose domain is a spontaneous lecture task

188.56

174.30

155.24

151.44 144.93

146.76

140.39 148.43

156.52

138.41 134.57

130.14 129.80

125.44 68.08

67.22

68.61

66.24

60.78 61.95

62.41

参照

関連したドキュメント

Maskit shows that this data may be used to identify a finite sided polyhedral fundamental set for the action of the (Teichm¨ uller) modular group on the space of all marked

“We’d like not just text or diagram, but both!”.

prove that the cbv linear β-template is robust and safe … relative to arithmetic and cbv linear β-reduction. apply the Partial Characterisation Theorem Partial

Furuta, Log majorization via an order preserving operator inequality, Linear Algebra Appl.. Furuta, Operator functions on chaotic order involving order preserving operator

In the first part we prove a general theorem on the image of a language K under a substitution, in the second we apply this to the special case when K is the language of balanced

For instance, what are appropriate techniques that fit choice models, especially those applied in an RM network environment; can new robust approaches reduce the number of

Specifically, we consider the glueing of (i) symmetric monoidal closed cat- egories (models of Multiplicative Intuitionistic Linear Logic), (ii) symmetric monoidal adjunctions

For quite some time a great deal of effort has been dedicated to the study of electrical behav- ior of brain cells; different models have come out since the Hodgkin-Huxley model