Hierarchical Latent Words Language Models for Robust Modeling to Out Of Domain Tasks

(1)

Ryo MASUMURA , Taichi ASAMI , Takanobu OBA ,

Hirokazu MASATAKI , Sumitaka SAKAUCHI and Akinori ITO

NTT Corporation, Japan Tohoku University, Japan

Hierarchical Latent Words Language Models

for Robust Modeling to Out-Of Domain Tasks

Motivation

To construct LM that can achieve high performance in not only in-domain tasks but also out-of domain tasks

 ^Overview

 We propose hierarchical latent words language models (h-LWLMs)

 ^Results

 H-LWLM offers improved robustness to out-of domain tasks

• Key advance is introducing a multiple latent word space with

hierarchical structure into LWLMs

• It can handle linguistic phenomena not present in the training data

• H-LWLM is superior to standard LWLM in terms of PPL and WER

• H-LWLM is significantly superior to the conventional n-gram LM

or the recurrent neural network LM in out-of domain tasks

2. Hierarchical latent words language models (h-LWLMs)

 H-LWLMs are novel LWLMs with multiple latent word spaces that

are hierarchically structured

1. Latent words language models (LWLMs)

 LWLMs are generative models with single latent word space

[Deschacht+, 2011]

� � = � �_� ℎ_�, � � ℎ_�|�_�, �

ℎ_�

�

�= � � �� = � � �^� ^ℎ^�^{, �}^� ^{� ℎ}^� ^�^�^{, �}^�

�

�=

�

�=

Latent variable sequence

Observed word sequence

Valid _{Test 1} _{Test 2}

�_�− �_�− �_� �_�+ �_�+

ℎ_�−^�−

ℎ_�−^� ℎ_�−^� ^ℎ_�^� ℎ_�+^� ℎ_�+^�

ℎ_�− ℎ_�− ℎ_� ℎ_�+ ℎ_�+

ℎ_�−^�− ℎ_�^�− ℎ_�+^�− ℎ_�+^�−

ℎ_�− ℎ_�− ℎ_� ℎ_�+ ℎ_�+

�_�− �_�− �_� �_�+ �_�+

 Latent word is represented as a specific word that is selected from the entire vocabulary

 The number of latent words equals the number of observed words

 The latent space helps to efficiently increase the robustness to out-of domain tasks

• LWLM significantly superior in out-of domain tasks while the performance

was comparable in domain-matched task to conventional LMs [Masumura+ 2013]

 ^Points

 ^Problems

• Standard LWLM simply represents the latent space as n-gram model of latent words

• Standard LWLM do not model the hierarchy of function and meaning of words

 It assumes that there is a latent word behind a latent word

 Standard LWLMs correspond to h-LWLMs with just one layer

 Latent words in all layers are represented as a specific word that is selected from the entire vocabulary

� � = � ℎ_� ℎ_� , � � ℎ_�^�− ℎ_�^� , � � ℎ_�^� _�^� , �

ℎ_�^� ℎ_�¹

�

�=

� � ��

= � � ℎ^� ^ℎ^� ^{, �}^� ^{� ℎ}^�^�− ^ℎ^�^� ^{, �}^� ^{� ℎ}^�^� ^�^� ^{, �}^�

ℎ_�^� ℎ_�¹

�

�=

�

�=

=

�

= �

�

� ^�−

� ^�

 To create the hierarchical latent word structure from training data

sets, we also propose a layer-wise inference

 LWLM structure is recursively accumulated by estimating a latent word sequence in an upper layer from a latent word sequence in the lower layer

• In line 4, Gibbs sampling is used for estimating a latent word sequence in an upper layer from that in the lower layer

• In line 6, t-th point estimated parameter indicates parameters of each LWLM for all layers in t-th iteration

3. Parameter inference of h-LWLMs

 Advantages

• It can be expected that hierarchy in the latent space flexibly calculate generative probability for unseen words unlike non-hierarchical LWLMs

 Usage for natural language processing tasks

• It is impractical to directly apply the h-LWLM to natural language processing tasks

• We use n-gram approximation technique that is conducted as by generating a lot of text data via random sampling using trained h-LWLM and constructing an n-gram model from the generated data

1-th

latent variable sequence

Observed word sequence (M-1)-th

latent variable sequence (M)-th

latent variable sequence

4. PPL and speech recognition experiments

 Our experiments assumes domain-matched training data is not available

• MKN: 3-gram LM with modified Kneser-Ney

• HPY: Hierarchical Pitman-Yor 3-gram LM

• RNN: Class-based RNNLM with 500 nodes

• LW: 3-gram approximation of h-LWLM (LW with 1 layer means standard LWLM)

Train CSJ 2672 lectures (700M words) Valid CSJ 10 lectures (20K words)

Test 1 Contact center dialogue (20K words) Test 2 Voice mail data (20M words)

Decoder VoiceRex (WFST-based) Acoustic model DNN-HMM with 8 layers

Setup Layer Valid Test 1 Test 2

MKN - 24.79 38.67 32.31

HPY - 24.67 38.29 32.00

LW 1 24.54 36.96 30.26

LW 5 24.60 36.49 29.57

LW+HPY 1

_23.62

36.49 29.76 LW+HPY 5 23.68

36.03 29.21

 Word Error rate (WER) [%]

 Methods

Valid

• PPL was not improved by the hierarchical structure in LW

• LW is comparable to MKN and HPY, and inferior to RNN in terms of PPL

Test

• PPL improved with the increase in the number of layers in LW

• LW with 5 layers was superior to 1 layer in terms of PPL and WER

• The best results were obtained by LW+HPY with 5 layers

120 130 140 150 160

1 2 3 4 5

MKN HPY RNN LW

LW+HPY

130 140 150 160 170 180 190 200

1 2 3 4 5

MKN HPY RNN LW

60 62 64 66 68 70

1 2 3 4 5

MKN HPY RNN LW

LW+HPY

Number of layers

P P L

 ^Results

 Perplexity (PPL)

 For LM training, we used the Corpus of Spontaneous Japanese (CSJ) whose domain is a spontaneous lecture task

188.56

174.30

155.24

151.44 _144.93

146.76

140.39 148.43

156.52

138.41 134.57

130.14 129.80

125.44 68.08

67.22

68.61

66.24

60.78 61.95

62.41

Hierarchical Latent Words Language Models for Robust Modeling to Out Of Domain Tasks

Ryo MASUMURA , Taichi ASAMI , Takanobu OBA ,

Hirokazu MASATAKI , Sumitaka SAKAUCHI and Akinori ITO

NTT Corporation, Japan Tohoku University, Japan

Hierarchical Latent Words Language Models

for Robust Modeling to Out-Of Domain Tasks

Motivation

To construct LM that can achieve high performance in not only in-domain tasks but also out-of domain tasks

 Overview

 We propose hierarchical latent words language models (h-LWLMs)

 Results

 H-LWLM offers improved robustness to out-of domain tasks

• Key advance is introducing a multiple latent word space with

hierarchical structure into LWLMs

• It can handle linguistic phenomena not present in the training data

• H-LWLM is superior to standard LWLM in terms of PPL and WER

• H-LWLM is significantly superior to the conventional n-gram LM

or the recurrent neural network LM in out-of domain tasks

2. Hierarchical latent words language models (h-LWLMs)

 H-LWLMs are novel LWLMs with multiple latent word spaces that

are hierarchically structured

1. Latent words language models (LWLMs)

 LWLMs are generative models with single latent word space

[Deschacht+, 2011]

• LWLM significantly superior in out-of domain tasks while the performance

was comparable in domain-matched task to conventional LMs [Masumura+ 2013]

 Points

 Problems

• Standard LWLM simply represents the latent space as n-gram model of latent words

• Standard LWLM do not model the hierarchy of function and meaning of words

 To create the hierarchical latent word structure from training data

sets, we also propose a layer-wise inference

3. Parameter inference of h-LWLMs

 Advantages

 Usage for natural language processing tasks

4. PPL and speech recognition experiments

 Our experiments assumes domain-matched training data is not available

23.62

36.03 29.21

 Word Error rate (WER) [%]

 Methods

P P L

 Results

 Perplexity (PPL)

 ^Overview

 ^Results

 ^Points

 ^Problems

_23.62

 ^Results