Copyright©2015 NTT corp. All Rights Reserved.
Ryo MASUMURA , Taichi ASAMI , Takanobu OBA ,
Hirokazu MASATAKI , Sumitaka SAKAUCHI and Akinori ITO
NTT Corporation, Japan Tohoku University, Japan
Hierarchical Latent Words Language Models
for Robust Modeling to Out-Of Domain Tasks
Motivation
To construct LM that can achieve high performance in not only in-domain tasks but also out-of domain tasks
Overview
We propose hierarchical latent words language models (h-LWLMs)
Results
H-LWLM offers improved robustness to out-of domain tasks
• Key advance is introducing a multiple latent word space with
hierarchical structure into LWLMs
• It can handle linguistic phenomena not present in the training data
• H-LWLM is superior to standard LWLM in terms of PPL and WER
• H-LWLM is significantly superior to the conventional n-gram LM
or the recurrent neural network LM in out-of domain tasks
2. Hierarchical latent words language models (h-LWLMs)
H-LWLMs are novel LWLMs with multiple latent word spaces that
are hierarchically structured
1. Latent words language models (LWLMs)
LWLMs are generative models with single latent word space
[Deschacht+, 2011]
� � = � �� ℎ�, � � ℎ�|��, �
ℎ�
�
�= � � �� = � � �� ℎ�, �� � ℎ� ��, ��
�
�=
�
�=
Latent variable sequence
Observed word sequence
Valid Test 1 Test 2
��− ��− �� ��+ ��+
ℎ�−�−
ℎ�−� ℎ�−� ℎ�� ℎ�+� ℎ�+�
ℎ�− ℎ�− ℎ� ℎ�+ ℎ�+
ℎ�−�− ℎ��− ℎ�+�− ℎ�+�−
ℎ�− ℎ�− ℎ� ℎ�+ ℎ�+
��− ��− �� ��+ ��+
Latent word is represented as a specific word that is selected from the entire vocabulary
The number of latent words equals the number of observed words
The latent space helps to efficiently increase the robustness to out-of domain tasks
• LWLM significantly superior in out-of domain tasks while the performance
was comparable in domain-matched task to conventional LMs [Masumura+ 2013]
Points
Problems
• Standard LWLM simply represents the latent space as n-gram model of latent words
• Standard LWLM do not model the hierarchy of function and meaning of words
It assumes that there is a latent word behind a latent word
Standard LWLMs correspond to h-LWLMs with just one layer
Latent words in all layers are represented as a specific word that is selected from the entire vocabulary
� � = � ℎ� ℎ� , � � ℎ��− ℎ�� , � � ℎ�� �� , �
ℎ�� ℎ�1
�
�=
� � ��
= � � ℎ� ℎ� , �� � ℎ��− ℎ�� , �� � ℎ�� �� , ��
ℎ�� ℎ�1
�
�=
�
�=
=
=
�
�
�
�
�
= �
�
�
� �−
� �
To create the hierarchical latent word structure from training data
sets, we also propose a layer-wise inference
LWLM structure is recursively accumulated by estimating a latent word sequence in an upper layer from a latent word sequence in the lower layer
• In line 4, Gibbs sampling is used for estimating a latent word sequence in an upper layer from that in the lower layer
• In line 6, t-th point estimated parameter indicates parameters of each LWLM for all layers in t-th iteration
3. Parameter inference of h-LWLMs
Advantages
• It can be expected that hierarchy in the latent space flexibly calculate generative probability for unseen words unlike non-hierarchical LWLMs
Usage for natural language processing tasks
• It is impractical to directly apply the h-LWLM to natural language processing tasks
• We use n-gram approximation technique that is conducted as by generating a lot of text data via random sampling using trained h-LWLM and constructing an n-gram model from the generated data
1-th
latent variable sequence
Observed word sequence (M-1)-th
latent variable sequence (M)-th
latent variable sequence
4. PPL and speech recognition experiments
Our experiments assumes domain-matched training data is not available
• MKN: 3-gram LM with modified Kneser-Ney
• HPY: Hierarchical Pitman-Yor 3-gram LM
• RNN: Class-based RNNLM with 500 nodes
• LW: 3-gram approximation of h-LWLM (LW with 1 layer means standard LWLM)
Train CSJ 2672 lectures (700M words) Valid CSJ 10 lectures (20K words)
Test 1 Contact center dialogue (20K words) Test 2 Voice mail data (20M words)
Decoder VoiceRex (WFST-based) Acoustic model DNN-HMM with 8 layers
Setup Layer Valid Test 1 Test 2
MKN - 24.79 38.67 32.31
HPY - 24.67 38.29 32.00
LW 1 24.54 36.96 30.26
LW 5 24.60 36.49 29.57
LW+HPY 1
23.62
36.49 29.76 LW+HPY 5 23.6836.03 29.21
Word Error rate (WER) [%]
Methods
Valid
• PPL was not improved by the hierarchical structure in LW
• LW is comparable to MKN and HPY, and inferior to RNN in terms of PPL
Test
• PPL improved with the increase in the number of layers in LW
• LW with 5 layers was superior to 1 layer in terms of PPL and WER
• The best results were obtained by LW+HPY with 5 layers
120 130 140 150 160
1 2 3 4 5
MKN HPY RNN LW
LW+HPY
130 140 150 160 170 180 190 200
1 2 3 4 5
MKN HPY RNN LW
60 62 64 66 68 70
1 2 3 4 5
MKN HPY RNN LW
LW+HPY
Number of layers
P P L
Results
Perplexity (PPL)
For LM training, we used the Corpus of Spontaneous Japanese (CSJ) whose domain is a spontaneous lecture task
188.56
174.30
155.24
151.44 144.93
146.76
140.39 148.43
156.52
138.41 134.57
130.14 129.80
125.44 68.08
67.22
68.61
66.24
60.78 61.95
62.41