Latent Words Recurrent Neural Network
Language Models
Ryo Masumura , Taichi Asami , Takanobu Oba ,
Hirokazu Masataki , Sumitaka Sakauchi , Akinori Ito
NTT Media Intelligence Laboratories, NTT Corporation
‡ Graduate School of Engineering, Tohoku University
Introduction
Focus on approaches intended to
improve language model (LM) structure
��− ��− �� ��+ ��+
Back-off N-gram models
Advanced approaches
Locality of
context information
data sparseness Problems
Related Works
• Maximum Entropy LM [Rosenfeld+, 1996]
• Decision Tree LM [Potamianos+, 1998]
• Random Forest LM [Xu+, 2004]
• Neural Network LM [Bengio+,2003][Schwenk+,2007]
• Recurrent Neural Network LM [Mikolov+, 2010]
Discriminative models
Generative models
• Hierarchical Pitman-Yor LM [Teh, 2006]
• Bayesian Class-based LM [Su, 2011]
• Dirichlet Class LM [Chien+, 2011]
• Latent Words LM [Deschacht+, 2011]
Overview
Recurrent Neural Network LM Latent Words LM
Discriminative models Generative models
Latent Words Recurrent Neural Network LM
Propose a novel advanced language model:
Combine RNNLM and LWLM to offset the faults of each
Agenda
3. Proposed methods: LWRNNLMs
1. Overview
2. Conventional methods:
RNNLMs and LWLMs
4. Experiments
5. Conclusions
Observed word space
Recurrent Neural Network LM
[Mikolov+, 2010]
• Long-range information can be flexibly taken into consideration
• The context information can be represented as a continuous space vector
��− ��− �� ��+ ��+
��− ��− �� ��+
��−
��−
�
�−hidden(k)
Output(k) input(k)
hidden(k-1)
� � = � ��|��− , ��− , �rnn
�
�=
Probability estimation of k-th word
�
�� ��
Observed word space
Latent variable space
��− �� ��+ ��+
��−
ℎ�− ℎ�− ℎ� ℎ�+ ℎ�+
� � = � ��|ℎ�, �lw � ℎ�|��, �lw
ℎ�
�
�=
Latent Words LM
[Deschacht+, 2011]
• Flexible attributes help us to mitigate data sparseness efficiently
• Robustly perform in not only in-domain tasks but also out-of-domain tasks
Transition probability distribution
• n-gram model of latent variables
• latent variable is expressed as a specific word that can be selected from the entire vocabulary
Emission probability distribution
• Unigram for each latent variable
• Soft class modeling: each observed word belongs to each latent variable
• RNNLM : N-gram approximation [Deoras+, 2011]
• LWLM : Vitebi approximation [Masumura+, 2013]
• RNNLM : Generative probability of recognition hypothesis
[Mikolov+, 2010]
• LWLM : N-gram approximation [Masumura+, 2013]
For rescoring
For one pass decoding
• A lot of text data is generated by random sampling,
and n-gram models are constructed from the generated data
• Approximated model can be used in WFST decoder
• Joint probability between the optimal latent variable sequence and recognition hypothesis
Usages for ASR
Advantages and disadvantages
Can capture long-range context information
and offer strong performance
Cannot explicitly capture the hidden relationship
behind observed word sequence since the concept of
latent variable space is not present
RNNLMs
Can consider latent variable sequence behind
the observed word sequence
cannot take into account the long-range relationship
between latent variables since the space is
modeled as a simple n-gram structure
LWLMs
Agenda
3. Proposed methods: LWRNNLMs
1. Overview
2. Conventional methods:
RNNLMs and LWLMs
4. Experiments
5. Conclusions
Our idea
Combine RNNLM and LWLM to offset the faults of each
Can support
a long-range relationship modeling
Can support
a latent variable space modeling
RNNLM
LWLM
Latent Words
Recurrent Neural Network LM
��− �� ��+ ��+
��−
ℎ�− ℎ�− ℎ� ℎ�+ ℎ�+
ℎ�− ℎ�− ℎ� ℎ�+
ℎ�−
• RNNLM of latent variables
� � = � ��|ℎ�, �lrn � ℎ�|��− , ℎ�− , �lrn
ℎ�
�
�=
Observed word space
Latent variable space
Transition probability distribution
Emission probability distribution
• Each observed word belongs to each latent variable as well as LWLMs
Have a latent variable space as well as
LWLM
and the latent variable space is modeled usingRNNLM
Relationship to RNNLM and LWLM
� � = � ��|ℎ�, �lrn � ℎ�|��− , ℎ�− , �lrn
ℎ�
�
• Can be considered as a soft class �= RNNLM
with a vast latent variable space
• Can be considered as an LWLM that uses
the RNN structure for latent variable modeling
LWLM
LWRNNLM
� � = � ��|ℎ�, �lw � ℎ�|��, �lw
ℎ�
�
�=
RNNLM
� � = � ��|��− , ��− , �rnn
�
�=
Parameter inference
It is impossible to estimate LWRNNLMs directly from training
data, so we propose an alternative inference procedure
1. Preliminarily train an LWLM and use it to
decode the latent word assignment of training data
2. Decoded latent word assignment is used for
estimating a RNN structure for transition probability, the emission probability is diverted from LWLM
Training data
Usages for ASR
For one pass decoding
• N-gram approximation
• A lot of text data is generated by random sampling
using trained LWRNNLMs, and n-gram models
are constructed from the generated data
For rescoring
• Use joint probability between the optimal latent variable
sequence and recognition hypothesis
• We sample several candidates of latent variable sequence using LWLM and re-evaluate them using LWRNNLM
• Viterbi approximation
Agenda
3. Proposed methods: LWRNNLMs
1. Overview
2. Conventional methods:
RNNLMs and LWLMs
4. Experiments
5. Conclusions
Experimental conditions
ASR evaluation on corpus of spontaneous Japanese (CSJ)
Train CSJ 2672 lectures (700M words) Valid (In-Domain) CSJ 10 lectures (20K words)
Test (In-Domain) CSJ 10 lectures (20K words) Test (Out-Of-Domain) Voice mail data (20M words) Decoder VoiceRex (WFST-based)
Acoustic model DNN-HMM with 8 layers
Methods
MKN3g
Modified Kneser-Ney 3-gram LM (Baseline)RNN
Recurrent neural network LMRNN3g
3-gram approximation of LWLW
Latent words LMLW3g
3-gram approximation of LWLRN
Latent words recurrent neural network LMLRN3g
3-gram approximation of LRNEvaluation 1 : within one pass decoding ( ○○ 3g)
Evaluation 2 : including n-best rescoring (RNN, LW, LRN)
23 23.5 24 24.5 25 25.5 26 26.5 27 27.5 28
MKN3g RNN3g LW3g LRN3g RNN3g +LRN3g
LW3 +LRN3g
ALL3g
W E R (%)
Conventional Proposed Combination
ALL3g achieves 1.5 point improvement compared with MKN3g
24.7 26.2
24.5 24.7 24.8
24.3
23.4
A single use of LRN3g was not powerful
Results within one pass decoding
(In-domain)
LRN3g with RNN3g or LW3g was effective compared with RNN3g or LW3g
(MKN3g+RNN3g +LW3g+LRN3g)
29 29.5 30 30.5 31 31.5 32 32.5 33 33.5 34
MKN3g RNN3g LW3g LRN3g RNN3g LW3 ALL3g
W E R (%)
LRN3g achieves higher performance than conventional methods
ALL3g strongly performed
compared to MKN3g
32.3
32.0
30.4
30.0
29.8 29.7
29.3 Conventional Proposed Combination
Results within one pass decoding
(Out-of-domain)
(MKN3g+RNN3g +LW3g+LRN3g)
22 22.5 23 23.5 24 24.5 25 25.5 26
MKN3g ALL3g ALL3g +RNN
ALL3g +LW
ALL3g +LW +LRN
ALL3g +RNN +LW +LRN
28 29 30 31 32 33 34
MKN3g ALL3g ALL3g +RNN
ALL3g +LW
ALL3g +LW +LRN
ALL3g +RNN +LW +LRN
Results including n-best rescoring
In-Domain Out-Of-Domain
Rescoring One pass
WER (%)
24.8
23.4
23.2 23.2
23.0
32.3
29.3
29.1
28.9
28.6
Viterbi approximation of LW and LRN is comparable to RNN based rescoring Each rescoring model was
effective and has different characteristics from their n-gram approximation
29.0 23.3
Rescoring One pass