Copyright©2014 NTT corp. All Rights Reserved.
Ryo MASUMURA, Taichi ASAMI, Takanobu OBA, Hirokazu MASATAKI and Sumitaka SAKAUCHI
NTT Media Intelligence Laboratories, NTT Corporation, Japan
Mixture of Latent Words Language Models for Domain Adaptation
2. N-gram mixture modeling
�− �− � �+ �+
�− �− � �+ �+
� � � = � � �, � � � �
�
= � � � , �� � � |�
�
�
=
� � � =� 1 � � � , �� � � , �� � �� � |�� �|�
�
=
N-gram models define a probability distribution � � � , � over current word � given context � and parameter �.
An n-gram mixture model is constructed by combining
several n-gram models trained using different sources.
The mixture weights � � � can be optimized using
development data based on following EM algorithm.
ℎ− ℎ− ℎ ℎ+ ℎ+
�− �− � �+ �+
3. Latent words language model (LWLM)
LWLMs are soft class based n-gram models where each
latent variable is associated with an observed word.
� � � = � � �, � � � � � � �
�
= � � ℎ , � � ℎ |� , �
ℎ
�
=
4. LWLM mixture modeling
� � � = � � �, �, � � � �, � � �|�
�
�
= � � ℎ , �� � ℎ � , �� � � |�
ℎ
�
�
=
LWLM mixture modeling can be considered to be the union of
an n-gram mixture model and an LWLM.
Latent variables in LWLM are represented as a specific word, so multiple LWLMs can share a common latent variable space.
� ℎ �, �−, � ~� � ℎ , �� � ℎ � , �� +�−
=
� � �, �, �− ~� � ℎ , �� � ℎ � , �� � � |�−
� � � =1� � � ��
�
�=
=1� � ; ��; ��� + �
� + ��
�
�=
In Bayesian criterion, optimized mixture weights are estimated by Monte Carlo integration using multiple model index sequences.
We use Gibbs sampling to assign latent word sequence and model index sequence to development set.
To optimize mixture weights, we use the Bayesian criterion
and a sampling that is compatible with LWLM training.
ℎ− ℎ− ℎ ℎ+ ℎ+
�− �− � �+ �+
�− �− � �+ �+
Introduce a novel language model adaptation method
based on mixture of latent words language models.
6. Experiments
Conventional: n-gram mixture modeling
Model merging is conducted on the observed word space.
To determine mixture weights based on ML criterion.
Proposed: LWLM mixture modeling
Model merging is conducted on the latent variable space .
To determine mixture weights based on the Bayesian criterion.
1. Background
Techniques to build effective language models using limited
target domain data are needed since large amounts of specific
domain data are not often available.
⋇Robust modeling: smoothing, dimensionality reduction
⋈Domain adaptation: mixture modeling
• Specialize in the target domain using out-of-domain data.
Problems of n-gram mixture modeling
• It is hard for out-of domain LMs to offer adequate adaptation performance since the words in out-of domain LMs often differ from those in the target domain LM.
• Realize high performance even when the target domain data is restricted.
Solution
• Realize a method in which model merging is conducted in a latent variable space in common with robust modeling.
5. Implementation
N-gram mixture model
Approximately expressed as a single back-off n-gram model.
Hierarchical Pitman-Yor LM is used as an n-gram LM.
LWLM and LWLM mixture model
An approximate model is constructed by randomly generating text data according to a stochastic process and training a standard back-off n-gram model.
Training 1 (target domain) CSJ academic lecture (3.5M words) Training 2 (out-of domain) CSJ extemporaneous lecture (3.8M words) Development (target domain) CSJ academic lecture (28K words) Test 1 (target domain) CSJ academic lecture (28K words) Test 2 (out-of domain) CSJ extemporaneous lecture (20K words)
Decoder VoiceRex (WFST-based)
Acoustic model Context dependent DNN-HMM
7 hidden layers of 2048 nodes
Evaluate both robust modeling and domain adaptation
in terms of perplexity (PPL) and word error rate (WER).
Test 1 Test 2
PPL WER(%) PPL WER(%)
N-gram constructed
from Training 1 62.85 21.98 183.38 32.51
LWLM constructed
from Training 1 62.34 21.85 165.87 31.43
N-gram mixture model constructed from each training data set
64.19 20.34 178.71 26.56
LWLM mixture model constructed from each training data set
60.19 19.36 164.46 25.34
Achieve further improvement by using the out-of-domain training data compared with only using the target domain training data.
Proposed LWLM mixture modeling can achieve improvements for both the target domain and the out-of domain data compared with only n-gram mixture modeling.
Model index sequence
Observed word sequence
Latent variable sequence Observed word sequence
Latent variable sequence Observed word sequence Model index sequence