1409【ポスター】pdf 最近の更新履歴 Ryo Masumura: Web

(1)

Copyright©2014 NTT corp. All Rights Reserved.

Ryo MASUMURA, Taichi ASAMI, Takanobu OBA, Hirokazu MASATAKI and Sumitaka SAKAUCHI

NTT Media Intelligence Laboratories, NTT Corporation, Japan

Mixture of Latent Words Language Models for Domain Adaptation

2. N-gram mixture modeling

�₋ �₋ � �₊ �₊

� � � = � � �, � � � �

�

= � � � , �� |�

�

=

� � � =_�¹ � � � , �^{� � � , �}� ^� �^{� � |�}� �|�

�

=

N-gram models define a probability distribution � � � , � over current word � given context � and parameter �.

 An n-gram mixture model is constructed by combining

several n-gram models trained using different sources.

 The mixture weights � � � can be optimized using

development data based on following EM algorithm.

ℎ₋ ℎ₋ ℎ ℎ₊ ℎ₊

�− �− ^� ^�+ �+

3. Latent words language model (LWLM)

 LWLMs are soft class based n-gram models where each

latent variable is associated with an observed word.

� � � = � � �, � � � � � � �

�

= � � ℎ , � � ℎ |� , �

ℎ

�

=

4. LWLM mixture modeling

� � � = � � �, �, � � � �, � � �|�

�

= � � ℎ , �� ℎ � , �� |�

ℎ

�

=

 LWLM mixture modeling can be considered to be the union of

an n-gram mixture model and an LWLM.

Latent variables in LWLM are represented as a specific word, so multiple LWLMs can share a common latent variable space.

� ℎ �, �⁻, � ~� � ℎ , �� ℎ � , �� +�−

=

� � �, �, �⁻ ~� � ℎ , �� ℎ � , �� |�⁻

� � � =¹_{� � � �}^�

�

�=

=¹_� ^{� ; �}_{�; �}^�_� ^{+ �}

� + ��

�

�=

In Bayesian criterion, optimized mixture weights are estimated by Monte Carlo integration using multiple model index sequences.

We use Gibbs sampling to assign latent word sequence and model index sequence to development set.

 To optimize mixture weights, we use the Bayesian criterion

and a sampling that is compatible with LWLM training.

ℎ− ℎ− ℎ ℎ+ ℎ+

�₋ �₋ � �+ �+

�₋ �− � �+ �+

Introduce a novel language model adaptation method

based on mixture of latent words language models.

6. Experiments

 Conventional: n-gram mixture modeling

Model merging is conducted on the observed word space.

To determine mixture weights based on ML criterion.

 Proposed: LWLM mixture modeling

Model merging is conducted on the latent variable space .

To determine mixture weights based on the Bayesian criterion.

1. Background

 Techniques to build effective language models using limited

target domain data are needed since large amounts of specific

domain data are not often available.

⋇Robust modeling: smoothing, dimensionality reduction

⋈Domain adaptation: mixture modeling

• Specialize in the target domain using out-of-domain data.

Problems of n-gram mixture modeling

• It is hard for out-of domain LMs to offer adequate adaptation performance since the words in out-of domain LMs often differ from those in the target domain LM.

• Realize high performance even when the target domain data is restricted.

Solution

• Realize a method in which model merging is conducted in a latent variable space in common with robust modeling.

5. Implementation

 N-gram mixture model

Approximately expressed as a single back-off n-gram model.

Hierarchical Pitman-Yor LM is used as an n-gram LM.

 LWLM and LWLM mixture model

An approximate model is constructed by randomly generating text data according to a stochastic process and training a standard back-off n-gram model.

Training 1 (target domain) CSJ academic lecture (3.5M words) Training 2 (out-of domain) CSJ extemporaneous lecture (3.8M words) Development (target domain) CSJ academic lecture (28K words) Test 1 (target domain) CSJ academic lecture (28K words) Test 2 (out-of domain) CSJ extemporaneous lecture (20K words)

Decoder VoiceRex (WFST-based)

Acoustic model Context dependent DNN-HMM

7 hidden layers of 2048 nodes

 Evaluate both robust modeling and domain adaptation

in terms of perplexity (PPL) and word error rate (WER).

Test 1 Test 2

PPL WER(%) PPL WER(%)

N-gram constructed

from Training 1 ^62.85 ^21.98 ^183.38 ^32.51

LWLM constructed

from Training 1 ^62.34 ^21.85 ^165.87 ^31.43

N-gram mixture model constructed from each training data set

64.19 20.34 178.71 26.56

LWLM mixture model constructed from each training data set

60.19 19.36 164.46 25.34

Achieve further improvement by using the out-of-domain training data compared with only using the target domain training data.

Proposed LWLM mixture modeling can achieve improvements for both the target domain and the out-of domain data compared with only n-gram mixture modeling.

Model index sequence

Observed word sequence

Latent variable sequence Observed word sequence

Latent variable sequence Observed word sequence Model index sequence

1409【 ポスター】pdf 最近の更新履歴 Ryo Masumura: Web