Latent Words Recurrent Neural Network Language Models

(1)

Latent Words Recurrent Neural Network

Language Models

Ryo Masumura , Taichi Asami , Takanobu Oba ,

Hirokazu Masataki , Sumitaka Sakauchi , Akinori Ito

NTT Media Intelligence Laboratories, NTT Corporation

‡ Graduate School of Engineering, Tohoku University

(2)

Introduction

Focus on approaches intended to

improve language model (LM) structure

�_�− _�_�− �_� �_�+ �_�+

Back-off N-gram models

Advanced approaches

 Locality of

context information

 data sparseness Problems

(3)

Related Works

• Maximum Entropy LM [Rosenfeld+, 1996]

• Decision Tree LM [Potamianos+, 1998]

• Random Forest LM [Xu+, 2004]

• Neural Network LM [Bengio+,2003][Schwenk+,2007]

• Recurrent Neural Network LM [Mikolov+, 2010]

Discriminative models

Generative models

• Hierarchical Pitman-Yor LM [Teh, 2006]

• Bayesian Class-based LM [Su, 2011]

• Dirichlet Class LM [Chien+, 2011]

• Latent Words LM [Deschacht+, 2011]

(4)

Overview

Recurrent Neural Network LM Latent Words LM

Discriminative models Generative models

Latent Words Recurrent Neural Network LM

Propose a novel advanced language model:

 Combine RNNLM and LWLM to offset the faults of each

(5)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

(6)

Observed word space

Recurrent Neural Network LM

[Mikolov+, 2010]

• Long-range information can be flexibly taken into consideration

• The context information can be represented as a continuous space vector

�_�− �_�− �_� �_�+ ^�_�+

�_�− �_�− �_� ^��+

�_�−

�

_�−

hidden(k)

Output(k) input(k)

hidden(k-1)

� � = � �_�|�_�− , �_�− , �_rnn

�

�=

Probability estimation of k-th word

�

_�

� �_�

(7)

Observed word space

Latent variable space

�_�− �_� �_�+ �_�+

�_�−

ℎ_�− _ℎ_�− ℎ_� ℎ_�+ ℎ_�+

� � = � �_�|ℎ_�, �_lw � ℎ_�|�_�, �_lw

ℎ_�

�

�=

Latent Words LM

[Deschacht+, 2011]

• Flexible attributes help us to mitigate data sparseness efficiently

• Robustly perform in not only in-domain tasks but also out-of-domain tasks

Transition probability distribution

• n-gram model of latent variables

• latent variable is expressed as a specific word that can be selected from the entire vocabulary

Emission probability distribution

• Unigram for each latent variable

• Soft class modeling: each observed word belongs to each latent variable

(8)

• RNNLM ^： N-gram approximation [Deoras+, 2011]

• LWLM ^： Vitebi approximation [Masumura+, 2013]

• RNNLM ^： Generative probability of recognition hypothesis

[Mikolov+, 2010]

• LWLM ^： N-gram approximation [Masumura+, 2013]

For rescoring

For one pass decoding

• A lot of text data is generated by random sampling,

and n-gram models are constructed from the generated data

• Approximated model can be used in WFST decoder

• Joint probability between the optimal latent variable sequence and recognition hypothesis

Usages for ASR

(9)

Advantages and disadvantages

Can capture long-range context information

and offer strong performance

Cannot explicitly capture the hidden relationship

behind observed word sequence since the concept of

latent variable space is not present

RNNLMs

Can consider latent variable sequence behind

the observed word sequence

cannot take into account the long-range relationship

between latent variables since the space is

modeled as a simple n-gram structure

LWLMs

(10)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

(11)

Our idea

Combine RNNLM and LWLM to offset the faults of each

Can support

a long-range relationship modeling

Can support

a latent variable space modeling

RNNLM

LWLM

(12)

Latent Words

Recurrent Neural Network LM

�_�− �_� �_�+ �_�+

�_�−

ℎ_�− ℎ_�− ℎ_� ℎ_�+ ℎ_�+

ℎ_�− ℎ_�− ℎ_� ℎ_�+

ℎ_�−

• RNNLM of latent variables

� � = � �_�|ℎ_�, �_lrn � ℎ_�|�_�− , ℎ_�− , �_lrn

ℎ_�

�

�=

Observed word space

Latent variable space

Transition probability distribution

Emission probability distribution

• Each observed word belongs to each latent variable as well as LWLMs

Have a latent variable space as well as

LWLM

and the latent variable space is modeled using

_RNNLM

(13)

Relationship to RNNLM and LWLM

� � = � �_�|ℎ_�, �_lrn � ℎ_�|�_�− , ℎ_�− , �_lrn

ℎ_�

�

• Can be considered as a soft class �= ^RNNLM

with a vast latent variable space

• Can be considered as an LWLM that uses

the RNN structure for latent variable modeling

LWLM

LWRNNLM

� � = � �_�|ℎ_�, �_lw � ℎ_�|�_�, �_lw

ℎ_�

�

�=

RNNLM

� � = � �_�|�_�− , �_�− , �_rnn

�

�=

(14)

Parameter inference

It is impossible to estimate LWRNNLMs directly from training

data, so we propose an alternative inference procedure

1. Preliminarily train an LWLM and use it to

decode the latent word assignment of training data

2. Decoded latent word assignment is used for

estimating a RNN structure for transition probability, the emission probability is diverted from LWLM

Training data

(15)

Usages for ASR

For one pass decoding

• N-gram approximation

• A lot of text data is generated by random sampling

using trained LWRNNLMs, and n-gram models

are constructed from the generated data

For rescoring

• Use joint probability between the optimal latent variable

sequence and recognition hypothesis

• We sample several candidates of latent variable sequence using LWLM and re-evaluate them using LWRNNLM

• Viterbi approximation

(16)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

(17)

Experimental conditions

ASR evaluation on corpus of spontaneous Japanese (CSJ)

Train CSJ 2672 lectures (700M words) Valid (In-Domain) CSJ 10 lectures (20K words)

Test (In-Domain) CSJ 10 lectures (20K words) Test (Out-Of-Domain) Voice mail data (20M words) Decoder VoiceRex (WFST-based)

Acoustic model DNN-HMM with 8 layers

(18)

Methods

MKN3g

Modified Kneser-Ney 3-gram LM (Baseline)

RNN

Recurrent neural network LM

RNN3g

3-gram approximation of LW

LW

Latent words LM

LW3g

3-gram approximation of LW

LRN

Latent words recurrent neural network LM

LRN3g

3-gram approximation of LRN

Evaluation 1 ^： within one pass decoding ( ^○○ 3g)

Evaluation 2 ^： including n-best rescoring (RNN, LW, LRN)

(19)

23 23.5 24 24.5 25 25.5 26 26.5 27 27.5 28

MKN3g RNN3g LW3g LRN3g RNN3g +LRN3g

LW3 +LRN3g

ALL3g

W E R (%)

Conventional Proposed Combination

ALL3g achieves 1.5 point improvement compared with MKN3g

24.7 26.2

24.5 ^24.7 24.8

24.3

23.4

A single use of LRN3g was not powerful

Results within one pass decoding

(In-domain)

LRN3g with RNN3g or LW3g was effective compared with RNN3g or LW3g

(MKN3g+RNN3g +LW3g+LRN3g)

(20)

29 29.5 30 30.5 31 31.5 32 32.5 33 33.5 34

MKN3g RNN3g LW3g LRN3g RNN3g LW3 ALL3g

W E R (%)

LRN3g achieves higher performance than conventional methods

ALL3g strongly performed

compared to MKN3g

32.3

32.0

30.4

30.0

29.8 _29.7

29.3 Conventional Proposed Combination

Results within one pass decoding

(Out-of-domain)

(MKN3g+RNN3g +LW3g+LRN3g)

(21)

22 22.5 23 23.5 24 24.5 25 25.5 26

MKN3g ALL3g ALL3g +RNN

ALL3g +LW

ALL3g +LW +LRN

ALL3g +RNN +LW +LRN

28 29 30 31 32 33 34

MKN3g ALL3g ALL3g +RNN

ALL3g +LW

ALL3g +LW +LRN

ALL3g +RNN +LW +LRN

Results including n-best rescoring

In-Domain Out-Of-Domain

Rescoring One pass

WER (%)

24.8

23.4

23.2 ^23.2

23.0

32.3

29.3

29.1

28.9

28.6

Viterbi approximation of LW and LRN is comparable to RNN based rescoring Each rescoring model was

effective and has different characteristics from their n-gram approximation

Latent Words Recurrent Neural Network Language Models