• 検索結果がありません。

Latent Words Recurrent Neural Network Language Models

N/A
N/A
Protected

Academic year: 2018

シェア "Latent Words Recurrent Neural Network Language Models"

Copied!
24
0
0

読み込み中.... (全文を見る)

全文

(1)

Latent Words Recurrent Neural Network

Language Models

Ryo Masumura , Taichi Asami , Takanobu Oba ,

Hirokazu Masataki , Sumitaka Sakauchi , Akinori Ito

NTT Media Intelligence Laboratories, NTT Corporation

Graduate School of Engineering, Tohoku University

(2)

Introduction

Focus on approaches intended to

improve language model (LM) structure

�− �− �+ �+

Back-off N-gram models

Advanced approaches

 Locality of

context information

 data sparseness Problems

(3)

Related Works

• Maximum Entropy LM [Rosenfeld+, 1996]

• Decision Tree LM [Potamianos+, 1998]

• Random Forest LM [Xu+, 2004]

• Neural Network LM [Bengio+,2003][Schwenk+,2007]

Recurrent Neural Network LM [Mikolov+, 2010]

Discriminative models

Generative models

• Hierarchical Pitman-Yor LM [Teh, 2006]

• Bayesian Class-based LM [Su, 2011]

• Dirichlet Class LM [Chien+, 2011]

Latent Words LM [Deschacht+, 2011]

(4)

Overview

Recurrent Neural Network LM Latent Words LM

Discriminative models Generative models

Latent Words Recurrent Neural Network LM

Propose a novel advanced language model:

 Combine RNNLM and LWLM to offset the faults of each

(5)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

(6)

Observed word space

Recurrent Neural Network LM

[Mikolov+, 2010]

Long-range information can be flexibly taken into consideration

The context information can be represented as a continuous space vector

�− �− �+ �+

�− �− �+

�−

�−

�−

hidden(k)

Output(k) input(k)

hidden(k-1)

� � = � �|��− , ��− , �rnn

�=

Probability estimation of k-th word

� �

(7)

Observed word space

Latent variable space

�− �+ �+

�−

�− �− �+ �+

� � = � �|ℎ, �lw � ℎ|�, �lw

�=

Latent Words LM

[Deschacht+, 2011]

Flexible attributes help us to mitigate data sparseness efficiently

Robustly perform in not only in-domain tasks but also out-of-domain tasks

Transition probability distribution

• n-gram model of latent variables

• latent variable is expressed as a specific word that can be selected from the entire vocabulary

Emission probability distribution

• Unigram for each latent variable

• Soft class modeling: each observed word belongs to each latent variable

(8)

• RNNLM N-gram approximation [Deoras+, 2011]

• LWLM Vitebi approximation [Masumura+, 2013]

• RNNLM Generative probability of recognition hypothesis

[Mikolov+, 2010]

• LWLM N-gram approximation [Masumura+, 2013]

For rescoring

For one pass decoding

• A lot of text data is generated by random sampling,

and n-gram models are constructed from the generated data

• Approximated model can be used in WFST decoder

• Joint probability between the optimal latent variable sequence and recognition hypothesis

Usages for ASR

(9)

Advantages and disadvantages

Can capture long-range context information

and offer strong performance

Cannot explicitly capture the hidden relationship

behind observed word sequence since the concept of

latent variable space is not present

RNNLMs

Can consider latent variable sequence behind

the observed word sequence

cannot take into account the long-range relationship

between latent variables since the space is

modeled as a simple n-gram structure

LWLMs

(10)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

(11)

Our idea

Combine RNNLM and LWLM to offset the faults of each

Can support

a long-range relationship modeling

Can support

a latent variable space modeling

RNNLM

LWLM

(12)

Latent Words

Recurrent Neural Network LM

�− �+ �+

�−

�− �− �+ �+

�− �− �+

�−

• RNNLM of latent variables

� � = � �|ℎ, �lrn � ℎ|��− , ℎ�− , �lrn

�=

Observed word space

Latent variable space

Transition probability distribution

Emission probability distribution

• Each observed word belongs to each latent variable as well as LWLMs

Have a latent variable space as well as

LWLM

and the latent variable space is modeled using

RNNLM

(13)

Relationship to RNNLM and LWLM

� � = � �|ℎ, �lrn � ℎ|��− , ℎ�− , �lrn

• Can be considered as a soft class �= RNNLM

with a vast latent variable space

• Can be considered as an LWLM that uses

the RNN structure for latent variable modeling

LWLM

LWRNNLM

� � = � �|ℎ, �lw � ℎ|�, �lw

�=

RNNLM

� � = � �|��− , ��− , �rnn

�=

(14)

Parameter inference

It is impossible to estimate LWRNNLMs directly from training

data, so we propose an alternative inference procedure

1. Preliminarily train an LWLM and use it to

decode the latent word assignment of training data

2. Decoded latent word assignment is used for

estimating a RNN structure for transition probability, the emission probability is diverted from LWLM

Training data

(15)

Usages for ASR

For one pass decoding

• N-gram approximation

• A lot of text data is generated by random sampling

using trained LWRNNLMs, and n-gram models

are constructed from the generated data

For rescoring

• Use joint probability between the optimal latent variable

sequence and recognition hypothesis

• We sample several candidates of latent variable sequence using LWLM and re-evaluate them using LWRNNLM

• Viterbi approximation

(16)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

(17)

Experimental conditions

ASR evaluation on corpus of spontaneous Japanese (CSJ)

Train CSJ 2672 lectures (700M words) Valid (In-Domain) CSJ 10 lectures (20K words)

Test (In-Domain) CSJ 10 lectures (20K words) Test (Out-Of-Domain) Voice mail data (20M words) Decoder VoiceRex (WFST-based)

Acoustic model DNN-HMM with 8 layers

(18)

Methods

MKN3g

Modified Kneser-Ney 3-gram LM (Baseline)

RNN

Recurrent neural network LM

RNN3g

3-gram approximation of LW

LW

Latent words LM

LW3g

3-gram approximation of LW

LRN

Latent words recurrent neural network LM

LRN3g

3-gram approximation of LRN

Evaluation 1 within one pass decoding ( ○○ 3g)

Evaluation 2 including n-best rescoring (RNN, LW, LRN)

(19)

23 23.5 24 24.5 25 25.5 26 26.5 27 27.5 28

MKN3g RNN3g LW3g LRN3g RNN3g +LRN3g

LW3 +LRN3g

ALL3g

W E R (%)

Conventional Proposed Combination

ALL3g achieves 1.5 point improvement compared with MKN3g

24.7 26.2

24.5 24.7 24.8

24.3

23.4

A single use of LRN3g was not powerful

Results within one pass decoding

(In-domain)

LRN3g with RNN3g or LW3g was effective compared with RNN3g or LW3g

(MKN3g+RNN3g +LW3g+LRN3g)

(20)

29 29.5 30 30.5 31 31.5 32 32.5 33 33.5 34

MKN3g RNN3g LW3g LRN3g RNN3g LW3 ALL3g

W E R (%)

LRN3g achieves higher performance than conventional methods

ALL3g strongly performed

compared to MKN3g

32.3

32.0

30.4

30.0

29.8 29.7

29.3 Conventional Proposed Combination

Results within one pass decoding

(Out-of-domain)

(MKN3g+RNN3g +LW3g+LRN3g)

(21)

22 22.5 23 23.5 24 24.5 25 25.5 26

MKN3g ALL3g ALL3g +RNN

ALL3g +LW

ALL3g +LW +LRN

ALL3g +RNN +LW +LRN

28 29 30 31 32 33 34

MKN3g ALL3g ALL3g +RNN

ALL3g +LW

ALL3g +LW +LRN

ALL3g +RNN +LW +LRN

Results including n-best rescoring

In-Domain Out-Of-Domain

Rescoring One pass

WER (%)

24.8

23.4

23.2 23.2

23.0

32.3

29.3

29.1

28.9

28.6

Viterbi approximation of LW and LRN is comparable to RNN based rescoring Each rescoring model was

effective and has different characteristics from their n-gram approximation

29.0 23.3

Rescoring One pass

(22)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

(23)

Conclusions

Proposed a novel model called

LWRNNLM by combining RNNLM and LWLM

• A single use of LWRNNLM

• Was not powerful in in-domain task

• Could achieve high performance in out-of-domain task

• LWRNNLM with RNNLM and LWLM

• Was effective compared with individual models

in one-pass decoding and rescoring

Results

(24)

Agenda

3. Proposed methods: LWRNNLMs

1. Overview

2. Conventional methods:

RNNLMs and LWLMs

4. Experiments

5. Conclusions

参照

関連したドキュメント

The connection weights of the trained multilayer neural network are investigated in order to analyze feature extracted by the neural network in the learning process. Magnitude of

Under the hypothesis of convergence in probability of a sequence of c` adl` ag processes (X n ) n to a c` adl` ag process X, we are interested in the convergence of corresponding

In order to understand whether some kind of probabilistic reasoning was taken into account by businessmen, it is thus necessary to look at these factors

In the present paper, the methods of independent component analysis ICA and principal component analysis PCA are integrated into BP neural network for forecasting financial time

Finally, we infer through a second simulation study that when the multidimensional data is fitted with a unidimensional model, the unidimensional latent ability is precisely

In the previous discussions, we have found necessary and sufficient conditions for the existence of traveling waves with arbitrarily given least spatial periods and least temporal

In 1965, Kolakoski [7] introduced an example of a self-generating sequence by creating the sequence defined in the following way..

Our a;m in this paper is to apply the techniques de- veloped in [1] to obtain best-possible bounds for the distribution function of the sum of squares X2+y 2 and for the