Advances in Neural Information Processing Systems 8

(1)

Active Learning in Multilayer

Perceptrons

KenjiFukumizu

Information andCommunicationR&DCenter,RicohCo.,Ltd.

3-2-3,Shin-yokohama,Yokohama,222 Japan

E-mail: [email protected]

Abstract

Weproposeanactivelearningmethodwithhidden-unitreduction,

whichisdevisedsp eciallyformultilayerp erceptrons(MLP).First,

we review our activelearning metho d, and point out that many

Fisher-information-based metho dsapplied to MLPhavea critical

problem: the information matrix may be singular. To solve this

problem,wederivethesingularityconditionofaninformationma-

trix,andproposeanactivelearningtechniquethatisapplicableto

MLP.Itseectivenessisveriedthroughexperiments.

1 INTRODUCTION

Whenonetrainsalearningmachineusingasetofdatagivenbythetruesystem,its

abilitycan beimprovedifone selectsthetrainingdataactively. Inthis pap er,we

considertheproblemofactivelearningin multilayerperceptrons(MLP). First,we

reviewourmethodofactivelearning(Fukumizuelal.,1994),inwhichwepreparea

probabilitydistributionandobtaintrainingdataassamplesfrom thedistribution.

Thismethodologyleadsustoaninformation-matrix-basedcriterionsimilartoother

existingones (Fedorov,1972;Pukelsheim, 1993).

Activelearningtechniqueshaveb eenrecentlyusedwithneuralnetworks(MacKay,

1992;Cohn,1994). Ourmetho d,however,aswellasmanyotheroneshasacrucial

problem: therequiredinverseofaninformationmatrixmaynotexist(White,1989).

Weproposeanactivelearning techniquewhichisapplicableto three-layerpercep-

trons. Developing a theory on thesingularityof a Fisherinformation matrix, we

presentanactivelearningalgorithmwhichkeepstheinformationmatrixnonsingu-

lar. Wedemonstratetheeectivenessofthealgorithmthroughexperiments.

(2)

2.1 A CRITERIONOF OPTIMALITY

Wereviewthecriterionofstatisticallyoptimaltrainingdata(Fukumizuetal.,1994).

Weconsidertheregressionproblemin whichthetargetsystemmapsagiveninput

xtoyaccording to

y=f(x)+Z;

wheref(x)isadeterministicfunctionfromR L

toR M

,andZisarandomvariable

whose law is a normaldistribution N(0;

2

I

M ), (I

M

is the unit M 2M matrix).

Ourobjectiveistoestimatethetruefunctionf as accuratelyaspossible.

Letff(x;)gb eaparametricmo delforestimation. Weusethemaximumlikeliho o d

estimator(MLE)

^

fortrainingdataf(x ()

;y ()

)g N

=1

,whichminimizesthesumof

squared errors in this case. In theoretical derivations, we assume that the target

functionf isincludedin themo delandequal tof(1;

0 ).

Wemakeatrainingexamplebycho osingx ()

totry,observingtheresultingoutput

y ()

, andpairingthem. Theproblem ofactivelearning ishowto determineinput

data fx ()

g N

=1

to minimize the estimation error after training. Our approach is

a statistical one using a probability for training, r (x), and cho osing fx ()

g N

=1 as

independent samples from r(x) to minimize the exp ectation of the MSE in the

actualenvironment:

EMSE=E

f(x ()

;y ()

)g Z

Z

ky0f(x;

^

)k 2

p(yjx)dydQ

: (1)

Intheaboveequation,Qisthe environmentalprobabilitywhichgivesinputvectors

to thetruesystemin theactualenvironment,andE

f(x ()

;y ()

)g

meanstheexpec-

tation on trainingdata. Eq.(1), therefore, showsthe averageerror of thetrained

machinethatisusedasasubstituteofthetruefunctionintheactualenvironment.

2.2 REVIEW OFAN ACTIVE LEARNING METHOD

Usingstatisticalasymptotictheory,Eq.(1)isapproximatedasfollows:

EMSE= 2

+

2

N Tr

2

I(

0 )J

01

(

0 )

3

+O(N 03=2

); (2)

wherethematrixesI andJ are (Fisher)information matrixesdened by

I

ab

(x; )=

@f T

(x; )

@

a

@f(x; )

@

b

; I( )= Z

I(x; )dQ(x); J()= Z

I(x;)r (x)dx:

Theessentialpart ofEq.(2) is Tr[I(

0 )J

01

(

0

)], computed bythe unavailable pa-

rameter

0

. Wehaveprop osedapracticalalgorithminwhichwereplace

0 with

^

,

prepareafamilyofprobabilityfr (x;v)jv:paramatergtochoosetrainingsamples,

andoptimizevand

^

iteratively(Fukumizuetal., 1994).

ActiveLearning Algorithm

1. Selectaninitialtrainingdataset D

[0]

fromr(x;v

[0]

),andcompute

^

[0]

.

2. k:=1.

3. Computetheoptimalv=v

[k ]

tominimizeTr[I(

^

[k01]

)J 01

(

^

[k 01]

)].

(3)

k [k ] [k ]

D

[k 01]

and thenewdata.

5. ComputetheMLE

^

[k ]

basedonthetrainingdatasetD

[k ] .

6. k:=k+1andgoto3.

The ab ove metho d utilizes a probability to generate training data. It has the

advantage of making many data in one step compared to existing ones in which

only one data is chosen in each step, though their criterions are similar to each

other.

3 SINGULARITY OF AN INFORMATION MATRIX

3.1 A PROBLEM ON ACTIVELEARNING IN MLP

Hereafter, we focus on active learning in three-layer perceptrons with H hidden

units,N

H

=ff(x; )g. Themap f(x; )isdened by

f

i

(x;)= H

X

j=1 w

ij s(

L

X

k =1 u

jk x

k +

j )+

i

; (1iM); (3)

wheres(t)isthesigmoidalfunction: s(t)=1=(1+e 0t

).

Ouractivelearning metho d as well as manyother ones requires theinverseof an

information matrix J. The information matrix of MLP, however, is not always

invertible (White, 1989). Any statistical algorithms utilizing the inverse, then,

cannotbeapplieddirectly to MLP(Hagiwaraetal.,1993). Suchproblemsdonot

ariseinlinearmo dels,whichalmostalwayshaveanonsingularinformationmatrix.

3.2 SINGULARITY OFAN INFORMATIONMATRIX OFMLP

Thefollowingtheoremshowsthattheinformationmatrixofathree-layerp erceptron

issingularif andonly ifthenetworkhasredundant hiddenunits. Wecan deduce

thatiftheinformationmatrixissingular,wecanmakeitnonsingularbyeliminating

redundanthiddenunitswithoutchangingtheinput-outputmap.

Theorem 1 Assume r(x) is continuous and positiveat any x. Then, the Fisher

information matrix J is singular if and only if at least one of the following three

conditionsis satised:

(1)u

j :=(u

j1

;...;u

jL )

T

=0, forsome j.

(2)w

j :=(w

1j

;...;w

Mj )=0

T

,forsomej.

(3)Fordierentj

1 andj

2 ,(u

T

j1

;

j

1 )=(u

T

j2

;

j

2 )or(u

T

j1

;

j

1

)=0(u T

j2

;

j

2 ).

Therough sketchofthepro ofisshownb elow. Thecompletepro ofwillappearina

forthcomingpaper(Fukumizu,1996).

Roughsketchoftheproof. Weknoweasilythataninformationmatrixissingularif

andonlyiff

@f(x;)

@a g

a

arelinearlydependent. Thesuciencycanbeprovedeasily.

Toshowthenecessity,weshowthatthederivativesarelinearlyindep endentifnone

ofthethreeconditions issatised. Assumea linearrelation:

M

X

i=1 H

X

j=1

ij

@f

@w

ij +

M

X

i=1

i0

@f

@

i +

H

X

j=1 L

X

k =1

jk

@f

@u

jk +

H

X

j=1

j0

@f

@

j

=0: (4)

(4)

WecanshowthereexistsabasisofR ,hx ;...;x i,suchthatu

j

1x 6=0for

8

j, 8

l , and u

j

1 1x

(l)

+

j

1

6= 6(u

j

2 1x

(l)

+

j

2 ) for j

1 6= j

2 ,

8

l . We replace x in

eq.(4)byx (l)

t(t2R). Letm (l )

j :=u

j 1x

(l )

,S (l)

j

:=fz2Cjz=((2n+1) p

010

j )=m

(l )

j

; n2Zg, and D (l)

:=C0[

j S

(l )

j

. Thepointsin S (l )

j

arethe singularities

ofs(m (l)

j z+

j

). WedeneholomorphicfunctionsonD (l )

as

9 (l)

i

(z) :=

P

H

j=1

ij s(m

(l )

j z+

j )+

i0 +

P

H

j=1 P

L

k=1

jk w

ij s

0

(m (l)

j z+

j )x

(l )

k z

+ P

H

j=1

j0 w

ij s

0

(m (l)

j z+

j

); (1iM):

Fromeq.(4),wehave9 (l)

i

(t)=0forallt2R. Usingstandardargumentsonisolated

singularities ofholomorphicfunctions,we knowS (l)

j

areremovablesingularities of

9 (l)

i

(z),andnallyobtain

w

ij P

L

k=1

jk x

(l )

k

=0; w

ij

j0

=0;

ij

=0;

i0

=0:

Itiseasytosee

jk

=0. Thiscompletestheproof.

3.3 REDUCTION PROCEDURE

We intro duce the followingreduction procedure based on Theorem 1. Used dur-

ing BP training, it eliminates redundant hidden units and keeps the information

matrixnonsingular. Thecriterionofeliminationis very important,b ecauseexces-

siveelimination ofhiddenunits degradesthe approximationcapacity. Wepropose

an algorithm which doesnot increasethe mean squared error onaverage. Inthe

following,lets^

j :=s(^u

j 1x+^

j

)and"(N)=A=N fora p ositivenumb erA.

Reduction Procedure

1. If kw^

j k

2 R

(^s

j 0s(

^

j ))

2

dQ<"(N); theneliminatethejthhiddenunit,

and^

i

!^

i +w^

ij s(

^

j

)foralli.

2. If kw^

j k

2 R

(^s

j )

2

dQ<"(N); theneliminatethejthhiddenunit.

3. If kw^

j

2 k

2 R

(^s

j

2 0^s

j

1 )

2

dQ<"(N) fordierentj

1 andj

2 ,

theneliminatethej

2

thhiddenunitandw^

ij1

!w^

ij1 +w^

ij2

foralli.

4. If kw^

j2 k

2 R

(10^s

j2 0^s

j1 )

2

dQ<"(N) fordierentj

1 andj

2 ,

then eliminatethe j

2

th hidden unitand w^

ij1

! w^

ij1 0w^

ij2

; ^

i

! ^

i +

^ w

ij2

foralli.

FromTheorem1,weknowthatw^

j ,u^

j ,(^u

T

j2

;

^

j

2 )0(^u

T

j1

;

^

j

1 ),or(^u

T

j2

;

^

j

2 )+(^u

T

j1

;

^

j

1 )

can b e reduced to 0 if the information matrix is singular. Let

~

2 N

K

denote

thereduced parameter from

^

according to theaboveprocedure. The ab ovefour

conditionsare, then,givenbycalculating R

kf(x;

~

)0f(x;

^

)k 2

dQ.

We briey explain how the procedure keeps the information matrix nonsingular

anddoesnotincreaseEMSEinhighprobability. First,supposedetJ(

0

)=0,then

there exists K

0 2N

K

(K <H)such that f(x;

0

)=f(x;

K

0

) and detJ(

K

0 )6= 0

in N

K

. Theelimination ofhidden units uptoK, ofcourse, do esnotincreasethe

EMSE. Therefore, we haveonly to consider the case in which detJ(

0

) 6= 0 and

hiddenunitsareeliminated.

Suppose R

kf(x;

K

0

)0f(x;

0 )k

2

dQ>O (N 01

)foranyreducedparameter K

0 from

0

. Theprobabilityofsatisfying R

kf(x;

^

)0f(x;

~

)k 2

dQ<A=N isverysmallfor

(5)

probability. Next,suppose R

kf(x;

K

0

)0f(x;

0 )k

2

dQ=O (N 01

). Let

~

2N

K be

a reducedparameter madefrom

^

with thesame pro cedureas weobtain K

0 from

0

. Wewill showfora sucientlysmallA,

E h

R

kf(x;

^

K

)0f(x;

0 )k

2

dQ i

E h

R

kf(x;

^

)0f(x;

0 )k

2

dQ i

;

where

^

K

is MLE computed in N

K

. We write = ( (1)

; (2)

) in which (2)

is

changedto0inreduction,changingthecoordinatesystemifnecessary. TheTaylor

expansionandasymptotictheorygive

E h

R

kf(x;

^

K

)0f(x;

0 )k

2

dQ i

R

kf(x;

K

0

)0f(x;

0 )k

2

dQ+

2

N Tr[I

11 (

K

0 )J

01

11 (

K

0 )];

E h

R

kf(x;

~

)0f(x;

^

)k 2

dQ i

R

kf(x;

K

0

)0 f(x;

0 )k

2

dQ +

2

N Tr[I

22 (

K

0 )J

01

22 (

0 )];

whereI

ii andJ

ii

denotethelocalinformationmatrixesw.r.t.

(i)

(i=1;2). Thus,

E h

R

kf(x;

^

)0f(x;

0 )k

2

dQ i

0E h

R

kf(x;

^

K

)0f(x;

0 )k

2

dQ i

0E

h

R

kf(x;

~

)0f(x;

^

)k 2

dQ i

+

2

N Tr[I

22 (

K

0 )J

01

22 (

0 )]

0

2

N Tr[I

11 (

K

0 )J

01

11 (

K

0 )]+E

h

R

kf(x;

^

)0f(x;

0 )k

2

dQ i

:

Sincethesumofthelasttwotermsispositive,thel.h.sisp ositiveifE[

R

kf(x;

^

K

)0

f(x;

^

)k 2

dQ]<B=N forasucientlysmallB. Althoughwecannotknowthevalue

ofthisexpectation,wecanmaketheprobabilityofholdingthisenequalityveryhigh

bytakingasmallA.

4 ACTIVE LEARNING WITH REDUCTION

PROCEDURE

Thereduction pro cedurekeeps theinformationmatrixnonsingularandmakesthe

activelearningalgorithm applicableto MLPevenwithsurplushiddenunits.

ActiveLearning with Hidden Unit Reduction

1. SelectinitialtrainingdatasetD

0

fromr(x;v

[0]

), andcompute

^

[0]

.

2. k:=1,anddoREDUCTIONPROCEDURE.

3. Computethe optimalv =v

[k ]

to minimizeTr[I(

^

[k 01]

)J 01

(

^

[k 01]

)], using

thesteep estdescentmetho d.

4. Choose N

k

new training data from r(x;v

[k ]

) and let D

[k ]

be a union of

D

[k 01]

and thenewdata.

5. Compute the MLE

^

[k ]

based on the training data D

[k ]

using BP with

REDUCTIONPROCEDURE.

6. k:=k+1andgoto3.

TheBPwith reductionprocedureis applicablenotonly toactivelearning, but to

avarietyofstatisticaltechniquesthatrequiretheinverseofaninformationmatrix.

Wedonotdiscussitinthis pap er,however.

(6)

100 200 300 400 500 600 700 800 900 1000

The Number of Training Data

0.0000001 0.000001 0.00001

Mean Squared Error (log)

Learning Curve

200 400 600 800 1000

0 1 2 3 4

The number of hidden units

# of hidden units

200 400 600 800 1000

0.000001

Active Learning

Active Learning [Av-Sd,Av+Sd]

Passive Learning

Passive Learning [Av-Sd,Av+Sd]

Figure1: Active/PassiveLearning: f(x)=s(x)

5 EXPERIMENTS

We demonstrate theeect of the proposed activelearning algorithm throughex-

periments. Firstweusea three-layermodelwith1inputunit,3hiddenunits,and

1 output unit. The truefunction f is a MLP network with 1 hidden unit. The

information matrix is singular at

0

, then. The environmental probability, Q, is

a normal distributionN(0;4). We evaluatethe generalizationerror in the actual

environmentusingthefollowingmeansquarederrorofthefunctionvalues:

Z

kf(x;

^

)0f(x)k 2

dQ:

We set the deviation in the true system = 0:01. As a family of distributions

for trainingfr(x;v)g, a mixture mo del of4 normaldistributions is used. Ineach

step of activelearning, 100 new samples are added. A network is trained using

onlineBP,presentedwithalltrainingdata10000timesin eachstep,andoperated

thereduction procedure oncea 100 cyclesb etween5000th and10000th cycle. We

try30trainings changing theseed of randomnumb ers. Incomparison, wetraina

networkpassivelybasedontrainingsamplesgivenbytheprobabilityQ.

Fig.1showstheaveragedlearningcurvesofactive/passivelearningandthenumber

of hiddenunits in a typical learning curve. The advantage ofthe proposed active

learningalgorithmisclear. Wecan ndthat thealgorithmhasexpectedeectson

asimple,idealapproximationproblem.

Second, we apply the algorithm to a problem in which the true function is not

includedintheMLPmodel. WeuseMLPwith4inputunits,7hiddenunits,and1

outputunit. Thetruefunctionisgivenbyf(x)=erf(x

1

),whereerf(t)istheerror

function. Thegraphoftheerrorfunctionresemblesthatofthesigmoidalfunction,

whiletheynevercoincidebyanyanetransforms. Weset Q=N(0;252I

4 ). We

train a network actively/passively based on 10 data sets, and evaluate MSE's of

functionvalues. Otherconditions arethesame asthoseoftherstexperiment.

Fig.2 shows the averaged learning curves and the number of hidden units in a

typicallearningcurve. Wendthattheactivelearningalgorithmreducestheerrors

thoughthe theoreticalcondition isnot p erfectlysatised in this case. It suggests

therobustnessofouractivelearningalgorithm.

(7)

100 200 300 400 500 600 700 800 900 1000

0.00001

Learning Curve

200 400 600 800 1000

4 5 6 7 8

The number of hidden units

# of hidden units

200 400 600 800 1000

0.00001 0.0001

Active Learning Passive Learning

Figure2: Active/PassiveLearning: f(x)=erf(x

1 )

6 CONCLUSION

Wereviewstatisticalactivelearningmetho dsandp ointoutaproblemin theirap-

plicationtoMLP:therequiredinverseofaninformationmatrixdoesnotexistifthe

networkhas redundanthidden units. Wecharacterize thesingularity conditionof

aninformationmatrixandprop oseanactivelearningalgorithmwhichisapplicable

to MLP with any numb er of hidden units. The eectiveness of the algorithm is

veriedthroughcomputer simulations,even whenthetheoretical assumptionsare

notperfectlysatised.

References

D.A. Cohn. (1994)Neuralnetwork explorationusing optimalexperimentdesign.

In J. Cowanet al. (ed.), Advances in Neural Information Processing Systems 6,

679-686. SanMateo,CA:MorganKaufmann.

V.V. Fedorov. (1972)Theory ofOptimal Experiments. NY:AcademicPress.

K.Fukumizu. (1996)A RegularityConditionof theInformationMatrixofa Mul-

tilayerPerceptronNetwork. NeuralNetworks,to app ear.

K. Fukumizu, & S. Watanab e. (1994) Error Estimation and Learning Data Ar-

rangementforNeuralNetworks. Proc. IEEE Int. Conf. NeuralNetworks:777-780.

K. Hagiwara, N. Toda, & S. Usui. (1993) On the problem of applying AIC to

determinethestructure ofa layeredfeed-forwardneural network. Proc. 1993Int.

Joint Conf. Neural Networks:2263-2266.

D.MacKay. (1992)Information-basedobjectivefunctionsforactivedataselection,

NeuralComputation 4(4):305-318.

F.Pukelsheim. (1993)Optimal Design ofExperiments. NY:JohnWiley&Sons.

H. White. (1989)Learning in articial neuralnetworks: A statistical persp ective

NeuralComputation 1(4):425-464.