An efficient learning scheme for pattern classification using multi-layer feed-forward neural networks

(1)

An Ecient Learning Scheme for Pattern Classication

Us-ing Multi-layer Feed-forward Neural Networks

33

KanadKeeni y

,KenjiNakayama yy

,and HiroshiShimo daira yyy

,

SUMMARY Thisstudyhighlightsonthesubjectofweight initializationinmulti-layerfeed-forwardnetworks.Trainingdata isanalyzedandthenotionofcriticalpointisintroducedfor de-terminingtheinitialweightsandthenumberofhiddenunits. Theprop osedmetho dhasb eenappliedtoarticialdata, pub-liclyavailablecancerdatabaseandDevanagaricharacters. The exp erimental resultsofarticialdata showthatthe prop osed metho dtakesalmost1/3ofthetrainingtimerequiredfor stan-dardbackpropagation.Inordertoverifytheeectivenessofthe prop osedmetho d,standardbackpropagationwherethelearning startswithrandominitialweightshasb eenappliedtothecancer databaseandDevanagaricharacters. Theexp erimentalresults indicatethetheprop osedweightinitializationmetho dresultsin b ettergeneralization.

key words: Neural networks, pattern classication, initial weights,decisionboundary,criticalpoints

1. Intro duction

Neuralnetworksarchitectureshavesparkedofgreat in-terestinrecentyearsb ecauseoftheirintriguing learn-ingcapabilities.Severallearningalgorithmshaveb een developed for trainingthe networksand out of them Back Propagation [1] is probably most widely used. The reason for thepopularity isthe underlying sim-plicityandrelativep owerofthealgorithm. Itsp ower derivesfromthefactthatunlikeitsprecursors,thep er-ceptionlearningrule[2],andtheWidrow-Holearning rule[3],itcanb eemployedfortrainingnonlinear net-works of arbitrary connectivity. Since suchnetworks are often required for real-worldapplications, such a learning pro cedure iscritical. However thestandard back-propagationhasthefollowingdrawbacks.

1. Thelearningpro cedureistimeconsuming. Train-ingalwaysstartsfromrandominitialweightsand theweightadjustmentpro cedureisveryslow

2. Itisnotclearastohowtoconstructanappropriate networkarchitecture.Thenumberofhiddenunits

y

TheauthoriswiththeFacultyofInformationSystems andQuantitativeSciencesofNanzanUniversity

yy

TheauthoriswiththeFacultyElec. Eng.andComp. ScienceofKanazawaUniversity

yyy

TheauthoriswiththeFacultyofInformationScience ofJapanAdvancedInstituteofScienceandTechnology

33

Apartofthispap erwaspresentedattheInternational JointConferenceonNeuralNetworks(IJCNN'99)and In-ternational ConferenceonComputationalIntelligenceand

for a particular problem can onlyb edetermined bytrialanderror

In caseof 1., it iswidely known that theinitial weightsofthenetworkwouldlargelyeectthe general-izationp erformance. Forexample,twonetworkswith samearchitecture,whentrainedwithtotally dierent initialweightswouldpro ducedierentresults. Several researchershavedesignedsystemsinwhichweightsare initializedsothattheinitialactivityofthenetwork cor-resp ondstothesuccessiverules,thatmaycomefroman exp ert. Wilsonhasprop osedFastBPN[4],wherethe initialparametersaredeterminedbyestimatingthe sig-nalrankwithgenerallikelihoo dratiotest(GLRT)and thesingularvaluedecomp osition(SVD)oftheGLRT covariancematrix. Howeverthedisadvantageof their metho disthatthenumberofhiddenunitscannot ex-ceedtheinputfeaturedimension. Inordertotackle1, mostoftheresearchershavemainlyfo cusedon improv-ingtheoptimizationpro cedurebydynamically adapt-ingthelearningrates[5]-[6]. Ontheotherhandithas b eenshownin [7]that trainingdataselection isalso imp ortant.

Incaseof2.,theproblemhasb eentreatedin var-ious ways. One mostcommon approachhas b eento start with a large number of hidden units and then prunethenetworkonceitistrained[8],[9]. However, pruningdo es not alwaysimprovegeneralization. An-other strategyfor nding a minimal architecture has b eentoaddorremoveunitssequentially[10],[11].

Itisalsowellknownthatneuralnetworksdonot make anyassumption ab out theprobability distribu-tion functions of data and can solve complex prob-lems witharbitrary decisionb oundary. Therefore, it isdesirabletoinvestigatethelearningcharacteristicsof thenetworksforndinganestimateaboutthedecision b oundary. However,there hasb een nosignicant re-searchforndinginitialweightsand minimalnetwork architecturebyexploitingthecharacteristicsofdecision b oundary.

Inthepresentstudy,theab ovementioned draw-backsof backpropagation has b een carefully investi-gatedandametho dhasb eenprop osedfordetermining theinitialweightsforinputtohiddenlayerand estima-tionofhiddenunitsautomatically.

(2)

Multi-layerNeuralNetworks(MLNN). Thethird sec-tionpresentstheautomaticmetho dforgenerating ini-tialweightsforinputtohiddenlayerconnections.The fourth section describ es the database. Exp erimental resultsofarticial,realworlddataareprovidedinthe Fifthsection.Finallythelastsectionisdevotedto con-clusionandfurtherresearches.

2. Pattern classication characteristics of MLNN

Inanypatternclassicationsystem, patternmapping orpatternclassicationisequivalenttodividinganN dimensionalspacewhere thepatterns aredistributed. Incaseofmulti-layerneuralnetworks(MLNN),thisN dimensional spaceisdividedbyforminghyper-planes with the help of synaptic weights of nonlinear neu-rons. MLNNsdonotmakeanyassumptionab outthe probabilitydistribution functionsof thetrainingdata and can solvecomplex problems witharbitrary deci-sion b oundary. The degreeof freedom inplacingthe decisionb oundaryisveryhigh. Therefore,neural net-works areconsidered to b ea go o dchoicefor pattern classicationtasks.

Anothermostimp ortantasp ectofneuralnetworks islearn-ability.Incaseofsup ervisedlearning,networks can nd optimal synaptic weights through learning. However,sinceneuralnetworksarenonlinearsystems andgradientdescentisusedtondasetofweightsthat optimizethep erformanceforaparticulartask,thereis alwaysa possibilityof getting stuckinlo cal minima. Therefore,globalminimaoroptimalsolutionisnot al-waysguaranteed. Furthermore,thelearningprocessis timeconsuminganditishighlydep endentonthe prob-lemthatistob esolved.

Ifweassumethattherearenooverlapamongthe distributionoftrainingpatterns,thenpatternmapping canb ecategorizedinthefollowingclasses.

1. kXi0Xjkissmall V kY i 0Yjkissmall 2. kX i 0X j kissmall V kY i 0Y j kislarge 3. kX i 0X j kislarge V kY i 0Y j kissmall 4. kX i 0X j kislarge V kY i 0Y j kislarge Here,X i and X j b elong toclass! 1 and! 2 ,Y i andY j

arethecorresp ondingoutputvectors,andk1k standsfortheEuclideannorm. Incaseof1.,the prob-lemistomapsimilarinputvectorsinawaysuchthat thecorresp ondingoutputvectorsalsob ecomesimilar. In thesecond case, theinput vectorsare similarbut theyaretob emapp edasdierentpatternsinthe out-putspace. Thethirdcaseimpliesthattheinput pat-ternsthatarefarfromeachotherintheinputspaceare tob emapp edassimilarpatternsintheoutputspace. Finally,the4thcasemeansthattheinputpatternsare

farfromeachotherintheinputspaceandtheyareto b e mapp edas dierentpatterns intheoutput space. Now,thepatternmappingof1.,3.,and4. arenotthat dicult.However,incaseof2.,theproblemistomap thepatternsthatareverycloseintheinputspace,as dierentpatternsintheoutputspace.Inthiscaseeven thoughthesolutionexists,duetothedicultyofthe problemthetrainingprocesswouldb etimeconsuming. Therefore,thesecondtypeofpatternmappingresults inveryslowlearningandthep ossibilityofarrivingat alocalsolutionisveryhigh.

For example, if we dene the connection weight fromthei'thinputtothej'thhiddenunitasw

ij then thetotalinputandoutputofthej'thhiddenunitcan b edenedasfollows. net j = n X i=1 w ij x i + j (1) O j =(net j ) (2) (netj)= 1 1+e 0netj (3)

where, (1)istheactivationfunctionand j

isthe bias. Atthesametimethetotalinputtothek'th out-putunitandthecorresp ondingoutputcanb edened asfollows. net k = j X j =1 w jk O i + k (4) Ok=(netk) (5)

where, (1) isthe sameactivation functionas it was withthehiddenlayer.

As mentionedearlierthe similarpatterns playa criticalroleinlearning.Supp osewehavetraining pat-ternsx

1n andx

2n

thatareverycloseintheinputspace andthepatternsb elongtotheclass!

1 and!

2 resp ec-tively. Inthiscase,thenetworkoutputwouldb ecome extremelysensitive.Thisisb ecausethenetworkoutput mustchangerapidlyforasmallchangeintheinput.

Now,ifthedecisionb oundaryisfarfromthe pat-terns x

1n and x

2n

, then the corresp onding outputs wouldhavethevalueO

1n = O 2n = 0 or1. However, duringthelearning pro cess,as thedecisionb oundary approachesx

1n andx

2n

theoutputofthecorresp ond-ingpatternsapproachthesamevalue,andthelearning pro cess b ecomesextremely slow. Inthiscase, as the decisionb oundary moves closeto the pair x

1n ,x

2n or entersthe regionb etween thepair, theamountof weightcorrectionb ecomesextremelysmall.Tob e spe-cic,ifweassumeO = O =

(3)

amountofcorrectionforthen'thpattern1 n wouldb e asfollows. 1 n = n O nj ; (6) n =(t n 0O n ) f(net n ); (7) where,O nj

istheoutputofthej'thhiddenunit. Now, asthepatternsx

1n andx

2n

aresimilar,theoutputof thejthhiddenunitwouldalsob ecomesimilar,thatis

O 1nj =O 2nj ; (8) and f(net 1n ) = f(net 2n ): (9)

Inthiscase,theweightcorrectionwillb easfollows.

11n+12 n = O1nj((t1n0O1n) +(t 2n 0O 2n )) f(net 1n ): (10)

Ifitisassumedthatthetargetsofthepatternsare

t 1n

=1;t 2n

=0; (11)

andtheoutputofthepatternsare

O1n=z ;O2 n

=z0; (12)

thentheweightcorrectionwouldb ecomeasfollows.

1 1n +1 2n = O 1nj ((t 1n 0z) +(t 2n 0(z0))) f(net 1n ) =((t 1n +t 2n )02z+) f(net 1n ) =(102z+) f(net 1n ) (13)

Nowatb eginningoftraining,thedecisionb ound-ary wouldb e far fromx

1n and x

2n

and inthat case thecorrectionofsynapticweightswouldnotb esmall. However, duringthe trainingpro cess,as thedecision b oundarymovestowardsx

1n andx

2n

,becauseofthe similarityof the patternstheoutput wouldapproach thesamevalue. Themostcriticalsituationwouldtake placeaszandkkapproachthevalue0.5and0resp ec-tively.Thatis,

11n+12 n = lim z !0:5 lim !0 (102z+) f(net 1n ) = 0 (14)

Therefore,thecorrectionofweightsforthese pat-ternswouldb ecomeverysmallandasaresultthe learn-ingpro cesswouldb ecomeextremelyslow.

Ontheotherhand,ifthepatternsx 1n

andx 2n

are farfromeachotherintheinputspace,evenifthe de-cisionb oundarymovestowardsthemtheactivationof thecorresp ondingoutputswouldnotbecomethesame atthesametime. Hence,theweightcorrectionwillnot

Critical point

Decision boundary

Class2

Fig.1 Criticalp ointsanddecisionb oundary

3. Estimationofdecisionb oundary

Here,thedecisionruleistoselecttheclass corre-sp ondingtotheoutputneuronwiththelargestoutput. Forthesakeofsimplicity,thenumberofoutputunitis settotwo(two-classclassicationproblem). However, theconcept can b e hop efully extended tomulti-class classication problems. The decisionb oundary for a multi-layerfeed-forwardnetworkisdenedasfollows.

Denition1. Thedecisionb oundarybetweentwo classesinafeed-forwardneuralnetworksisthelo cusof p ointswhere b othoftheoutputneuronsproduce the sameactivation.

If wedenetheactivationoutputunitiasO i

(x) where x isan input vectorand let d(x) = O

1 (x)0 O

2

(x),thenthedecisionb oundarycanb edenedas

fxjd(x)=0g (15)

Next,thenotionofCriticalpointsisintroducedas follows.

Denition2. Thesetofcriticalp ointscontainspairsof datathatsatisfythefollowingcondition:

min k (d(p i ;q k ))=d(p i ;q j ); (16) min k (d(p k ;q j ))=d(p i ;q j ) (17) whered(p i ;q k

)denotestheEuclideandistanceb etween thevectorp

i andq

k .

Ifwedenotethesamplesinclass! 1 asp i and sam-plesinclass!2 asq j

thenforeachsampleinclass!1 andclass!2,thesetofcriticalp ointsCcanb edened as C=f(p i ;q j )jmin k (d(p i ;q k ))=d(p i ;q j ); min k (d(p k ;q j ))=d(p i ;q j );p i 2! 1 ;q j 2! 2 g (18)

Ahyper-planehastobeplacedinb etweenthepair ofcriticalp oints. Iftheco ordinateofthepairof criti-calp oints(p i ;q k )are(x i ;y k )and(u i ;v k

)thentheideal hyp erplanemustgothroughthepoint

(xi+ui) 2

; (yk+vk)

2 and theslop ez ofthe straightlinecanb e calculated fromthefollowingequation.

v k 0y k u i 0x i 2z=01 (19)

In thepresent approachinsteadof starting from scratchtheinitialweightsforthehiddenunits

(4)

connec-Since,theweightvectorsareorthogonaltothe sep-aratinghyp er-plane,theinitialweightsaregeneratedin thefollowingway. First,thepairofcriticalp ointsare determinedfromthetrainingdataasmentionedab ove. Next forall pair of criticalpoints(p

i ;q k )theweight vectorsm n

aregeneratedbythefollowingequation:

m n = p i 0q k kp i 0q i k (20)

andthebiases n

aregeneratedbythefollowing equa-tion: n =0m t n 1P=0 (p i 0q k ) t kp i 0q k k 1 (p i +q k ) 2 (21) 4. Databse

ga

u

a

0 kha

gha ca cha pa pha ba

bha

ta

da

dha na ta

tha da dha na

ya

ra

la

va

sa

sa sa

ha

kta

pta

ru

tra sra sra sta dhra

nja

tta

i

u

r

e

,

h

end

1

2

3

4

5

6

7

8

9 dva dbha

ddha

tta

pra

dba

ka

ja

ma

ksa

stha

a

0

Fig.2 SomeofthebasicDevanagaricharacters

Theprop osedmetho dhasb eenappliedtothe can-cerdatabaseobtainedfromtheUniversityofCalifornia Machine Learning Rep ository (two-class classication problem) andDevanagaricharacters(multi-class clas-sication problem). The cancer database is publicly availableand it contains699 instanceseachhaving 9 attributes. TheDevanagaridatabaseconsistsof44 dif-ferentcharacters(23consonants,13c-ccombinations, 3 vowels,2 sp ecialcharactersand3 numerals). Sixty seven Devanagari characterstakenfromthetext [12] are showninFigure. 2. Thecharactersarestoredin the formof162 16 oatingp ointimages. Adetails ab outtheDevanagaridatabsecanb efoundin[13].

5. Exp eriments

5.1 Exp erimentswitharticialdata

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 Decision Boundary

"class1.output"

"class2.output"

"class1.orgdat"

"class2.orgdat"

"01.dat"

"10.dat"

Fig.3 Decisionb oundarygivenbytheprop osedmetho d

Inthepresentapproachthenumberofhiddenunits iskeptthesameas thenumb er ofcriticalp oints cal-culatedfromthetrainingdata. Twodimensionaldata isusedfortrainingandtesting. Fivesamplesforeach class(inthiscase2classes)wererandomlygenerated for training. The network had two input units, two outputunitsandthenumberofhiddenunitwassetto 3. Trainingwascontinueduntilthemeansquareerror reach0.001. Fortesting,10000sampleswererandomly generated and the class to which the testing sample fallsisdecidedbyconsideringthemaximumactivation oftheoutputunits. Thenetworkcouldlearnthe train-ingdatawith3hiddenunits.Thedecisionboundaryis estimatedfromtheoutputactivationofthenetworkin resp ecttothetestingsamplesasfollows.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 Decision Boundary

"class1.output"

"class2.output"

"class1.orgdat"

"class2.orgdat"

"01.dat"

"10.dat"

Fig.4 Decisionb oundarygivenbytheConventional Back-propagation

Foreachtestingsamplecorrectlyclassiedasclass !

1

,ndthenearesttestingsamplecorrectlyclassiedas class!

2

. Thesamepro cessisrep eatedforthetesting samples classiedas class!

2

. Nowthe line connect-ingthepairsmentionedab ovemustpassthroughthe decisionb oundary sincethepair of samplescorrectly classieddierently. The decisionboundarygivenby thenetworkisshowninFigure 3,whereeachofthe

(5)

cal-In orderto evaluatethe eectivenessof theprop osed metho d another set of exp eriments were p erformed byemployingtheconventionalback-propagation algo-rithmandthedecisionb oundaryisshowninFigure 4. Next,thenetworkwastrainedwithvedierentinitial weights(weightsforhiddentooutputunitconnection) ,andtheresultissummarizedinTable 1.

Initweight Iteration Iteration Prop osedmetho d StandardBP

Seed1 6345 21279 Seed2 6341 21347 Seed3 6251 21263 Seed4 6179 21050 Seed5 6224 22019 Mean 6268 21392

Table1 Comparisonofresults

Itcanb eseeninTable 1,thattheiterations neces-saryfortheprop osedmetho dislessthan1/3ofthatof standardback-propagation. Incaseof standard back-propagation,thereisnootherwaythantocutandtry for determiningthenumberofhiddenunitsnecessary forsolvingaproblem. Incaseoftheprop osedmetho d, thenumberofhiddenunitisdeterminedautomatically.

5.2 Two-classclassicationproblem

Exp erimentswerep erformed byemployingthecancer data base. It wasdivided intoten training and ten testsetsandatenfoldcrossvalidationwasp erformed. In orderto evaluatethe eectivenessof theprop osed metho d another setof exp eriment wasp erformed by applying the standard back-propagation. The num-b er of hidden unit wasset to thenumber of critical p ointsgivenbytheprop osedmetho d. Theaverage ac-curacy rate of the prop osed metho d, standard back-propagationandtheresultsofapplyingBayesdecision (assuming normaldistributionforeachcategory)rule isshowninTable 2. Ontheotherhand,thefollowing

Metho d Accuracy(%) Prop osed 96.8 StandardBP 96.3

Bayes 95.27

Table2 Averageaccuracyrate

linear programming methods for pattern recognition: MultiSurfaceMetho d[18](MSM),RobustLinear Pro-gramming [19] (RLP), and Perturb ed Robust Linear Programming [20] (RLP-P) were also applied on the samedatasetandtheresultsaresummarizedinTable 3.

ItisclearfromTable 3thatoutofthethree meth-o dsRLPPgivestheb estaccuracy of96.6%. Nowif compared to Table. 2, it is clear that theprop osed metho dgivesslightlyb etterresultthanRLPP.Onthe otherhand,thesup eriorityoftheprop osedmethodover

MSM 92.6

RLP 96.4

RLP-P 96.6

Table3 Averageaccuracyrateoflinearprogramming meth-o ds

Bayesdecisionrulesuggeststhattheboundarygivenby theproposedmethodisreliable.

5.3 Multi-classClassicationproblem

Here the proposed metho d is applied to Devanagari charactersforevaluatingitsapplicabilitytomulti-class classicationproblem. However, a dierentapproach hasb eentakenforrepresentingtheoutputlayer. The outputlayerco dingmetho disthoroughlydiscussedin [12]. However,fortheb enetofthereadersabrief de-scriptionofthemethodisgiveninthenextsubsection.

5.3.1 Automaticco dingscheme

Split the 16 X 16 image

into four parts

Cluster each part

5 bit binary number

Feature vector

y

q

C

(n)

Map each element of q to a

_l

l

Binary number

encoder

1

2

3

4

1

2

3

4 n

n

(20−bit)

Fig.5 Automaticco dingpro cedure

One ofthe imp ortantasp ect of information pro-cessingisthewaytheinformationisrepresented. The conventionalapproachof assigningacategorytoeach cellisreasonableinthesensethat,thedistanceamong theco desisconstant. However,thiskindofapproach cannotbetakenforrepresentingalargenumb erof cat-egories. Althoughbinaryrepresentationapp earstobe reasonableinthesensethatitwouldrequirelog

2 nbits forrepresentingncategories; theyarenot wellsuited fornetworkoutputrepresentationsb ecauseofthe inter-dep endencyofcells whichmakethelearning dicult. Forexample,ifwerepresentthenumb ers7and8in bi-naryas0111and1000thenirresp ectiveofthefact thatthenumb ersareadjacentintegers,4outputunits willhavedierentvalue.Here,insteadofastraight for-ward binarycodingmetho d, a dierentapproachhas b eentakenandthemethodisasfollows. Atrsteach trainingdatax

n

(n=1;111;N)(16216pixels)is splitintofourparts(828pixels)inthefollowingway.

x n

(6)

Asaresultthefollowingsetofpatternisgeneratedfor eachparti=1;111;4. Y i =fy 1i ;y 2i ;111;y Ni g Next, eachY i

(i =1, 2, 3,4) isclustered. Herethe LBG method [14] wasapplied for clustering and the numberofclusterwassetto 16. Afterclustering, for each !

l

(l = 1;111;L) the clusterto which the part b elongs ischeckedand thenalclusterisdetermined byconsideringthemaximumnumb erofcharactersb e-longing toaspeciccluster. So,inthisway,for each category !

l

there will b e a four dimensional vector q l = (q l1 ;q l2 ;q l3 ;q l4

), where eachelementof the vec-tor will corresp ondtotheclustertowhichthe

corre-11011

11010

10111

10110

10011

10001

01111

01110

01101

01100

01011

01001 00111

00110 00011

00010

Fig.6 Co deassignmentpro cedure

sp ondingpartofthetrainingsamplebelongs.Now,each elementof vectorq

l

ismapp edtoa 5-bitbinary pat-tern. Thebinarypatternsarehandpickedandtheyare mapp edinawaysuchthatforanytwoclusterswhose parentisthesame,theEuclideandistanceb etweenthe binarypatternsb ecomesone.Thiswillsomehowensure thatthesimilarpart(q

l

)ofeachcategory! l

wouldget similarcodes. On theotherhand, neglectingthe bi-narypatternscontainingonly0'sor1's,theminimum numberof bitsrequiredtomaintainthisconstraintis 5. Thecodeassignmentpro cedureisshowninFigure 6. Finally,q

l

istransformedtoafeature vectorwhich isa20-bit co de. Inthiswayforeachtrainingsample x

n

(n=1;111;N)therewillb e afeature vector c (n) with 20 elements. (Here(n) representsthe category numberto which x

n

b elongs.) Theentire pro cessis summarizedinFigure 5.

5.3.2 Recognitionpro cess

In the present approach, training pro cess is di-videdintotwophases. Intherstphase,thenetwork is trainedwith thetraining data ina usual way. In thesecondphase,thetrainingdataisfedtothe net-workandtheoutputvectorsgivenbythenetworkare employedforformingtheprototypesforeachcategory.

Input(xn)(n=1;:::;N) ? NN ? Output t t R o=(o 1 ;o 2 ;:::;o K ) (K=dimensionofo) Testmo de Learnmo de t 9 m1= 1 N 1 X x2! 1 o m2= 1 N 2 X x2!2 o . . . m M = 1 NM X x2! M o d 1 = K X k=1 jo k 0m 1k j 2 d2= K X k=1 jok0m2kj 2 . . . dM= K X k =1 jok0mMkj 2

Fig.7 Recognitionpro cess dM

3

=mindM=)M3:recognizedcategory

m 1 = 1 N 1 X x2!1 o m 2 = 1 N 2 X x2!2 o . . . m M = 1 N M X x2!M o

Here N and o stand for the number of samples usedforeachcategoryduringtrainingandtheoutput vectorresp ectively.Theseprototypesarelaterusedfor recognition.

During evaluation, Euclidean distance has b een takenamongtheprototypesformedinthesecondphase oftrainingandtheoutputvectorgivenbythenetwork duringtest. Thewholepro cess,fromformationof pro-totyp evectorsto evaluationpro cessissummarized in Figure 7.

Herethetargetoutputsaregeneratedby employ-ingtheprop osedautomaticco dingscheme.Thissolves theproblemofoutputlayerrepresentation. Thenext issues are related to network architecture and initial weights.

Insteadofp erformingtrialanderror,thenumber ofhiddenunitisdeterminedbyemployingthe knowl-edgeofcriticalp oints.Inthiscasethestraightforward approachofsetting thenumberofhidden unittothe numberofcalculatedcriticalpointspairscouldnotb e appliedduetotheenormousnumberofcriticalp oints. Therefore,thepair ofcriticalp oints aresortedbased onthedistanceb etweeneachpair. Next, thenumber ofhiddenunitisdeterminedbyxingathresholdvalue d

m

axoverthedistance. Inthiswayif d m

axsettox resultsinuhiddenunits,thenrstucriticalpointsare employedforcalculatingtheinitialweightsandbiases.

(7)

inthesense thatthepatternsthat staycloseto each otherintheinputspacewouldlargelyeectthe learn-ingpro cess.

5.3.3 Results

Therelationofdmaxandthenumberofhidden units isgiveninTable.4.

dmax 2.0 2.2 2.4 2.5 3.0 #Hiddenunits 35 46 57 61 76 Table4 Relationofdmaxand#hiddenunit

0.50

1.00

1.50

2.00

2.50

3.00

3.50 # Hidden unit

Method1

Method2

Training epoch (X 10 )

3

35

46

57

61

76

Fig.8 Trainingep o ch(Metho d1VsMetho d2)

97.5

98

98.5

99

99.5

100

35

40

45

50

55

60

65

70

75

80 C

l

ass

ifi

ca

ti

on ra

t

e

(%)

"training-error-method1"

"training-error-method2"

Fig. 9 Classication rate (Training data : Metho d1 Vs Metho d2)

Theperformanceof theprop osedautomaticco d-ing pro cedurecould haveb eencompared tothe con-ventionaloutputlayerrepresentationwhereeachofthe output unitrepresentsa singlecategory,howeverthis

92.5

93

93.5

94

94.5

95

95.5

96

96.5

97

35

40

45

50

55

60

65

70

75

80 C

l

ass

ifi

ca

ti

on ra

t

e

(%)

#

idd

i

"test-error-method1"

"test-error-method2"

Fig. 10 Classication rate (Testingdata : Metho d1 Vs Metho d2)

if thereal worldproblemswhere thenumberof cate-gorytob erecognizedistoolarge(forexampleKanji) isconsideredthenitisunrealistictoapplythe conven-tionalmetho d that would requiremuchmorestorage and/memorythantheprop osedmetho d.

In order to evaluate the eect of the prop osed metho d, one set of exp eriments were p erformed by employingonlyautomaticco dingpro cedure(Metho d1), andanothersetofexp erimentswereperformedby em-ployingb othautomaticco ding and automatic weight initializationpro cedure(Metho d2). However,inb othof thecases,thelearningparameterswerekeptthesame. Theexp erimentaloutcomesareshowninFigure. 8,9, and10.

ItisclearfromFigure. 8,thatthenumberof train-ingep o chrequiredforthecombinationoftheprop osed metho dsislessthanthecasewhereonlyautomaticco d-ingprocedureisemployed.

In Figure 9, 10 it can be seen that the com-bined metho d results in b etter p erformance against b othtrainingandtestingdata. Nowb othofMetho d1 andMethod2gavesamep erformancefortrainingdata with57hiddenunits(Figure. 9). However,compared toMetho d1,Metho d2gaveb etterp erformancewith re-sp ecttotestingdata(Figure. 10).Thisassuresthatthe combinedmetho d(Metho d2)producesbettersolution. Therefore,theresults indicatethat theinitialweights areeectiveforfastertrainingandbettersolution.

6. Conclusion

It has b een successfully shown through exp eriments that thea priori relatedto decisionboundarycanb e employedfordetermining theinitial weightsofa net-work.Comparedtostandardback-propagationthe pro-p osedmetho dreducestrainingtime. Themethodhas

(8)

cer databaseand Devanagaricharacters. Themetho d alsodeterminesthenumberofhiddenunits automati-cally.

However, inthepresentstage thepairof critical p oints areselectedbasedon a thresholdoverthe dis-tanceb etweenthepairofcriticalp oints,whichcanb e consideredasalo calapproach.Hence,someother cri-terionforselectingthepairofcriticalp ointsincaseof complexclassb oundaryistob efurtherinvestigated.

Acknowledgement

TheauthorwouldliketothanktheEducationCenter for InformationPro cessing(ECIP)ofTohoku Univer-sityfor providingtheDevanagaridatabase. Theyare also gratefulto MasayukiKimura, SusumuHoriguchi and Kazunori Kotaniof Japan AdvancedInstituteof ScienceandTechnologyfortheirvaluablecomments.

References

[1] Rumelhart, McClelland, and the PDP Research Group, \ParallelDistributedPro cessing,"TheMITPress,1989. [2] Rosenblatt,F:TwoTheoremsofStatisticalSeparabilityin

thePerceptron;ProceedingsofaSymposiumonthe Mecha-nizationofThoughtProcess,HerMajesty'sStationary Of-ce,London,1959,421-456.

[3] Widrow, B., and Ho, M.E: Adaptive Switching Cir-cuits;InstituteofRadioEngineers,WesternElectronicShow andConvention,ConventionRecord,part4,1960,96-104. [4] DavidH.Kil, Frances B.Shin,\Patternrecognitionand

Predictionwith applicationstoSignalCharacterization," AIPPRESS,1996,pp.134-138.

[5] Riedmiller,MartinandBraun:RPROP-AfastAdaptive learningalgorithm;Technicalrep ortUniversitatKarlsruhe, 1992.

[6] Y.Riedmiller,MartinandBraun:RPROP-Afast Adap-tivelearningalgorithm;Technicalrep ortUniversitat Karl-sruhe,1992.

[7] K. Hara and K. Nakayama: Training Data Selection Metho dforNeural Networks;IEICETrans.ofInf.Syst. SocietyofJapan,Fundamentals,Vol.E81-A,No.3,pp. 374-381,March1998.

[8] M.C.Mozer,P.Smolensky:Skelitonization:Atechnique fortrimmingthefatfromanetworkviarelevance assess-ment; inAdvancesin neural informationprocessing sys-tems,1,pp.107-115,1989.

[9] J.SietsmaandDow: Neuralnetworkpruning-whyand how;Proceedingofthesecondinternationalconferenceon neuralnetworks,pp.326-333,July1988.

[10]Fahlman,E.Scott:Anempiricalstudyoflearningsp eedin backpropagationnetworks;Technicalrep ort CMU-CS-88-162,1988.

[11]T.Ash: Dynamicno decreationinbackpropagation net-works;ConnectionScience,1(4),pp.365-375,1989. [12]H. Kernand BunyiuNanjio: Saddharmapundarika,

ST.-PETERSBOURG,1912.

[13]K.Keeni,H.Shimo daira,T.Nishino,Y.Tan\Recognition ofDevanagariCharactersUsingNeuralNetworks," Trans-actionofInformationandSystemsSo cietyofJapan,IEICE TRANS.INF.&SYST.,VOL.E79-D,No.5,May1996. [14]Y.Linde,A.Buzo,andR.M.Gray:Analgorithmfor