鹿児島大学リポジトリ

(1)

AN IMPROVEMENT OF A PATTERN MATCHING ALGORITHM

BY MAKING USE OF A STATISTICAL PROPERTY OF

LANGUAGES

著者

TOGASHI Akira, KAWASAKI Hirotoki

journal or

publication title

鹿児島大学理学部紀要. 数学・物理学・化学

volume

16 page range

27-30

別言語のタイトル

言語の統計的性質を利用したパターンマッチングア

ルゴリズムの改良について

URL

http://hdl.handle.net/10232/6406

(2)

AN IMPROVEMENT OF A PATTERN MATCHING ALGORITHM

BY MAKING USE OF A STATISTICAL PROPERTY OF

LANGUAGES

著者

TOGASHI Akira, KAWASAKI Hirotoki

journal or

publication title

鹿児島大学理学部紀要. 数学・物理学・化学

volume

16 page range

27-30

別言語のタイトル

言語の統計的性質を利用したパターンマッチングア

ルゴリズムの改良について

URL

http://hdl.handle.net/10232/00007015

(3)

Rep. Fac. Sci., Kagoshima Univ., (Math., Phys., & Chem.), No.16, p.27-30, 1983

AN IMPROVEMENT OF A PATTERN MATCHING

ALGORITHM BY MAKING USE OF A

STATISTICAL PROPERTY OF LANGUAGES

Akira Togasi* and Hirotoki Kawasaki*

(Received September 9, 1983)

Abstract

In this paper we give an algorithm improving processing times for a pattern matching

● ●

problem, which is often used to information retrieval, by using statistical properties of

●

languages.

1. Introduction

The statistical properties of the natural languages have been investigated by several researchers. Especially there are many papers describing the investigation about proper-ties relating to consecutive characters in a word, and also the properproper-ties of the vocal sounds in it, since the alphabets used in the India-European language being very small sets, we can easily seek statistical properties of languages.

In this paper we consider an improvement about the number of handling times of the pattern matching to the natural language.

2. Pattern Matching

Suppose that r is a string whose length is m, n is a string whose length is n, and assume that m>n.

In this paper, we say r a text and k a pattern. Moreover suppose r and k are strings

onthe alphabet ∑-{tfi,..., as) i.e. r,7C∈∑+-∑*-w.

Then the problem whether a pattern n exists in a text z or not is said to be a pattern matching problem. And it is the problem we consider in this paper.

3. Basic Algorithm

We give one of the most primitive algorithms for the pattern matching (algorithm 1).

First, let us denote the i-th character of r and n by z{i) and x{i¥ respectively. Fig. 1 is the aowchart of algorithm 1.

(4)

A. Togasi and H. Kawasaki

Fig. 1 The flowchart of Algorithm 1.

The program of Algorithm 1.

1-1,j-l,M-l

L:IF ri)-i;)

THEN IF j-n

THEN O.K.

ELSE IF /-1

THEN

M-i-i+l

●

]-)+l

GO TO L

ELSE 2-2+1

J-j+1

GO TO L

ELSE IF i-m-n+l ¥Jm-n

THEN false

ELSE i-M+l

-1

GO TO L

Algorithm 1 starts from M-l, compares rd+M-1) with x(i). If the matching

succeeds to i- n, it turns on O.K. (i.e. k exists in r), and if the unmatching happens on the

halfway, adds one to M and does the same things. When M- m-n+l, ii the unmatching

happens, as it is impossible to increase M more, so it turns on false{n does not exist in r).

Then let TN be the number of times comparing rii+M-1) with n¥i) until it turns

on O.K. or false.

(5)

t)

m川川 umm山川日掛仙打hmRHH国肌川mmmWm川川川mm川N相川川M川仙川m川mmm川帆山川仙川矧川川仙川川仙川脚h川相川川

An Improvement of a Pattern Matching Algorithm 29

most (when it turns on O.K. or false on M-m-n+1, i- n after the unmatchings happen

on i-n from M-¥ to M-m+n), that is,

n≦TN<nX(tn-n+1).

Definition 1. Given a text r and a pattern n, we define the referring order by the number of times going through the label L of algorithm 1.

4. An improvement of the algorithm

When the text r is a string on the natural language, the appearance pattern of r is not uniform. The probability of z{i)-aj is not generally equal to that of r(i)-a/{j:^j ).

So let us assume that />,-, is the probability of z{i)-aj and pt/ that of r(i)-aj.

Then we can de丘ne an appearance pattern matrix ;

(≡

P-♪11 '-- -Pr, ● ● ＼Pis蝣-Pms inthis,triviallywehave s 2pu-l j=l[i-l,2,-,m) Given方suchthat tc-an..Mme∑+, theprobabilitythatkexistsfromk-thinris n Hbj十k-ljtj -- -･･----･･･- - -- (1) .7-1

Definition 2. Given the pattern %, we define the averaged referring order by the average of the referring order of matching by the text r which has an appearance pattern matrix P.

Moreover, suppose Tk is the averaged referring order that n exists from k-th in r.

Let us denote by F′･ the averaged referring order that the unmatching happens on 〟 - ㍗(㍗

-1, 2,..., k-1). Then we have immediately

n h-¥

F,-∑ hinpji+r-l,ijX(1 Ph+r, ih))

.7-1 (Fr is independent of k), and k-1

∴ Tk-∑Ft+n

1-1

Therefore the averaged referring order 7'of Algorithm 1 is

t n

T-∑ Tk npJ+k-i. ----･-･--･-･--･･･--･- (4)

.7-1

¥t-m-n+¥)

Now we know that T is dependent on {1, 2,..., /} which is the order M moves inthe

algorithm 1. So we can decrease 7'by changing the order 〟 moves.

Suppose that

W-{w¥w is apermutationof {1, 2,..., t}},

(6)

30 A. Togasi and H. Kawasaki

W∋W-{wi,..., Wt)

as the new order M moves, we denote the new Tk by Then Tkw is k-1 7V-2 Fwt+n一 (5) z-l So7Wis t n

Tw-∑ npj, _i,a-･- - -･･- -･･--･- (6)

j-¥

Therefore choosing w* such that

●

Tw -min T"

w<=W

as new order 〟 moves, then

qrw <^p

So we can improve the algorithm 1. We show the algorithm 2 improving the algori-thml.

The program of Algorithm 2 Search w* such that min Tw

welT

i-wi,y-i,M-¥

L: IF tU)-icU)

THEN IF j-n

THEN O.K.

ELSE IF /-1

THEN M-i

ELSE

i-i+l

/-;+!

●

GO TO L

1-1+1

●

- +1

GO TO L

ELSE IFi-t v m-n

THEN false

ELSE i - u)M^¥

7-1 GO TO L References

[ 1 ] Suen, C.Y : n-Gram Statistics for Natural Language Understanding and Test Processing, IEEE Trans. Vol. PAMI-1, No.2, April 1979

[ 2 ] Knuth, E.E., Morris, J.HJr., Pratt, V.R. : Fast Pattern Matching in Strings, SIAM J. Comput Vol.6, No. 2, pp. 323-350 (1977).

[ 3 ] Boyer, R.S. and Moore, J.S. : A Fast String Searching Algorithm, Comm. ACM, Vol. 20, No. 10,

pp. 762-772 (1977).

[4] Guibas, LJ. and Odlyzko, A.M.: A New Proof of the Linearity of the Boyer-Moore String Searching Algorithm, SIAM J. Comput., Vol. 9, No. 4, pp. 672-682 (1980).