AN IMPROVEMENT OF A PATTERN MATCHING ALGORITHM
BY MAKING USE OF A STATISTICAL PROPERTY OF
LANGUAGES
著者
TOGASHI Akira, KAWASAKI Hirotoki
journal or
publication title
鹿児島大学理学部紀要. 数学・物理学・化学
volume
16
page range
27-30
別言語のタイトル
言語の統計的性質を利用したパターンマッチングア
ルゴリズムの改良について
URL
http://hdl.handle.net/10232/6406
AN IMPROVEMENT OF A PATTERN MATCHING ALGORITHM
BY MAKING USE OF A STATISTICAL PROPERTY OF
LANGUAGES
著者
TOGASHI Akira, KAWASAKI Hirotoki
journal or
publication title
鹿児島大学理学部紀要. 数学・物理学・化学
volume
16
page range
27-30
別言語のタイトル
言語の統計的性質を利用したパターンマッチングア
ルゴリズムの改良について
URL
http://hdl.handle.net/10232/00007015
Rep. Fac. Sci., Kagoshima Univ., (Math., Phys., & Chem.), No.16, p.27-30, 1983
AN IMPROVEMENT OF A PATTERN MATCHING
ALGORITHM BY MAKING USE OF A
STATISTICAL PROPERTY OF LANGUAGES
Akira Togasi* and Hirotoki Kawasaki*
(Received September 9, 1983)
Abstract
In this paper we give an algorithm improving processing times for a pattern matching
● ●
problem, which is often used to information retrieval, by using statistical properties of
●
languages.
1. Introduction
The statistical properties of the natural languages have been investigated by several researchers. Especially there are many papers describing the investigation about proper-ties relating to consecutive characters in a word, and also the properproper-ties of the vocal sounds in it, since the alphabets used in the India-European language being very small sets, we can easily seek statistical properties of languages.
In this paper we consider an improvement about the number of handling times of the pattern matching to the natural language.
2. Pattern Matching
Suppose that r is a string whose length is m, n is a string whose length is n, and assume that m>n.
In this paper, we say r a text and k a pattern. Moreover suppose r and k are strings
onthe alphabet ∑-{tfi,..., as) i.e. r,7C∈∑+-∑*-w.
Then the problem whether a pattern n exists in a text z or not is said to be a pattern matching problem. And it is the problem we consider in this paper.
3. Basic Algorithm
We give one of the most primitive algorithms for the pattern matching (algorithm 1).
First, let us denote the i-th character of r and n by z{i) and x{i¥ respectively. Fig. 1 is the aowchart of algorithm 1.
A. Togasi and H. Kawasaki
Fig. 1 The flowchart of Algorithm 1.
The program of Algorithm 1.
1-1,j-l,M-l
L:IF ri)-i;)
THEN IF j-n
THEN O.K.
ELSE IF /-1
THEN
M-i-i+l
●]-)+l
GO TO L
ELSE 2-2+1
J-j+1
GO TO L
ELSE IF i-m-n+l ¥Jm-n
THEN false
ELSE i-M+l
-1GO TO L
Algorithm 1 starts from M-l, compares rd+M-1) with x(i). If the matching
succeeds to i- n, it turns on O.K. (i.e. k exists in r), and if the unmatching happens on the
halfway, adds one to M and does the same things. When M- m-n+l, ii the unmatching
happens, as it is impossible to increase M more, so it turns on false{n does not exist in r).
Then let TN be the number of times comparing rii+M-1) with n¥i) until it turns
on O.K. or false.
t)
m川川 umm山川日掛仙打hmRHH国肌川mmmWm川川川mm川N相 川川M川仙川m川mmm川帆山川仙川矧川川仙 川川仙川脚h川相川川
An Improvement of a Pattern Matching Algorithm 29
most (when it turns on O.K. or false on M-m-n+1, i- n after the unmatchings happen
on i-n from M-¥ to M-m+n), that is,
n≦TN<nX(tn-n+1).
Definition 1. Given a text r and a pattern n, we define the referring order by the number of times going through the label L of algorithm 1.
4. An improvement of the algorithm
When the text r is a string on the natural language, the appearance pattern of r is not uniform. The probability of z{i)-aj is not generally equal to that of r(i)-a/{j:^j ).
So let us assume that />,-, is the probability of z{i)-aj and pt/ that of r(i)-aj.
Then we can de丘ne an appearance pattern matrix ;
(≡
P-♪11 '-- -Pr, ● ● \Pis蝣-Pms inthis,triviallywehave s 2pu-l j=l[i-l,2,-,m) Given方suchthat tc-an..Mme∑+, theprobabilitythatkexistsfromk-thinris n Hbj十k-ljtj -- -・・----・・・- - -- (1) .7-1Definition 2. Given the pattern %, we define the averaged referring order by the average of the referring order of matching by the text r which has an appearance pattern matrix P.
Moreover, suppose Tk is the averaged referring order that n exists from k-th in r.
Let us denote by F′・ the averaged referring order that the unmatching happens on 〟 - ㍗(㍗
-1, 2,..., k-1). Then we have immediately
n h-¥
F,-∑ hinpji+r-l,ijX(1 Ph+r, ih))
.7-1 (Fr is independent of k), and k-1∴ Tk-∑Ft+n
1-1Therefore the averaged referring order 7'of Algorithm 1 is
t n
T-∑ Tk npJ+k-i. ----・-・--・-・--・・・--・- (4)
.7-1
¥t-m-n+¥)
Now we know that T is dependent on {1, 2,..., /} which is the order M moves inthe
algorithm 1. So we can decrease 7'by changing the order 〟 moves.
Suppose that
W-{w¥w is apermutationof {1, 2,..., t}},
30 A. Togasi and H. Kawasaki
W∋W-{wi,..., Wt)
as the new order M moves, we denote the new Tk by Then Tkw is k-1 7V-2 Fwt+n一 (5) z-l So7Wis t n
Tw-∑ npj, _i,a-・- - -・・- -・・--・- (6)
j-¥Therefore choosing w* such that
●
Tw -min T"
w<=W
as new order 〟 moves, then
qrw <^p
So we can improve the algorithm 1. We show the algorithm 2 improving the algori-thml.
The program of Algorithm 2 Search w* such that min Tw
welT
i-wi,y-i,M-¥
L: IF tU)-icU)
THEN IF j-n
THEN O.K.
ELSE IF /-1
THEN M-i
ELSEi-i+l
/-;+!
●GO TO L
1-1+1
●- +1
GO TO L
ELSE IFi-t v m-n
THEN false
ELSE i - u)M^¥
7-1 GO TO L References[ 1 ] Suen, C.Y : n-Gram Statistics for Natural Language Understanding and Test Processing, IEEE Trans. Vol. PAMI-1, No.2, April 1979
[ 2 ] Knuth, E.E., Morris, J.HJr., Pratt, V.R. : Fast Pattern Matching in Strings, SIAM J. Comput Vol.6, No. 2, pp. 323-350 (1977).
[ 3 ] Boyer, R.S. and Moore, J.S. : A Fast String Searching Algorithm, Comm. ACM, Vol. 20, No. 10,
pp. 762-772 (1977).
[4] Guibas, LJ. and Odlyzko, A.M.: A New Proof of the Linearity of the Boyer-Moore String Searching Algorithm, SIAM J. Comput., Vol. 9, No. 4, pp. 672-682 (1980).