拡張正則表現に対する近似文字列照合問題

全文

(1)アルゴリズム 86−１（２００２．９．１９）. 拡張正則表現に対する近似文字列照合問題山本博章信州大学工学部情報工学科概要拡張正則に対する近似文字列照合問題とは、長さ m の拡張正則表現 r 長さ n の文字列 x 、任意の d が与えられたときに x の接頭語 yで edit y;r d なるものすべてを見つける問題である。ここで、 edit y;r は、yと r間の最小の編集距離である。この問題を多項式時間で解くことは難しいことが知られている。本論文ではこの問題を O Nr2 m 2N n 時間かつ O Nr2 m N 領域で解くための新しいアルゴリズム u r n r で、u r および n r はそれぞれ、rに出現する共通集合演算子およを示す。ここで、Nr .. 数. ,. 0. (. ,. (. ). ). ,. (. = 2. (2. ( )+. r. ). ). (. ( )+1. (2. ( ). ). r). ( ). び補集合演算子の数を表す（これら二つの演算子は拡張演算子と呼ばれる）。このように、我々のアルゴリズムは拡張演算子の数に依存し、もしそれが定数ならば多項式時間で走る。. Approximate Pattern Matching Problem for Extended Regular Expressions Hiroaki YAMAMOTO Department of Information Engineering, Shinshu University, 4-17-1 Wakasato, Nagano-shi, 380-8553 Japan. [email protected]. Abstract. Given an extended regular expression number pre

(2) xes. d y x. r. of length. m. , a text string. x. of length. of. with. string generated by. edit y;r d r. and any. (. ). edit y;r. . Here. (. ) is the minimum edit distance between string. O Nr m 2 (2. (. and any. . It is dicult to design a polynomial time algorithm for this problem. In this paper,. O Nr2 m 2N n ur nr. we will present a new algorithm to solve the approximate pattern matching problem in time and. n y. 0, the approximate pattern matching problem for extended regular expressions is to

(3) nd all the. N ) r). space in the worst case.. Here. Nr. ur. = 2. ( ) +. nr. ( ) + 1 when. the number of intersection operators and complement operators occurring in algorithm runs in polynomial time if. ur. ( ) and. nr. r. (. (2. , respectively.. r. ). ( ) and. ). ( ) are. Thus our. ( ) are constant numbers.. 1 Introduction This paper is concerned with the approximate pattern matching problem for extended regular expressions (that is, regular expressions with intersection and complement):. r m x n d edit y;r d edit y;r f edit y;z g L r z 2L(r ) r edit y;z edit aaabb;aabbb edit aaabb;abbb O mn Om O mn= n O mn Om O mn= d+2 n expression. of length. with. ). (. .. , a text string. Here. (. is, language) generated by. of length. ) = min. and any number. (. and the edit distance. ) , where. (. Given an extended regular. 0,

(4) nd all the pre

(5) xes. ) between two strings. y. and. z. (. ) = 1 and. (. of. is de

(6) ned to be. the minimum number of insertions, deletions, and/or substitutions required to transform. example,. y x. ( ) denotes the set of strings (that. y. into. z. .. For. ) = 2.. It is widely known that the classical exact pattern matching algorithm for regular expressions runs in. (. ) time and. (. ) space, which is based on nondeterministic

(7) nite automata (NFAs for short) [1, 2, 5].. Myers [6] has improved this algorithm and has given an algorithm.. (. log. ) time and space pattern matching. Myers and Miller [7] applied the NFA-based technique to the approximate pattern matching. problem of regular expressions, and showed an. (. ) time and. (. ) space pattern matching algorithm.. Wu, Manber and Myers [9] have improved this algorithm by using the similar technique to Myers [6], and showed. (. log. ) time algorithm.. Furthermore, Wu and Manber [8] presented another algorithm. that runs fast in practice for small regular expressions. Thus the approximate pattern matching problem for regular expressions can be solved in the same time complexity as the exact pattern matching problem. As for extended regular expressions, Hopcroft and Ullman [5]

(8) rst gave the recognition algorithm based on the inductive de

(9) nition of extended regular expressions, which runs in space.. O mn3 (. ) time and. O mn2 (. ). Knight and Myers [4] have given a recognition algorithm based on an extension of the inductive. −1−.

(10) transformation from regular expressions into NFAs. Their algorithm also runs in. O mn3 (. ) time and. O mn2 (. ). space in the worst case. Recently, Yamamoto [10] has given a new automata-based recognition algorithms. His algorithm runs faster than the existing one. By using these recognition algorithm, the exact pattern matching problem for extended regular expressions can be solved in the similar complexities. On the other hand, up to now, the approximate pattern matching problem has never been studied. In this paper, we will study the approximate pattern matching problem and will present a new approximate pattern matching algorithm that is a natural extension of that in [7].. d d 2 N O Nr m ur nr r ur nr O m2 mn E qj ;i qj. running time is independent of. O Nr2 m 2N n ur nr. runs in. Nr. (. = 2. (2. ( ) +. r. ). if. ) time and. ( ) + 1 when. m. time in. and. (. ( ) and. complement operators occurring in faster our algorithm runs.. (2. ). r ) space for any number. ( ) and. . Furthermore, it runs. (. ( ) are constant numbers then it runs in polynomial. +. a dynamic programming technique, where. y. [. is a string accepted at the state. extension is as follows.. 0 in the worst case, where. , respectively. Thus, the less the number of extended operators is, the. ) time and. Om (. ) space for regular expressions.. Our technique is an extension of the technique used in [7, 9].. such that. d. ( ) are the number of intersection operators and the number of. Especially, if. n. As with [7, 9], the. is a number that can be stored in one machine-word. Our algorithm. E qj ;i edit a1 1 1 1 ai ;y. They eciently computed. ] is the minimum edit distance over all. [. ] by. (. of the NFA obtained from a regular expression.. ). Our. As in [10], we partition the parse tree of an extended regular expression into. modules by extended operators, and then transform each module into an NFA called an augmented NFA. Furthermore, we introduce a new structure called a computation modular tree for such augmented NFAs, and construct a kind of NFAs based on computation modular trees. In other words, this NFA is constructed by viewing computation modular trees as states and the accepting language exactly coincides with Then we are computing. E T;i y [. ] for every computation modular tree. modular tree by dynamic programming technique, where. edit a1 1 1 1 ai ;y l (. ) such that. E T;i T l maxf n; mg l [. Lr. ( ).. reachable from the initial computation. ] is the minimum edit distance over all. is a string accepted at the computation modular tree. Throughout the paper, as in [1, 5, 6, 7, 8, 10], we rely on an. that is, each. T. =. -bit instruction is executed in one unit of time and each. log. .. log. -bit uniform RAM,. -bit register is stored in one unit of. space, to evaluate all complexities appearing in this paper.. 2 Preliminaries Let 6 be an alphabet. The extended regular expressions over 6 are de

(11) ned as follows. 1. ;, and a (2 6) are extended regular expressions that denote the empty set, the set fg and the set fag, respectively.. De

(12) nition 1. 2. Let r1 and r2 be extended regular expressions denoting the sets R1 and R2 , respectively. Then (r1 _ r2 ), 3 (r1 r2 ), (r1 ), (r1 ^ r2 ) and (:r1 ) are also extended regular expressions that denote the sets R1 [ R2 , 3 R1R2, R1 , R1 \ R2, and R1 (= 63 0 R1), respectively. By. ur. ( ) and. :. operators. nr. ( ), we denote the number of intersection operators. occurring in an extended regular expression. operators (. ), (. ), (. ), (. ) and (. 2. If. Regular expressions are de

(13) ned. ) and (. ).. To take advantage of hierarchical structure of extended Let. be an extended regular expression.. Then the. is de

(14) ned as follows:. =. ( ,. =. , respectively), then (. =. root is labeled by are. and the number of complement. ), and semi-extended regular expressions are de

(15) ned by four. regular expressions, we introduce their parse trees. parse tree. ^. , respectively.. r1 _ r2 3r1r2 r13 r1 _ r2 r1 r2 r1 r1 ^ r2 r Pr r ;a Pr r r1 _ r2 r r1 ^ r2 r r1 r2 r r13 r :r1 _ ^ 1 3 : Pr Pr 3 : Pr. by three operators (. 1. If. r. 1 and. 2 (. ,. (. and. ,. ,. is a tree consisting of just one node labeled by. =. ,. ,. =. ,. =. , respectively), then. Pr. ;a ( ,. , respectively).. is a tree such that its. , respectively) and the left subtree and the right subtree of the root. have only. 1. 1 ), respectively. The operator \ " means concatenation.. The following proposition is widely known as the linear translation from regular expressions into NFAs (for example, see [5]).. −2−.

(16) Let r be a regular expression of length m. Then we can construct an NFA M such that M has at most 2m states and accepts the language denoted by r.. Proposition 1. 3 Modular Tree and Augmented NFAs As in [10], we design an algorithm by inductively constructing NFAs from hierarchical formation of extended regular expressions. Let. r. the parse tree of. .. r. be an extended regular expression over an alphabet 6 and let. Then, we partition. Pr. ^. by nodes labeled with intersection. and complement. Pr. :. be. into. ^ : Pr Pr ^ : ; a2 ^ : ^ : a universal leaf a negating leaf a modular leaf a module. R 0 R0 P0 r u R R Pr R a parent of R R0 a child of R a child of R at u R Pr R a root module R R a leaf module a modular tree Tr R; E R R;R0 2 E R R0 Tr Nr Tr R u R u a modular symbol R [ fu j u R g MR R MR an augmented NFA q q;u q a universal state u q a negating state u q R0 R u MR associated with u associated with the state q u u u subtrees such that (1) the root of each subtree is either a child of a node labeled with root of. by. ,. , (2) each subtree does not contain any interior nodes labeled by. ,. 6,. or. . If it is labeled by. (. and. root of. .. , then. children, then = (. .. is called. If a modular leaf. , and conversely. is the root of. , then. is called. (. of. becomes the parent of the. is called. or. is called. .. is a set of modules, (2) (. of. This time, for a state. number of transitions from such a state. is the parent of. with a new symbol. can be viewed as a regular expression over 6. is a universal leaf, and. 0. if and only if. .. .. . Then, by Proposition 1, we can construct an NFA. , then A-NFA. ). , we relabel every modular leaf. . By this relabeling,. if. If the. does not have any. from 0 in the order of the root to leaves, the left to the right.. , we denote the number of nodes of. Now, for each module. (A-NFA for short).. If a module. . It is clear that such a parent-child relationship induces. ) such that (1). We number the nodes of the modular tree. leaf. or the. Thus there are two children at each universal leaf, while one child at each negating leaf.. root of a module. By. in. . We call such a subtree. be modules in the parse tree. in. or. , (3) each leaf is labeled. , respectively), then it is called. , respectively). These leaves are also called Let. or. for a module. . We call this. such that a transition. is called. called. is a modular leaf of. if. (. ) is de

(17) ned,. is just one if any. In addition, if a module. is said to be. is called. is a negating leaf. Note that the. or. is a child of. at. . Obviously, if a modular. is universal, then two A-NFAs are associated with the modular symbol. then just one A-NFA is associated with the modular symbol. , and if it is negating,. . The following proposition is immediately. obtained from Proposition 1.. Proposition 2 Let r be an extended regular expression of length m, let R0 ; . . . ;Rl be modules produced by partitioning Pr , and let mj be the length of the subexpression of r for the module Rj . Then we can construct A-NFAs Mj for each module Rj such that the number of states of Mj is at most 2mj .. 4 Approximate Pattern Matching Algorithm We

(18) rst introduce a new structure called. a computation modular tree which is an important concept to. design a new approximate pattern matching algorithm for extended regular expressions.. 4.1 A Computation Modular Tree 3. Let. r. be an extended regular expression over an alphabet 6 and let. 6 . Then assume that the parse tree. Pr. is partitioned into. M0; ;Ml Mj j l T U; E E edges U0 M0 Uj Rj Rj. module. Let. .... be A-NFAs for each module. Then. is de

(19) ned as follows. For each A-NFA. (0. ), we introduce. A computation modular tree consists of state-nodes, and is the state-node child of. 1. l. = (. is the set of. 2. =. is a child of. be a text string in. . Here. R0. a state-node Uj , which takes a state of Mj. , which consists of pairs (. , (3) let. 1. .... Uj. 1. and. Uj u. at a modular leaf. −3−. is the root. a computation modular tree (CMT for short). ) is de

(20) ned to be a tree such that (1). for the root A-NFA. if and only if. x a1 1 1 1 an R0; ;Rl. + 1 modules. 2. U;U 0 u. U. is the set of. ) of state-nodes, (2) the root. be any state-nodes of. . In (3), if. as its value.. nodes, which. U. . Then. Uj. 2 is a. is a universal leaf, then. Rj. 1.

(21) Function BuildGraph(r). Input: an extended regular expression. r;. G = (Q; E );. Output: CMT transition graph. Step 1. Partition Pr into modules R0 ; . . . ; Rl . Step 2. Translate each module Rj (0 j l) into an A-NFA Mj . Step 3. Q := fT0 g and E := ;, where T0 is the initial CMT. Step 4. BuildCMT(G; T0 ). Step 5. Return G . 図. Rj. has another child. u. 3 at. Uj. . Hence. BuildGraph. 1: The function. Uj. U ;Uj. 1 . This time, we call the pair ( j2. 3 also becomes a child of. 3). a universal pair. If u is a negating leaf, then Rj1 has just one child Rj2 at u. This time, we call the node Uj2 a negating node. For any state-node U , let State(U ) denote the state of U . Let. q0. T. be the initial state of the root A-NFA. M0. . Then the CMT. the initial CMT. In addition, for any CMT T , T. is called. is the

(22) nal state of. M0. is called. T0. fU0 g; ;. = (. ) with. State U0 (. ) =. an accepting CMT if the state of U0. The CMT transition graph for an extended regular expression is the labeled directed graph. Q; E ; ;T0 ; Tf 6. ) such that (1). Q. is the set of CMTs reachable from. T;T 0 Tf 2 Q. each of which is an ordered pair ( (4). T0 2 Q. T 2Q. T;Tf. , there is an edge (. G. T0. , (2). E. ) of CMTs and labeled by a symbol in 6. is the initial CMT, (5). We can view. is the special CMT called. T. ) if and only if. Q. as an NFA by regarding. T0. LG (. ) accepted by. G. G. =. is the set of directed edges,. [ fg. , (3) 6 is an alphabet,. the

(23) nal CMT,. and for any CMT. is an accepting CMT. as the set of states,. E. as the transitions between states,. the initial CMT as the initial state and the

(24) nal CMT as the accepting state. language. of. .. 4.2 The CMT Transition Graph (. q0. Hence we can de

(25) ne the. in the standard way.. Now, to construct the CMT transition graph, we compute all CMTs reachable from the initial CMT by simulating the set of A-NFAs. This time, we must keep the following two conditions. 1.. Intersection condition: state. q. Let. Mj. and a modular symbol. q. transition from state. Mj Mj Mj. 1 be an A-NFA such that. to state. p. , and let. Mj. 2 and. 1 has a transition. q; (. ) =. p. for a universal. 3 be two A-NFAs associated with. is possible if and only if. 2 and. Mj. . . Then the. 3 reach the

(26) nal states at same. time. 2.. Complement condition: negating state. q. Let. Mj p. transition from state. q. to state. M Mj. q; . . , and let. Mj. p. ) = for a j1 has a transition ( 2 be the A-NFA associated with . Then the. 1 be an A-NFA such that. and a modular symbol. is possible if and only if. 2 does not reach the

(27) nal state.. BuildGraph that builds the CMT transition graph for a given extended BuildGraph

(28) rst assigns only the initial CMT invokes the procedure BuildCMT given in Fig. 2. The procedure BuildCMT computes. Now let us describe a function. regular expression. The detail is given in Fig. 1. The function. T0 G to. and then. CMTs reachable from a CMT. T. ,. T0. using the technique of depth-

(29) rst search.. To compute all the transitions from. BuildCMT invokes NextCMT. A stack Anc is used to store all CMTs on the path from T0. the CMT that is currently being processed. Then all the possible transitions from. Tf T a2. following four cases: (a) the transition to the

(30) nal CMT each state included in. T. , (c) transitions by a symbol. by a modular symbol if. T. if. T. is accepting, (b) transitions by. 6 from every state included in. T. . -move from. , (d) transitions. satis

(31) es the intersection condition or the complement condition.. NextCMT. computes cases (a), (b) and (c) in 1, 2 and 3, respectively, and computes the case (d) in 4 and 5.. r. Let. G. be the CMT transition graph generated by. . Then we have the following property for. G. to. are classi

(32) ed into the. BuildGraph for a given extended regular expression. .. −1−.

(33) Procedure BuildCMT(G ; T ). G = (Q; E ):. T:. a CMT transition graph;. a CMT; 1.. mark[T ]. :=. processed,. 2.. Push(T; Anc), /* insert T. 3.. N ext[T ] :=. 4.. Anc. */. ;,. NextCMT(G; T ),. 2 N ext[T ] such that mark[T 0 ] := unprocessed do. T0. 5. for all 6.. at the top of stack. BuildCMT(G ; T ), Pop(Anc). /* delete the top element of stack Anc */ 0. 図. 2: The Procedure. BuildCMT. (1) The number of CMTs is at most 2(2m + 1)Nr , (2) the number of outgoing edges from each CMT is at most 2Nr , (3) L(r) = L(G ). Lemma 1. a2. Q; Et Q; Et [feg. The incoming edges of each state of A-NFAs are either all edges with. G. 6. Hence so are the incoming edges of each CMT of. respectively. For any CMT transition graph. =. Q; E. G. graph induced from. Ec E 0 Et Ec. = (. G. Q. NextCMT.. by. r. e 2 Ec. be an extended regular expression and. =. (. ) =. (. is a string accepted at the CMT. ). . Furthermore, let. also de

(34) ned similarly. Then. recurrence for. [. of. 1. For. Gt. . We can compute. [. (. ). [. ] that is the minimum value over all. . Now let. ] = min. (. ) =. [. 0. (. ]. ). and. and. [. (. ). ] is. ] is de

(35) ned by the following recurrence. We are computing the following. [. [. ] =. with. [. ] = 0. ] becomes the minimum edit distance over all the substrings ending at the. .. = 0. (a). [. 0] = 0,. (b) for any CMT [. 0] =. such that. [. (. (c) for any CMT [. (d) for any CMT 2. For 1. in. ] =. ,. (b) for any CMT i. if ii. if. T T. T. [. =. (. such that. ,. , where if. is an. and 1. ) 0]. ,. [. = 1; otherwise 0,. ,. [. (. T 6 T0 E T; =. -node, then. 0] =. ) 0]. +. E L T; [. , where if. T. is an. S. -node, then. = 1;. 0].. ,. E T0;i i [. =. ) 0] +. such that. 0] = min. otherwise 0,. (a). ) has a cycle, where. according to. ] in topological order of CMTs. As in [9], if we replace. at 2-(a), then the value of. position. [. -node,. be a text string. Our algorithm

(36) rst computes. , computes. edit a1 1 1 1 ai ;y0 0 y Pre T fT j T ;T 2 Et g E T;i E T;i E T;i i x i E T0 ; T T 6 T0 E 0 T; E 0 Pre T ; T S T h T 6 T0 hL E h T; fE Pre T ; ;E h01 Pre T ; g ) such that. T0. S. -node and an. ) to be the directed acyclic. x a1 1 1 1 xn T E T;i T Pre T 0 fT 0 j T 0 ;T 2 Eg E Pre T ;i E Pre T ;i T 2P re(T ) fE T ;i g E T0 ;i i E T0;i. the CMT transition graph, and then for all CMTs (. = (. , the directed graph (. are topologically ordered from. 4.3 The Algorithm Let. Gt. , let us de

(37) ne a graph. ) such that for any. . Note that all CMTs in. while constructing. G. or all edges with a symbol. . We call such CMTs an. is an is an. where. T S. such that. T 6 T0 E 0 T;i E 0 Pre T ;i E 0 T;i fE T;i 0 =. ,. -node, then. [. ] =. -node, then. [. ] = min. = 0 if. ai. [. (. ). [. matches the label; otherwise. −5−. ],. 1] + 1. ;E Pre T ; i 0. = 1,. [. (. ). 1] +. ;E0 Pre T ;i g [. (. ). ]+1 ,.

(38) Procedure NextCMT(G ) a CMT transition graph; : a CMT; 1. if is an accepting CMT, then E := E [ f( f )g and label ( f ) with , 2. for all state-nodes in do (a) if ( ( ) ) is de

(39) ned, then for all states 2 ( ( ) ) do i. copy to , and then change ( ) to in and [ ] := [ ] [ f g, ii. Q := Q [ f g, E := E [ f( )g, label ( ) with , and if 2 then Ec := Ec [ f( )g, (b) if ( ) is a universal state, then i. copy to , an then assign ; to in , ii. if there is not nodes j1 and j2 in , then generate them, where j1 and j2 are the A-NFAs associated with ( ), iii. assign j1 and j2 to j1 and j2 , where j1 and j2 are the initial states of j1 and j2 , respectively, and [ ] := [ ] [ f g, iv. Q := Q [ f g, E := E [ f( )g, label ( ) with , and if 2 then Ec := Ec [ f( )g, (c) if ( ) is a negating state, then i. copy to , and then assign ; in , ii. if there is not node j1 in , then generate it, where j1 is the A-NFA associated with ( ), iii. assign j1 to j1 , where j1 is the initial state of j1 , and [ ] := [ ] [ f g, iv. Q := Q [ f g, E := E [ f( )g, label ( ) with , and if 2 then Ec := Ec [ f( )g, 3. for all 2 6 do (a) copy to , (b) for all state-nodes in do i. if ( ( ) ) = for a state , then change the state of to , ii. if ( ( ) ) = ;, then change the state of to ;, (c) [ ] := [ ] [ f g, Q := Q [ f g, E := E [ f( )g, label ( ) with a symbol , and if 2 then Ec := Ec [ f( )g, 4. for all universal pairs ( j1 j2 ) in do /* check the intersection condition */ (a) if j1 and j2 both have the

(40) nal states, then i. copy to , ii. assign a state to the parent of j1 and j2 ), where is the state with ( ) = such that the corresponding A-NFAs j1 and j2 ) are associated with , iii. [ ] := [ ] [ f g, Q := Q [ f g, E := E [ f( )g, label ( ) with , and if 2 then Ec := Ec [ f( )g, 5. for all negating nodes j1 in do /* check the complement condition */ (a) if j1 does not have the

(41) nal state, then i. copy to , ii. assign a state to the parent of j1 , where is the state with ( ) = such that the corresponding A-NFA j1 is associated with , iii. [ ] := [ ] [ f g, Q := Q [ f g, E := E [ f( )g, label ( ) with , and if 2 then Ec := Ec [ f( )g, 6. for all newly generated CMTs , [ ] := . ;T. G = (Q; E ):. T. T. T; T. U. T; T. S tate U ; T. . T. p. T. 0. T. State U. 0. T; T. S tate U ; . p. 0. T; T. T. 0. 0. N ext T. . T. N ext T. 0. T. 0. Anc. T; T. 0. S tate U. T. T. 0. U. U. T. U. T. 0. 0. M. M. State U. q. q. U. N ext T. T. U. q. N ext T. 0. T. T; T. q. M. M. 0. 0. T; T. 0. . T. 0. Anc. T; T. 0. State U. T. T. 0. U. U. q. T. U. T. T. 0. 0. M. q. 0. State U. M. T; T. 0. T; T. 0. N ext T. . T. 0. N ext T. T. 0. Anc. T; T. 0. a. T. T. 0. U. T. State U ; a. p. 0. p. U. State U ; a. N ext T T. 0. U. U. N ext T. T. 0. T. Anc. T; T. U. p. ;U. 0. T; T. 0. T; T. 0. a. 0. T. U. T. T. 0. p. U. U. N ext T. N ext T. T. T; T. U. U. p. 0. q; . p. . M. M. T. 0. T; T. 0. T; T. 0. . T. 0. Anc. 0. T. U. T. T. 0. p. U. U. p. M. N ext T. N ext T. T. T; T. T. q; . 0. T. 0. T; T. 0. 0. p. . mark T. 図. 0. unprocessed. 3: The Procedure. −6−. NextCMT. 0. T; T. 0. . T. 0. Anc.

(42) Algorithm APPROX(r, x, d). r , a text string x = a1 1 1 1 an , and any number d edit(a1 1 1 1 ai ; r ) such that edit(a1 1 1 1 ai ; r ) d.. Input: an extended regular expression. Output: all pairs of a position. Step 1. G :=BuildGraph(r). Step 2. InitSet(G,E ). Step 3. for i = 1 to n do EO. 1.. :=. i. and. 0.. E.. EditDist1(G ,E ,EO ,i), EditDist2(G ,E ,EO ,i),. 2. 3.. E [Tf ] d,. 4. if. then return (i; E [Tf ]).. 図. APPROX. 4: The algorithm. Procedure InitSet(G ,E ) G = (Q; E ):. E:. CMT transition graph;. array of edit distance;. E [T 0 ]. 1.. := 0,. T 2 Q in topological order do -node, then E [T ] := E [P re(T )]; otherwise E [T ]. 2. for all CMTs. T. if. is an. update :=. 3.. 4. while (a). :=. E [P re(T )] + 1,. 1,. update. = 1 do. update. := 0,. (b) for all CMTs i. if ii. if. T. 2 Q in topological order do. T is an -node, then E [T ] := E [P re(T )]; otherwise E [T ] := E [P re(T )] + 1, E [T ] has been updated then update := 1.. 図 (c) for any CMT. E h T;i [. T. T 6 T0 hL f E h Pre T ;i ;E h01 Pre T ;i g such that. ] = min [. otherwise 0, (d) for any CMT The number. L. T. InitSet. 5: The Procedure. [. =. (. ). and 1. ]. [. ,. (. ). ]. +. , where if. T 6 T0 E T;i E L T;i. such that. =. ,. [. ] =. [. T. S. is an. -node, then. = 1;. ].. is the maximum number of edges belonging to. Ec. over all loop-free paths on. G. .. The. APPROX, given in Fig. 4, computes the case of i = 0 by InitSet, the cases of 2-(a) and 2-(b) by EditDist1, and the case of 2-(c) by EditDist2. It is clear that we need E [T;i] and E [T;i 0 1] to main algorithm. E T;i in E T;i 0 E Pre T E Pre T E Pre T. compute and. [. [. (. [. ] for any 1. )] and. [. (. ET E Pre T. . Hence it is sucient to use only two arrays. 1]. Furthermore,. [. (. )] = minT 0. 0 2P re(T ) fE T g [. ]. and. [. [. (. ] and. EO T [. ] for. E T;i [. ]. 0 T 2P re(T ) fE T g. )] = min. [. 0. ] .. )] are also de

(43) ned similarly. We have the following Lemma 2 and Theorem 1.. The above recurrence correctly computes E [T;i].. Lemma 2. Given an extended regular expression r of length m, a text string x of length n and any number d 2 , theN algorithm APPROX2 correctly

(44) nd all the pre

(45) x y of x with edit y;r d in O Nr2 m 2N N LNr m n time and O Nr m space. L2 N G O Nr2 m 2N n O Nr m Nr APPROX r 2 Nr u r n r O m mn Om G Nr L Theorem 1 0. (. (2. ). r. Since. and. (. ). (. (2. ). is less than the number of CMTs in. (2. ). r ) space in the worst case.. polynomial time.. = 2. ( ) +. space for regular expressions because = 1 and. (. , the algorithm APPROX runs in. Thus if. (. is a constant number, then. In fact, if the number of extended operators occurring in. in polynomial time because. ). (2. ). r +. r). ( ) + 1. Furthermore, it runs in. (2. ). r. ) time. runs in. is a constant, then it runs. (. +. ) time and. (. ). becomes just the NFA transformed from a regular expression and. = 1.. −7−.

(46) Procedure EditDist1(G ,E ,EO , i) G = (Q; E ): 1.. E [T 0 ]. E , EO :. CMT transition graph;. arrays of edit distance;. := i,. 2. for all CMTs. T. (a) if. T. is an. (b) if. T. is an. matches. 2 Q in topological order do -node,. then. E [T ]. :=. E [P re(T )],. S -node with a symbol a, a then = 0; otherwise 1.. 図. Procedure EditDist2(G ,E ,E0 , i) G = (Q; E ): 1.. CMT transition graph;. update :=. 2. while (a). E , EO :. then. E [T ]. fEO [T ] + 1; EO [P re(T )] + ; E [P re(T )] + 1g,. := min. 6: The Procedure. where if. ai. EditDist1. arrays of edit distance;. 1,. update. = 1 do. update. := 0,. (b) for all CMTs i. if ii. if. T. 2 Q in topological order do. T is an -node, then E [T ] := E [P re(T )]; otherwise E [T ] := E [P re(T )] + 1, E [T ] has been updated then update := 1.. 図. 7: The Procedure. EditDist2. 参考文献 [1] A.V. Aho, Algorithms for

(47) nding patterns in strings, In J.V. Leeuwen, ed. Handbook of theoretical computer science, Elsevier Science Pub., 1990. [2] A. Apostolico, Z. Galil ed., Pattern Matching Algorithms, Oxford University Press, 1997. [3] R.A. Baeza-Yates and G.H. Gonnet, Fast Text Searching for Regular Expressions or Automaton Searching on Tries, J. ACM, 43,6, 915-936, 1996. [4] J.R. Knight and E.W. Myers, Super-Pattern matching, Algorithmica, 13, 1-2, 211-243, 1995. [5] J.E. Hopcroft and J.D. Ullman, Introduction to automata theory language and computation, Addison Wesley, Reading Mass, 1979. [6] G. Myers, A Four Russians Algorithm for Regular Expression Pattern Matching, J. ACM, 39,4, 430448, 1992. [7] E. Myers and W. Miller, Approximate Matching of Regular Expressions, Bull. of Mathematical Biology, 51, 1, 5-37, 1989. [8] S. Wu and U. Manber, Fast Text Searching Allowing Errors, Communications of the ACM, 35, 10, 83-91, 1992. [9] S. Wu, U. Manber and E. Myers, A Sub-Quadratic Algorithm for Approximate Regular Expression Matching, J. of Algorithm, 19, 346-360, 1995. [10] H. Yamamoto, A New Recognition Algorithm for Extended Regular Expressions, ISAAC2001 Proc., LNCS 2223, 257-267, 2001.. −8−.

(48)