Conclusion - ゲノムデータ解析への演繹データベースの応用に関する研究

In this thesis, we discussed two major topics on the application of deduc

tive database to the analysis of three-dimensional structure of protein. The results obtained through this study are as follows.

• Develovment

of

PACADE: A prototype deductive database system, which stores data on three-dimensional structure of protein, was de

veloped. Some real-world exa1nples, including hydrophobic clusters, dipole mon1ents and supersecondary structures, could be described on the system. The results of performance evaluations show that even if a database consists of large an1ount of data and a rule set including recursive rules, the exan1ples could be processed in short tin1e. In the cases of supersecondary structures, the rviagic Set method dran1atically in1proved the efficiency of query processing (in average case, approxi

nlately

1/72

for n-stranded meander rules and approximately

1/36

^for

Greek key rules). This result indicates that deductive database is, by its nature, highly applicable to flexible analysis of large genome data.

• Si·mila1 ity Search on Ded·uctive Dalabase: vVe developed a novel frame

work for a kind of approximate search, i.e. similarity search, on deduc

tive database. It is an example of ambiguous search request for highly complex entities expressed by heterogeneous data. By extending the prototype syste1n, it was enabled to search for similar structures in proteins with simple query description.

Basic idea of this framework is "recycle of derivation processes);. This

is one of the advantages of deductive database. Queries in this ex

tended system could also be optimized through the ;\.-lagic Set method

(

in case of n-strandecl meander, it cuts clown the execution tin1e to approximatelyl

/

^400). As to the rnore extended version of sinlilar

ity search function vvhich auton1atically compute a closure of indirect sirnilarity relationships among protein structures, cost of computation tended to be linear to the size of search space.

This result indicates that deductive database is suitable for managing ambiguous search requests over genon1e data.

Future works of this thesis are as follows.

• Develop-ment of a Graphical User Inter face (CUI) for PACADE: For the comfortable analysis of genon1e data, PACADE must be furnished with a high-level GUI. The present state of PACADE is still "flexi

ble search engine prototype", and G UI is essential in order to rnake PACAD E a more practical research tool for biologists.

• [( nowledge Disco·very in Ded'uctive Database: Besides flexible and high

speeded search of large amount of genome data, n1ethods for auto

n1atecl knowledge discovery from them are desired under the situation that they have been increasing exponentially. In the research area of database, studies called knowledge discovery or data ·mining in database have been active in these years. In a sense, searches of genon1e data rnight be clone in order to find out biological knowledges and the rules behind the large amount of experimental results.

In recent years, some studies in this research area become to have great importance, which are called inductive logic progra·m·ming (ILP). Typ

ically, ILP systems find out logical rules fron1 a s t of base facts, a set of facts as positive or negative examples and a set of background knowledges described in the form of logical rules. Though ILP systerns perform completely opposite con1putation in comparison with deduc

tive database systems, we believe that PAC AD have to evolve into the system in which discovery of biological knowleclges can be performed as well as deductive searches of data.

However, there are difficulties in applying knowledge discovery methods to large amount of genome data:

In some n1ethods, con1putation time tends to be exponential to the sizes of don1ains of values. This feature may be fatal because 1nany genon1e data include nun1erical values

(

integer, real, etc.

)

detern1ined through experin1ents and they often have large do

Inalns.

In the fran1ework of ILP, a user has to give positive or negative exan1ples, in the fonn of facts, to an ILP systen1. It means that, before using the system, the user must already know "what" he wants to discover. This means that aln1ost every existing ILP systen1s could not be applicable to the areas like discovery of bio

logical knowledges concerning about relationships between struc

ture and function of protein because in such an area, the knowl

edges which should be discovered are not or could not be specified clearly.

In our opinion, a knowledge discovery framework proposed by Agrawal et al.

[1]

will be effective to the first difficulty. On the other hand, as to the second difficulty, an elegant extensions of ILP fran1ework rnacle by De Raedt et al.

[

₂₀

]

will be effective as well as Agrawal's.

Bibliography

[1] Agrawal,R., Imielinski,T. and Swan1i,A.N.: .Lv'lining Association Rules between Sets of Items in Large Databases, AC.NI SIGi\!IOD, pp.207-216 (1993).

[2] Akutsu, T.: Efficient and Robust Three-Dimensional Pattern .Nlatching Algorithms Using Hashing and Dynamic Programming Techniques, in Proc. of the 27th Hawaii International Conference on Syste·m Sciences, pp.225-234 (1994).

[3] Bancilhon, F. and Ramakrishnan, R.: An Amateur's Introduction to Recursive Query Processing Strategies, in Rea dings in Adificial Intelli

gence and Databases, Eels. i\!Iylopoulos and Brodie, 1\tlorgan Kaufmann, 376-430 (1989).

[4] Barker,vV.C. George,D.G., Hunt,L.T. and Garavelli,J.S.: The PIR protein sequence database, JVucleic Acids Research, 19, pp.2231-2236 (1991).

[5] Beeri,C., and Ramakrishnan, R.: On The Power of l\!Iagic, Proc. 6th A C1vf PODS, pp.269-283, San Diego, California (1987).

[6] Bernstein,F.C., Koetzle,T.F., vVilliams,G.J.B., l\!Ieyer,E.F., Brice,l\!I.D., Rodgers,.J.R., Kennard,O., Shimanouchi,T. and Tasun1i,i\!I.: The Pro

tein Data Bank: A Computer-based Archival File for 1\tlacromolecular Structures, Journal of ivfolecular Biology, 112, pp.53.S-542 (1977).

[7] French,J .C., Jones,A.K. and Pfaltz,.J .L.: NSF \Norkshop on scientific database n1anagement, SIGi�IOD Rec., Vol.19, No.4, 32-40(1990).

[8] Burks,C., Cassidy,l\tl., Cinkosky,l\tl.J., Cumella,K.E., Gilna,P., Hayden,J.E.-D., Keen,G.l\11., Kelley,T.A., Kelly,.Yl., Kristofferson,D. and Ryals,J.: Genbank, JVucleic Acids Research, 19, pp.2221-2225(1991).

[9] Codd,E.F.: A relational model for large shared data banks, Co·m·m.

ACiYI, 13:6, pp.377-387(1970).

[10] Cocld,E.F.: Relational completeness of data base sublanguages, ibid, pp.65-98(1972).

[11] Gray,P.i\!I.D., Patton,N.vV., Kemp,G.J.L. and Fothergrill,J.E.: An object-oriented database for protein structure analysis, Protein Engi

neering, 3, pp.235-243 (1990).

[12] Hashin1oto,S.: Studies on Optimization of Similarity Search in a De

ductive Database (in Japanese), Bachelor thesis of Kyushu University, (1994).

[13] Isla1n,S.A. and Sternberg,l\tl.J .E.: A Relational Database of Protein Structures Designed for Flexible Enquiries about Conforn1ation, Protein

EngineeTing, 2, pp.431-442 (1989).

[14] Kabsh, vV. and Sander, C.: Dictionary of Protein Secondary Structure:

Pattern Recognition of Hydrogen-Bonded and Geometrical Features, in Biopoly·mers, 22, pp.2577-2637 (1983).

[1.5] Kuhara, S., i\!Iatsuo, F., Futamura, S., Fujita, A., Shinohara, T., Takagi, T., and Sakaki, Y., A.: GENAS: a database system for nucleic acid sequence analysis, in 1V1tcl. Acids Res., Vol. 12, pp.89-99 (1984).

[16] Kuhara,S., Satou,K., Furuichi,E., Takagi,T., Takehara,H. and Sakaki,Y.:

A deductive database systen1 PAC AD E for the three dimensional struc

ture of protein, in Proceedings of the 24th Hawaii International Confer

ence on Syste·m Sciences, pp.6.53-659 (1991).

[1 7] Lloyd,J. vV.: Foundations of Logic P rogra1nn1ing 2nd edn., Springer

Verlag (1987).

[18] .LVforffew,A. and Todd,S.: The Use of Prolog as a Protein Querying Lan

guage, Co·mputers and Che·mistry" 10, pp.9-14 (1986).

[19] Nicholson, H., Becldel, vV. J., and iVlatthews, B. vV., A.: nhanced protein thern1ostability from designed n1utations that interact with ^a

helix dipole, in JVature, Vol. 336, pp.651-656 (1988).

[20] De Raedt,L. and van Laer,vV.: Inductive Constraint Logic, in .Jantke et al.(Eds.) Lecture JVotes in Artificial Intelligence 997 (ALT'95), pp.80-94 (1995).

[21] Rarnakrishnan,R., Srivastava,D. and Sudarshan,S.: CORAL-Control, Relations And Logic, Proc. 18th VLDB, pp.238-250 (1992).

[22] Rawlings,C.J., Taylor,vV.R., Nyakairu,J., Fox,J. and Sternberg,Yl.J.E.:

Using Prolog to represent and reason about protein structure, in Shapiro,E. (eel.) Third International Confere nee on Logic Progra·m·ming, Springer-Verlag, Berlin, pp.536-543 (1986).

[23] Read, R. J. and Jan1es N. G., A.: Refined crystal structure of Strep

tornyces griseus Trypsin at 1.7 A resolution in J. ivfol. B1:ol., Vol. 200, pp.523-5.S1 (1988).

[24] Richardson, J .S.: The anatomy and taxonomy of protein structure, in Advances in Protein Che·mistr·y, Academic Press, pp.283-306 (1981 ).

[25] Satou,K., Furuichi,E., Takiguchi,K., Takagi T. and Kuhara,S.: A de

ductive database system FACADE for analyzing three dirnensional and secondary structures of protein, Co·mputer Applications in Bioscience, 9, pp.2.S9-265 (1993).

[26] Satou,K., Furuichi,E., Takagi,T., Kuhara,S. and Ushijima,K.: Sin1ilar Structure Search in a Deductive Database, In Proceedings of Interna

tional Sy·mposiu·m on JVext Gene1·ation Database Syste·ms and Their Ap

plications, pp.130-137 (1993).

[27] Satou,K., Furuichi,E., Kuhara,S., and Takagi,T.: Searches for Topologi

cally and Three Dimensionally Similar Structures in Proteins, Proceed

ings of Genon1e Informatics vVorkshop IV, pp.25-35 (1993).

[28] Satou,K., Furuichi,E. Takiguchi,K., Kuhara,S. and Takagi,T.: Appli

cation of a Deductive Database System FACADE toward Discovery of

Clusters of Similar Structures in Proteins, in Proc. of th e 27th Hawaii International Conference on Sysle·m Sciences, pp.160-169 (1994).

[29]

_Schwartz,_R.l\11.and Dayhoff, 1\11.0., A.: in Atlas of Protein Sequence and Slntclure, Vol. 5 (eel. Dayhoff, 1\11.0.) pp.353-358 (1972).

[30]

Takagi,T., Goto,S., Suzuki,T. and Ushijima,K.: Applicability of a De

ductive Database to CAD Systems, in Proceed·ings of Snpple·ment 7th IEEE International Conference on Data Engineering, pp.51-58(1991).

[31]

Ullman,].: Principles of Database and Knowledge-Base Systems volume I and II, COYIPUTER SCIENCE PRESS (1989).

List of Publications

1. Kuhara,S., Satou,K., Furuichi,E., Takagi,T., Takehara,H. and Sakaki,Y.:

A deductive database system FACADE for the three din1ensional struc

ture of protein, in Proceedings of the 24th Hawaii I nte7'national Con

ference on Syste·m Sciences, pp.653-659 (1991 ).

2. Satou,K., Furuichi,E., Takiguchi,K., Takagi,T. and Kuhara,S.: A de

ductive database system FACADE for analyzing three dimensional and secondary structures of protein, Co·mputer Applications in Bio. cience, 9, pp.259-265 (1993).

3. Satou,K., Furuichi,E., Takagi,T., Kuhara,S. and Ushijin1a,K.: Similar Structure Search in a Deductive Database, In Proceedings of Inter

national Sy·mpos·iu·m on JVext Generation Database Syste·ms and Their Applications, pp.130-137 (1993).

4. Satou,K., Furuichi,E., Kuhara,S. and Takagi,T.: Searches for Topologi

cally and Three Din1ensionally Sirnilar Structures in Proteins, Proceed

ings of Genome Informatics vVorkshop IV, pp.25-35 (1993).

5. Satou,K., Furuichi,E., Takiguchi,K., Kuhara,S. and Takagi,T.: Appli

cation of a Deductive Database System FACADE toward Discovery of Clusters of Similar Structures in Proteins, in Proc. of the 27th Hawaii I nlernational Conference on Syste·m Sciences, pp.160-169 (1994).

6. Nishikawa,A., Satou,K., Furuichi,E., Kuhara,S. and Ushijima,K.: Equip

ments for Statistical Analysis in a Deductive Database System -their Irnplementation and their Application to Protein Structural Analysis-, in Proc. of International Sy·mposiu·m on Advanced Database Technolo

gies and Their Integrat-ion, p p.237-244 ( 1994).

7. Nishikawa,A., Satou,K., Furuichi,E., Kuhara,S. and Ushijin1a,K.: Data Classification Component in a Deductive Database System and its Ap

plication to Protein Structural Analysis, Special Issue on Advanced Databa. ·e Technologies of IEICE Transactions on I nforrnation and Sys

len�s, Vol.E78-D, No.ll, pp.l377-l387

(1995).

8. Satou,K., Furuichi,E., Hashimoto,S., Tsukamoto,Y., Kuhara,S., Tak

agi,T. and Ushijin1a,K.: Developrnent of a Deductive Database Sys

tenl for Computing Closures of Sin1ilarity Relationships an1ong Protein Structures, Journal of Japanese Associat'io n for Artificial I ntellige nee, Vol.ll, No.3 (to appear).

ドキュメント内ゲノムデータ解析への演繹データベースの応用に関する研究 (ページ 43-53)