In this thesis, we discussed two major topics on the application of deduc
tive database to the analysis of three-dimensional structure of protein. The results obtained through this study are as follows.
• Develovment
of
PACADE: A prototype deductive database system, which stores data on three-dimensional structure of protein, was developed. Some real-world exa1nples, including hydrophobic clusters, dipole mon1ents and supersecondary structures, could be described on the system. The results of performance evaluations show that even if a database consists of large an1ount of data and a rule set including recursive rules, the exan1ples could be processed in short tin1e. In the cases of supersecondary structures, the rviagic Set method dran1atically in1proved the efficiency of query processing (in average case, approxi
nlately
1/72
for n-stranded meander rules and approximately1/36
forGreek key rules). This result indicates that deductive database is, by its nature, highly applicable to flexible analysis of large genome data.
• Si·mila1 ity Search on Ded·uctive Dalabase: vVe developed a novel frame
work for a kind of approximate search, i.e. similarity search, on deduc
tive database. It is an example of ambiguous search request for highly complex entities expressed by heterogeneous data. By extending the prototype syste1n, it was enabled to search for similar structures in proteins with simple query description.
Basic idea of this framework is "recycle of derivation processes);. This
is one of the advantages of deductive database. Queries in this ex
tended system could also be optimized through the ;\.-lagic Set method
(
in case of n-strandecl meander, it cuts clown the execution tin1e to approximatelyl/
400). As to the rnore extended version of sinlilarity search function vvhich auton1atically compute a closure of indirect sirnilarity relationships among protein structures, cost of computation tended to be linear to the size of search space.
This result indicates that deductive database is suitable for managing ambiguous search requests over genon1e data.
Future works of this thesis are as follows.
• Develop-ment of a Graphical User Inter face (CUI) for PACADE: For the comfortable analysis of genon1e data, PACADE must be furnished with a high-level GUI. The present state of PACADE is still "flexi
ble search engine prototype", and G UI is essential in order to rnake PACAD E a more practical research tool for biologists.
• [( nowledge Disco·very in Ded'uctive Database: Besides flexible and high
speeded search of large amount of genome data, n1ethods for auto
n1atecl knowledge discovery from them are desired under the situation that they have been increasing exponentially. In the research area of database, studies called knowledge discovery or data ·mining in database have been active in these years. In a sense, searches of genon1e data rnight be clone in order to find out biological knowledges and the rules behind the large amount of experimental results.
In recent years, some studies in this research area become to have great importance, which are called inductive logic progra·m·ming (ILP). Typ
ically, ILP systems find out logical rules fron1 a s t of base facts, a set of facts as positive or negative examples and a set of background knowledges described in the form of logical rules. Though ILP systerns perform completely opposite con1putation in comparison with deduc
tive database systems, we believe that PAC AD have to evolve into the system in which discovery of biological knowleclges can be performed as well as deductive searches of data.
However, there are difficulties in applying knowledge discovery methods to large amount of genome data:
In some n1ethods, con1putation time tends to be exponential to the sizes of don1ains of values. This feature may be fatal because 1nany genon1e data include nun1erical values
(
integer, real, etc.)
detern1ined through experin1ents and they often have large do
Inalns.
In the fran1ework of ILP, a user has to give positive or negative exan1ples, in the fonn of facts, to an ILP systen1. It means that, before using the system, the user must already know "what" he wants to discover. This means that aln1ost every existing ILP systen1s could not be applicable to the areas like discovery of bio
logical knowledges concerning about relationships between struc
ture and function of protein because in such an area, the knowl
edges which should be discovered are not or could not be specified clearly.
In our opinion, a knowledge discovery framework proposed by Agrawal et al.
[1]
will be effective to the first difficulty. On the other hand, as to the second difficulty, an elegant extensions of ILP fran1ework rnacle by De Raedt et al.[
20]
will be effective as well as Agrawal's.Bibliography
[1] Agrawal,R., Imielinski,T. and Swan1i,A.N.: .Lv'lining Association Rules between Sets of Items in Large Databases, AC.NI SIGi\!IOD, pp.207-216 (1993).
[2] Akutsu, T.: Efficient and Robust Three-Dimensional Pattern .Nlatching Algorithms Using Hashing and Dynamic Programming Techniques, in Proc. of the 27th Hawaii International Conference on Syste·m Sciences, pp.225-234 (1994).
[3] Bancilhon, F. and Ramakrishnan, R.: An Amateur's Introduction to Recursive Query Processing Strategies, in Rea dings in Adificial Intelli
gence and Databases, Eels. i\!Iylopoulos and Brodie, 1\tlorgan Kaufmann, 376-430 (1989).
[4] Barker,vV.C. George,D.G., Hunt,L.T. and Garavelli,J.S.: The PIR protein sequence database, JVucleic Acids Research, 19, pp.2231-2236 (1991).
[5] Beeri,C., and Ramakrishnan, R.: On The Power of l\!Iagic, Proc. 6th A C1vf PODS, pp.269-283, San Diego, California (1987).
[6] Bernstein,F.C., Koetzle,T.F., vVilliams,G.J.B., l\!Ieyer,E.F., Brice,l\!I.D., Rodgers,.J.R., Kennard,O., Shimanouchi,T. and Tasun1i,i\!I.: The Pro
tein Data Bank: A Computer-based Archival File for 1\tlacromolecular Structures, Journal of ivfolecular Biology, 112, pp.53.S-542 (1977).
[7] French,J .C., Jones,A.K. and Pfaltz,.J .L.: NSF \Norkshop on scientific database n1anagement, SIGi�IOD Rec., Vol.19, No.4, 32-40(1990).
[8] Burks,C., Cassidy,l\tl., Cinkosky,l\tl.J., Cumella,K.E., Gilna,P., Hayden,J.E.-D., Keen,G.l\11., Kelley,T.A., Kelly,.Yl., Kristofferson,D. and Ryals,J.: Genbank, JVucleic Acids Research, 19, pp.2221-2225(1991).
[9] Codd,E.F.: A relational model for large shared data banks, Co·m·m.
ACiYI, 13:6, pp.377-387(1970).
[10] Cocld,E.F.: Relational completeness of data base sublanguages, ibid, pp.65-98(1972).
[11] Gray,P.i\!I.D., Patton,N.vV., Kemp,G.J.L. and Fothergrill,J.E.: An object-oriented database for protein structure analysis, Protein Engi
neering, 3, pp.235-243 (1990).
[12] Hashin1oto,S.: Studies on Optimization of Similarity Search in a De
ductive Database (in Japanese), Bachelor thesis of Kyushu University, (1994).
[13] Isla1n,S.A. and Sternberg,l\tl.J .E.: A Relational Database of Protein Structures Designed for Flexible Enquiries about Conforn1ation, Protein
EngineeTing, 2, pp.431-442 (1989).
[14] Kabsh, vV. and Sander, C.: Dictionary of Protein Secondary Structure:
Pattern Recognition of Hydrogen-Bonded and Geometrical Features, in Biopoly·mers, 22, pp.2577-2637 (1983).
[1.5] Kuhara, S., i\!Iatsuo, F., Futamura, S., Fujita, A., Shinohara, T., Takagi, T., and Sakaki, Y., A.: GENAS: a database system for nucleic acid sequence analysis, in 1V1tcl. Acids Res., Vol. 12, pp.89-99 (1984).
[16] Kuhara,S., Satou,K., Furuichi,E., Takagi,T., Takehara,H. and Sakaki,Y.:
A deductive database systen1 PAC AD E for the three dimensional struc
ture of protein, in Proceedings of the 24th Hawaii International Confer
ence on Syste·m Sciences, pp.6.53-659 (1991).
[1 7] Lloyd,J. vV.: Foundations of Logic P rogra1nn1ing 2nd edn., Springer
Verlag (1987).
[18] .LVforffew,A. and Todd,S.: The Use of Prolog as a Protein Querying Lan
guage, Co·mputers and Che·mistry" 10, pp.9-14 (1986).
[19] Nicholson, H., Becldel, vV. J., and iVlatthews, B. vV., A.: nhanced protein thern1ostability from designed n1utations that interact with a
helix dipole, in JVature, Vol. 336, pp.651-656 (1988).
[20] De Raedt,L. and van Laer,vV.: Inductive Constraint Logic, in .Jantke et al.(Eds.) Lecture JVotes in Artificial Intelligence 997 (ALT'95), pp.80-94 (1995).
[21] Rarnakrishnan,R., Srivastava,D. and Sudarshan,S.: CORAL-Control, Relations And Logic, Proc. 18th VLDB, pp.238-250 (1992).
[22] Rawlings,C.J., Taylor,vV.R., Nyakairu,J., Fox,J. and Sternberg,Yl.J.E.:
Using Prolog to represent and reason about protein structure, in Shapiro,E. (eel.) Third International Confere nee on Logic Progra·m·ming, Springer-Verlag, Berlin, pp.536-543 (1986).
[23] Read, R. J. and Jan1es N. G., A.: Refined crystal structure of Strep
tornyces griseus Trypsin at 1.7 A resolution in J. ivfol. B1:ol., Vol. 200, pp.523-5.S1 (1988).
[24] Richardson, J .S.: The anatomy and taxonomy of protein structure, in Advances in Protein Che·mistr·y, Academic Press, pp.283-306 (1981 ).
[25] Satou,K., Furuichi,E., Takiguchi,K., Takagi T. and Kuhara,S.: A de
ductive database system FACADE for analyzing three dirnensional and secondary structures of protein, Co·mputer Applications in Bioscience, 9, pp.2.S9-265 (1993).
[26] Satou,K., Furuichi,E., Takagi,T., Kuhara,S. and Ushijima,K.: Sin1ilar Structure Search in a Deductive Database, In Proceedings of Interna
tional Sy·mposiu·m on JVext Gene1·ation Database Syste·ms and Their Ap
plications, pp.130-137 (1993).
[27] Satou,K., Furuichi,E., Kuhara,S., and Takagi,T.: Searches for Topologi
cally and Three Dimensionally Similar Structures in Proteins, Proceed
ings of Genon1e Informatics vVorkshop IV, pp.25-35 (1993).
[28] Satou,K., Furuichi,E. Takiguchi,K., Kuhara,S. and Takagi,T.: Appli
cation of a Deductive Database System FACADE toward Discovery of
Clusters of Similar Structures in Proteins, in Proc. of th e 27th Hawaii International Conference on Sysle·m Sciences, pp.160-169 (1994).
[29]
Schwartz, R.l\11. and Dayhoff, 1\11.0., A.: in Atlas of Protein Sequence and Slntclure, Vol. 5 (eel. Dayhoff, 1\11.0.) pp.353-358 (1972).[30]
Takagi,T., Goto,S., Suzuki,T. and Ushijima,K.: Applicability of a Deductive Database to CAD Systems, in Proceed·ings of Snpple·ment 7th IEEE International Conference on Data Engineering, pp.51-58(1991).
[31]
Ullman,].: Principles of Database and Knowledge-Base Systems volume I and II, COYIPUTER SCIENCE PRESS (1989).List of Publications
1. Kuhara,S., Satou,K., Furuichi,E., Takagi,T., Takehara,H. and Sakaki,Y.:
A deductive database system FACADE for the three din1ensional struc
ture of protein, in Proceedings of the 24th Hawaii I nte7'national Con
ference on Syste·m Sciences, pp.653-659 (1991 ).
2. Satou,K., Furuichi,E., Takiguchi,K., Takagi,T. and Kuhara,S.: A de
ductive database system FACADE for analyzing three dimensional and secondary structures of protein, Co·mputer Applications in Bio. cience, 9, pp.259-265 (1993).
3. Satou,K., Furuichi,E., Takagi,T., Kuhara,S. and Ushijin1a,K.: Similar Structure Search in a Deductive Database, In Proceedings of Inter
national Sy·mpos·iu·m on JVext Generation Database Syste·ms and Their Applications, pp.130-137 (1993).
4. Satou,K., Furuichi,E., Kuhara,S. and Takagi,T.: Searches for Topologi
cally and Three Din1ensionally Sirnilar Structures in Proteins, Proceed
ings of Genome Informatics vVorkshop IV, pp.25-35 (1993).
5. Satou,K., Furuichi,E., Takiguchi,K., Kuhara,S. and Takagi,T.: Appli
cation of a Deductive Database System FACADE toward Discovery of Clusters of Similar Structures in Proteins, in Proc. of the 27th Hawaii I nlernational Conference on Syste·m Sciences, pp.160-169 (1994).
6. Nishikawa,A., Satou,K., Furuichi,E., Kuhara,S. and Ushijima,K.: Equip
ments for Statistical Analysis in a Deductive Database System -their Irnplementation and their Application to Protein Structural Analysis-, in Proc. of International Sy·mposiu·m on Advanced Database Technolo
gies and Their Integrat-ion, p p.237-244 ( 1994).
7. Nishikawa,A., Satou,K., Furuichi,E., Kuhara,S. and Ushijin1a,K.: Data Classification Component in a Deductive Database System and its Ap
plication to Protein Structural Analysis, Special Issue on Advanced Databa. ·e Technologies of IEICE Transactions on I nforrnation and Sys
len�s, Vol.E78-D, No.ll, pp.l377-l387
(1995).
8. Satou,K., Furuichi,E., Hashimoto,S., Tsukamoto,Y., Kuhara,S., Tak
agi,T. and Ushijin1a,K.: Developrnent of a Deductive Database Sys
tenl for Computing Closures of Sin1ilarity Relationships an1ong Protein Structures, Journal of Japanese Associat'io n for Artificial I ntellige nee, Vol.ll, No.3 (to appear).