Prediction of protein-protein interactions and disease genes by machine learning

(1)

Annals of the Institute of Statistical Mathematics manuscript No.

(will be inserted by the editor)

Prediction of protein-protein interactions and disease

genes by machine learning

Thanh Phuong Nguyen · Tu Bao Ho

Received: 14 August 2007 / Revised: 20 August 2007

Abstract This paper presents two machine learning based methods to solve two signiﬁcant problems in bioinformatics: prediction of protein-protein inter-actions and prediction of disease genes.

Protein-protein interactions (PPI) are intrinsic to almost all cellular pro-cesses, and different computational methods recently offer chances to study PPI and related problems in molecular biology and medicine. We first use inductive logic programming (ILP) to predict PPI from integrative protein domain data and genomic/proteomic data. Starting with constructed biolog-ically significant background knowledge of more than 220,000 ground facts, we can induce ILP significant rules that better predict proteprotein in-teractions in comparison with other methods. We then use semi-supervised learning methods to exploit PPI data for predicting disease genes. In addition to 3,053 disease genes known in OMIM database, we found about fifty novel putative genes that are potential in causing a number of diseases.

Keywords Protein-protein interaction· integrative method · inductive logic programming· semi-supervised learning · disease genes prediction.

1 Introduction

The past decade has seen tremendous growth in the amount of biomedical data. Recent research in biomedical computing and informatics is focused on the interpretation of such data. Given the wealth of data - the interpretation cannot be done manually. It requires advanced computational tools, mimicking some aspects of the manual interpretation process, but expediting it several folds. Machine learning is concerned with the automatic acquisition of models from data, as well as with the usage of such models for automatic inference

Nguyen and Ho

Japan Advanced Institute of Science and Technology E-mail:_{{phuong, bao}@jaist.ac.jp}

(2)

and prediction. Therefore, machine learning methods have much to oﬀer, and are currently applied to a wide variety of biomedical problems (1).

The ultimate goal of our ongoing research is to understand and predict nor-mal function of organisms and, more importantly, the mechanisms underlying disease. Protein-protein interactions are indispensable at almost every level of cell function, in the structure of sub-cellular organelles, in muscle contraction, signal transduction, and regulation of gene expression, etc. Detecting both normal and abnormal biological processes via prediction of protein-protein interactions (PPI) has emerged as a new trend, both in vitro and in silico.

With the recent blooming of public proteomic and genomic databases, nu-merous machine learning approaches offer a chance to study more widely and deeply regarding protein-protein interactions. Besides methods based on a sin-gle data source, many bioinformaticians make the effort in the integrative approach that employs multiple data sources to better predict PPI. Typical work include those use Bayesian network (10), kernel methods (2), probabilis-tic decision trees (31), inductive logic programming method (25), probabilisprobabilis-tic model (22), among others. Also, the domain-based approach to prediction of PPI has received much attention, as protein domains are are believed to be the key regulators in protein-protein interactions. Typical work include those use associations (23), graph-oriented method for interacting domain profile pairs (29), domain-based random forest (6), among others.

The shortcoming of integrative methods is that they do not take protein domains into account while there are evidences that the biological mechanism behind protein-protein interactions involves protein domains and their inter-actions (21). On the other hand, while domain-based methods all treasured the biological roles of protein domains in PPI prediction, most of them merely considered the co-occurrence of domains/domain pairs. However, to compre-hensively predict PPI it seems necessary that domain-based methods could also employ genomic/proteomic features.

The first method introduced in this paper, early initialized in (19), is a novel integrative domain-based approach using inductive logic programming to pre-dict protein-protein interactions. The key idea of this computational method is to integrate protein domain features and multiple genomic/proteomic fea-tures. To efficiently integrate such two kinds of feature in predicting PPI, we specified two main tasks. The first is to extract as many as possible useful domain and genomic/proteomic features related to PPI. From seven popular databases, we extracted more than 220,000 ground facts of domain fusion, domain-domain interaction features and various other biologically significant genomic/proteomic features. The second is to employ inductive logic program-ming (ILP) with the huge amount of background knowledge to infer PPI.

In addition, rich information of protein-protein interaction networks helps us to understand the processes and events related to diseases. The information contained in our genes is so critical that simple changes can lead to a severe inherited disease, make us more inclined to develop a chronic disease, or make us more vulnerable to an infectious disease. Genes having such changes to

(3)

cause some diseases are called disease-causing genes or disease genes (18). Intuitively, proteins corresponding to disease genes are disease proteins.

Researches on PPI and diseases have been rapidly increasing in recent years. Disease genes were discovered by topological features in human PPI networks (30) using k-nearest neighbor algorithm. Using a phenomic ranking of protein complexes linked to human diseases, a Bayesian model was proposed to predicted new candidates in disorders (14). In (3), the authors integrated graph kernels for gene expression and human PPI to predict disease genes. Besides, some work concentrated in using PPI to discover disease genes for speciﬁc diseases, i.e. Alzheimer disease, using heuristic score functions (5), (13).

From the machine learning point of view, these previous works tried to apply supervised learning based on known disease genes (labeled data) to pre-dict new gene candidates causing diseases. However, nowadays the ratio of known disease genes to the total number of human genes is very small. It is shortcoming if biological information of genes closed to diseases genes is omit-ted. We present in this paper the second method for disease gene prediction that makes the best of semi-supervised learning, integrating data of human protein-protein interactions and various biological data extracted from multi-ple proteomic/genomic databases. The key idea is based on the assumption that disease genes have close biological associations with other genes whose proteins interact with respective proteins of disease genes.

We employ semi-supervised learning methods to determine the extended set of candidate proteins from human protein interaction networks, and to predict putative disease genes from the extended set. We extract various pro-teomic/gemomic features such as protein domains, GO terms, protein key-words, and coded enzymes of protein candidates, to comprehensively infer disease genes.

We carefully carried out various experiments with disease genes extracted from OMIM database – Online Mendelian Inheritance in Man database (ver-sion 2007) (7). We did five experiments with different sizes of labeled data, and twenty trials for each experiment to evaluate accuracy of the method. Accuracy of the prediction was 82%, which showed that the proposed method is useful for the disease gene prediction problem. About fifty potential disease proteins were predicted and some of them have been validated in the scientific literature.

2 Prediction of Protein-Protein Interaction by ILP

In this section, we present our proposed method to predict protein-protein interactions based on domain and multiple genomic /proteomic data using ILP. Two main tasks of the method are: (1) Constructing integrated background knowledge1 of domain features and multiple genomic/proteomic features, and 1 _{the terms ‘background knowledge’ and ‘ground facts’ (the second task) are used in terms}

(4)

(2) Learning PPI predictive rules by ILP from the constructed background knowledge.

2.1 Inductive Logic Programming

Inductive logic programming is the intersection of machine learning and logic programming (17). Inductive logic programming uses logic programming as a uniform representation for examples, background knowledge and hypotheses. Given an encoding of the known background knowledge and a set of examples (positive and negative examples) represented as a logical database of ground facts, an ILP system will derive hypotheses in form of logical rules which entails all the positive and none of the negative examples. The ILP schema is:

Positive examples + Negative examples + Background knowledge ⇒ Hypotheses.

There have been many ILP systems that were applied to diﬀerent problems in bioinformatics. ILP is particular suitable for bioinformatics tasks because of its ability to take into account background knowledge and work directly with structured data (20). The ILP system GOLEM has been applied to ﬁnd the predictive theory about the relationship between chemical structure and activity (12). Other central concerns of bioinformatics have been convincingly solved by ILP, such as protein secondary structure prediction, protein fold recognition, and protein-protein interactions prediction, etc.

2.2 Extracting domain fusion and domain-domain interaction data

Protein domains form the structural or functional units of proteins that par-take in intermolecular interactions. The existence of certain domains in pro-teins can therefore suggest the propensity for the propro-teins to interact or form a stable complex to bring about certain biological functions. Domain fusion and domain-domain interaction features have important biological roles in PPI prediction (26), (21), and these two domain features are extracted in our work. Let P denote the set of considered proteins pi. Denote by D the set of all protein domains dk which belong to proteins pi. A protein pair (pi, pj) that interacts together is denoted by pij, and a protein pair that does not interact together by ¬p_ij. Similarly for a domain pair (dk, dl), dkl represents an interaction, and¬d_kl a non-interaction.

Domains of interacting proteins have more chance to fuse together than do-mains of non-interacting proteins. Therefore, when ﬁnding a pair of proteins which have fused domains, we can predict an interaction between them. Do-main fusion data is referred from DoDo-main Fusion Database (26). We extracted domain fusion data for protein pairs (pi, pj),∀pi, pj ∈ P . The following pred-icate represents the domain fusion between two proteins:

(5)

Note that in the ILP system used - system Aleph (A learning engine for propos-ing hypothesis) (24), there are some mode declarations to build the bottom clauses, and a simple mode type is one of the following: (1) the input variable (+), (2) the output variable (−), or (3) the constant term (#). Predicate (1) means whether two input proteins, A and B, have fused domains or not (valued ”yes” by the constant term #FUSION). This predicate is supported by a set of ground facts G_{domain fusion}, e.g.,domain fusion (ap3m yeast, ap3b yeast, yes). After

preprocessing, the set Gdomain fusion consists of 255 ground facts for protein pairs.

The assumption that proteins interact with each other through interactions of their domains is widely accepted and validated. The domain-domain inter-action data is exploited to more reliably predict PPI. We extracted DDI data from iPfam database(http://www.sanger.ac.uk/Software/Pfam/iPfam/). iPfam is a resource that describes domain-domain interactions that are observed in PDB entries. The domains are deﬁned by Pfam. When two or more domains occur in a single structure, the domains are analysed to see if they form an interaction considered by the bonds forming the interaction are calculated. We considered two features of DDI. The ﬁrst feature is whether a protein pair (pi, pj) has a domain interaction dkl, and if yes, how many dkl it has. This information is formulated by predicate:

hasddi(+protein, +protein, #DDI) (2)

The set of ground facts for this predicateG_ddiincludes 573 ground facts, some of them are: hasddi(jsn1 yeast,

yip1 yeast,2),hasddi(msh4 yeast,msh5 yeast,5), etc.

The number of domain-domain interactions of a protein is one of the fea-tures which may increase or decrease the probability of its interaction with others. So we considered the relationship between PPI and the number of DDI of each interacting partner. This relationship is presented in predicate 3.

num ddi(+protein, #NUM DDI) (3)

Denoted by Gnum ddi, the set of ground facts of the above predicate con-tains 289 ground facts. We found that there are some proteins having a large number of DDI, for examplenum ddi(did4 yeast,20)ornum ddi(bud27 yeast,39), and these proteins potentially interact with many other proteins.

2.3 Extracting proteomic and genomic data from multiple databases

In addition to domain fusion and domain-domain interaction features, we mined genomic and proteomic data from UniProt database, CYGD database, InterPro database, Gene Ontology database, and Gene Expression database to detect useful genomic and proteomic features for PPI prediction. Table 1 shows 19 predicates corresponding to proteomic/genomic data extracted from multiple databases.

(6)

As the world’s most comprehensive catalog of information on proteins,

UniProt database(http://www.pir.uniprot.org/) largely provides functional, structural or other categories; regions or sites of interest in the sequences (in Feature Table - FT lines); describes enzymes coded (EC) and pointers to in-formation related to entries and found in data collections other than UniProt such as GO, PIR, PROSITE, Pfam, and Interpro.

Table 1 Predicates used as background knowledge in various genomic data

Genomic data Background knowledge predicates UniProt keyword(+protein,#Keyword)

A protein has a proteins keyword feature(+protein,#Feature) A protein has a protein feature coded enzyme(+protein,#EC) A protein has a coded enzyme dr prosite(+protein, -PROTSITE ID)

A protein has a PROSITE annotation number dr interpro(+protein, -INTERPRO ID)

A protein has an InterPro annotation number dr go(+protein,-GO TERM)

A protein has a GO term dr pfam(+protein, -PFAM ID)

A protein has an Pfam annotation number dr pir(+protein, -PIR ID)

A protein has a Pir annotation number CYGD subcell cat(+protein, #SUBCELLCAT)

A protein has subcellular structures in which it is found function cat(+protein, #FUNCAT)

A protein has a certain function category protein cat(+protein, #PROTEINCAT) A protein has a certain protein category phenotype cat(+protein, #FENCAT)

A protein has a certain phenotype category complex cat(+protein, #COMPLEXCAT) A protein has a certain complex category InterPro interpro go(+INTERPRO ID, -GO TERM)

Mapping of InterPro annotations and GO terms GO is a(+GO TERM,-GO TERM)

is a relation between two GO terms part of(+GO TERM,-GO TERM)

part of relation between two GO terms Gene expression(+protein, +protein, #COEFFICIENT) Expression Gene expression correlation coeﬃcient of two proteins

Others num ppi(+protein, +protein, #NUM PPI)

A protein has a number of protein-protein interactions ig(+protein, +protein, #IG)

Interaction generality of two proteins is the number of protein that interact with just two considered proteins

Some examples of extracted data from these predicates are keyword (ace1 yeast, transcription regulation), feature(ldb7 yeast, chain chromatin structure remodel-ing complex), coded enzyme(uqcr1 yeast, ec1.10.2), dr go(twoa5d yeast,go0005935), etc. The ﬁrst three predicates present general protein features that should ef-fect their interactions. The other give references to other databases. Data from diﬀerent databases related to PPI are bound by these predicates. We extracted 10,919 ground facts for these UniProt predicates.

The MIPS (http://mips.gsf.de/genre/proj/yeast/) Comprehensive Yeast Genome Database (CYGD) presents information on the molecular structure

(7)

and functional network of the entirely sequenced, well-studied model eukary-ote, the budding yeast Saccharomyces cerevisiae. A protein has more chance to interact with proteins in the same category than with proteins in diﬀerent catalogues. The set extracted from CYGD database GCY GDconsists of 2,152 ground facts. Here are some examples: subcell cat (ahc1 yeast, cytoplasm), phe-notype cat(cyk2 yeast, cell cycle defects), etc.

InterProdatabase (http://www.ebi.ac.uk/interpro/) is a database of pro-tein families, domains and functional sites. We considered the association be-tween InterPro identiﬁers and GO terms. There are 556 ground facts.

Gene Ontology database (http://www.geneontology.org/) has three or-ganizing principles: molecular function, biological process and cellular compo-nent. The terms in an ontology are linked by two relationships, is a and part of. Two predicates for GO database have 438 ground facts (e.g., is a (go0000002, go0007005), part of (go0000032, go0007047)).

Interacting proteins are often co-expressed, and then this genomic feature is useful in predicting PPI. The Gene Expression coefficients between two proteins are referred to in (10) which contains 25,000,000 pairwise coefficients for about 18,773,128 protein pairs. In our work, we randomly extracted 200,000 gene expression coefficients in terms of ground facts represented by predicates expression(+protein, +protein, #COEFFICIENT) for about 110,000 posi-tives and negaposi-tives in the training data set.

Two last predicates represent information about the number of protein-protein interactions (with the set Gnum ppi of 690 ground facts) and inter-action generality of two interacting partners (with the set Gig 1,718 ground facts). Interaction generality is the number of proteins that interact with both interacting partners in an interaction.

2.4 Constructing background knowledge for predicting PPIs

After deﬁning 22 predicates, we exploit data in terms of ground facts for these predicates from seven databases (two databases for domain features and ﬁve others for genomic and proteomic features). In succession, we denote the sets of ground facts extracted from UniProt, CYGD, InterPro, Gene Ontology, and Gene Expression by GUniP rot, GGO, GInterP ro, GCY GD, and Gexpression, respectively. Algorithm 1 presents the procedure to extract data from multiple databases to construct background knowledge for PPI prediction.

2.5 Predicting PPI with integrative domain-based ILP

Algorithm 2 describes the integrative domain-based ILP framework for pre-dicting PPI from multiple genomic/proteomic databases. The previous frame-work presents the common procedures of the ILP method. Step 2 and Step 3 are for generating positive and negative examples Sinteract, and S¬interact, respectively (see more in Subsection 4.1). In Step 4, we extracted background knowledge Sbackgroundincluding both domain features and genomic/proteomic

(8)

Algorithm 1Extracting domain feature data and genomic /proteomic feature data from multiple sources.

t [htbp]

Input:

Set of proteins{p_i} ⊆ P .

Output:

Sets of ground factsGL ={Gl}, Gl ∈ {Gdomain f usion,Gddi,Gnum ddi,GU niP rot,

GCY GD,GInterP ro,GGO,Gexpression,Gig,Gnum ppi}.

1: Initialize all sets of ground facts_Gl:=∅; D := ∅.

2: Extract all domainsdkbelonging to proteinspi;D := D ∪ {dk}.

3: for each protein pair (pi,pj)

4: for alldk∈ pianddl∈ pj

5: if _fused(dk, dl) =true then

Gdomain f usion:=Gdomain f usion∪ {(pi, pj)}.

6: if∃ dklthen

Gddi:=Gddi∪ {(pi, pj)}

Count the number of DDInum ddiiandnum ddijfor

proteinspiandpjforGnum ddi, respectively.

7: for each protein_pi∈ P

8: Extract data from UniProt database and CYGD database for GU niP rotandGCY GD, respectively.

9: Extract mapping data between GO termsgiand Interpro identiﬁers

tirelated topifrom InterPro database forGInterpro;

GInterP ro=GInterP ro∪ {ti, gi.}.

10: for each proteinpi∈ P

11: for eachproteinpj∈ P

12: Extract the relationshiprijbetween GO terms (gi, gj) related to

(pi, pj) from GO database;GGO=GGO∪ {rij(gi, gj)}.

13: Extract the expression correlation coeﬃcients_eijof (pi, pj);

Gexpression=Gexpression∪ {pi, pj, eij}.

14: Extract the interaction generality of PPInijof (pi, pj);

Gig=Gig∪ {pi, pj, nij}.

15: if∃ pijthen

num ppii:=num ppii + 1;

16: Gnum ppi:=Gnum ppi∪ {(pi, num ppii)}.

17: returnGL.

Algorithm 2 An integrative domain-based ILP framework for PPI prediction Input:

Set of protein-protein interactionsSinteract={pij}

Number of negative examples (¬pij)N

Sets of ground factsGdomain f usion,Gddi,Gnum ddi,GU niP rot,GCY GD,GInterP ro,

GGO, ,Gexpression,Gig, andGnum ppi.

Output:

Set of rulesR for protein-protein interaction prediction. 1: R := ∅.

2: Extract positive examples for the set_Sinteract.

3: Generate N negative examples¬pij;S¬interact={¬pij}.

4: call Algorithm 1 to generate sets of ground factsGl;Sbackground=GL={Gl}.

5: Run an ILP program withSinteract,S¬interactandSbackground to induce rulesr.

6: R := R ∪ {r}. 7: return_R.

(9)

features from sets of ground facts of deﬁned predicates (see Section 2.4). In Step 5, in our experiments, system Aleph was applied to induce rules.

Aleph is an advanced ILP system that uses a top-down ILP covering algo-rithm. Aleph requires three input ﬁles to construct theories: positive examples, negative examples and background knowledge. The target predicate in our work ishas int(+protein, +protein), meaning that two arbitrary proteins interact. Aleph learns three inputs and induces rules (hypothesized clauses) in terms of the relationships between the target predicate and other predicates declared in background knowledge.

3 Prediction of disease genes by semi-supervised learning

3.1 Semi-supervised Learning

Semi-supervised learning (SSL) is halfway between supervised and unsuper-vised learning. SSL considers both labeled data (superunsuper-vised learning) and un-labeled data (unsupervised data). A given data setX = {x₁, ..., xl, xl+1, ..., xn} can always be divided into two parts. The ﬁrst one is the set of l data points Xl = {x₁, ..., xl} which are labeled by the label set Yl= {y₁, ..., yl}, and the other one is the data set of u data points Xu ={xl+1, ..., xn}, the labels of which are not known. The goal is to predict labels of unlabeled data. Some often-used semi-supervised learning methods include EM with generative mix-ture models, self-training, co-training, transductive support vector machines, and graph-based methods (4).

Since labeling often requires much human labor, whereas unlabeled data is far easier to obtain, semi-supervised learning is very useful in many real-word problems, and has recently attracted an increasing number of researchers (4). In bioinformatics, SSL is also applied to solve many problems and has achieved considerable results, for example, in the study of protein classiﬁcation (28) and in the functional genomics (15), etc.

In this paper, we employed Harmonic Gaussian method (32) - the graph-based semi-supervised learning algorithm - in the proposed framework. Be-cause we integrated human protein-protein interaction networks, semi-supervised learning based on the graph was considered to be suitable for predicting disease genes.

3.2 Semi-supervised Learning Framework for Predicting Disease genes Figure 1 brieﬂy describes our semi-supervised learning framework for disease gene prediction which uses integrated human PPI and proteomic/genomic data.

Corresponding to Figure 1, the proposed framework consists of four main tasks, as follows:

(10)

Fig. 1 Semi-supervised learning framework for disease gene prediction.

1. Identify disease genes as positives, and non-disease genes as negatives, and map them to the corresponding proteins, called disease proteins and non-disease proteins, respectively.

2. Extend the initial set of positives by extracting their interacting proteins as positive candidates from a human PPI database.

3. Extract and represent human PPI and proteomic/genomic data as feature vectors.

4. Apply a semi-supervised learning algorithm to predict disease genes. Algorithm 3 presents in detail the algorithm for disease gene prediction using semi-supervised learning. The input of the algorithm are positive exam-ples (known disease genes), negative examexam-ples (non-disease genes), the set of human protein-protein interactions, and the set of proteomic/genomic feature data. The output of the algorithm is the set of new disease genes.

In Algorithm 3, there are 12 steps corresponding to the four main tasks. Steps 1 to 3 are for the ﬁrst task. Until Step 3, all disease proteins piand non-disease proteins p−_i are identiﬁed by the Uniprot names, and we have the initial set of disease proteins P. From the human PPI network Ω, Step 4 does the second task to generate the extended set of disease proteinsP+ including the interacting proteins p+_i (positive candidates) of disease proteins in the initial set. In Step 5, the union set P∗ is formed consisting of positives, positive candidates and negatives. For the third task, Steps 6 to 9 extract various features fk from databases OPHID, Uniprot, GO, and Pfam, and estimate the scores scorek of features for each protein. The k -dimension feature vectors vi are determined to integrate all feature scores as the input of a semi-supervised learning algorithm. Steps 10 to 12 correspond to the fourth task, where we apply a semi-supervised learning algorithm to predict new disease genes.

In Step 10, we used the SemiL software developed by Huang et al. (8) which implements Harmonic Gaussian method. SemiL software is an eﬃcient software for solving large-scale semi-supervised learning problems. SemiL pro-vides various options for distance weight, hard or soft label, normalization, etc.

(11)

Algorithm 3Semi-supervised learning framework to predict disease genes.

Input:

The setG of disease genes giand the setG−of non disease genesg−i .

The protein-protein interaction networkΩ.

PPI feature and proteomic/genomic featuresfk _{extracted from databases (OPHID,}

Uniprot, GO, and Pfam). Output:

The set of new predicted disease genesG+ 1: _G+:=_∅;

P :=∅; /*P is the initial set of disease proteins. */ P+_:=_∅; _/*_P+ _{is the extended set of disease proteins. */}

P−_:=_∅; _/*_P−_{is the set of non-disease proteins. */}

2: Map all disease genes_gi∈ G to the corresponding proteins pi; P = P ∪ {pi}.

3: Map all non-disease genes_g_i−_{∈ G}−to the corresponding proteins_p−_i ; _P−:=_P−_∪ {p−

i}.

4: for eachpi∈ P

Extract all interacting proteins_p+_i of the disease protein_pifromΩ; P+ :=P+∪

{p+

i}.

5: P∗:=P+∪ P−∪ P. /*P∗is the union set of all considered proteins. */ 6: for each_pi∈ P∗

7: Extract featuresfppi,flength,fkw,fec,fgo,fpf amfrom databases OPHID, Uniprot, GO, and Pfam.

8: Estimate the scorescorekof each featurefk.

9: Determine k -dimension feature vectorsvi={vik} where the dimension vikcorresponds

to the featurefk_.

10: Run a semi-supervised learning algorithm on the setP∗and the set of feature vectors vito predict disease proteins.

11: Map new predicted disease proteins to their corresponding disease genes g+_i ;G+ := G+_{∪ {g}+

i}.

12: return_G+.

3.3 Scores of genomic/proteomic features

To have a comprehensive view of the relationship between disease genes and other proteomic/proteomic features, in addition to PPI features from OPHID database, we also utilized three other databases, Uniprot, GO, and Pfam, to look for relevant and useful features for disease gene prediction. When inte-grating proteomic/genomic features, one diﬃculty that we have to overcome is representing various data types of feature data. Extracted data may be nu-merical such as sequence length, or categorical such as keywords, and coded enzymes. Thus, by using score functions we can represent the extracted data in the form of feature vectors.

Table 2 shows the statistics of extracted features for each data source. The ﬁrst two columns are the number of records extracted according to respective features, and the last two columns are the number of feature categories.

We describe here how we deﬁne the score functions. First, the protein-protein interaction score was considered, to determine how close one protein-protein is to other disease proteins in protein interaction networks. Human protein-protein interactions are extracted from OPHID database(http://ophid.utoronto.ca/

(12)

Table 2 Statistics for the set of all proteins considered, and the set of disease proteins with the extracted features.

Feature_fk _#Record _#Category

The whole Set of The whole Set of data set disease proteins data set disease proteins

fGO ₁₇₂₄₁ ₆₄₀₄ ₂₉₁₁ ₁₈₁₇

fKW ₃₁₄₆₅ ₁₃₅₉₇ ₅₆₄ ₅₀₄

fEC ₁₁₂₃ ₄₅₁ ₁₃₃ ₁₀₆

fP f am ₆₈₁₇ ₂₄₂₆ ₁₇₉₆ ₁₄₁₃

ophid/). One protein may have many interactions, however, among these pro-tein interactions, the more the propro-tein interacts with disease propro-teins, the more likely it is to be a disease protein. Also, a protein which is a hub of many interactions is often very important. The score scoreppi for the PPI feature fppiis deﬁned as follows:

scoreppi(pi) = p_j∈PInt(pi, pj) p_j∈P∗Int(pi, pj) ∗ p_j∈PInt(pi, pj) Avgppi (4) where Int(pi, pj) = ⎧ ⎨ ⎩

1 if there is an interaction between proteins_piandpj,

0 otherwise_.

Avgppi: the average of number of protein interactions belonging to disease proteins.

UniProt database(http://www.pir.uniprot.org/) is the world’s most compre-hensive catalog of information on proteins, and it provides functional, struc-tural or other categories, lengths of protein sequences, and describes enzymes coded.

The following equation is the score for the sequence length feature:

scorelength(pi) = length(pi

) Avglength

(5)

where

length(pi): the sequence length of a proteinpi.

Avglength: the average sequence length of disease proteins.

Disease proteins may share the same keywords, and be coded by the same enzymes. Some keywords and coded enzymes were found to be common in the set of disease proteins. Then, scorekw and scoreec show how probable it is that a protein is a disease protein, in terms of keywords and enzymes. In Uniprot database, keywords are classiﬁed into 10 categories, i.e. biological process, developmental stage, disease, molecular function, etc.

Among 5,557 proteins, there are 31,465 data records extracted for keyword features, and 1,123 enzymes. These proteins share the same 564 keywords and 133 enzymes.

(13)

We proposed similar scores for keyword feature fkw and coded enzymes feature fec. For each protein, we extracted the corresponding keywords kwi and coded enzymes eci. The keyword and enzyme data are categorical, for example, (P05067, alzheimer disease) and (P01011, disease mutation) where P05067, P01011 are the Uniprot names, and “alzheimer disease”, “disease mu-tation” are their keywords; or (O75688, ec3.1.3) where O75688 is the Uniprot names, and ec3.1.3, is enzymes coded.

Since each protein may have many diﬀerent keywords, each keyword kwi is assigned to its signiﬁcant weight, as follows:

wkw

i =freq(kwi)∗ freq(kwi),

where

freq(kwi): the frequency count ofkwiobserved in the set of disease proteinsP.

freq(kwi): the frequency count ofkwiobserved in the set of proteinsP∗.

Equation 6 shows the score for the keyword feature: scorekw(pi) = 1

∀kwj∈pi

wkw_j (6)

Unlike the keyword feature, each protein pi is coded by only one enzyme eci. The score for the coded enzyme feature is deﬁned in Equation 7.

scoreec(pi) =freq(eci)∗ freq(eci) (7)

where

freq(eci): the frequency count ofeciobserved in the set of disease proteinsP.

freq(eci): the frequency count ofeciobserved in the set of proteinsP∗.

It is useful to investigate the relationship between GO terms(http://www. geneontology.org/)and disease proteins. GO terms are divided into three groups: molecular function, biological process and cellular component. GO terms re-lated to the set of proteins considered goi were extracted, and each of them has its own weight, deﬁned by the following equation:

wgo i =

#goi+ 1

#goi+ 1,

where

#goi: the count ofgoiobserved in the set of disease proteinsP.

#goi: the frequency counts ofgoiobserved in the set of proteinsP∗.

Then, the score for GO term feature is proposed as follows: scorego(pi) = 1

∀kwj∈pi

wgo_i (8)

Protein domains are the building blocks of proteins. Disease proteins may structurally or functionally depend on their domains. Pfam database(http:// www.sanger.ac.uk/Software/Pfam/)is a large collection of multiple sequence align-ments and hidden Markov models covering many common protein domains and families. Pfam domains dj of all considered proteins are extracted and scored by Equation 9. scorepf am(pi) = #pfami+ 1 #_pfami+ 1, (9) where

#pfami: the number of domains dj of a protein pi observed in the set of domains

belonging to disease proteins.

(14)

4 Evaluation

4.1 Experiment design

4.1.1 For protein-protein prediction

We concentrate on predicting PPI for Saccharomyces cerevisiae, a budding yeast, due to the availability of Saccharomyces cerevisiae data. We carried out experimental comparative evaluation, consisting of two experiments cor-responding to proteprotein interaction prediction and domadomain in-teraction prediction.

To assess the performance of our method for PPI prediction, we did two comparative tests to demonstrate: (1) the advantages of the integration of mul-tiple proteomic and genomic features in our method and (2) the advantages of domain-based approach. First, ROC curves of 10-fold cross validation tests were produced to compare our proposed method with other domain-based methods, particularly AM method and SVMs method. Second, we also con-ducted 10-fold cross validation tests for an ILP method with multiple genomic databases, but not using domain features, and compared those results with our method in terms of sensitivity and speciﬁcity.

For three comparative tests for PPI prediction, we used the core data of Ito data set (9) with more than two IST hits2, as positive examples, and se-lected at random 1000 protein pairs whose elements are in separate subcellular compartments as negative examples. Each interaction in the interaction data originally shows a pair of bait and prey ORF (Open Reading Frame). After removing all interactions in which either bait ORF or prey ORF is not found in UniProt database, we obtained 718 interacting pairs from the original 841 pairs. Subsection 4.2.1 shows the experimental results of PPI prediction. 4.1.2 For disease gene prediction

We first prepared three data sets to carry out the experiments: (i) the set of disease genes, (ii) the set of non-disease genes, (iii) the set of proteprotein in-teractions. Then, we carried out experiments with various parameters to com-putationally evaluate accuracy of the proposed method. Finally, we looked up newly-predicted disease genes in the scientific literature to biologically verify the findings of the proposed method.

The database OMIM is a catalog of human genes and genetic disorders. In OMIM, the list of hereditary disease genes is described in the OMIM morbid map. There are 4,512 records with 3,053 unique OMIM ID in the catalog. As shown in Algorithm 3, the total of 3,053 human disease genes were mapped, to look for their disease proteins identiﬁed by Uniprot names. The results showed 3,590 corresponding disease proteins. Some of these proteins have published interactions.

2 _{IST hit means how many times the corresponding interaction was observed. The higher}

(15)

Compiling a list of genes that are known not to be involved in hereditary disease is difficult. A recent study (27) showed that the human genome may contain thousands of essential genes having features that differ significantly, both from disease genes and from other genes. In the absence of a set of well-defined human essential genes, they compiled the list of ubiquitously expressed human genes (UEHG) as an approximation of essential genes. Non-disease genes belong to neither the OMIM morbid map nor the UEHG set. The genes that satisfy this condition are negative examples in our experiments. Mapping to Uniprot names, there are 723 proteins corresponding to UEHG, and 180 proteins overlapping in the set of disease proteins.

We obtained the human protein-protein interactions from OPHID database. Among 51,934 human protein-protein interactions stored in OPHID, there are 13,368 interactions which have at least one interacting partner belonging to the set of disease proteins. We found that there were 1,502 disease proteins hav-ing interactions in OPHID. From 13,368 interactions, the initial set of disease proteins extended to 5,775 proteins.

4.2 Experiment results

4.2.1 Result of predicting protein-protein interactions.

With the same positives and negatives datasets, we conducted 10-fold cross validation tests for our method, AM method and SVMs method. AM method calculated the probability of protein pairs based on protein domains (23). In our experiment, the probability threshold is set to 0.05. For SVMs method, we used SV Mlight (11). The linear kernel with default values of the parameters was used. For Aleph, we selected minpos = 2 and noise = 0, i.e. the lower bound on the number of positive examples to be covered by an acceptable clause is 2, and there are no negative examples allowed to be covered by an acceptable clause. We also used the default evaluation function coverage which is deﬁned as P −N , where P , N are the number of positive and negative examples covered by the clause.

The ROC curves of ILP, AM and SVMs methods with 1000 negative ex-amples are shown in Figure 2. ROC curve (Receiver Operating Characteristic curve) shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). Sensitivity refers to the ability of the test to detect individuals who actually have the disorder. On the other hand, the term specificity means that the test is specific to the disorder being assessed and that it does not give a positive result because of other conditions.

The ROC curve of our method is close to the left-hand border and then the top border of the ROC space. On the other hand, ROC curves of AM method and SVMs method are close to the 45-degree diagonal of the ROC space. The ROC curve demonstrates that our method has a considerably better performance than those of AM and SVMs method.

(16)

Fig. 2 Comparative ROC curves of ILP, SVMs and AM method with 1000 negative examples.

Fig. 3 Comparison of sensitivity and speciﬁcity of non-domain based method and our proposed method.

Conducting 10-fold cross-validation with various tested numbers of nega-tive examples, the results (in Figure 3) show that our method achieved higher sensitivity, and higher or equal speciﬁcity, than the non-domain based ap-proach (25).

4.2.2 Result of predicting disease genes

As mentioned above, we used SemiL software to implement the Harmonic Gaussian method (33). The weight matrixes W were calculated with two dif-ferent distance functions, i.e. Euclidean distance and Cosine distance, and the degree of graph was 20. The kernel was RBF function, and other parameters were default.

From the data set, we randomly selected l data points as labeled data, and the rest (n-l) as unlabeled data. Then, accuracy was estimated by comparing the predicted labels and true labels. For each labeled set size l tested, we performed 20 trials. The ﬁnal result is average accuracy of 20 trials. Accuracy is deﬁned as the ratio of (true positive/(true positive + false positive))

We chose similar sets of disease genes, non-disease genes, and protein-protein interactions as those used in the of of Xu and Li (30), but our method provided higher accuracy with similar data. In (30), accuracy ranged from 74% to 76%. Our accuracy ranged from 78% to 82%. In future work, we would like to reproduce the same experiments as in (30) for a comparative evaluation.

When the size of labeled data is small (10% of the data set), semi-supervised learning obtained non trivial accuracy, 78%. When the number of labeled data is at least half of the total data set, accuracy is over 80%. This demonstrates that even with a very low percent of labeled data, semi-supervised learning can still predict disease genes with high accuracy.

(17)

5 Discussion

5.1 About protein-protein interactions

The experimental results have shown that ILP approach potentially predicts PPI and DDI with high sensitivity and speciﬁcity. Furthermore, the inductive rules of ILP encouraged us to discover many interesting biological reciprocal re-lationships among protein-protein interactions and protein domains, and other genomic/proteomic features related to protein-protein interactions. Analysing our results in comparison with information in biological literatures and books, we found that ILP induced rules could be further applied to related studies in biology. Studying the rules of PPI prediction related to domain-domain interaction information, we found many interesting rules. For example, the following rule shows that if two proteins have domains belonging to domain databases like PROSITE or InterPro and these domains interact with each other, they may interact

has int (A,B) :- dr prosite (B, C), dr prosite (A, C), ddi (A, B, yes) with 43 positives covered

has int(A,B) :- dr interpro(B,C), dr interpro(A,C), ddi (A, B, yes)with 90 positives covered.

A large number of positives, which indicates these rules, conﬁrms why domain-domain interactions are considered as key factors to predict PPI. Con-sidering the group of proteins which may be required for the production of pyridoxine (vitamin B6) sno1 yeast, snz3 yeast snz1 yeast, and snz2 yeast, we found that each pair in this group has an interaction which satisﬁes the rule

has int(A,B) :- ig (A, B, C), C = 1, ddi (A, B, yes), function cat (B, cell rescue defense and virulence).

This rule means interaction of protein A and protein B may occur if the proteins satisfy three conditions. First is that they interact with the same protein. Second is that they have at least one DDI. Third is that one of them is categorized to function catalogue ‘cell rescue defense and virulence’. PPI plays an important role in drug design, so such rules and their evidence are expected to help us to discover interesting relationships between PPI, DDI and protein function in pharmaceuticals. Two most popular rules are

has int(A,B) :- dr go(B,C), part of(C,D), domain fusion(A, B,yes)

has int(A,B) :- dr go(B,C), dr go(A,C), domain fusion(A,B,yes)

The ﬁrst one covers 199 positives and the second one covers 217 positives. Both of these rules consist of GO term and domain fusion information. Ac-cording to the second rule, if two proteins have GO terms and their domains are fused in another protein, there may occur an interaction.

Our induced rules with large number of positives prove that if a pair of proteins, A and B, are located in the same subcellular compartment, protein A potentially interacts with protein B. There are 216 covered positives for ‘nucleus compartment’, 284 ones for ‘cytoplasm compartment’, and 15 ones

(18)

for ‘mitochondria compartment’. However, surprisingly among induced rules, we found a rule with 37 positives that showed the phenomenon of two proteins being in diﬀerent subcellular locations but interacting

has int(A,B) :- subcell cat(B,nucleus),

subcell cat(A,cytoplasm), function cat(A,transcription).

This phenomenon could occur when there is a certain translocation or post-translation modiﬁcation of proteins in diﬀerent subcellular compartments.

Analysing DDI prediction rules, some interesting associations between DDI and other domain and protein features are discovered.

Related to motif compound feature in domain, we found that the more mo-tifs a domain has, the more interactions the domain has with other domains. This means that domains which have many conserved motifs tend to inter-act with others. The interinter-actions of these domains play an important role in forming stable domain-domain interactions in particular, and protein-protein interactions in general (16). If two domains, one is with eight motifs and the other belongs to proteins categorized in protein synthesis function category, they interact. This rule covers 23 positives

interact domain (A,B) :- prints (B, C), motif compound (C, compound(8)), function category (A, protein synthesis).

We expect that the combination of inductive rules of ILP will be very useful for understanding PPI and DDI in particular, and protein structures, protein functions, and biological processes in general.

5.2 About disease genes

In addition to computational evaluation, we endeavored to look for biological evidence to support to our method. And we found some interesting evidence when verifying the novel potential disease proteins. As in (27), ubiquitously expressed human genes, also known as house keeping genes, should be re-garded as most severe ”disease” genes. Among 50 new predicted disease pro-teins, there are 6 proteins which correspond to UEHG genes, i.e., nherf human, ddx3x human, tyy1 human, 1433t human, ctbp1 human, spta2 human.

Hepatitis C virus (HCV) core inﬂuences the expression of host genes. Ddx3x human (ATP-dependent RNA helicase DDX3X) acts as a cofactor for XPO1-mediated nuclear export of incompletely spliced HIV-1 Rev RNAs, and is also involved in HIV-1 replication. This protein interacts speciﬁcally with hepatitis C virus core protein, resulting in a change in intracellular location.

Protein tyy1 human acts as a repressor in absence of adenovirus E1A pro-tein, but as an activator in its presence. A group of viruses that infect the membranes (tissue linings) of the respiratory tract, the eyes, the intestines, and the urinary tract, adenoviruses account for about 10% of acute respira-tory infections in children, and are a frequent cause of diarrhea.

Protein trrap human is the isolation of highly conserved 434 kDa protein, and interacts speciﬁcally with the c-Myc N terminus, and has homology to

(19)

the ATM/PI3-kinase family. Trrap human (related to gene trrap) also inter-acts speciﬁcally with the E2F-1 transactivation domain. Expression of trans-dominant mutants of the protein trrap human or antisense RNA blocks c-Myc-and E1A-mediated oncogenic transformation. Then, trrap was suggested as an essential cofactor for both the c-Myc and E1A/E2F oncogenic transcription factor pathways.

Though accuracy is the most common evaluation measurement in the dis-ease gene prediction problem, other measurements such as sensitivity and speciﬁcity or the area under the ROC curve, should also be used for eval-uation in future work. Many semi-supervised learning algorithms have been proposed, and each of them is suitable for a particular problem. In this paper, we proposed the general semi-supervised learning framework for disease gene prediction. In addition to the Harmonic Gaussian algorithm already applied, we may also investigate and use other algorithms that may achieve better results.

6 Conclusion

We have presented an integrative domain-based approach using ILP and multi-ple genome databases to predict protein-protein interactions, and disease gene prediction by SSL. The experimental results demonstrated that our proposed methods could produce comprehensible rules and perform well in comparison with other methods. In future work, we would like to further investigate the biological signiﬁcance of novel PPIs obtained by our method, and apply the ILP approach to other important tasks.

References

1. Pierre Baldi and Sren Brunak. Bioinformatics - The Machine Learning Approach. The MIT Press, 2001.

2. Asa Ben-Hur and William Staﬀord Noble. Kernel methods for predicting protein-protein interactions. Bioinformatics, 21(suppl1):i38–46, 2005.

3. K.M. Borgwardt and H. Kriegel. Graph kernels for disease outcome prediction from protein-protein interaction networks. In Paciﬁc Symposium on Biocomputing, volume 12, pages 4–15, 2007.

4. O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 2006. 5. J.Y. Chen, C. Shen, and A.Y Sivachenko. Mining alzheimer disease relevant proteins from

integrated protein interactome data. In Paciﬁc Symposium on Biocomputing, volume 11, pages 367–378, 2006.

6. X.W. Chen and M. Liu. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics, 21(24):4394–4400, 2005.

7. A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V. A. McKusick. Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders.

Nucleic Acids Res, 33 Database Issue, January 2005.

8. T. M. Huang and V. Kecman. Semil, software for solving semi-supervised learning problems, 2004.

9. T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. In Proc. Natl. Acad. Sci. USA 98, pages 4569–4574, 2001.

(20)

10. Ronald Jansen, Haiyuan Yu, Dov Greenbaum, Yuval Kluger, Nevan J. Krogan, Sambath Chung, Andrew Emili, Michael Snyder, Jack F. Greenblatt, and Mark Gerstein. A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data. Science, 302(5644):449–453, 2003.

11. T. Joachims. Making large-scale support vector machine learning practical. In B. Scholk¨opf, C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.

12. R.D. King, S. Muggleton, R.A. Lewis, and M.J.E. Sternberg. Drug design by machine learn-ing: The use of inductive logic programming to model the structure activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. of the National Academy

of Sciences of the USA, 89(23):11322–11326, 1992.

13. M. Krauthammer, C. A. Kaufmann, T. C. Gilliam, and A. Rzhetsky. Molecular triangula-tion: Bridging linkage and molecular-network information for identifying candidate genes in Alzheimer’s disease. PNAS, 101(42):15148–15153, 2004.

14. K. Lage, O. E. Karlberg, Z. M. Strling, Pll, A. G. Pedersen, O. Rigina, A. M. Hinsby, Z. Tmer, Fl. Pociot, Y. Tommerup, N.and Moreau, and S. Brunak. A human phenome-interactome net-work of protein complexes implicated in genetic disorders. Nature Biotechnology, 25(3):309– 316, March 2007.

15. M. Mark-A and Tobias Scheﬀer. Multi-relational learning, text mining, and semi-supervised learning for functional genomics: Special issue: Data mining lessons learned. Machine

Learn-ing_{, 57(1-2):61+, 2004.}

16. H. S. Moon, J. Bhak, K.H. Lee, and D. Lee. Architecture of basic building blocks in protein and domain structural interaction networks. Bioinformatics, 21(8):1479–1486, 2005. 17. Stephen Muggleton. Inductive Logic Programming. Academic Press, 1992. 18. NCBI. Genes and disease. National Library of Medicine (US), NCBI., 2007.

19. T. P. Nguyen and T. B. Ho. Combining domain fusions and domain-domain interactions to predict protein-protein interactions. In 7th International Workshop on Data Mining in

Bioinformatics (BIOKDD ’07)_{, pages –, 2007.}

20. D. Page and M. Craven. Biological applications of multi-relational data mining. In SIGKDD

Explorations_{, volume 5, pages 69–79, 2003.}

21. T. Pawson, M. Raina, and N. Nash. Interaction domains: from simple binding events to complex cellular behavior. FEBS Letters, 513(1):2–10, 2002.

22. D. R. Rhodes, S. A. Tomlins, S. Varambally, V. Mahavisno, T. Barrette, S. Kalyana-Sundaram, D. Ghosh, A. Pandey, and A. M. Chinnaiyan. Probabilistic model of the human protein-protein interaction network. Nat Biotech_{, 23(8):1087–0156, 2005.}

http://www.nature.com/nbt/journal/v23/n8/suppinfo/nbt1103 S1.html.

23. E. Sprinzak and H. Margalit. Correlated sequence-signatures as markers of protein-protein interaction. Journal of Molecular Biology, 311(4):681–692, 2001.

24. A. Srinivasan, 1993. http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/. 25. T.N. Tran, K.Satou, and T.B.Ho. Using inductive logic programming for predicting

protein-protein interactions from multiple genomic data. In PKDD, pages 321–330, 2005.

26. K. Truong and M. Ikura. Domain fusion analysis by applying relational algebra to protein sequence and domain databases. BMC Bioinformatics, 4(16):1–10, 2003.

27. Z. Tu, L. Wang, M. Xu, X. Zhou, T. Chen, and F. Sun. Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics, 7(31), 2006. 28. J. Weston, C. Leslie, E. Ie, D. Zhou, A. Elisseeﬀ, and W. S. Noble. Semi-supervised protein

classiﬁcation using cluster kernels. Bioinformatics, 21(15):3241–3247, August 2005. 29. J. Wojcik and V. Schachter. Protein-protein interaction map inference using interacting

domain proﬁle pairs. Bioinformatics, 17(suppl1):S296–305, 2001.

30. Jianzhen Xu and Yongjin Li. Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics, 22(22):2800–2805, 2006.

31. L.V. Zhang, S.L. Wong, O.D. King, and F.P. Roth. Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics, 5(38), 2004. 32. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf. Learning with local and global

consistency. Advances in Neural Information Processing Systems, 16:321–328, 2004. 33. X. Zhu, Z. Ghahramani, and J. Laﬀerty. Semi-supervised learning using gaussian ﬁelds and

harmonic functions. In Proceedings of the Twentieth International Conference on Machine