Protein Complex Prediction Using Random Walks Under Domain Structural Constraints

全文

(1)Vol.2012-MPS-90 No.15 2012/9/19. IPSJ SIG Technical Report. Protein Complex Prediction Using Random Walks Under Domain Structural Constraints Morihiro Hayashida1,a). Peiying Ruan1,b). Tatsuya Akutsu1,c). Abstract: Several proteins aggregate and form complexes that play important roles in cellular systems. Many methods for predicting protein complexes from protein-protein interaction (PPI) networks have been developed based on the knowledge that proteins included in a complex tend to interact with each other, that is, dense subgraphs in PPI networks are considered as complexes. Macropol et al. proposed a method using random walk technique, which starts on a restart protein, and repeatedly adds a new protein based on random walk distances. In this technical report, we introduce domain structural constraints that one domain interacts with at most one other domain, and propose modified random walks using second-order Markov chains derived from such constraints.. 1. Introduction Protein complexes are involved in various biological functions such as gene expression regulation and enzymatic catalysis. Therefore, many prediction methods such as MCL [1], MCODE [2], RNSC [3], COACH [4], RRW [5], and NWE [6] have been developed. These methods often try to find dense subgraphs from protein-protein interaction networks because it is considered that proteins contained in a complex have a high possibility to interact with each other. Macropol et al. proposed repeated random walks (RRW), which starts on a protein, and repeatedly adds a new protein according to random walk distances calculated from steady state probabilities [5]. Ozawa et al. proposed a verification method for protein complexes using domain structural constraints that one domain interacts with at most one domain [7], and Zhao et al. improved their verification method under the same constraints [8], where a domain is a part of proteins, and is considered as a structural and functional unit. We introduce such constraints, derive second-order Markov chains, and propose modified random walks with restarts.. 2. Method In this section, we briefly review random walks with restarts or repeated random walks proposed by Macropol et al. [5] for predicting protein complexes from protein-protein interaction networks, and propose modified random walks with restarts under domain structural constraints. Let G(V, E) be an edge-weighted graph with a set V of vertices corresponding proteins and a set E of edges corresponding interactions between proteins, where each edge (vi , v j ) (∈ E) is 1. a) b) c). Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611–0011, Japan [email protected] [email protected] [email protected]. c 2012 Information Processing Society of Japan. weighted as wi j (> 0). Our purpose is to find a set of protein complexes (sets of vertices) from G. 2.1 Random Walks with Restarts The random walk method [5] has restart vertices, and adds a new vertex repeatedly based on random walk distances until the ratio of the distance to that in the previous iteration becomes smaller than a cutoff value. The algorithm is defined as follows. Algorithm RepeatedRandomWalk Input: network G(V, E) with weight wi j for edge (vi , v j ) restart probability α (0 < α < 1) maximum (minimum) size of protein complex smax (smin ) early cutoff value λ (0 < λ < 1) overlap threshold θ (0 < θ < 1) Output: a set of protein complexes D D := ∅ for every vi ∈ V C := {vi }, D0 := ∅, x0 := 0 while |C| < smax x := RandomWalk(G, C, α) v j := argmaxv j ∈V−C xv j if xv j ≥ λ · x0 C := C ∪ {v j }, x0 := xv j if |C| >= smin D0 := D0 ∪ C else break D := D∪RemoveOverlap(D0 , θ) return RemoveOverlap(D, θ) In this algorithm, algorithm RemoveOverlap sorts protein complexes in order of decreasing significance calculated as 1 − √ |C| · score(C), where score(C) denotes the average of random walk distances from a vertex in C to another vertex in C, and. 1.

(2) Vol.2012-MPS-90 No.15 2012/9/19. IPSJ SIG Technical Report. P2. P3. D3. P1 D2 D1. Fig. 1. second-order Markov chains. For Eq. (1), we can write the i-th element as follows. P(Xt = vi ) = αbC(i) + (1 − α) · ∑ P(Xt = vi |Xt−1 = v j )P(Xt−1 = v j ). D4. P4. P5. D6 D5. D7. For second-order Markov chains under the domain structural constraint, we have the following representation corresponding to random walks with restarts. P(Xt = vi , Xt−1 = v j ) = αP(Xt = vi |Xt−1 = v j )bC( j) ∑ +(1 − α) P(Xt = vi |Xt−1 = v j , Xt−2 = vk ) · (v j ,vk )∈E. P(Xt−1 = v j , Xt−2 = vk ). Illustration of random walks under domain structural constraints.. deletes overlapping complexes with less significance if the ratio |C1 ∩C2 |/ min {|C1 |, |C2 |} is more than the threshold θ for two complexes C1 and C2 . In addition, algorithm RandomWalk finds a steady state vector x for vertices V such that x = αbC + (1 − α)Px,. (1). where α denotes the restart probability, the i-th element of bC , bC(i) = 1/|C| if vi ∈ C, otherwise 0, and P denotes the transi∑ tion probability matrix that Pi j = wi j / (vk ,v j )∈E wk j . The NWE method [6] successfully improved prediction accuracy by using ∑ ∑ ∑ bC(i) = (vi ,vk )∈E wik / v j ∈C (v j ,vk )∈E w jk if vi ∈ C. 2.2 Second-order Markov Chains on PPI Networks We assume that one domain interacts with at most one domain. Consider the case that domain Da in protein Pi can interact with domain Db in protein P j and Dc in Pk , and a random walker moved from P j to Pi . Then, domains Da and Db are regarded to have interacted with each other. Therefore, domain Da cannot interact with any other domain. It means that the walker cannot move from Pi to Pk using Dc . Figure 1 shows an example of a PPI network with five proteins P1 , . . . , P5 , where solid lines between domains indicate that these domains can interact with each other, and simultaneously indicate that there exist interactions between proteins. Suppose that a random walker moved from P5 to P1 . Then, domain D7 is regarded to be interacting with D1 , and cannot interact with domains D4 and D5 . It means that the walker cannot move from P1 to P3 or P4 . Then, the probabilities that the walker move from P1 to P2 , P3 , P4 , and P5 are w12 /(w12 + w15 ), 0, 0, and w15 /(w12 + w15 ), respectively. Let Xt be a random variable taking the vertex where a random walker is at time t. In the existing methods of RRW and NWE, the first-order Markov property P(Xt |Xt−1 , . . . , X1 ) = P(Xt |Xt−1 ) holds. However, under the domain structural constraint, the property does not always hold. In order to completely satisfy the domain structural constraint, we cannot assume any Markov property. Although, in the previous example, we considered the probability that the walker will move to Xt+1 after Xt = P1 and Xt−1 = P5 , we should consider Xt−2 , Xt−3 , . . ., that is, before the walker arrived at P5 . In this technical report, we consider only. c 2012 Information Processing Society of Japan. (2). (vi ,v j )∈E. (3). Let Y be a steady state matrix in this second-order Markov chain. Then, Yi j represents P(Xt = vi , Xt−1 = v j ), and we have ∑ Y = αPBC + (1 − α) Q( j) f j (Y 0 ), (4) j. where BC denotes the diagonal matrix in which the diagonal entries BC(ii) are bC(i) , P is the same transition matrix as in Eq. (1), Q( j) denotes the transition probability matrix that Q(i,kj) = P(Xt = vi |Xt−1 = v j , Xt−2 = vk ), Y 0 denotes the transpose of Y, and f j (Y) denotes the matrix in which the j-th column is the same as that of Y and the other columns are zero. In a similar way to RRW and NWE, we can get the steady state Y by repeatedly calculating the right hand side of Eq. (4) until Y converges. Thus, in the RepeatedRandomWalk algorithm, our method replaces RandomWalk that finds steady state vectors with the method that finds a steady state matrix Y in Eq. (4).. 3. Conclusion We introduced domain structural constraints that one domain interacts with at most one domain, approximately derived secondorder Markov chains, and proposed modified random walks with restarts under the constrained condition. References [1] [2] [3] [4] [5] [6] [7]. [8]. Enright, A., Dongen, S. and Ouzounis, C.: An efficient algorithm for large-scale detection of protein families, Nucleic Acids Research, Vol. 30, pp. 1575–1584 (2002). Bader, G. and Hogue, C.: An automated method for finding molecular complexes in large protein interaction networks, BMC Bioinformatics, Vol. 4, p. 2 (2003). King, A., Prˇulj, N. and Jurisica, I.: Protein complex prediction via costbased clustering, Bioinformatics, Vol. 20, pp. 3013–3020 (2004). Wu, M., Li, X., Kwoh, C. and Ng, S.: A core-attachment based method to detect protein complexes in PPI networks, BMC Bioinformatics, Vol. 10, p. 169 (2009). Macropol, K., Can, T. and Singh, A.: RRW: repeated random walks on genome-scale protein networks for local cluster discovery, BMC Bioinformatics, Vol. 10, p. 283 (2009). Maruyama, O. and Chihara, A.: NWE: Node-weighted expansion for protein complex prediction using random walk distances, Proteome Science, Vol. 9, No. Suppl 1, p. S14 (2011). Ozawa, Y., Saito, R., Fujimori, S., Kashima, H., Ishizaka, M., Yanagawa, H., Miyamoto-Sato, E. and Tomita, M.: Protein complex prediction via verifying and reconstructing the topology of domain-domain interactions, BMC Bioinformatics, Vol. 11, p. 350 (2010). Zhao, Y., Hayashida, M., Nacher, J., Nagamochi, H. and Akutsu, T.: Protein complex prediction via improved verification methods using constrained domain-domain matching, pp. 394–406 (2012).. 2.

(3)