Graduate School of Fundamental Science and Engineering Waseda University

(1)

Graduate School of Fundamental Science and Engineering Waseda University

༤ ༤㻌 ኈ㻌 ㄽ㻌 ᩥ㻌 ᴫ㻌せ

Doctoral Thesis Synopsis

ㄽ ᩥ 㢟 ┠

Thesis Theme

Sequence-based Prediction of Protein Functional Sites ࢱࣥࣃࢡ㉁ࡢ㓄ิࢆ⏝࠸ࡓᶵ⬟㒊఩ࡢண

⏦ ㄳ ⪅ (Applicant Name)

Chun FANG

᪉ ᫓

Department of Computer Science and Engineering, Research on Parallel and Distributed Architecture

October, 2013

(2)

With the rapid development of sequencing technology, many "unknown function" proteins have being deposited into various protein databases every year. Studying the functional aspect of these proteins is not only important for exploring the mystery of life activities, but also help for elucidating the mechanisms of many diseases and providing the targets for drug design. Therefore, analyzing the expression of function information from the protein sequences has become one of the most urgent works in the post-genomics era.

Although experimental methods have high accuracy in identifying protein functional sites, they are expensive and time-consuming, thereby resulting in a far behind on speed of protein function annotation compared to the speed of protein sequence determination. Thus, computational method, which can greatly reduce the research costs and shorten the study period, is indispensable for guiding the experimental analysis. It can also help us to carry out in-depth discussion and analysis on large datasets.

So far, a number of computational methods have been explored for protein functional sites prediction. These methods can be categorized into three groups: a) methods that focus on molecular docking with known protein structures; b) methods that are based on protein sequences; c) methods that are based on the hybrid features of protein structure and sequences. Due to the structures of most proteins are not available, the structure-based methods cannot be generally used. Therefore, the protein sequence-based method has attracted much attention. This method uses the amino acid sequence information alone for prediction, and their accuracy and reliability do not depend on the prior information of proteins. With the rapid increasing of protein sequences, the sequence-based method is considered as an ideal approach and can be widely used in identifying protein functional sites.

The input features of previous sequence-based methods also can be categorized into three groups: (i) direct output of position-specific scoring matrix (PSSM) which includes the evolutionary conversation information of residues; (ii) combination of PSSM with other sequence features, including residues distance and physicochemical prosperities; (iii) combination of PSSM with other predicted structural features, such as predicted secondary structure, predicted solvent accessibility, predicted disorder probabilities, predicted dihedral angles and predicted B-factors. Although these methods are successful to predict certain functional sites of proteins, they have some potential shortcomings:

Problem 1: Multi-features easily result in high-dimensional feature space and lead to over fitting to noise data in machine learning. Some features are useless on the classifier’s predictive power and may even impede the prediction.

Problem 2: Incorporating predicted results from other predictors greatly increase the complexity of the algorithm and impede computational speed. Many studies even provided no Web services because of the complexity in their design.

Problem 3: Adopting predicted features is highly susceptible to limitations of other predictors, such that some classifiers can handle only one single sequence, unable to perform batch processing; some others may have restriction on the length of input sequence (for example, less than 700 residues). These restrictions will bring inconvenience to users. Moreover, the predictive accuracies of other classifiers themselves have limitations, thereby inevitably affecting the performance of the predictor which employs them.

Problem 4: PSSM is sparse matrix and contains many redundant features. Every value included in it is calculated independently, without incorporating the dependency on surrounding neighbors of a central residue.

To solve the above problems, this thesis focuses on the following works:

Goal 1: To solve the problem 1, the first goal is to reduce the high dimensions of feature space, and adopt more

No.1

(3)

effective feature-encoding method which can combine more predictive features into fewer dimensions. This can also decrease the computing time.

Goal 2: To solve the problem 2 and 3, the second goal is to simplify the design of predictors, and avoid using predicted results from other classifiers as input.

Goal 3: To solve the problem 4, the third goal is to remove the noise features and strengthen the predictive features to enhance the accuracy. For example, modifying the PSSM to combine the detailed conservation patterns of functional sites, rather than using the PSSM directly for prediction.

We have proposed some novel methods, which are more simple and high efficient than existing methods for identifying protein functional sites. The contributions of this thesis are listed and described as follows:

1) Proposing a novel condensed PSSM-based method for ligand-binding sites prediction (Chapter 3)

PSSM has been widely used in predicting protein functional sites. However, it is 20-dimentional and contains many redundant features. The Kidera Factors was reported to carry information relating almost all physical properties of all 20 amino acids, but it needs appropriate weighting coefficients to express the property of residues.

We have developed a novel method, named as KSPSSMpred, which integrates PSSM and the Kidera Factors into a 10-dimensional condensed matrix (KSPSSM) instead of the traditional PSSM for ligand-binding prediction.

Flavin adenine dinucleotide (FAD) is chosen as a representative ligand for the study. Comparing with five other features-based methods on a well-prepared benchmark dataset, KSPSSMpred significantly outperformed other methods, achieving an AUC of 0.903 (0.054~0.195 higher than others). This result demonstrates that, SKPSSM can enrich the PSSM with information relating 188 physical properties of residues, and reduce 50% of the feature dimensions without losing the effective information included in PSSM.

2) Proposing a novel method based on contextual local evolutionary conservation for identifying Molecular recognition features (MoRFs) in disordered proteins (Chapter 4)

Because MoRFs regions include both highly conserved residues and highly variable residues, and usually evolve more rapidly than ordered proteins, thus, the standard PSSM is ineffective when used directly for identifying MoRFs. We have developed a novel predictor called “MFSPSSMpred” (Masked, Filtered and Smoothed PSSM Predictor), for identifying MoRFs, with predictions of quality exceeding the existing sequence-based methods which adopt many predicted features from other classifiers as input. Firstly, a masking method is used to calculate the average local conservation scores of residues within a masking-window length in the PSSM. Then, the scores below the average are filtered out. Finally, a smoothing method is used to incorporate the features of flanking regions for each residue to prepare the feature sets for prediction. This method can filter out the noise data (low conservative scores) and enhance the highly intensively conserved feature, thereby distinguishing MoRF residues from general non_MoRF residues.

When comparing with other existing methods on the same datasets, MFSPSSMpred achieves the best performance, besides adopting the fewest input features. In addition, when tested on an independent membrane proteins-related dataset, MFSPSSMpred significantly outperformed the state-of-the-art predictor “MoRFpred”.

3) Testing and proving that the contextual local evolutionary conservation based method can also be used for ATP-binding sites prediction in ordered proteins (Chapter 5)

To test whether the “MFSPSSMpred” method proposed for identifying MoRFs is also competent for identifying

No.2

(4)

functional sites in ordered proteins, we used it to predict ATP-binding residues. A corresponding predictor called

“ClCLpred” has been developed. We compared it with existing methods on 2 separate datasets that have been used in existing studies. Experimental results show that, the performance of “ClCLpred” exceeds all the existing sequence-based methods, which incorporated various predicted structural features as their input. This study indicates that: i) the most effective features for predicting functional sites are embedded in the sequence itself; ii) the most important factor contributing towards accurate predictions is residue conservation; and iii) the local evolutionary conservation enables accurate prediction of ligand-binding sites directly from protein sequence.

4) Analyzing the conservation patterns of some functional residues and its influence on identifying the protein functional sites (Chapter 6)

In order to utilize PSSM more effectively, conservation patterns of three functional sites: the NAD-binding site, the catalytic residues in enzymes, and the MoRFs in disordered proteins have been analyzed as examples. We found that, different functional sites show different conservation patterns: some of them are linear contextual, some of them are mingled with highly variable residues, and some others seem to be conserved independently. To extract these patterns effectively, three PSSM-based methods: the standard PSSM, the smoothed-PSSM, and the masked-smoothed-PSSM were also compared. The three methods were then applied on identifying the three kinds of functional sites. Experiment results show that, although all the methods are based on the same feature -- PSSM of protein sequence, they are competent in identifying different patterns of functional sites: the standard PSSM method is competent in identifying functional sites which are conserved high independently; the smoothed-PSSM method is competent in identifying functional sites which are usually clustered together and highly linear conserved; and the masked-smoothed-PSSM method is competent in identifying functional sites of disordered proteins which are highly linear conserved but also mingled with highly variable residues. The result suggests that, when using PSSM to predict protein functional sites, modifying the PSSM to combine the detailed conservation patterns of functional sites would largely facilitate the prediction.

The outline of the thesis is as follows.

Chapter 1 introduces the backgrounds and importance of computing methods in identifying protein functional sites, and also describes the research motivation and organization of this thesis.

Chapter 2 describes some concepts and definitions related to the computational methods in identifying protein functional residues, which are used throughout this thesis.

Chapter 3 introduces a novel condensed PSSM based methods for ligand-binding prediction.

Chapter 4 describes a novel method which is based on contextual local evolutionary conservation for identifying MoRFs in disordered proteins.

Chapter 5 tests and proves that the prediction method used in chapter 5 is also applicable for identifying ATP-binding sites in ordered proteins.

Chapter 6 compares three kinds of PSSM-based methods in identifying three representative functional sites which have significantly different conservation patterns, and shows the necessity and importance of analyzing conservation patterns in identifying protein functional sites.

Chapter 7 concludes this thesis and gives a viewpoint on potential future work.

No.3

(5)

㹌㹭

᪩

᪩✄⏣኱Ꮫ ༤ኈ㸦ᕤᏛ㸧 Ꮫ఩⏦ㄳ ◊✲ᴗ⦼᭩

Ặྡ ᪉ ᫓ ༳

㸦 ᖺ ᭶ ⌧ᅾ㸧

✀㢮ู 㢟ྡࠊ Ⓨ⾲࣭Ⓨ⾜ᥖ㍕ㄅྡࠊ Ⓨ⾲࣭Ⓨ⾜ᖺ᭶ࠊ 㐃ྡ⪅㸦⏦ㄳ⪅ྵࡴ㸧 ㄽᩥㄅ

ᅜ㝿఍㆟

[1] Chun Fang, Tamotsu Noguchi, Daisuke Tominaga and Hayato Yamana, “MFSPSSMpred:

Identifying short disorder-to-order binding regions in disordered proteins based on contextual local evolutionary conservation,” BMC bioinformatics, 14:300, October 2013.

[2] Chun Fang, Tamotsu Noguchi and Hayato Yamana, “Condensing position-pecific scoring matrixs by the Kidera factors for ligand-binding prediction,” International Journal of Data Mining and Bioinformatics, Accepted.

[3] Chun Fang, Tamotsu Noguchi and Hayato Yamana, “SCPSSMpred: A General Sequence-based Method for Ligand-binding Site Prediction,” IPSJ Transactions on Bioinformatics, Vol.6, 35– 42, July 2013.

[4] Chun Fang, Tamotsu Noguchi and Hayato Yamana, “Analyzing Conservation Patterns and Its Influence on Identifying Protein Functional Sites,” In Proc. Of the 6th International Conference on Bioinformatics and Computational Biology (BICOB2014), Las Vegas, USA, March 2014.

[5] Chun Fang, Tamotsu Noguchi and Hayato Yamana, “Sequence-based Predication of Molecular Recognition Features in Disordered Proteins,” Journal of Medical and Bioengineering, Vol.2, No. 2, 110–114, June 2013. In Proc. Of the 2nd International Conference on Bioinformatics and Computational Biology (ICBCB2013), Beijing, China, April 2013.

[6] Chun Fang, Tamotsu Noguchi and Hayato Yamana, “Prediction of FAD Binding Residues with Combined Features from Primary Sequence,” In Proc. Of the Computer Science and Information Technology (ICBCB2012), Kuala Lumpur, Malaysia, April 2012.

(6)

㹌㹭

᪩

᪩✄⏣኱Ꮫ ༤ኈ㸦ᕤᏛ㸧 Ꮫ఩⏦ㄳ ◊✲ᴗ⦼᭩

✀㢮ู 㢟ྡࠊ Ⓨ⾲࣭Ⓨ⾜ᥖ㍕ㄅྡࠊ Ⓨ⾲࣭Ⓨ⾜ᖺ᭶ࠊ 㐃ྡ⪅㸦⏦ㄳ⪅ྵࡴ㸧

ㅮ ㅮ₇

[1] Chun Fang, Tamotsu Noguchi and Hayato Yamana, “Identifying functional site of disordered proteins,” The 5th Forum on Data Engineering and Information Management (DEIM2013), Fukushima, Japan, March 2013.

[2] Chun Fang, Tamotsu Noguchi and Hayato Yamana, “Identifying Molecular Recognition Features in Disordered Proteins,” Bioinformatics week in Odaiba 2013 (BiWO2013), Tokyo, Japan, September 2013.

Graduate School of Fundamental Science and Engineering Waseda University