Chapter 3 Condensing Position-specific Scoring Matrix by the Kidera Factors for Ligand-binding Prediction
3.2. Proposed methods
3.2.1. Data set
We collected data by the same way as Mishra’s research [8]. Since the database of ‘Supersite’ [36]
based on which Mishra et al. collected their PDB IDs of FAD-interacting proteins has been updated, and some data released on the website of research [8] could not be used due to some errors, we collected the data set again. Firstly, according to ‘Supersite’ [36], 868 Protein Data Bank (PDB) IDs of proteins related to FAD were extracted. Secondly, searching by Ligand Protein Contact (LPC) [37], 1,815 FAD-binding protein chains were selected out. Finally, the redundant chains with length less than 50 residues or with sequence similarity above 40% clustered by CD-HIT [38] were discarded. The remaining 191 protein chains, which contained 5,662 FAD-interacting residues (FIRs) and 73,680 non-FAD interacting residues (non-FIRs), were used for this study.
3.2.2. Sequence features description of FAD-binding proteins
¾ Continuous binding residues analysis
In order to conform whether FAD-binding sites are clustered closely together in sequence, we calculated the continuous binding residues in 191 FAD-binding protein chains. As shown in Figure 3.1, only 27% of binding-sites appear alone, and the others appear continuously. The lengths of linear binding regions are distributed between 1 to 9 residues.
Figure 3.1 Statistics of continuous FAD-binding residues
¾ Composition analysis of three regions
The composition of 191 FAD-binding sequences has been analyzed. Residues were divided into three kinds: FIRs, residues located within the 15-residue long flanking regions of FIRs, and general non-FIRs.
The composition difference between the binding regions (including the flanking regions) and general non-binding regions (percentage of binding regions - percentage of non-FIRs) are shown in Figure 3.2. Figure 3.2 demonstrates that, the composition difference of some residues change significantly with closing to the binding sites.
Figure 3.2 Composition difference between binding regionsand general non-FIRs Position 0 means the binding site, L means the left flanking side, and R means the right flanking
side.
¾ 10 physicochemical properties analyses of three regions
We next analyzed differences between binding regions and general non-FIRs with respect to ten physicochemical properties, namely: hydrophobic, polar, small, proline, tiny, aliphatic, aromatic, positive, negative and charged. The difference of physicochemical propensity between the binding regions and general FIRs (property percentage of binding regions - property percentage of non-FIRs) is shown in Figure 3.3. As can be seen from Figure 3.3, within 8-residue long flanking regions, the differences of many physicochemical properties fluctuant significantly, while in the outlying regions, all properties differences nearly tend to zero. This phenomenon illustrates that, properties of flanking regions are very different from those of general non-FIR regions, namely that, properties of neighboring residues significantly impact the ligand-binding behavior.
Since FIRs are highly contextual in protein sequence, we assume that, FIR and its flanking regions can be considered as a whole short linear binding region, thereby adopting a smoothing method to incorporate the information of neighboring residues for a central residue.
Figure 3.3 Differences in physicochemical properties between flanking regions and general non-FIRs. Position 0 means the binding site, L means the left flanking side and R means the right
flanking side.
3.2.3. Kitera factors
One of the ultimate goals in studying protein sequence is to understand how information about function is encoded in the one-dimensional sequence of its residues [18]. In order to elucidate such information by parameterization, Kidera et al. did several multivariate statistical analyses on 188 physical properties of 20 amino acids. They concluded that, properties of residues could be expressed as a sum of ten orthogonal factors with appropriate weighting coefficients [18]. These ten orthogonal factors are called ‘Kidera factors’.
The ‘Kidera factors’ [18] was reported to carry information relating almost all properties of all 20 standard amino acids. Nearly all physical properties of residues could be expressed as a sum of these ten orthogonal factors with appropriate weighting coefficients. The ‘Kidera factors’ of 20 amino acids are listed in Table 3.1. For a given physical property, # for the 9th amino acid in a sequence can be expressed as formula (3.1).
:; = <?@?;+ <A@A;+ ⋯ + <?C@?C;+ μ;, (3.1) where (, , … , ) are 10 weighting coefficients, (D, D, … , D) are the10 Kidera factors, and μ#is a constant.
Table 3.1 The Kidera factors Amino
acid
Factor*
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
A -1.56 -1.67 -0.97 -0.27 -0.93 -0.78 -0.2 -0.08 0.21 -0.48 R 0.22 1.27 1.37 1.87 -1.7 0.46 0.92 -0.39 0.23 0.93 N 1.14 -0.07 -0.12 0.81 0.18 0.37 -0.09 1.23 1.1 -1.73 D 0.58 -0.22 -1.58 0.81 -0.92 0.15 -1.52 0.47 0.76 0.7 C 0.12 -0.89 0.45 -1.05 -0.71 2.41 1.52 -0.69 1.13 1.1 Q -0.47 0.24 0.07 1.1 1.1 0.59 0.84 -0.71 -0.03 -2.33 E -1.45 0.19 -1.61 1.17 -1.31 0.4 0.04 0.38 -0.35 -0.12 G 1.46 -1.96 -0.23 -0.16 0.1 -0.11 1.32 2.36 -1.66 0.46 H -0.41 0.52 -0.28 0.28 1.61 1.01 -1.85 0.47 1.13 1.63 I -0.73 -0.16 1.79 -0.77 -0.54 0.03 -0.83 0.51 0.66 -1.78 L -1.04 0 -0.24 -1.1 -0.55 -2.05 0.96 -0.76 0.45 0.93 K -0.34 0.82 -0.23 1.7 1.54 -1.62 1.15 -0.08 -0.48 0.6 M -1.4 0.18 -0.42 -0.73 2 1.52 0.26 0.11 -1.27 0.27
F -0.21 0.98 -0.36 -1.43 0.22 -0.81 0.67 1.1 1.71 -0.44 P 2.06 -0.33 -1.15 -0.75 0.88 -0.45 0.3 -2.3 0.74 -0.28 S 0.81 -1.08 0.16 0.42 -0.21 -0.43 -1.89 -1.15 -0.97 -0.23 T 0.26 -0.7 1.21 0.63 -0.1 0.21 0.24 -1.15 -0.56 0.19 W 0.3 2.1 -0.72 -1.57 -1.16 0.57 -0.48 -0.4 -2.3 -0.6 Y 1.38 1.48 0.8 -0.56 0 -0.68 -0.31 1.03 -0.05 0.53 V -0.74 -0.71 2.04 -0.4 0.5 -0.81 -1.07 0.06 -0.46 0.65
* These values are standardized. F1: a-helix or bend-structure preference-related; F2: bulk-related;
F3: β-structure preference-related; F4: hydrophobicity-related; F5 to F10: Mixture of several physical properties. Please see Kidera’s paper [18] (Kidera et al., 1985) for details.
3.2.4. Prediction model
In our previous studies [39], we found that, combinatorial features of PSSM with physicochemical properties outperformed combinatorial features of PSSM with structural information (e.g. solvent accessibility) in ligand-binding prediction. Furthermore, we found that using the physicochemical properties of amino acids to condense a standard PSSM can reduce redundant features and improve prediction performance [40].
We can use the scores included in PSSM as parameters for Kidera factors to express the properties of residues in a sequence. The reasons are that: 1) each score displayed in PSSM means the substitution frequency at a given position of the sequence by a given amino acid. 2) In protein evolution, substitutions of amino acids are predominantly dependent on their physical properties, i.e., substitutions among amino acids with similar physical properties are more frequent than among those without similarity [41]. 3) Integrating Kidera factors with PSSM can incorporate the information relating almost all properties of residues carried by Kidera factors and information included in PSSM into a new matrix, which is only 10-dimensional.
Based on the above analyses, combining with the ‘Kidera factors’, we developed a modified PSSM, which includes evolutionary information contained in PSSMs, 188 physicochemical properties of residues brought by the ‘Kidera factors’, as well as contextual information of residues in protein sequence, to design our prediction model. The prediction model is shown in Figure 3.4. A detailed description of each part is explained later.
Figure 3.4 Prediction model of KSPSSMpred
3.2.5. Procedure of preparing feature sets
¾ Evolutionary information
Evolutionary information can be obtained from PSSMs, which are generated by PSI-BLAST [42]
searching against NCBI non-redundant (nr) database [43] by three times iteration with an e-value of 0.001. Evolutionary information for each amino acid is encapsulated in a vector of 20 dimensions, where the size of PSSM of a protein with N residues is 20 × N. ‘20’ is related to the 20 naturally occurring amino acid residues used to express the row vector of PSSM, and N is the length of a protein.
¾ Smoothing the standard PSSM
The statistics in Figure 3.1 illustrates that above 70% of FIRs appear continuously, indicating that binding sites are largely affected by neighboring residues. Accordingly, in order to incorporate the dependency on surrounding neighbors of a central residue, we adopted a previous smoothing method [44] inspired by the smoothing skill in image processing [45]. Firstly, in order to address the N-terminal and C-N-terminal of a protein sequence, m (m is an odd number) ZERO vectors were appended to the head and tail of a standard PSSM profile, where 2m+1 was the size of a smoothing sliding-window. The smoothing sliding-window was then used to incorporate the evolutionary information from the upstream and downstream residues. Each row vector of an amino acid residue Ci was smoothed according to formula (3.2).
Smoothing_Ci 2m11jj
iimmPSSM_Cj, (E = 1, … , F) (3.2) where PSSM_Cj represents the score in the original PSSM, Smoothing_Cirepresents the score in the smoothed PSSM, N is the length of the sequence, ‘2m+1’ is the smoothing-window size. Figure 3.5 illustrates an example of a smoothed PSSM profile. For amino acid ‘Q’, if the smoothing-window size ‘2m+1’ is 7, then the first column of the vector is smoothed by the summation of [2 + 1 + 0 + 1 + 0 + (−1) + (−2)]/7 = 0.14.
¾ Condensing the smoothed PSSM by Kidera factors
After smoothing, we next use the Kidera factors to condense the smoothed PSSM. The smoothed PSSMs are then divided into sliding windows of size m. Each window is a matrix EHI{E = 1, … , J, 9 = 1, … .20}, where j represents each of the standard 20 amino acids. Each feature is calculated according to formula (3.3).
FH,O= ∑ E,#P#, (E = 1, … , J, Q = 1, … 10) (3.3)
where P#, means the pth Kidera factor of j (each j has 10 Kidera factors). Finally, each value in the condensed and smoothed PSSM matrix is scaled to the range of [-1, 1] according to a certain ratio.
The procedure of preparing feature sets for the predictor is shown in Figure 3.5.
Figure 3.5 Procedure of preparing feature sets for KSPSSMpred
3.2.6. SVM and 5-fold cross-validation
Identification of FIRs can be addressed as a two-classification problem: determining whether a given residue is interacting with FAD or not. Our prediction model was trained by the LIBSVM software package written by in Chih-Jen [34-35]. The Radial Basis Function (RBF kernel) was adopted to construct the SVM classifier. The grid search method [35] was used to search for the best parameters c and g. 5-fold cross-validation was used to evaluate the performance of the developed models, such
that the patterns were randomly divided into five sets. Four sets were used for training the prediction model and the remaining one set was used for testing. The process is repeated until each set was used once for testing.