Ying
Wang‡
ab,
Xiaowei
Chen‡
c,
Zhi-Ping
Liu‡
d,
Qiang
Huang
b,
Yong
Wang
b,
Derong
Xu
c,
Xiang-Sun
Zhang
*b,
Runsheng
Chen
*c and
Luonan
Chen
*d
aSchool of Science, Dalian Jiaotong University, Dalian 116028, China
bNational Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China. E-mail: zxs@amt.ac.cn
cLaboratory of Bioinformatics and Noncoding RNA, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China. E-mail: crs@sun5.ibp.ac.cn
dKey Laboratory of Systems Biology, SIBS-Novo Nordisk Translational Research Centre for PreDiabetes, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China. E-mail: lnchen@sibs.ac.cn
First published on 11th October 2012
Protein–RNA interactions are fundamentally important in understanding cellular processes. In particular, non-coding RNA–protein interactions play an important role to facilitate biological functions in signalling, transcriptional regulation, and even the progression of complex diseases. However, experimental determination of protein–RNA interactions remains time-consuming and labour-intensive. Here, we develop a novel extended naïve-Bayes-classifier for de novo prediction of protein–RNA interactions, only using protein and RNA sequence information. Specifically, we first collect a set of known protein–RNA interactions as gold-standard positives and extract sequence-based features to represent each protein–RNA pair. To fill the gap between high dimensional features and scarcity of gold-standard positives, we select effective features by cutting a likelihood ratio score, which not only reduces the computational complexity but also allows transparent feature integration during prediction. An extended naïve Bayes classifier is then constructed using these effective features to train a protein–RNA interaction prediction model. Numerical experiments show that our method can achieve the prediction accuracy of 0.77 even though only a small number of protein–RNA interaction data are available. In particular, we demonstrate that the extended naïve-Bayes-classifier is superior to the naïve-Bayes-classifier by fully considering the dependences among features. Importantly, we conduct ncRNA pull-down experiments to validate the predicted novel protein–RNA interactions and identify the interacting proteins of sbRNA CeN72 in C. elegans, which further demonstrates the effectiveness of our method.
Fortunately, the recently accumulated protein–RNA experimental data make it possible to analyze the characteristics of RNA binding and identify protein–RNA interactions. For instance, some researchers have analyzed and discussed the structural and chemical characteristics of RNA binding residues,14–18 and have compared those proteins that bind different functional classes of RNA targets.17 Jones et al.14 analyzed RNA–protein interactions at a structure level, mainly based on chemical and physical properties of RNA-binding residues and the distribution of atom–atom contacts in the complexes. Ellis et al.18 presented a statistical analysis of the large-scale dataset of protein–RNA complexes and made a comparison of properties for binding residues bound to different functional classes of RNA. On the other hand, there have been other literatures for analyzing and discussing protein–RNA interactions from structural, functional, and other aspects.19,20
Some methods synthetically analyzed the dominating factors for determining protein–RNA interactions in allusion to different groups of data.15–17 Most of the existing works on protein–RNA interactions focus on the analysis and discussion of protein–RNA complexes. Recently, computational approaches with regard to protein–RNA interactions have been proposed but they are mainly presented to predict either RNA-binding sites based on protein information21–28 or RNA-binding proteins,29–31 rather than the prediction of protein–RNA interactions. For example, Han et al.30 proposed an SVM-based machine learning method to distinguish the RNA-binding from non-RNA-binding proteins utilizing protein primary sequences. To date, the prediction of the RNA–protein binding residues has been extensively studied, but there have been few works for protein–RNA interaction prediction,32–34 which is essential to reconstruct the protein–RNA relevant network. Pancaldi and Bahler32 used more than 100 different properties of mRNA and protein to construct features to make predictions by RF and SVM classifiers. The method has many limits for the present data. Muppirala et al.33 also employed RF and SVM classifiers to predict protein–RNA interactions. In addition, Bellucci et al.34 utilized several kinds of physicochemical properties of protein and RNA to analyze and predict the interaction properties for individual residues.
The fact that there are many functional large non-coding RNAs (ncRNAs) inside cells drives the recent studies to explore the functional diversity and mechanistic role of these large ncRNAs. The precise mechanism by which ncRNAs function remains poorly understood. One urgent task is to explore the interaction between ncRNAs and proteins. The functional importance of many ncRNA-protein interactions for correct transcriptional regulation has been demonstrated in small scale experiments. For example, several ncRNAs were discovered to be required for the correct localization of chromatin proteins to genomic DNA targets. Thus, it is in pressing need to computationally predict ncRNA–protein interaction in a large scale manner. Importantly, long non-protein-coding RNAs (lncRNAs) are a novel class of RNA molecules that exhibit rising interest within the research community in recent years. These interactions are known to be closely related to the progression of complex diseases, and their function reaches from signaling, different ways of cellular modifications to scaffolding functions. In this study, we will particularly focus on non-coding RNA–protein (including lncRNA–protein) interaction prediction.
Computational approaches for predicting protein–RNA interactions by using small training data have become highly desirable. In particular, there is an urgent need for presenting a computational method to predict protein–RNA interaction based only on the primary sequences of RNA and protein. Hence, we aim to develop a new machine learning based method to distinguish protein–RNA interaction pairs from non-protein–RNA interaction pairs. Specifically to meet the challenge of small number of experimental data on protein–RNA interactions, we proposed a new Bayesian classifier to efficiently predict protein–RNA interactions based on supervised learning. We presented two kinds of Bayesian classification methods to identify protein–RNA interactions. One is a Naive Bayesian (NB) classification method, and the other is an Extended Naive Bayesian (ENB) classification method. Both of them just use primary sequences of RNA and protein as input, without requiring any structure-derived information. We show that the NB classifier is a fast and effective learning approach for predicting protein–RNA interactions, which follows the assumption of the independence between the features. On the other hand, the ENB classifier takes the feature dependency into account, and thus is able to offer the accurate prediction with correlated features. Numerical experiments show that our method achieved the prediction accuracy of 0.77 even although we currently used only small number of protein–RNA interaction pairs to train the classifiers. We demonstrated that our classifier performs better than other machine-learning-based models for predicting protein–RNA interactions. To verify the computational prediction on unknown protein–RNA interactions, we conducted non-coding RNA pull-down experiments for sbRNA CeN72 in C. elegans, which further validated the effectiveness of our method.
The 20 amino acid residues were categorized into seven classes according to their physicochemical properties, and then the dimensions of the vector space were dramatically reduced, thereby overcoming the overfitting problem.35 The conjoint triad was proposed to express the properties of one amino acid and its vicinal amino acids.35 In this work, we employed the similar strategies to encode the residues. Different from that in the protein–protein interaction, 20 amino acid residues were classified into four classes {‘DE’}, {‘HRK’}, {‘CGNQSTY’}, and {‘AFILMPVW’} according to their charge and polarity of side chains in the protein–RNA interaction.26,30 To consider one amino acid and its vicinal amino acids, the definition of a conjoint triad is employed to denote a unit of three continuous amino acids.35 Accordingly, all the triads can be classified into 4 × 4 × 4 classes. For any two triads, if the residues in the corresponding positions belong to the same class, then the two triads are thought to be one class of triads. The latter ‘triads’ refer to one class of triads.
We mainly used three descriptors for a pair of protein and RNA sequence. From the biological perspective, these features indicate their interaction propensity. First is the composition of amino acid triads. One class of triads corresponds to one feature, and hence there are 4 × 4 × 4 features of triads. Triad composition is calculated by implementing the normalized frequency of the triads in the protein sequence.35 Second is the composition of k-nucleotide acids (simply k-NAs). k-NAs refer to a unit of k continuous nucleotide acids. k-NAs composition is calculated in the same way as the triad composition. Third is the composition of triads and k-NAs combination. The interaction of RNA and protein might be mainly attributed to the interaction of amino acid residues and nucleotide acid residues from the biochemical viewpoint. Thus, the combination of triads and k-NAs is also presented as a kind of feature descriptor. Totally, the features for the three descriptors are 4k + 4 × 4 × 4 + 4 × 4 × 4 × 4k in all. There might be some overlaps in these features to some extent. All the features are combined to form the feature vector, which is approximately a 4 × 4 × 4 × 4k dimensional vector. A high dimensional feature vector has resulted in high computational complexity. However, a small number of features may suffice to represent their relationship with RNA–protein interaction. To overcome the difficulty, a feature selection method was proposed to decrease the dimension of the feature vector and extract valuable features, which is described in the next section.
This score measures the enrichment of combination (xi, yj) in the positive relative to negative samples. In essence the above score is the likelihood ratio scores, which can be regarded as a measure of the usefulness of single combination feature (xi, yj). A likelihood ratio score much larger than 1 indicates that the feature is a good predictor for protein–RNA interaction. Likewise, a likelihood ratio score less than 1 indicates that the evidence is anti-predictive for protein–RNA interaction. Likelihood ratios of different features are directly comparable. According to the measured F(xi, yj), the enriched features can be extracted from all the features. Fig. 1 shows that 500 features are extracted from all the 4 × 4 × 4 × 43 = 4096 features when k = 3.
![]() | ||
Fig. 1 Selecting predictive features by thresholding the likelihood ratio based score. The 500 features (red points) were selected from all the 4096 triads and 3-KAs combination features (red points and blue points) for k = 3. The red points refer to the 500 selected features from all the combination features. The red points above the first dash line indicate the features enriched in the positive samples, and those below the second dash line indicate the features enriched in the negative samples. |
In Fig. 1, the horizontal axis corresponds to 4096 combination features, and the vertical axis corresponds to the score of the likelihood ratio for each feature. The two dashed lines divide all the features into three subsections, i.e. those above the first line denote the features enriched in the positive sample set, and those below the second line denote the features enriched in the negative sample set. The extracted 500 features are these enriched features discriminatively in the positive and negative sample sets. All of these features are available in Tables S2 and S3 (ESI†). We observed that the extracted features by feature selection scheme showed significant preferences for different classes of amino acid residues in the positive samples and negative samples, which implies that the prominent residues in the positive samples have significant tendency to basic and polar amino acids, and those in the negative samples have main tendency to acidic and non-polar residues. This was expected when the overall negative charge of RNA is considered.
The NB classifier classifies the samples according to the probability ratio:
![]() | (1) |
![]() | (2) |
Given a pair of hypothetical RNA and protein sequences, as shown in Table 1, there are two classes of triads {ARR}, {RRK, RKR, KRR, RRR} and two 2-nucleotide features {UA}, {UU}. The amino acid residues ‘R’ and ‘K’ in the above triads fall into one class of amino acids, whereas residue ‘A’ falls into another class in the four classes of residues. From above, it is obvious that the residues in the first position of ‘ARR’ and ‘RRK’ belong to different classes, hence the two triads ‘ARR’ and ‘RRK’ belong to different classes of triads. If the residues in the corresponding positions belong to the same class respectively, the two triads are defined as one class of triads. For example, the four triads in the second brace belong to one class of triads due to the residues in the corresponding positions coming from the same class. The triads–2-NAs combinations correspond to four features, namely {ARR-UA}, {RRK-UA, RKR-UA, KRR-UA, RRR-UA}, {ARR-UU} and {RRK-UU, RKR-UU, KRR-UU, RRR-UU}. The above ones included in one brace belong to one combination feature. For the first two combination features {ARR-UA} and {RRK-UA, RKR-UA, KRR-UA, RRR-UA}, only two elements in the first position belong to different classes, and two elements in the remaining four positions belong to the same class, thereby the degree of similarity between the two features is computed as (5 − 1)/5 = 0.8. Similarly, the similarity degree of {ARR-UA} and {RRK-UU, RKR-UU, KRR-UU, RRR-UU} is (5 − 2)/5 = 0.6. The more similar features are, the stronger the correlation between them is. The similar features might have similar function, and hence they are deemed to belong to one kind of feature. In the above example, the similarity degree achieves 0.8, that shows that the two features are more similar, and then the two features are combined into a new group of combination feature {ARR-UA, RRK-UA, RKR-UA, KRR-UA, RRR-UA}.
Amino acid sequence → A R R K R R R R R R K |
Nucleotide acid sequence → U U U U A |
Two kinds of triads: {ARR}, {RRK, RKR, KRR, RRR} |
Two 2-nucleotide acids: {UA}, {UU}{UA}, {UU} |
Given the high computational complexity and a small scale training data, it is impossible to learn the optimal Bayesian network structure from training data. In this paper, we rely on prior knowledge to pre-determine the Bayesian network structure.
To determine the Bayesian network structure, we maximally utilize the available correlation relationships among features and meanwhile maximize the number of conditional independencies among features. We recombine the selected features in the NB model into a group of new features by clustering based on the correlation between them. Suppose there exist k non-overlapping groups of the total m features, i.e., G1,G2,⋯,Gk, Gi ∩ Gj = ∅, ∀ i,j. We use naïve Bayes to integrate features in group level, where two groups of sequence features are assumed to be conditionally independent given protein–RNA interaction. In this case, the likelihood ratio of the combined groups is equal to the product of the likelihood ratios for an individual group. The feature integration inside each individual group is achieved via a full Bayesian Network, where none of the features are conditionally independent. In this case, we estimate all possible combinations of features values. The structure is then fixed during training and testing. This way, the computational complexity of the problem is dramatically reduced, and only a small training set is required.
![]() | (3) |
![]() | (4) |
As an example, we assume X = (x1,x2,x3,x4) to denote a 4-dimensional feature vector corresponding to four features. And only the features related to x2 and x3 have strong correlation. Due to the correlation between the two features, the formula (1) can be computed in the following way.
![]() | (5) |
In addition to the above measures, we also used the receiver operating characteristic (ROC) curve42 and area under the ROC curve (AUC)43 to assess the performance of NB and ENB classifiers. The different thresholds of probability are used to plot the ROC curve, by which we can visualize how the thresholds change SE and SP. AUC calculates the area under an ROC curve and its range of change is between 0 and 1. The maximum value 1 represents a perfect prediction. For a random guess, the value of AUC is close to 0.5. To further test the ability of NB and ENB classifiers to predict ncRNA–protein interactions, we also utilized the RPI369 and the RPI2241 as the training sample set, and the NPInter dataset as the testing sample set. We trained the two classifiers on the RPI369 and RPI2241 datasets, and independently tested the prediction performance of the two classifiers on the NPInter dataset, respectively.
Sample set | # Of selected features | ACC | SE | SP | PRE | MCC |
---|---|---|---|---|---|---|
NPInter | 500 | 0.7432 | 0.4262 | 0.9016 | 0.6842 | 0.3810 |
1000 | 0.7423 | 0.3552 | 0.9358 | 0.7345 | 0.3730 | |
2000 | 0.7368 | 0.3333 | 0.9385 | 0.7305 | 0.3569 | |
RPI369 | 500 | 0.7326 | 0.3089 | 0.9444 | 0.7355 | 0.3442 |
1000 | 0.7498 | 0.3659 | 0.9417 | 0.7584 | 0.3947 | |
2000 | 0.7715 | 0.3821 | 0.9661 | 0.8494 | 0.4598 | |
RPI2241 | 500 | 0.7378 | 0.4496 | 0.8819 | 0.6556 | 0.3721 |
1000 | 0.7326 | 0.4174 | 0.8902 | 0.6552 | 0.3545 | |
2000 | 0.7574 | 0.4429 | 0.9147 | 0.7220 | 0.4180 |
In our computations, we found that the SE of prediction is mostly at a low level of 40%, compared to the level of 90% for SP. The lower prediction accuracy for positive samples is very likely resulted from the limiting positive samples. Few samples in a positive set tend to be less adequate in representing all kinds of positive samples, which might partially contribute to inaccurate prediction by the supervised learning. The fact that the sample data come from multiple different organisms likely makes it more difficult for a statistical classification system to unambiguously predict the interaction of RNA and protein, which is another possible reason for low accuracy for the positives. But, with the further accumulation of experimental data for protein–RNA interactions, this problem can be considerably alleviated in future. The sensitivity can be increased at the expense of a decrease in specificity by varying the threshold of classification θ. The change trend of SE and SP can be obviously seen in the ROC curve shown in Fig. 2.
![]() | ||
Fig. 2 (a)–(c) indicate the ROC curves of the prediction results obtained from an NB classifier and an ENB classifier, separately in NPInter, RPI369 and RPI224. The blue dash line denotes the ROC curve from the NB classifier and the red solid line denotes the ROC curve from the ENB classifier. |
On this non-redundant dataset of RNA–protein interactions, the NB classifier performs well by only utilizing the primary sequence information. With the parameter k = 3 and the number of extracted features reaching 2000, the NB classifier achieved an overall accuracy of 0.74, 0.77 and 0.76, with a correlation coefficient of 0.36, 0.46 and 0.42, a specificity of 0.94, 0.97 and 0.92, and a precision of 0.73, 0.85 and 0.72 for NPInter, RPI369 and RPI2241 dataset, respectively.
Sample set | # Of selected features | Cluster | ACC | SE | SP | PRE | MCC |
---|---|---|---|---|---|---|---|
NPInter | 1000 | 91 | 0.7760 | 0.4728 | 0.9277 | 0.7656 | 0.4668 |
RPI369 | 1000 | 465 | 0.7500 | 0.3497 | 0.9500 | 0.7772 | 0.3955 |
RPI2241 | 1000 | 386 | 0.7403 | 0.3889 | 0.9158 | 0.6980 | 0.3693 |
The computational results (data not shown) from the two classifying models indicate that the overall accuracy has improved in a way when k is from 2 to 3, but less improved or even declined when k is from 3 to 4. The best performance was obtained with k = 3. The similar change in accuracy for the two classifiers seems to conclude that the length of nucleotide acids has effects on the prediction of protein–RNA interaction, and that 3-nucleotide acid feature may be deemed as a functional unit in the interaction event. Moreover, we also constructed a balanced dataset with equal number of positive samples and negative samples. The prediction results on this dataset are shown in Table S4 (ESI†). We found that it is slightly different from those of Tables 2 and 3. For example, the overall accuracies have decreased obviously when the number of training samples decreased. The results also provided evidence for the importance of the number of the training samples.
Fig. 2(a)–(c) shows the ROC curves of the prediction results by the NB classifier, and the ENB classifier in NPInter, RPI369 and RPI224, respectively. From the figures, we observed that the ROC curve of the ENB classifier is slightly higher than that of the NB classifier for the RPI369 and the NPInter. For the RPI2241, two ROC curves have nearly the identical change. The trivial change of ROC curves indicates the two classifiers have similar prediction performance on the whole. However, the computational complexity for the NB classifier has increased more promptly with the increase of sample data compared to the ENB classifier. Namely, the ENB model is more efficient than the NB model meanwhile keeping the better prediction performance. The AUC of the ENB classifier achieved ∼0.78, ∼0.68, ∼0.72, compared to ∼0.74, ∼0.64, ∼0.71 for the NPInter, the RPI369 and the RPI2241.
All of the computational results demonstrate that our ENB classifier not only considers the correlation between features, but also yields a significant improvement in computational efficiency over the NB classifier. In NB, we employed a feature selection process to extract important features distinguishing protein–RNA interactions from non-interactions. Compared to NB, the relationship of these features are considered simultaneously in ENB. It will have large applicable scope because there is no assumption of the dependency and redundancy in these features. The competitive results of the NB with feature selection and ENB also provided evidence for the importance of feature relationship in the prediction. Our presented feature selection scheme and extended NB classification will also shed light on other similar researches.
In our method, we encoded proteins and RNAs only by sequence-based information. Compared to the limited structure information available, the easy availability of sequence information will benefit the flexibility and applicability of our method in the prediction. One research direction is to evaluate the potential improvement of prediction accuracy by employing more valuable information.
Tables 4 and 5 show the prediction results for the NB classifier and the ENB classifier on the NPInter dataset. In our prediction, we separately computed the number of correctly predicted ncRNA–protein pairs for different organisms that were shown in the two tables. The results showed that our ENB model predicted more accurate for the Drosophila melanogaster, Mus musculus and Saccharomyces cerevisiae than the other organisms for the two training sets RPI369 and RPI2241. The results indicate that ENB method out-performed NB method when trained whether on the RPI369 or the RPI2241 dataset. Concretely, when trained on the RPI369 dataset, the ENB classifier achieved an accuracy of ∼67%, compared to ∼62% for the NB classifier. And when trained on the RPI2241dataset, the ENB classifier achieved an accuracy of ∼79%, compared to ∼66% for the NB classifier. The prediction results further proved the effectiveness and efficiency of our proposed ENB method.
Organism | Total RNA–protein pairs | Pairs predicted by NB classifier (%) | Pairs predicted by ENB classifier (%) |
---|---|---|---|
Caenorhabditis elegans | 3 | 3(100%) | 1(33%) |
Drosophila melanogaster | 26 | 13(50%) | 19(74%) |
Escherichia coli | 25 | 13(52%) | 17(68%) |
Homo sapiens | 148 | 93(63%) | 77(52%) |
Mus musculus | 46 | 30(65%) | 40(87%) |
Saccharomyces cerevisiae | 119 | 76(64%) | 89(75%) |
Total | 367 | 228(62%) | 243(67%) |
Organism | Total RNA–protein pairs | Pairs predicted by NB classifier (%) | Pairs predicted by ENB classifier (%) |
---|---|---|---|
Caenorhabditis elegans | 3 | 1(33%) | 1(33%) |
Drosophila melanogaster | 26 | 23(88%) | 25(96%) |
Escherichia coli | 25 | 12(48%) | 15(60%) |
Homo sapiens | 148 | 84(57%) | 91(62%) |
Mus musculus | 46 | 28(61%) | 37(80%) |
Saccharomyces cerevisiae | 119 | 94(79%) | 118(99%) |
Total | 367 | 242(66%) | 287(79%) |
We selected 30 ncRNAs (17 sbRNAs and 13 snlRNAs) and 759 proteins to predict interactions in worms.44 The list of these proteins were shown in Table S5 (ESI†). For the nc11 (sbRNA CeN72) in C. elegans, we predicted 115 pairs of interactions when trained on the RPI369 and 207 pairs of interactions on the RPI2241, which were shown in Tables S6 and S7 (ESI†), respectively. To verify our prediction results, we performed the ncRNA pull-down experiments to identify the interacting proteins of sbRNA CeN72 in C. elegans. The detailed experimental steps are described in Materials and methods. Our biological experiments were repeated for 7 times, and 51 proteins were identified to interact with CeN72. We found that 7 predicted interacting proteins on the RPI369 (O44572, P41937, P49041, P52013, Q18943, Q20140 and Q93615) and 10 ones on the RPI2241 (P27639, P34460, P34690, P41937, P49405, P52013, P91306, Q18625, Q18885 and Q22508) were verified in the experiments. The detailed information of these proteins was listed in Table S5 (ESI†). The 7 predicted interacting proteins on the RPI 369 were from different protein families and had low sequence similarity. And the 10 proteins on the RPI2241 had low sequence identity, except P34690 and P41937, which were both members of the tubulin family. The results of the wet experiment provided evidence for the effectiveness of our method of de novo prediction of protein–RNA interaction only by sequence information.
To meet the challenge of a small set of well-constructed gold-standard dataset for ncRNA–protein interaction, we utilized graphical models such as Bayesian networks45,46 to capture the pattern in the training data. Specifically we adopted a transparent feature integration strategy which is especially important in our case, where the gold-standard data are scarce. Our extended naive Bayesian network structure is pre-chosen by carefully selecting predictive features and considering their possible conditional dependence relationships. As a major advantage, we only need to learn Bayesian network parameters during training. In addition, our method is particularly suitable for predicting long non-coding RNA–protein interactions. The longer the RNA sequence is, the more accurate the feature importance and correlation estimation are. In this case, the parameters in the Bayesian network can be reasonably estimated using the small set of gold-standard positive data. This fact enables our method to study functional roles for long non-coding RNAs, a novel class of RNA molecules, that exhibit rising interest within the research community in recent years.47,48
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c2mb25292a |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2013 |