NPred: QSAR classification model for identifying plant based naturally occurring anti-cancerous inhibitors

Kanika Dhiman and Subhash Mohan Agarwal*
Bioinformatics Division, Institute of Cytology and Preventive Oncology, I-7, Sector-39, Noida-201301, India. E-mail: smagarwal@yahoo.com

Received 30th January 2016 , Accepted 6th May 2016

First published on 9th May 2016


Abstract

The prediction of naturally occurring plant based compounds as anticancer agents is the key to developing new chemical entities in the area of therapeutic oncology. Therefore, in the present study various machine learning techniques viz. Naive Bayesian classifier (NB), sequential minimal optimization (SMO), instance based learner (IBK) and random forest (RF) have been used to develop models of the relationship between the chemical structures of plant based natural compounds and their anti-cancerous inhibition activity. These models were trained, tested and validated using 549 active and 424 inactive compounds deposited in the NPACT database. We observe that the random forest based model using 881 PubChem fingerprints showed the best performance with an MCC of 0.54 and an accuracy of 77.6% on a five-fold cross-validation set and an MCC of 0.35 with an accuracy of 68.4% on an independent external validation set. Also, a frequency-based feature selection method was used to identify the fingerprints that have differential occurrence percentages in an active inhibitor dataset from an inactive set. We find that almost the entire top 10 fingerprints (FP797, FP818, FP12, FP179, FP3, FP143, FP712, FP704, FP334 and FP711) are present in vincristine, vinblastine and paclitaxel, the three therapeutic drugs that are derived from natural products and used as anticancer drugs in clinics. Finally, we have also developed a web server NPred, to predict the potential of natural compounds as anticancer agents and thus help the researchers working in this area. We expect that the results of this study will pave the way for identifying and designing novel natural products as cancer growth inhibitors.


Introduction

Natural products are a continuous source of promising compounds for new drug discovery.1,2 In the area of cancer, it is reported that ∼49% of therapeutic agents developed are either natural products or directly derived therefrom.3–5 Since finding new potential leads from naturally occurring compounds is highly desired; researchers have continuously screened various plant based compounds for anti-cancer activity using a variety of in vitro cellular assay systems. A range of bioactive compounds along with their inhibitory activity data is thus available in the public domain which can be harnessed in order to find the relationship between their chemical structures and anti-cancerous inhibition activity.4 Additionally, it has been well established that designing inhibitors computationally using structure and ligand based approaches can be useful in accelerating drug design.6–10 The number of active inhibitors is growing, yet there is still a need to speed-up the drug discovery process so as to have new effective drug candidates against cancer in the armory of physicians to treat the disease.

As naturally occurring scaffolds are considered superior due to their druggable nature, optimal interaction with biological macromolecules and reduced toxicity issues, natural products remain an indispensable source for anticancer drug discovery.11,12 Thus, to speed up the drug discovery process in the area of cancer it is necessary to develop computational methods that can predict the likeliness of a naturally occurring molecule being an anticancer agent. QSAR is one such computational method that has been recently applied to successfully predict the biological activity of molecules.13,14 Additionally, several investigators have developed web service applications based on QSAR models in different research areas,7,8,15–18 which not only allows wider dissemination but also facilitates further advances in the field. Therefore, we have undertaken a study to develop QSAR models for predicting the probability of naturally occurring plant based compounds as anti-cancerous leads. In our recent work, we have compiled a central resource named the Naturally Occurring Plant-based Anti-cancer Compound-Activity-Target database (http://crdd.osdd.net/raghava/npact/;4) that gathers the information related to experimentally validated plant-derived natural compounds exhibiting anti-cancerous activity. In the present study, we have used a large dataset of ∼1000 diverse molecules from NPACT to understand structure–activity relationships and to develop classification-based prediction models. We developed the models using various machine-learning techniques (e.g., IBK, SMO and random forest) and used their distinguishable features to predict the inhibition potential of a naturally occurring molecule. We also identified important fingerprints that play a significant role in exhibiting inhibitory potential. As per our knowledge, there is no such prediction method available in the public domain. Also, we have developed a web-based platform (http://smagarwal.in/npred/) that is freely available to the scientific community.

Methodology

Dataset

We obtained 1030 anticancer compounds along with their inhibitory concentrations (IC50) from the NPACT database.4 These compounds are diverse in nature and belong to various structural scaffolds. Based on the inhibition activity, compounds were assigned as inhibitors or active molecules if their IC50 (50% inhibition) was less than 10 μM. As we obtained the inhibition activies (IC50 of molecules) from various studies/assays, we queried molecules with conflicting IC50 values. We observed that only a few molecules have multiple IC50 values and removed all those compounds with conflicting IC50 values. We have removed 57 compounds with IC50 values less than 10 μM as well as greater than 10 μM. Thus, the positive dataset (active) contains 549 inhibitors and 424 non-inhibitors (IC50 > 10 μM). Also, for evaluating the performance of the developed models, we created two sets from the above datasets called the train and validation sets. The train set consisted of 90% of the data and the remaining 10% was kept for the external validation set.

Descriptor calculation

Chemical descriptors are the representative features of a chemical molecule that are responsible for its activity. In this study, we have used 881 PubChem based binary fingerprints calculated using PaDEL software.19 The complete details of the 881 PubChem fingerprints along with their descriptions are available from the PubChem website (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt).

Feature selection

As selection of significant descriptors is crucial for QSAR modeling, feature selection was carried out to eliminate highly correlated descriptors and multicollinearity, and to remove insignificant descriptors. In this study, we have used the Weka attribute selection filter to pick the selected variables. The attribute selection filter uses an evaluator, which assesses a subset of attributes and a search algorithm, which navigates for a subset of selected variables to be evaluated by the evaluator. The CfsSubsetEval module implemented in Weka was used as an evaluator for the significant descriptor selection and the best first algorithm as the search algorithm. We used the default settings of the above algorithm for selection of the important variables.7

Fingerprint frequency and selection

We have used our previously developed frequency-based approach for identification of fingerprints that exhibit a difference in the active and inactive datasets.8 Briefly, for each descriptor or fingerprint, the frequency is calculated in active and inactive molecules using eqn (1) and (2).
 
image file: c6ra02772e-t1.tif(1)
 
image file: c6ra02772e-t2.tif(2)
where, FAi and FIi represent the mean of the ith fingerprint in active (A) and inactive (I) molecules respectively. NA and NI are the total number of molecules in the active and inactive datasets respectively. Dij is the value of the ith fingerprint for the jth molecule (the value is either 0 or 1).

Finally, we compute a fingerprint score (FS) of each fingerprint using eqn (3).

 
FSi = FAiFIi (3)
where FSi is the inhibitory score of the ith fingerprint.

A positive FS score for a descriptor means that it is more preferred in active molecules while a higher negative score signifies that the fingerprint is more preferred in inactive molecules.

Classification

We have used various machine learning classifiers implemented in the WEKA package.20 We used Bayes, NaiveBayes, Sequential Minimal Optimization (SMO), Instance Based Learner (IBK) and Random Forest (RF) classifiers to develop models and to test the performance.

Performance evaluation

The performance of the developed models has been evaluated using a five-fold cross-validation technique. In this method, training and testing is done five times in such a way that one set is used at a time for testing while the remaining (n − 1) sets were used for training. The train set is also randomly divided into five training and testing sets. To avoid any bias in the prediction model, an independent external validation set is also used for evaluation. The fitness of the predicted models is assessed using various parameters like sensitivity, specificity, accuracy, and Matthew’s correlation coefficient (MCC).8 Further, it has been recommended that the applicability domain (APD) must be checked for the validated model so that the predictions can be considered reliable.13,21,22 Briefly, the APD is defined as:
APD = 〈d〉 +
where, 〈d〉 and σ are the average and standard deviations, respectively, of all distances between all training compounds and test compounds, while Z was chosen equal to 0.5 (default value). Finally, the prediction for a test set compound is treated as reliable if the distance to its nearest neighbor in the training set is less than the predefined applicability domain (APD) threshold. We have used the Enalos domain similarity node in KNIME23 to evaluate the applicability domain of the proposed model.22,24

Results

Classification model development based on 881 PubChem fingerprints

We developed classification models for predicting the inhibitory potential of plant based naturally occurring compounds using various algorithms/techniques that include IBK, Bayes, Naive Bayes and random forest using 881 PubChem fingerprints. These models were evaluated using five-fold cross-validation. We observed that models based on the random forest algorithm using 100 trees performed best among various classifiers and achieved 81.58% sensitivity, 72.44% specificity and 77.6% accuracy with MCC 0.54 and 0.85 AUC (Table 1).
Table 1 Five-fold cross validation performance of different machine-learning classifiers using 881 PubChem fingerprints
  Sensitivity Specificity Accuracy MCC AUC
Bayes 73.68 53.54 64.91 0.28 0.67
NaiveBayes 68.02 66.14 67.20 0.34 0.69
SMO 76.52 69.29 73.37 0.46 0.72
IBK 83.20 60.10 73.14 0.45 0.80
Random forest 81.58 72.44 77.6 0.54 0.85


Further, to validate the models developed using the training dataset, we tested the performance on an independent validation dataset, which was not used for model development. We find that although IBK and SMO achieved sensitivity of around 80%, random forest achieved a similar sensitivity (78.18%) with the highest specificity (55.81%) (Table 2).

Table 2 Performance of models developed on the training dataset and applied to the validation dataset
  Sensitivity Specificity Accuracy MCC AUC
Bayes 76.36 48.84 64.29 0.26 0.64
NaiveBayes 74.55 60.47 68.37 0.35 0.69
SMO 83.64 51.16 69.39 0.37 0.67
IBK 80.00 51.16 67.35 0.33 0.74
Random forest 78.18 55.81 68.37 0.35 0.71


Classification based on Weka selected PubChem fingerprints

Next, we performed attribute selection using the WEKA package and obtained 24 fingerprints out of 881 fingerprints. It needs to be noted that we only used the training dataset for selection of attributes while the independent dataset was not considered. We also examined the 24 best fingerprints obtained using the attribute selection filter in WEKA and listed them in Table 3.
Table 3 Description of 24 PubChem fingerprints selected using the attribute selection filter of Weka
PubChem FP Id Description PubChem FP Id Description
2 ≥16H 341 C(∼C)(∼C)(∼O)
3 ≥32H 374 C(∼H)(∼H)(∼H)
12 ≥16C 417 C#C
13 ≥32C 686 O[double bond, length as m-dash]C–C–C–C–O
116 ≥1 saturated or aromatic carbon-only ring size 3 688 C–C–C–C–C–C–C
143 ≥1 any ring size 5 696 C–C–C–C–C–C–C–C
145 ≥1 saturated or aromatic nitrogen-containing ring size 5 698 O–C–C–C–C–C–C–C
147 ≥1 unsaturated non-aromatic carbon-only ring size 5 704 O[double bond, length as m-dash]C–C–C–C–C–C–C
149 ≥1 unsaturated non-aromatic heteroatom-containing ring size 5 706 O[double bond, length as m-dash]C–C–C–C–C([double bond, length as m-dash]O)–C
193 ≥3 saturated or aromatic carbon-only ring size 6 779 CC1CCC(N)CC1
209 ≥5 saturated or aromatic heteroatom-containing ring size 6 797 CC1CC(C)CCC1
245 ≥1 unsaturated non-aromatic carbon-only ring size 9 818 CC1C(C)CCCC1


We then performed the five-fold cross validation on the training dataset which was created using only the above listed fingerprints. We observed that in the case of the Bayes classifiers the performance increased from 0.28 to 0.33 MCC (Table 4). While, for the three classifiers i.e. SMO, IBK and random forest, the performance decreased marginally from 0.46, 0.45 and 0.54 MCC to 0.45, 0.43 and 0.43, respectively (Table 4). We found that both IBK and random forest achieved similar accuracy (72%) and 0.43 MCC, but random forest showed a higher AUC value of 0.79.

Table 4 Five-fold cross validation performance of different classifiers using only 24 fingerprints
  Sensitivity Specificity Accuracy MCC AUC
Bayes 78.74 53.02 67.54 0.33 0.69
NaiveBayes 79.55 52.76 67.89 0.34 0.73
SMO 78.74 66.14 73.26 0.45 0.75
IBK 78.14 64.57 72.23 0.43 0.78
Random forest 76.11 66.67 72.00 0.43 0.79


Again, for independent evaluation of the models, we used the training dataset of the 24 best fingerprints for developing models and checked their accuracy on the independent validation dataset. We found that SMO, IBK and random forest performed better on the independent dataset compared to the other classifiers. In the case of SMO and IBK, we achieved sensitivity values of 83.64% and 87.27% and 0.61 and 0.56 MCC, respectively; while random forest achieved 85.45% sensitivity and 0.63 MCC (Table 5). A comparison of different classifiers on the complete as well best fingerprint datasets, indicated better performance of random forest followed by SMO and IBK.

Table 5 Performance of the model developed on the training dataset and applied to the validation dataset using the 24 best fingerprints selected by Weka
  Sensitivity Specificity Accuracy MCC AUC
Bayes 81.82 53.49 69.39 0.37 0.67
NaiveBayes 80 58.14 70.41 0.39 0.69
SMO 83.64 76.74 80.61 0.61 0.84
IBK 87.27 67.44 78.57 0.56 0.84
Random forest 85.45 76.74 81.63 0.63 0.87


Assessment of applicability domain (APD)

After model validation, the APD of our model was also tested to confirm whether a given prediction is reliable or not. We find that the APD value calculated for all compounds in the test set was 7.35, based on the equation provided in the Methods section. We then checked the domain of applicability for the external validation test. It was noticed that predictions for the entire validation dataset fell within the range 0–6.85 i.e. fell inside the domain of applicability and hence the model was established to be reliable (ESI Table S1).

Analysis of frequently occurring fingerprints using a frequency-based approach

We have also calculated the difference in the frequency of occurrence of fingerprints in the positive and negative datasets in order to identify the features that can discriminate between active/inactive inhibitors and provide information regarding bioactive inhibitors. Using the fingerprint selection approach, we observe that the top 10 PubChem fingerprints are FP797 (CC1CC(C)CCC1), FP818 (CC1C(C)CCCC1), FP12 (≥16C), FP179 (≥1 saturated or aromatic carbon-only ring size 6), FP3 (≥32H), FP143 (≥1 any ring size 5), FP712 (C–C(C)–C(C)–C), FP704 (O[double bond, length as m-dash]C–C–C–C–C–C–C), FP334 (C(∼C)(∼C)(∼C)(∼C)) and FP711 (C–C(C)(C)–C–C); which show at least 16% difference in their prevalence (Table 6). The top-most fingerprints are FP797 and FP818 corresponding to 1,3-dimethylcyclohexane and 1,2-dimethylcyclohexane, which show a difference of 25% and 23%, respectively.
Table 6 Top 10 fingerprints which are dominant in the inhibitor dataset as compared to non-inhibitors
PubChem FP Id Description Positive% Negative% Difference%
797 CC1CC(C)CCC1 68.49 43.40 25.09
818 CC1C(C)CCCC1 59.20 36.08 23.11
12 ≥16C 91.62 70.75 20.87
179 ≥1 saturated or aromatic carbon-only ring size 6 48.82 28.07 20.75
3 ≥32H 52.28 32.31 19.97
143 ≥1 any ring size 5 58.29 39.15 19.14
712 C–C(C)–C(C)–C 69.22 51.65 17.57
704 O[double bond, length as m-dash]C–C–C–C–C–C–C 71.77 54.95 16.81
334 C(∼C)(∼C)(∼C)(∼C) 48.27 31.84 16.43
711 C–C(C)(C)–C–C 48.27 31.84 16.43


To further evaluate and understand the importance of these fingerprints we determined whether they are present in the three well documented naturally occurring plant based drugs used clinically i.e. vincristine, vinblastine and paclitaxel. We find that in all the three drugs 9 out of the top 10 fingerprints are present (Table 7). This demonstrates that these structural fragments are important and contribute to determining the anti-cancerous inhibitory bioactivity of naturally occurring molecules.

Table 7 The presence (✓) and absence (✗) of the 10 fingerprints in the three clinically used drugs derived from natural products
PubChem FP Id Known natural product anticancer drugs
Vincristine Vinblastine Paclitaxel
797
818
12
179
3
143
712
704
334
711


Web server

In this study, we have developed and implemented this model in the form of a user-friendly webserver for prediction of the anti-cancerous activity potential of a naturally occurring compound derived from plants. The webserver has been established at http://smagarwal.in/npred/ and has been named NPred. The webserver has been developed using PHP and html language at the front end and is hosted on the linux environment. The user can either paste or upload a list of compounds in the SMILES format for virtual screening. Once the structures are provided, the result is generated by clicking the submit button. The server at the backend generates the PubChem fingerprints using the PaDEL package and predicts the probability score using the random forest algorithm implemented in the WEKA package. The output is then presented on a different html page and it shows the classifications of the queried compounds as either inhibitor or non-inhibitor along with a probability score. A screenshot of the webserver and the results page is also presented for better understanding (Fig. 1). Additionally, we have provided a link through which the complete datasets including the training, test and validation datasets can be downloaded. Although it is a requisite to compare the performance of a newly developed model with an existing one, it is not possible for us to compare our model as, to our knowledge, no such previous study has been undertaken using such a large dataset. In recent years it has been indicated that models which are made available for use as open access tools or webservers will enhance rapid development.25 Therefore, we have made efforts to make this server available so that it can assist investigators in scanning small molecule libraries and identifying structures of interest to biologists.
image file: c6ra02772e-f1.tif
Fig. 1 Snapshot of the NPred webserver along with an example and the options for pasting the molecules or uploading the files. It also shows the result page displaying the probability scores and classifications of the queried compounds.

Conclusion

In recent years, the interest of pharma companies as well as researchers has shifted towards naturally occurring or naturally derived inhibitors for their quest of NCEs. Also, it is known that a large majority of drugs that are used in cancer treatment are derived from natural products, so there is an inclination towards identifying new plant based natural molecules as anticancer agents. Thus, in the present study efforts have been made to develop a classification model using the largest available dataset of the chemical structures of natural inhibitors with plant origin and anti-cancerous activity data. We have used our previously published dataset of diverse molecules from the NPACT database to develop a robust and accurate prediction model using PubChem fingerprints. We find that the random forest model with 881 PubChem fingerprints exhibits the best performance. In addition, analysis of the fingerprints shows that the PubChem fingerprints FP797, FP818, FP12, FP179, FP3, FP143, FP712, FP704, FP334 and FP711 are differentially present more in the active inhibitor dataset. Further, analysis revealed that these fingerprints are also present in the therapeutic natural product drugs vincristine, vinblastine and paclitaxel signifying the importance of these fingerprints in anti-cancerous activity. Additionally, a freely available web server named NPred (http://smagarwal.in/npred/) has been designed for the prediction of plant based natural products as anti-cancer inhibitors. Overall, this study will be useful in designing novel anti-cancerous molecules as it will provide prior information regarding their inhibition potential.

References

  1. G. M. Cragg and D. J. Newman, Biochim. Biophys. Acta, 2013, 1830, 3670–3695 CrossRef CAS PubMed.
  2. J. W. Li and J. C. Vederas, Science, 2009, 325, 161–165 CrossRef PubMed.
  3. D. J. Newman and G. M. Cragg, J. Nat. Prod., 2012, 75, 311–335 CrossRef CAS PubMed.
  4. M. Mangal, P. Sagar, H. Singh, G. P. Raghava and S. M. Agarwal, Nucleic Acids Res., 2013, 41, D1124–D1129 CrossRef CAS PubMed.
  5. A. L. Harvey, R. Edrada-Ebel and R. J. Quinn, Nat. Rev. Drug Discovery, 2015, 14, 111–129 CrossRef CAS PubMed.
  6. I. S. Yadav, P. P. Nandekar, S. Srivastavaa, A. Sangamwar, A. Chaudhury and S. M. Agarwal, Gene, 2014, 539, 82–90 CrossRef CAS PubMed.
  7. J. S. Chauhan, S. K. Dhanda, D. Singla, S. M. Agarwal and G. P. Raghava, PLoS One, 2014, 9, e101079 Search PubMed.
  8. H. Singh, S. Singh, D. Singla, S. M. Agarwal and G. P. Raghava, Biol. Direct, 2015, 10, 10 CrossRef PubMed.
  9. I. M. Kapetanovic, Chem.-Biol. Interact., 2008, 171, 165–176 CrossRef CAS PubMed.
  10. A. Mohamed, C. H. Nguyen and H. Mamitsuka, Briefings Bioinf., 2016, 17, 309–321 CrossRef PubMed.
  11. Z. Xiao, S. L. Morris-Natschke and K. H. Lee, Med. Res. Rev., 2016, 36, 32–91 CrossRef CAS PubMed.
  12. M. Mangal, M. I. Khan and S. M. Agarwal, Anti-Cancer Agents Med. Chem., 2016, 16, 138–159 CrossRef CAS.
  13. V. D. Mouchlis, G. Melagraki, T. Mavromoustakos, G. Kollias and A. Afantitis, J. Chem. Inf. Model., 2012, 52, 711–723 CrossRef CAS PubMed.
  14. M. Akhtar and P. V. Bharatam, Chem. Biol. Drug Des., 2012, 79, 560–571 CAS.
  15. G. Melagraki and A. Afantitis, Curr. Top. Med. Chem., 2015, 15, 1827–1836 CrossRef CAS PubMed.
  16. G. Melagraki and A. Afantitis, RSC Adv., 2014, 4, 50713–50725 RSC.
  17. C. Y. Shao, B. H. Su, Y. S. Tu, C. Lin, O. A. Lin and Y. J. Tseng, Bioinformatics, 2015, 31, 1869–1871 CrossRef PubMed.
  18. H. Gonzalez-Diaz, C. R. Munteanu, L. Postelnicu, F. Prado-Prado, M. Gestal and A. Pazos, Mol. BioSyst., 2012, 8, 851–862 RSC.
  19. C. W. Yap, J. Comput. Chem., 2011, 32, 1466–1474 CrossRef CAS PubMed.
  20. E. Frank, M. Hall, L. Trigg, G. Holmes and I. H. Witten, Bioinformatics, 2004, 20, 2479–2481 CrossRef CAS PubMed.
  21. S. Zhang, A. Golbraikh, S. Oloff, H. Kohn and A. Tropsha, J. Chem. Inf. Model., 2006, 46, 1984–1995 CrossRef CAS PubMed.
  22. A. Afantitis, G. Melagraki, P. A. Koutentis, H. Sarimveis and G. Kollias, Eur. J. Med. Chem., 2011, 46, 497–508 CrossRef CAS PubMed.
  23. M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kotter, T. Meinl, P. Ohl, C. Sieb, K. Thiel and B. Wiswedel, The Konstanz Information Miner, Studies in Classification, Data Analysis, and Knowledge Organization, ed. C. Preisach, H. Burkhardt, L. Schmidt-Thieme and R. Decker, GfKl: Springer, 2007, pp. 319–326 Search PubMed.
  24. G. Melagraki, A. Afantitis, H. Sarimveis, O. Igglessi-Markopoulou, P. A. Koutentis and G. Kollias, Chem. Biol. Drug Des., 2010, 76, 397–406 CAS.
  25. I. V. Tetko, J. Comput.-Aided Mol. Des., 2012, 26, 135–136 CrossRef CAS PubMed.

Footnote

Electronic supplementary information (ESI) available. See DOI: 10.1039/c6ra02772e

This journal is © The Royal Society of Chemistry 2016
Click here to see how this site uses Cookies. View our privacy policy here.