Open Access Article
Yayuan
Peng
,
Jiye
Wang
,
Zengrui
Wu
*,
Lulu
Zheng
,
Biting
Wang
,
Guixia
Liu
,
Weihua
Li
and
Yun
Tang
*
Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China. E-mail: ytang234@ecust.edu.cn; zengruiwu@ecust.edu.cn
First published on 25th January 2022
Drug–target interaction (DTI) plays a central role in drug discovery. How to predict DTI quickly and accurately is a key issue. Traditional structure-based and ligand-based methods have some inherent deficiencies. Hence, it is necessary to develop a new method for DTI prediction that does not rely on crystal structures of protein targets or quantity and diversity of ligands. In this study, we collected 40
898 DTIs with kd values from ChEMBL 27 to develop a prediction method. Through data standardization, SMOTE sampling and pipeline techniques, among 30 models the Morgan-PSSM-SVM model (MPSM-DTI) was demonstrated as the best one with ten-fold cross-validation (F1 = 85.55 ± 0.46%, R = 84.89 ± 0.62% and P = 86.24 ± 0.81%) and test set validation (F1 = 85.11%, R = 84.34% and P = 85.90%). The results in two external validation sets indicated that the MPSM-DTI model had satisfactory generalization capability and could be used in target prediction for new compounds. Specifically, the F1, P and R values were 83.27%, 85.21% and 81.41% in external validation set 1 and 86.45%, 87.50% and 85.42% in external validation set 2. Via the latest literature evidence, we collected 100 new DTIs of eight GPCR targets to prove that MPSM-DTI could predict compounds for protein targets without known ligands and crystal structures. Compared with other DTI prediction methods, our method reached considerable accuracy and addressed the dilemma of DTI prediction for brand new protein targets. Furthermore, we proposed the pipeline encapsulation technique, which would avoid data leak and improve generalization ability of the model. The source code of the method is available at https://github.com/pengyayuan/MPSM-DTI.
The traditional methods for DTI prediction are mainly divided into two categories:1 structure-based and ligand-based. In structure-based methods, molecular docking tools are widely used to find new ligands for a protein with a three-dimensional (3D) structure, or identify new protein targets with 3D structures for a known drug. In ligand-based methods, pharmacophore search and similarity search in 3D shapes, substructures and physicochemical properties are usually employed.2–5 Though these traditional methods have succeeded in many cases, there are still some inherent deficiencies. For structure-based methods, 3D structures of targets are a must. However, most of the potential targets have no known 3D structures yet, for example, only 60 GPCRs (G protein-coupled receptors) have been determined structurally among the total 800 members, which means that the structure-based methods could not be utilized on those targets without 3D structures directly.6,7 For ligand-based methods, it is impossible to search new ligands for those targets without known ligands. Therefore, it is urgent to develop novel methods for DTI prediction.8
Recently, a new type of method, named network-based methods, were developed for DTI prediction. These new methods do not rely on the 3D structures of targets. Instead, they utilize a large number of known DTIs and multiple chemogenomic data to construct a DTI network for prediction of potential DTIs. For example, Wu et al. developed network-based inference methods SDTNBI and bSDTNBI to predict new DTIs by introducing substructure information of ligands to a known DTI network, which could be applied in target prediction for new chemical entities outside the DTI network.9–12 However, these methods could not be used in finding potential ligands for new targets outside the network.
Meanwhile, machine learning methods are also used in DTI prediction. For example, Lee et al. extracted local residue patterns of protein sequences to predict novel DTIs using convolution neural network,13 which proved that protein sequences could offer useful information in DTI prediction. Mahmud et al. developed the iDTi-CSsmoteB webserver to predict DTIs based on PubChem fingerprints and various protein sequence features using XGBoost and oversampling techniques.14 However, the data quality of the above-mentioned methods was not satisfactory because the negative data were selected arbitrarily. Several other studies also did so.15–17 Some of them used random non-positive DTIs to act as negative samples. However, non-positive DTIs are not definitely negative because they are just not validated yet. Some of them might be positive after validation. Therefore, it is significant to construct predictive models using high-quality data.
In this study, we developed a machine learning model for prediction of DTIs using chemical structures and protein sequences as features. The pipeline technique was used to encapsulate the feature data standardization, SMOTE sampling process and machine learning estimator, which would avoid overfitting and improve model generalization. The whole workflow is shown in Fig. 1. In brief, over 40
000 DTIs with dissociation constant (kd) values were collected from various sources. Five types of molecular fingerprints and descriptors were calculated by PaDEL-Descriptor and RDKit. The protein sequence features were extracted through PSI-Blast and the POSSUM toolkit. 30 prediction models were built for DTI prediction by 5 machine learning methods and 6 feature representation approaches, among which the Morgan-PSSM-SVM model (MPSM-DTI) was validated as the best one. In case studies, the MPSM-DTI model exhibited satisfactory capability in DTI prediction.
![]() | ||
| Fig. 1 Overview of the workflow to construct the prediction model, including data preparation, feature extraction and model construction. | ||
The SMILES of all drugs were imported to Pipeline Pilot Client (version 2017 R2) to clean chemicals with wrong structures, followed by a series of steps, including removing salt and inorganics, standardizing SMILES and wiping out molecules with molecular weight >1200 Da or <200 Da. Duplicated data were then removed. To ensure clean data, the ambiguous DTIs, the interactions being not only positive but also negative, were removed. For proteins, if the protein sequences were not available in UniProt, the corresponding interactions were deleted, too. After that, the whole data were divided into a training set and a test set in a ratio of 8
:
2.
Data of external validation set 1 were gathered from BindingDB (accessed in June 2020)22 and IUPHAR/BPS Guide to PHARMACOLOGY (accessed in June 2020).23 All data were prepared in the same way as those in the training set and test set. Duplicates with those in the training set and test set were removed, to keep external validation set 1 independent.
To evaluate the capability of predicting targets for new compounds, external validation set 2 was prepared, in which the DTIs were not duplicated with those in the training set and test set, but the compounds were brand new. Furthermore, to verify whether the model can predict compounds exactly for new targets, the experimentally validated DTIs were gathered from a list of recent publications, in which the proteins were completely new compared with proteins in the training set and test set.
:
2.
The external validation set is independent of the training set and test set. The external validation data were divided into four different groups to assess the ability of classifiers to predict new DTIs for new compounds and new proteins. The statistical numbers of DTI samples are shown in Table 1.
| Data set | N d | N T | N P | N N | Total samples | Data sources |
|---|---|---|---|---|---|---|
| a N d: number of drugs; NT: number of targets; NP: number of positive DTI samples; NN: number of negative DTI samples. | ||||||
| Training set | 7445 | 888 | 13 858 |
18 859 |
32 717 |
ChEMBL |
| Test set | 2268 | 720 | 3641 | 4719 | 8180 | |
| External validation set 1 | 987 | 625 | 1152 | 869 | 2021 | BindingDB, IUPHAR/BPS guide to PHARMACOLOGY |
| External validation set 2 | 853 | 604 | 1014 | 818 | 1832 | |
The following performance metrics were used as evaluation indicators: F1, recall (R) and precision (P) to assess each prediction model. See below equations:
![]() | (1) |
![]() | (2) |
![]() | (3) |
898 DTI samples with kd values in total, among which the numbers of positive and negative DTIs were 17
320 and 23
578, respectively. The ratio of negative and positive data is 1.36
:
1 approximately. Obviously, the data set is not balanced to some degree. All DTIs were then split into a training set and a test set randomly in a ratio of 8
:
2. In the training set, there were 7445 drugs, 888 targets, 13
858 positive interactions and 18
859 negative interactions. The test set contained 2268 drugs, 720 targets, 3641 positive interactions and 4719 negative interactions. To better evaluate the model generalization, we gathered 2021 DTIs with kd values from BindingDB and IUPHAR/BPS Guide to PHARMACOLOGY to serve as external validation set 1, which contained 1152 positives and 869 negatives. In external validation set 2, there were 1832 DTIs with kd values, among which 1014 ones were positive and 818 were negative. The details of all the data sets are summarized in Table 1.
In addition, 100 DTIs collected from the latest literature were used in the case study and summarized in Table S1,† which enclosed eight functional GPCR (G-protein coupled receptor) proteins covering some principal biological pathways and complex diseases.
Through a pipeline approach, the feature data standardization, SMOTE sampling process and machine learning estimator were encapsulated as a unitive estimator. The superiority of the pipeline approach is to avoid data leaking. Fig. 2 shows the ΔF1, ΔP and ΔR values of 30 models with and without pipeline encapsulation. ΔF1, ΔP and ΔR stand for the differences of F1, P and R between test set validation and ten-fold cross-validation. From Fig. 2, we could see that all the ΔF1, ΔP and ΔR values of models without the pipeline strategy were much larger than those with pipeline. The larger ΔF1, ΔP and ΔR values reflected that the models could achieve better performance in ten-fold cross-validation but poor performance in test set validation, i.e. overfitting. Therefore, the pipeline strategy is effective in avoiding data leaking and overfitting.
As for the chemical features, from Fig. 3 we found that models with FP4-PSSM performed worse than the others, which indicated that FP4 could not represent the features of chemical structures well. Meanwhile, models with Descriptor-PSSM, KR-PSSM, MACCS-PSSM, Morgan-PSSM, and PubChem-PSSM exhibited comparable performance, and models with Morgan-PSSM outperformed slightly. Especially, the Morgan-PSSM-SVM model (MPSM-DTI) performed the best among all 30 models, with ten-fold cross-validation results as F1 = 85.55 ± 0.46%, R = 84.89 ± 0.62% and P = 86.24 ± 0.81%.
Besides ten-fold cross-validation, test set validation was also employed for the comparison of different models. Fig. 4 displays the test set validation results. The detailed values of evaluation indicators for all 30 models are shown in Table S5.† The results of test set validation were similar to those of ten-fold cross-validation. The F1, R and P scores of test set validation for most models also exceeded 80%. From Fig. 4, we could see that all SVM models performed better than those of Bagging, DT, GBDT, and k-NN. Furthermore, MPSM-DTI was also tested as the best model among all 30 models, with test set validation results as F1 = 85.11%, R = 84.34% and P = 85.90%.
Before the assessment, to see if the external data sets were located within the applicability domain of the model, the PCA analysis was performed to reduce the dimensionality of the chemical and protein features on all four data sets into a 3D chemical space. As shown in Fig. 5A, it is easy to see that the distributions of features in the four data sets were covered well in the 3D space after PCA dimensionality reduction. Fig. 5A indicates that the two external data sets were suitable for assessment of the generalization capability of the model. The evaluation results of the four data sets via the MPSM-DTI model are shown in Fig. 5B and Table 2. From Fig. 5B, we could see that the external validation set 1 and external validation set 2 achieved quite similar results in comparison with ten-fold cross-validation and test set validation. Specifically, for external validation set 1, the F1, P and R scores were 83.27%, 85.21% and 81.41%, respectively, while for external validation set 2, the three evaluation indicators exceeded external data set 1 to some extent with F1 = 86.45%, P = 87.50% and R = 85.42%.
| Evaluation indicators | Ten-fold cross validation | Test set validation | External validation set 1 | External validation set 2 |
|---|---|---|---|---|
| F 1 | 85.55 ± 0.46 | 85.11 | 83.27 | 86.45 |
| P | 86.24 ± 0.81 | 85.90 | 85.21 | 87.50 |
| R | 84.89 ± 0.62 | 84.34 | 81.41 | 85.42 |
From the above analysis, it is obvious that the MPSM-DTI model achieved high-quality generalization capability in two different external sets. Moreover, from the results of external validation set 2, we could see that the model obtained ideal results on known DTIs with new compounds.
Table 3 briefly shows the predictive results of the MPSM-DTI model for the eight new GPCR targets. It is straightforward to see that the MPSM-DTI model obtained a considerable recall rate with 90 correct predictions among all 100 experimentally validated DTIs. Fig. 6 illustrates the results clearly with DTI networks. From Fig. 6, we learn that all the DTIs of DHCR7, HTR1F, LTB4R, GPER1, and PTGIR, were predicted correctly, while a small portion of DTIs were predicted incorrectly for CYSLTR2, GRIK3 and SIPR5. The detailed prediction results are listed in Table S1† and the SMILES of all compounds are shown in Table S2.†
| No. | Target name | Correctly predicted | All predicted | Recall score |
|---|---|---|---|---|
| 1 | DHCR7 | 7 | 7 | 100% |
| 2 | HTR1F | 14 | 14 | 100% |
| 3 | LTB4R | 14 | 14 | 100% |
| 4 | CYSLTR2 | 11 | 15 | 73.30% |
| 5 | GRIK3 | 5 | 8 | 62.50% |
| 6 | GPER1 | 10 | 10 | 100% |
| 7 | PTGIR | 16 | 16 | 100% |
| 8 | S1PR5 | 12 | 15 | 80% |
In comparison with other similar models, the MPSM-DTI model possesses several advantages. Firstly, the data quality was greatly guaranteed by gathering first-hand DTI data. However, in some reported DTI prediction models, the threshold to discriminate positive and negative DTI data was often incorrect; sometimes unconfirmed interactions were regarded as negative DTI data in some research studies, which would lead to inaccurate models and mislead false predictions.14,17 Secondly, the MPSM-DTI model could predict targets for new compounds outside the DTI network. From the results in external validation set 2, we could see that the MPSM-DTI model would correctly predict potential targets for brand new compounds. Thirdly, the MPSM-DTI model could predict compounds for new targets outside the DTI network. From the results of the case study, the MPSM-DTI model could correctly predict 90 percent of DTIs for those eight new GPCR targets and achieve a relatively decent performance. In theory, our MPSM-DTI model could predict potential ligands for any new targets as long as the target sequence could be obtained.
At present, there are several published methods for the prediction of DTIs, such as SwissTarget,3 SDTNBI,27 bSDTNBI,27 and ChemMapper.5 These methods are widely used as free webservers in drug discovery. SwissTarget is a ligand-based method for target prediction, established based on a combination of 2D and 3D similarity with a library of 370
000 known actives.3 ChemMapper is also a kind of ligand-based approach, which is based on the concept that compounds sharing high 3D similarities may have relatively similar target association profiles.4,5 SDTNBI and bSDTNBI are two network-based methods for target prediction. SDTNBI uses a network-based inference method to recommend targets for compounds, which relies on source propagation on the substructure-drug–target network,9 while bSDTNBI is the upgraded version of SDTNBI by adding three parameters to adjust the network weights.10 SDTNBI and bSDTNBI methods could be reached by the NetInfer (http://lmmd.ecust.edu.cn/netinfer/) webserver.27
Compared with these published methods, our MPSM-DTI model showed better prediction accuracy with a higher recall rate than the others. Fig. 7 displays the prediction results of these methods, including MPSM-DTI, SwissTarget, SDTNBI (top 20), SDTNBI (top 50), bSDTNBI (top 20), bSDTNBI (top 50), and ChemMapper. All predictions were performed through corresponding webservers and the correct recall numbers and false recall numbers were counted by Python scripts. From Fig. 7, it is easy to see that MPSM-DTI achieved the best results with 90% correctness for the aforementioned eight GPCR targets. SwissTarget ranked the second with 61% correctness. bSDTNBI outperformed SDTNBI, which was confirmed by the previous studies.7,10,20 The prediction ranking top 20 or top 50 of bSDTNBI did not influence the ultimate results a lot. The detailed prediction results for each DTI of these methods are shown in Table S1,† and the SMILES for the 100 compounds are listed in Table S2.† If somebody is interested in some of the targets, they could use those data in their studies.
The MPSM-DTI model also exhibited two more advantages. First, MPSM-DTI could predict potential ligands for new protein targets especially for those without crystal structures and known ligands, whereas the other methods could not do that. Second, MPSM-DTI runs very fast and only needs a few seconds. However, ChemMapper would take a much longer time (usually more than 24 hours) because it identifies potential compounds via 3D similarity calculations. SwissTarget takes 5–10 minutes after submitting a query for small molecule.
Anyhow, there is still some space to improve MPSM-DTI. For example, we did not use deep learning methods to construct the model, because we are short of gigantic DTI data and plentiful computational resources to support vast data calculation. At present, deep learning does not improve model performance but takes too much computation resource in comparison with ordinary machine learning methods. Meanwhile, a webserver might be very helpful for others to use it friendly elsewhere, for instance, to do virtual screening or lead discovery for targets without known ligands and crystal structures. Nevertheless, MPSM-DTI might have a profound significance on drug discovery and development.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/d1dd00011j |
| This journal is © The Royal Society of Chemistry 2022 |