Open Access Article
Xiao Li
*ab,
Yaojie Chena,
Xinrui Songa,
Yuan Zhanga,
Huanhuan Lia and
Yong Zhao*ab
aBeijing Beike Deyuan Bio-Pharm Technology Co. Ltd., 7 Fengxian road, Beijing 100094, China. E-mail: lixiao@bcc.ac.cn; lixiao1688@163.com; Tel: +86-10-5934-1890
bBeijing Key Laboratory of Cloud Computing Key Technology and Application, Beijing Computing Center, Beijing Academy of Science and Technology, 7 Fengxian road, Beijing 100094, China. E-mail: zhaoyong@bcc.ac.cn; Fax: +86-10-5934-1855; Tel: +86-10-5934-1764
First published on 20th February 2018
Drug-induced liver injury (DILI), caused by drugs, herbal agents or nutritional supplements, is a major issue for patients and the pharmaceutical industry. It has been a leading cause of clinical trials failure and withdrawal of FDA approval. In this research, we focused on in silico estimation of chemical DILI potential on humans based on structurally diverse organic chemicals. We developed a series of binary classification models using five different machine learning methods and eight different feature reduction methods. The model, developed with the support vector machine (SVM) and the MACCS fingerprint, performed best both on the test set and external validation. It achieved a prediction accuracy of 80.39% on the test set and 82.78% on external validation. We made this model available at http://opensource.vslead.com/. The user can freely predict the DILI potential of molecules. Furthermore, we analyzed the difference of distributions of 12 key physical–chemical properties between DILI-positive and DILI-negative compounds and 20 privileged substructures responsible for DILI were identified from the Klekota–Roth fingerprint. Moreover, since traditional Chinese medicine (TCM)-induced liver injury is also one of the major concerns among the toxic effects, we evaluated the DILI potential of TCM ingredients using the MACCS_SVM model developed in this study. We hope the model and privileged substructures could be useful complementary tools for chemical DILI evaluation.
There is a great deal of interest worldwide in developing fast and accurate experimental and computational approaches to evaluate the risk of DILI.2,15 Since biological and chemical experimentations are too time-consuming and expensive, in silico techniques, such as quantitative structure–activity relationship (QSAR), have been widely used to reduce the cost of the chemical risk assessment. Compared with experimental methods, QSAR models are applicable to virtual molecules even before they are isolated or synthesized.16 Cheng et al.17 reported the first QSAR model for DILI prediction in 2003. They built a training set containing 382 drug and drug-like compounds with dose–response data and developed a classifier with recursive partitioning trees. The final models were applied to a set of 54 compounds collected after the models were created. In total, 81% of the compounds were classified correctly by the ensemble method. In 2014, Chen et al.18 and Ekins19 reviewed the QSAR-based models for human hepatotoxicity. These models were trained with very different toxicity data using various modeling approaches.15,20–26 Several models were developed in the next few years. In 2016, Mulliner et al.27 developed a systematic approach to construct interrelated models for hepatotoxicity with a general applicability scope from a repository of 3712 compounds and associated human and animal hepatotoxicity data. To our knowledge, this is the largest existing dataset for DILI modeling. Zhang et al.28 reported an excellent study focused on predicting the risk of DILI in humans. They developed the models using a comprehensive data set containing 1317 compounds (785 DILI-positive compounds and 532 DILI-negative compounds). In total, 88 compounds collected from a benchmark DILI database-Liver Toxicity Knowledge Base (LTKB) were used for model validation and some key substructure patterns correlated with drug-induced liver toxicity were also identified as structural alerts. More recently, Kotsampasakou et al.1 presented a DILI prediction model generated with Random Forest and 2D molecular descriptors on a dataset of 966 compounds. They considered the quality of the training data and carefully curated the datasets for DILI both with respect to the chemical structures and for their class labels (DILI positive, DILI negative). Most of the published QSAR models suffer from low statistical performance or small data sets and the usefulness of these models was restricted because of poor availability.29–31 Moreover, considering the species specificity in chemical toxicity between rodent animals and human beings, it should be more useful to develop the models based on reliable data on human DILI.
In the present study, we focused on the following tasks: (1) the development of human DILI models based on structurally diverse organic chemicals; (2) the analysis of the difference in structural characteristics between the DILI positive and negative chemicals; (3) the application of the QSAR model to predict the DILI potential of TCM components.
:
1. The training set contained 1731 compounds and the test set contained 413 compounds. In order to further evaluate the predictive ability of the models, an external validation set were extracted from the study reported by Ivanov et al.32 After pretreatment and removal of duplicates, the external validation set contained 151 compounds. Our datasets for model training and validation were larger and more structurally diverse than most of the previous studies.
Kow (A
log
P), molecular solubility (log
S), the number of hydrogen bond acceptors (nHBA), the number of hydrogen bond donors (nHBD), log
D, molecular surface area (MSA), molecular polar surface area (MPSA), molecular fractional polar surface area (MFPSA), the number of rotatable bonds (nRTB), the number of rings (nR) and the number of aromatic rings (nAR). These properties formed a set of molecular descriptors and were used as a feature reduction method for model building.
Recently, molecular fingerprints have also been widely used in chemical toxicity prediction because the fingerprint sequences perform effectively on symbolizing chemical fragments and the procedure is highly efficient.36–38 Herein, we calculated seven types of commonly used fingerprints, including the Estate fingerprint (Estate, 79 bits), CDK fingerprint (FP, 1024 bits), CDK extended fingerprint (Extended, 1024 bits), Klekota–Roth fingerprint (KRFP, 4860 bits), MACCS keys (MACCS, 166 bits), PubChem fingerprint (PubChem, 881 bits) and Substructure fingerprint (SubFP, 307 bits).
All the descriptors and fingerprints were calculated by the open source software package PaDEL-Descriptor,38 which has both a graphical user interface and command line interfaces. PaDEL-Descriptor can work on all major platforms (Windows, Linux, MacOS), and supports more than 90 different molecular file formats. It is a useful addition to the currently available molecular descriptor calculation software and it has been commonly used in drug discovery.
SVM is an excellent kernel-based tool for classification and regression introduced by Vapnik et al.39 The purpose of SVM training is to find an optimal hyperplane, which could discriminate molecules from different categories.28 SVM maps the input vectors into a higher dimensional feature space by using a kernel function.23 Herein, we used the Gaussian radial basis function (RBF) kernel. The parameters C and g for RBF kernel were sought with a grid search method based on 5-fold cross-validation.
The kNN algorithm is a method for classifying objects based on the closest training examples in the feature space. It is a type of instance-based learning or lazy learning, where the function is only approximated locally and all computation is deferred until classification. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors. In this study, the closeness was measured by Euclidean distance metrics and the parameter of k = 5 was used.
The NB classifier is based on Bayes' theorem with independent assumptions between predictors. A NB model is easy to build, with no complicated iterative parameter estimations, which makes it particularly useful for very large datasets.
C4.5 DT is an algorithm developed by Ross Quinlan that generates Decision Trees (DT), which can be used for classification problems. It improves Quinlan's earlier ID3 algorithm by dealing with both continuous and discrete attributes, missing values and pruning trees after construction. RF is an ensemble learning method for classification and regression that is operated by constructing a multitude of decision trees at training time. The mode of the classes output by individual trees is taken as the overall output.
The parameters of C4.5 DT, RF and NB algorithms were default in the Orange toolbox.
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
In addition, we plotted the receiver operating characteristic (ROC) curve, which was used to graphically present the model behavior. The ROC curve can show the separation ability of a binary classification model by iteratively setting the possible threshold of classification.36 The values of the area under the ROC curves (AUC) were also computed. AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. The principle is that if the plot has a surface area of 1, the classifier is perfect and if the area equals 0.5, the classifier is useless and random.34
We identified the privileged substructures responsible for DILI with substructure fragment analysis50 methods based on KRFP fingerprints. If a substructure presented more frequently in DILI-positive chemicals than DILI-negative chemicals, this substructure would be regarded as a privileged substructure. The frequency of a substructure is defined as follows:
![]() | (5) |
| Data sets | DILI-positive | DILI-negative | Total |
|---|---|---|---|
| Training set | 980 | 751 | 1731 |
| Test set | 270 | 143 | 413 |
| External validation set | 88 | 63 | 151 |
| Total | 1338 | 957 | 2295 |
It is well known that the diversity of chemical structures is a key issue for global model building. QSAR models based on a relatively small dataset or homologous compounds always resulted in poor generalization abilities. In this study, we applied the radar chart to explore the chemical space of the entire data set as shown in Fig. 1. The MW values ranged from 43.07 to 994.19, the nHBA values ranged from 0 to 35, the nHBD values ranged from 0 to 14, the A
log
P values ranged from −12.21 to 18.77, and the log
S values ranged from −20.78 to 4.58. These data suggested that the 2295 compounds in our data set covered a sufficiently large chemical space. We also calculated the Tanimoto similarity index55 based on the ECFP-4 fingerprint, which has been widely used to evaluate similarities among chemicals. The average Tanimoto similarity index was 0.128, indicating that the chemical structures in our data set were evidently diverse. We plotted the heat map of the Tanimoto similarity index of 100 randomly filtered molecules as shown in Fig. 2.
Furthermore, the data sets distributed in the chemical space defined by molecular weight (MW) and Ghose–Crippen log
Kow (A
log
P) were analyzed. The distribution scatter diagram is presented in Fig. 3, which illustrated that the test and validation sets shared a similar chemical space with the training set.
| Models | Q | SE | SP | AUC | MCC |
|---|---|---|---|---|---|
| Estate_kNN | 0.7167 | 0.7333 | 0.6853 | 0.7300 | 0.4047 |
| Estate_SVM | 0.7240 | 0.8667 | 0.4545 | 0.7633 | 0.3556 |
| Estate_RF | 0.7215 | 0.7741 | 0.6224 | 0.7826 | 0.3922 |
| Estate_NB | 0.6925 | 0.7370 | 0.6084 | 0.7334 | 0.3382 |
| Estate_CT | 0.6804 | 0.7074 | 0.6294 | 0.7164 | 0.3259 |
| Extend_NN | 0.7022 | 0.6741 | 0.7552 | 0.7740 | 0.4090 |
| Extend_SVM | 0.7482 | 0.8481 | 0.5594 | 0.8090 | 0.4261 |
| Extend_RF | 0.7264 | 0.6926 | 0.7902 | 0.8309 | 0.4600 |
| Extend_NB | 0.6174 | 0.5333 | 0.7762 | 0.7157 | 0.2978 |
| Extend_CT | 0.6513 | 0.6370 | 0.6783 | 0.6720 | 0.3005 |
| FP_kNN | 0.7119 | 0.6815 | 0.7692 | 0.7808 | 0.4294 |
| FP_SVM | 0.7893 | 0.8667 | 0.6434 | 0.8630 | 0.5247 |
| FP_RF | 0.6877 | 0.6185 | 0.8182 | 0.8401 | 0.4164 |
| FP_NB | 0.6174 | 0.5407 | 0.7622 | 0.7129 | 0.2907 |
| FP_CT | 0.6949 | 0.6963 | 0.6923 | 0.6812 | 0.3726 |
| MACCS_kNN | 0.7651 | 0.7333 | 0.8252 | 0.8243 | 0.5332 |
| MACCS_SVM | 0.8039 | 0.8815 | 0.6573 | 0.8578 | 0.5568 |
| MACCS_RF | 0.6828 | 0.6370 | 0.7692 | 0.8271 | 0.3866 |
| MACCS_NB | 0.6465 | 0.5667 | 0.7972 | 0.7566 | 0.3487 |
| MACCS_CT | 0.6901 | 0.7037 | 0.6643 | 0.7162 | 0.3544 |
| Pubchem_kNN | 0.7240 | 0.7148 | 0.7413 | 0.7794 | 0.4368 |
| Pubchem_SVM | 0.7748 | 0.8444 | 0.6434 | 0.8341 | 0.4957 |
| Pubchem_RF | 0.7264 | 0.6741 | 0.8252 | 0.8036 | 0.4751 |
| Pubchem_NB | 0.6562 | 0.5963 | 0.7692 | 0.7298 | 0.3485 |
| Pubchem_CT | 0.6610 | 0.6519 | 0.6783 | 0.6694 | 0.3151 |
| SubFP_kNN | 0.7191 | 0.7333 | 0.6923 | 0.7756 | 0.4111 |
| SubFP_SVM | 0.7651 | 0.8704 | 0.5664 | 0.8438 | 0.4624 |
| SubFP_RF | 0.7312 | 0.7519 | 0.6923 | 0.8111 | 0.4310 |
| SubFP_NB | 0.6973 | 0.6815 | 0.7273 | 0.7742 | 0.3902 |
| SubFP_CT | 0.7482 | 0.7593 | 0.7273 | 0.7691 | 0.4708 |
| KRFP_kNN | 0.7119 | 0.7074 | 0.7203 | 0.7848 | 0.4099 |
| KRFP_SVM | 0.7821 | 0.8926 | 0.5734 | 0.8234 | 0.5001 |
| KRFP_RF | 0.7119 | 0.7778 | 0.5874 | 0.7934 | 0.3646 |
| KRFP_NB | 0.6780 | 0.6556 | 0.7203 | 0.7298 | 0.3581 |
| KRFP_CT | 0.6634 | 0.6889 | 0.6154 | 0.6619 | 0.2937 |
| MD_kNN | 0.7288 | 0.7444 | 0.6993 | 0.7827 | 0.4294 |
| MD_SVM | 0.7094 | 0.8963 | 0.3566 | 0.7282 | 0.3060 |
| MD_RF | 0.6780 | 0.9667 | 0.1329 | 0.6948 | 0.1884 |
| MD_NB | 0.6247 | 0.7370 | 0.4126 | 0.6397 | 0.1533 |
| MD_CT | 0.6659 | 0.6852 | 0.6294 | 0.6645 | 0.3029 |
MCC is a single performance measure, which is less influenced by imbalanced data. Considering the apparent imbalance between positive and negative data in this study, we paid special attention to the MCC and Q values. According to the values of Q and MCC, ten models (MACCS_SVM, FP_SVM, KRFP_SVM, Pubchem_SVM, MACCS_kNN, SubFP_SVM, Extend_SVM, SubFP_CT, SubFP_RF and MD_kNN) were regarded as top performance models with the best predictive results (with Q > 72.88% and MCC > 0.4294) on the test data.
| Models | Q | SE | SP | AUC | MCC |
|---|---|---|---|---|---|
| MACCS_SVM | 0.8278 | 0.9318 | 0.6825 | 0.8880 | 0.6470 |
| FP_SVM | 0.7483 | 0.8750 | 0.5714 | 0.8303 | 0.4754 |
| KRFP_SVM | 0.7417 | 0.8523 | 0.5873 | 0.8326 | 0.4606 |
| Pubchem_SVM | 0.7881 | 0.9091 | 0.6190 | 0.8289 | 0.5625 |
| MACCS_kNN | 0.7616 | 0.8068 | 0.6984 | 0.8194 | 0.5077 |
| SubFP_SVM | 0.7616 | 0.8864 | 0.5873 | 0.8045 | 0.5044 |
| Extend_SVM | 0.7219 | 0.8523 | 0.5397 | 0.7604 | 0.4174 |
| SubFP_CT | 0.6755 | 0.7159 | 0.6190 | 0.7060 | 0.3342 |
| SubFP_RF | 0.7351 | 0.7841 | 0.6667 | 0.7950 | 0.4530 |
| MD_kNN | 0.6556 | 0.6818 | 0.6190 | 0.7233 | 0.2986 |
It is not surprising that SVM algorithm performed best on DILI model building. It is well known that SVM algorithm is the most suitable tool for small sample, nonlinear and high dimensional pattern problems. With appropriately chosen kernel and corresponding parameters, this algorithm could provide a good out-of-sample generalization even when the training sample has some bias. Moreover, all the features in the fingerprints were saved to avoid the loss of information in our study. The high dimensional samples highlighted the advantages of SVM on high dimensional pattern problems. This also resulted in the SVM displaying a better performance than other machine learning methods for model building. This result is in agreement with our previously published study, which also concluded that the SVM is a good classification algorithm for chemical toxicity prediction.34,50,51,56,57
Eight different patterns, including 7 fingerprint types and a set of molecular descriptors containing 12 key physical–chemical properties, were used for molecular characterization. The fingerprints performed better than the molecular descriptors. Only one top-performing model was developed with molecular descriptors, while 9 top-performing models were developed with fingerprints. The MACCS and SubFP fingerprints accounted for half of the models with the best performance. In particular, the model built with MACCS and SVM achieved the best predictive results both on the test set and external validation. MACCS and SubFP fingerprints contained a great deal of structural information as they were generated based on the well-defined structural fragments dictionary. In consideration of this result, MACCS and SubFP fingerprints were recommended as the preferential attributes for chemical DILI model building.
MW and MSA were regarded as simple estimations of molecular size and complexity. The MW distribution was between 43.07 and 994.19 with a mean of 347.53. The mean values of MW were 343.88 and 352.63 for DILI-positive and DILI-negative compounds, respectively, with a p-value of 0.008. This indicates that DILI-negative chemicals are likely to have higher MW value. MSA was distributed from 92.56 to 722.37 with a mean of 293.68. The mean values of MSA were 329.59 and 279.02 for DILI-positive and DILI-negative compounds, respectively, with a p-value of 0.006. A
log
P and log
D represent the lipophilicity of a compound. In our dataset, A
log
P was distributed between −12.21 and 18.77 with a mean of 2.01 and log
D was distributed between −23.78 and 18.77 with a mean of 1.38. The mean values of A
log
P were 2.03 and 1.99 for DILI-positive and DILI-negative compounds and the mean values of log
D were 1.42 and 1.33 for DILI-positive and DILI-negative compounds, respectively. The p-values between the mean A
log
P and log
D for the DILI-positive and DILI-negative chemicals were 0.160 and 0.145, respectively, indicative of no or low significant difference. Hydrogen bonding ability was commonly represented by nHBA and nHBD. As shown in Fig. 3, the mean values of nHBA were 5.08 and 4.90 for DILI-positive and DILI-negative compounds, respectively, with a p-value of 0.436. The mean values of nHBD were 1.96 and 2.12 for DILI-positive and DILI-negative compounds, respectively. The value of nHBD of chemicals indicated high significant difference of the mean nHBA of DILI-positive and DILI-negative chemicals with the lower p-value of 0.007. Moreover, the values of log
S and value of nR of chemicals indicated a high significant difference of the mean nR of DILI-positive and DILI-negative chemicals with the low p-values of 0.019 and 3.21 × 10−4, respectively. Moreover, the p-values between the mean nR, nAR, MPSA and MFPSA for the DILI-positive and DILI-negative chemicals were 0.408, 0.232, 0.126 and 0.257, respectively, indicative of no or low significant difference.
The results indicated that these physical and chemical properties have a weak differentiating effect on DILI potential although no individual property can be a key factor for chemical DILI potential. In fact, DILI is a complex chemical and biological process comprising numerous steps. It is very difficult to explain the DILI mechanism using individual or several simple chemical descriptors. This should be the primary reason why the models based on MD did not perform well.
In these privileged substructures, six fragments are fluorine-containing groups. The carbon–fluorine bond is metabolically stable in general and fluorine usually acts as a bioisostere of the hydrogen atom. The presence of fluorine atoms always results in an extreme increase of the drug lipophilicity, which could enrich the intracellular concentration of hepatotoxic drugs.28 In total, 11 fragments are amine or nitro derivatives (two substructures, no. 11 and no. 13, are both fluorine-containing groups and amine derivatives). They could bind to proteins through covalent bonds with cysteine residues via Michael addition reactions and result in DILI.58
These identified privileged substructures could be regarded as structural alerts responsible for DILI and we hope that they could be used to predict the DILI potential of new compounds.
679 compounds isolated from 8445 TCM plants. After preparation, 21
518 unique chemicals were extracted. Before the DILI prediction of TCM components, an applicability domain (AD) experiment was performed. The AD analysis was performed on the basis of the ranges distances approach using the Ambit Discovery software (version 0.04, available free of cost at http://ambit.sourceforge.net/download_ambitdiscovery.html). It was observed that 17
276 compounds were predicted to be in the domain of the MACCS_SVM model. In total, 10
745 (62.20%) TCM components were predicted as DILI-positive.
These compounds are present in 5678 TCM plants. In total, 227 predicted DILI-positive compounds are present in more than 10 TCM plants. Among them, stigmasterol is the most frequently detected ingredient, which is present in 99 TCM plants. Moreover, another nine TCM compounds, namely, quercetin, gallic acid, oleanolic acid, scopoletin, berberine, kaempferol, ursolic acid, palmatine and eugenol are present in more than 40 TCM plants. There are 140 TCM plants containing more than 20 predicted DILI-positive compounds. Morus alba contains the most predicted DILI-positive compounds; a total of 104 constituents present in Morus alba. In addition, another 12 TCM plants, namely, Zingiber officinale, Ligusticum chuanxiong, Salvia miltiorrhiza, Houttuynia cordata, Artemisia annua L., Angelica sinensis, Schisandra chinensis (Turcz.) Baill., Tripterygium wilfordii, Taxus baccata, Panax notoginseng, Artemisia capillaries and Panax ginseng C. A. Mey. contain more than 50 predicted DILI-positive compounds.
The compounds predicted as DILI-positive and the TCM plants containing a larger number of these components should be paid close attention to avoid DILI in humans.
The structural characteristics of the DILI-positive chemicals and DILI-negative chemicals were also analyzed. The distributions of the 12 key physicochemical properties showed more or less difference between DILI-positive and DILI-negative compounds. Furthermore, 20 substructures identified from KRFP fingerprints are present far more frequently in DILI-positive compounds than DILI-negative compounds. Thus, these substructures could be regard as structural alerts responsible for DILI. In addition, we predicted the DILI potential of ingredients in TCM using the MACCS_SVM model developed in this study.
We hope that the in silico models and structural alerts could provide critical information and be useful tools for predicting DILI potential of new chemicals.
| This journal is © The Royal Society of Chemistry 2018 |