Xianchao Panab,
Li Chaob,
Sujun Qub,
Shuheng Huangb,
Li Yangab and
Hu Mei*ab
aKey Laboratory of Biorheological Science and Technology, Ministry of Education, Chongqing University, Chongqing 400044, China. E-mail: meihu@cqu.edu.cn; Fax: +86-23-65112677; Tel: +86-23-65102507
bCollege of Bioengineering, Chongqing University, Chongqing 400044, China
First published on 28th September 2015
CYP1A2, an important member of the cytochromes P450 (CYPs) superfamily, is involved in the metabolism or bioactivation of many clinical drugs and precarcinogens. Thus, accurate prediction of CYP1A2 inhibitors is of great importance in early drug discovery and cancer prevention. In this study, a dataset of more than 12000 structurally diverse compounds was used to develop prediction models by a support vector machine (SVM). By combining two types of fragment descriptors, i.e. Molecular Hologram and MACCS descriptors, an improved radial basis function (RBF)-based SVM model was obtained, of which the accuracies (ACCs), sensitivities (SENs), specificities (SPEs), and Matthews correlation coefficients (MCCs) were 90.95%, 92.40%, 89.70%, 0.8191 for 6396 training samples, and 83.14%, 85.17%, 81.41%, 0.6638 for 6395 test samples, respectively. The prediction capability of the SVM model obtained was further validated by an independent dataset of 2581 samples with geometric mean (G-mean) based accuracy of 70.67%. The results indicate that the combination of the two types of fragment descriptors is an extremely efficient method for eliciting the key structural features of CYP inhibitors, and thus can be employed to large-scale virtual screening of inhibitors of CYP isoforms.
In human liver, CYP1A2 accounts for about 13% of total CYP content5 and metabolizes a variety of clinical drugs (e.g. clozapine, ropivacaine, olanzapine, theophylline, and terbinafine).2,6 In the past decade, in silico methods in particular quantitative structure–activity relationship (QSAR) have been increasingly attractive for prediction of potential CYP1A2 inhibitors and associated DDIs.7–14 However, the extrapolation capabilities of the available prediction models are restricted by small datasets and limited structural diversities.
In 2009, Veith et al. determined the AC50 values (half-maximal activity concentration) of 17143 compounds against 5 CYP isoforms (1A2, 2C9, 2C19, 2D6, and 3A4) by quantitative high throughput screening (qHTS) technique.15 Based on this large dataset, various prediction models of CYP inhibitors have been developed by support vector machine (SVM), decision tree (DT), k-nearest neighbor (k-NN), naïve Bayes (NB), and random forest (RF), respectively.16–19 One of the most interesting works comes from Cheng et al.,16 who established combined classifiers for predicting CYP inhibitors by using SVM, C4.5DT, NB, and k-NN algorithms. For CYP1A2 isoform, the 5-fold cross-validation (CV) accuracies of the combined classifiers are approximately 81% for 12
099 training samples (5663 inhibitors/6436 noninhibitors), and the prediction accuracies for 2804 test samples (1752 inhibitors/1052 noninhibitors) range from 70% to 73%.16 This is the first time, to the best of our knowledge, that highly predictive models have been constructed based on this large dataset. However, the method of combined classifiers is somewhat complex and time-consumed for large-scale virtual screening.
Herein, a strategy of combination of Molecular Hologram and MACCS has been successfully applied to construct prediction models for CYP1A2 inhibitors. The results showed that a predictive RBF (radial basis function)-SVM model was achieved with the accuracies of 90.95% for 6396 training samples and 83.14% for 6395 test samples. The prediction capability of the RBF-SVM model was further validated by an independent dataset of 2581 samples with geometric mean based accuracy of 70.67%. Taken together, the strategy of the combined fragment descriptors provides an extremely simple, accurate, and efficient approach for predicting CYP1A2 inhibitors and can be further applied for predicting inhibitors of other CYP isoforms.
According to this criterion, 13256 compounds with inhibitor/non-inhibitor labels of CYP1A2 were extracted from the database, of which 6000 compounds were determined as inhibitors and 7256 as non-inhibitors. Prior to further analysis, the molecular structures were pretreated by Verify 2D module of Sybyl 8.1.20 After removing inorganic compounds, counter ions, duplicates, salts, and mixtures, the resulting 12
791 compounds (designated as Dataset I) were then randomly split into a training set containing 6396 compounds and a test set containing 6395 compounds.
An independent validation dataset containing 8465 compounds (4446 inhibitors/4019 noninhibitors) was collected from another PubChem BioAssay database (AID: 410). After filtering structures as mentioned above and removing duplicated structures to Dataset I, 2581 qualified compounds (designated as Dataset II) were obtained. According to the SLN strings of molecules, the duplicate compounds have been removed by using ligand preparation tools in Sybyl 8.1.20 The statistical descriptions of the training, test and validation sets are shown in Table 1. The PubChem ID, SMILES, and inhibitor/non-inhibitor labels of all 15372 samples are listed in Table S1.†
Datasets | Type | No. of inhibitors | No. of noninhibitors | Sum | Ratio (inhibitors/noninhibitors) |
---|---|---|---|---|---|
Dataset I (AID: 1851) | Training set | 2948 | 3448 | 6396 | 0.8550![]() ![]() |
Test set | 2947 | 3448 | 6395 | 0.8547![]() ![]() |
|
Dataset II (AID: 410) | Validation set | 1774 | 807 | 2581 | 2.1983![]() ![]() |
Total | 7669 | 7703 | 15![]() |
Molecular Hologram description was carried out by Hologram QSAR (HQSAR) module of Sybyl 8.1 package.20,23 A Molecular Hologram is an array containing counts of molecular fragments. As depicted in Fig. 1, molecules are first broken into pre-defined structural fragments (including branched, cyclic, and overlapping fragments). Then, each unique fragment is assigned a specific large integer by means of cyclic redundancy check (CRC) algorithm. Each integer corresponds to a bin in an integer array of fixed length L. Bin occupancies are incremented according to the fragments generated. Thus, all generated fragments are hashed into array bins in the range 1 to L. This array is so-called Molecular Hologram, and bin occupancies are the hologram descriptors, which contain topological and compositional molecular information. The generation of Molecular Hologram is mainly determined by 3 parameters: fragment size, fragment distinction, and hologram length.
![]() | ||
Fig. 1 Generation of Molecular Hologram.19 |
MACCS descriptors, also called 166-bit MDL keys, use a dictionary that consists of 166 pre-defined substructure fragments.24 For a molecule, each bit represents the presence or absence of a certain atom type, bond type, atom environment, group, or property. If a specified substructure is presented in a given molecule, the corresponding bit is set to ‘1’; conversely, it is set to ‘0’. Thus, each molecule is described as a binary string to represent structural features. The substructure dictionary of MACCS keys is freely available in OpenBabel (http://openbabel.org/).
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
Here, TP (true positives) is the number of inhibitors predicted as inhibitors, TN (true negatives) the number of non-inhibitors predicted as non-inhibitors, FP (false positives) the number of noninhibitors predicted as inhibitors, and FN (false negatives) the number of inhibitors predicted as noninhibitors. The value of MCC ranges from −1 to 1. A value of 1 indicates perfect agreement between predicted and observed classes, whereas −1 indicates the worst possible prediction.
The distribution in the first two principal components of the samples is shown in Fig. 2. It can be seen that both the training and test samples cover the most chemical space within 95% confidence interval and that only a minority of the samples are outside the 95% confidence interval. This indicates that the training and test samples have similar chemical distributions. Besides, no significant difference is detected in the distributions between the inhibitors and the noninhibitors.
The best 14 PLS-DA models established by different parameter combinations are shown in Table 2. It can be seen that no significant difference in the overall performance is observed among the 14 models. Herein, model 8 with the highest ACCs and MCCs for both the training and test sets is selected as the best PLS-DA model, of which the fragment size is 4–7, the fragment distinction is A/B/Ch/DA, and the hologram length is 401 bins. Thus, the Molecular Hologram descriptors of model 8 were used for the following SVM modeling.
Model | Fragment distinctiona | Hologram length | Training set | Test set I | ||||||
---|---|---|---|---|---|---|---|---|---|---|
SEN (%) | SPE (%) | ACC (%) | MCC | SEN (%) | SPE (%) | ACC (%) | MCC | |||
a A: atom types; B: bond types; C: connectivity; Ch: chirality; H: hydrogens; DA: H-bond donor and acceptor. | ||||||||||
1 | A/B/C | 401 | 76.29 | 81.12 | 78.89 | 0.5748 | 73.97 | 78.86 | 76.61 | 0.5288 |
2 | A/B/C/H | 401 | 79.65 | 75.41 | 77.36 | 0.5488 | 78.52 | 72.45 | 75.25 | 0.5082 |
3 | A/B/C/Ch | 401 | 75.20 | 82.05 | 78.89 | 0.5745 | 72.38 | 79.32 | 76.12 | 0.5186 |
4 | A/B/C/H/Ch | 353 | 80.33 | 75.00 | 77.45 | 0.5516 | 79.30 | 70.74 | 74.68 | 0.4994 |
5 | A/C/Ch/DA | 401 | 74.49 | 82.08 | 78.58 | 0.5681 | 72.11 | 80.63 | 76.70 | 0.5300 |
6 | A/B/C/DA | 257 | 72.76 | 81.96 | 77.72 | 0.5506 | 69.63 | 79.15 | 74.76 | 0.4907 |
7 | A/B/C/Ch/DA | 401 | 74.86 | 81.24 | 78.30 | 0.5625 | 71.19 | 77.44 | 74.56 | 0.4872 |
8 | A/B/Ch/DA | 401 | 76.97 | 81.67 | 79.50 | 0.5871 | 75.26 | 78.60 | 77.06 | 0.5385 |
9 | A/B/H/Ch | 401 | 79.82 | 76.65 | 78.11 | 0.5630 | 78.15 | 73.06 | 75.40 | 0.5105 |
10 | A/B/H/DA | 401 | 77.78 | 77.87 | 77.83 | 0.5554 | 77.06 | 74.91 | 75.90 | 0.5182 |
11 | A/B/H/Ch/DA | 401 | 78.22 | 78.02 | 78.11 | 0.5612 | 77.33 | 75.09 | 76.12 | 0.5227 |
12 | A/B/C/H/Ch/DA | 401 | 79.27 | 75.93 | 77.47 | 0.5504 | 78.08 | 72.80 | 75.23 | 0.5072 |
13 | A/C/H/Ch/DA | 307 | 77.68 | 78.92 | 78.35 | 0.5651 | 74.55 | 75.90 | 75.28 | 0.5037 |
14 | A/C/H/DA | 401 | 77.88 | 78.07 | 77.99 | 0.5584 | 75.98 | 74.88 | 75.39 | 0.5072 |
Model | Description method | Training set | Test set | ||||||
---|---|---|---|---|---|---|---|---|---|
SEN (%) | SPE (%) | ACC (%) | MCC | SEN (%) | SPE (%) | ACC (%) | MCC | ||
a RBF kernel, C = 30.8022, γ = 1.0000.b Linear kernel, C = 39.5508.c RBF kernel, C = 16.4872, γ = 1.3591.d Linear kernel, C = 37.1545. | |||||||||
RBF-SVMa | Molecular Holograms | 85.31 | 79.09 | 81.96 | 0.6421 | 82.59 | 73.72 | 77.81 | 0.5620 |
Linear-SVMb | Molecular Holograms | 83.21 | 78.74 | 80.80 | 0.6176 | 80.52 | 74.07 | 77.04 | 0.5444 |
RBF-SVMc | MACCS | 84.70 | 79.90 | 82.11 | 0.6441 | 84.19 | 76.54 | 80.06 | 0.6056 |
Linear-SVMd | MACCS | 83.31 | 80.48 | 81.79 | 0.6361 | 82.80 | 77.18 | 79.77 | 0.5979 |
Just as expected, the overall prediction performance of SVM models significantly increases after introducing the combined descriptors (Table 4). Especially for the RBF-SVM model, the ACCs and MCCs are strikingly high for both the training (90.95%, 0.8191) and test sets (83.14%, 0.6638). The results indicate that the combination of the two fragment descriptors with different types can effectively enhance the prediction performance. In comparison with earlier researches on CYP1A2 inhibitors, our RBF-SVM model is clearly more predictive and simple (Table 4).
Modeling methods | No. of training samples | Training set | No. of test samples | Test set | ||||||
---|---|---|---|---|---|---|---|---|---|---|
SEN (%) | SPE (%) | ACC (%) | MCC | SEN (%) | SPE (%) | ACC (%) | MCC | |||
a RBF kernel; C = 42.1016, γ = 2.2408.b Linear kernel, C = 50.7842.c The optimal combined model designed by Cheng et al.16 were based on SVM and k-NN algorithms and the performance was evaluated by 5-fold cross-validation.d The presented performance of the SVM model18 was obtained based on the training and the test sets assembled by random sampling strategy in the study.e The accuracy for the training set was evaluated by 7-fold cross-validation, and the accuracy for the test set was measured by the area under the curves (AUC) of the receiver operating characteristic (ROC).17f ASNN: Associative neural networks.26 | ||||||||||
RBF-SVM in this studya | 6396 | 92.40 | 89.70 | 90.95 | 0.8191 | 6395 | 85.17 | 81.41 | 83.14 | 0.6638 |
Linear-SVM in this studyb | 6396 | 87.25 | 83.56 | 85.26 | 0.7060 | 6395 | 84.32 | 78.71 | 81.30 | 0.6284 |
Cheng et al. combined model IIc | 12![]() |
80.00 | 82.50 | 81.30 | 0.6260 | 2804 | — | — | 72.00 | — |
Su et al. SVMd | 10![]() |
— | — | — | — | 2559 | 86.80 | 74.00 | 79.80 | — |
Sun et al. RBF-SVMe | 7208 | — | — | 87.50 | — | 7128 | 0.93 (AUC) | |||
Novotarskyi et al. ASNNf | 3745 | — | — | — | — | 3741 | 0.827 | 0.827 | 0.827 | 0.6530 |
Recently, Lapins et al.25 developed a unified proteochemometric (PCM) model successfully for predicting inhibitors of five major CYP isoforms, i.e. CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4. Based on the signature information of the inhibitors and amino acid property composition and transition information in the CYP primary sequences, high cross-validated accuracies (78–85%) and prediction accuracies for an external dataset (79–88%) were obtained by using SVM, k-nearest neighbor, and random forest classifiers. For CYP1A2 inhibitors, however, it can be seen from the ROC curve that high sensitivity for the external validation dataset can be achieved only at a cost of low specificity. In comparison with our models, the PCM models established on about 80000 atomic signatures and amino acid property transition information is somewhat complex and less interpretable.
In order to further validate the prediction and extrapolation capabilities of the SVM models, an independent validation set (Dataset II) containing 2581 diverse compounds (1774 inhibitors and 807 noninhibitors) was introduced. As observed in Table 5, the two SVM models with different kernel functions achieve modest ACCs (∼65%). Although the SENs of both models are somewhat limited, the SPEs and ACCs are still satisfying. The low SENs may be explained by the different experimental protocols and labeling methods applied for the two databases. For this imbalanced dataset, the geometric mean (G-mean) based ACC was also introduced to measure the predictive power. It can be seen that the G-mean of ∼70% is relatively high for this imbalanced dataset.
Model | Description method | SEN (%) | SPE (%) | ACC (%) | MCC | G-meana (%) |
---|---|---|---|---|---|---|
a ![]() |
||||||
RBF-SVM | Combined | 57.33 | 87.11 | 66.64 | 0.4156 | 70.67 |
MACCS | 53.27 | 83.89 | 62.84 | 0.3494 | 66.85 | |
Molecular Holograms | 54.90 | 84.76 | 64.24 | 0.3719 | 68.22 | |
Linear-SVM | Combined | 54.40 | 86.12 | 64.32 | 0.3809 | 68.45 |
MACCS | 52.03 | 85.01 | 62.34 | 0.3498 | 66.51 | |
Molecular Holograms | 52.82 | 83.02 | 62.26 | 0.3371 | 66.22 |
Furthermore, according to the weight coefficients of variables in the RBF-SVM model, variable screening was also performed. The results showed that no significant improvement was observed with the decreased number of fragment descriptors (Fig. 3).
![]() | ||
Fig. 3 The performance of the combined descriptors based RBF-SVM model for the (a) training set (b) test set and (c) validation set. |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c5ra17196b |
This journal is © The Royal Society of Chemistry 2015 |