Wei Zhou†
abc,
Yanjun Fan†a,
Xunhui Caia,
Yan Xianga,
Peng Jianga,
Zhijun Daia,
Yuan Chena,
Siqiao Tana and
Zheming Yuan*abc
aHunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, P. R. China. E-mail: mengrzhou@163.com; zhmyuan@sina.com; Fax: +86-731-8461-8163
bHunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha 410128, P. R. China
cHunan Provincial Engineering & Technology Research Center for Biopesticide and Formulation Processing, Hunan Agricultural University, Changsha 410128, P. R. China
First published on 2nd November 2016
The environmental protection agency thinks that quantitative structure–activity relationship (QSAR) analysis can better replace toxicity tests. In this paper, we developed QSAR methods to evaluate the narcosis toxicities of 50 phenol analogues. We first built multiple linear regression (MLR), stepwise multiple linear regression (SLR) and support vector regression (SVR) models using five descriptors and three different partitions, and the optimal SVR models with all three training-test partitions had the highest external prediction ability, about 10% higher than the models in the literature. Second, to identify more effective descriptors, we applied two in-house methods to select descriptors with clear meanings from 1264 descriptors calculated by the PCLIENT software and used them to construct the MLR, SLR and SVR models. Our results showed that our best SVR model (Rpred2 = 0.972) significantly increased 16.55% on the test set, and the appropriate partition presented the better stability. The different partitions of the training-test datasets also supported the excellent predictive power of the best SVR model. We further evaluated the regression significance of our SVR model and the importance of each single descriptor of the model according to the interpretability analysis. Our work provided a valuable exploration of different combinations among data partition, descriptor selection and model and a useful theoretical understanding of the toxicity of phenol analogues, especially for such a small dataset.
Traditionally, the quantitative structure–activity relationship (QSAR) has been perceived as a means of establishing a correlation between chemical structure and biological activity, and it contributes to explaining how structural descriptors determine biological activity.5,8–10 In particular, an acceptable QSAR model is considered as a rapid and cost-effective alternative to experimental evaluation. The mathematical methods for constructing QSAR include linear and nonlinear methods that solve regression and classification problems in data structure.11 Baker et al.12 have reported a QSAR study of the toxicity of alkylated or halogenated phenols. In that study, they developed a simple model for phenol toxicity based on physicochemical properties without external prediction. Based on the same dataset, Xu et al.13 calculated descriptors directly from the phenol molecular structures and used two approaches, multiple linear regression (MLR) analysis and feed-forward neural networks, to develop a model. Guo and Xu14 also used the same approaches to demonstrate that neural networks were better than MLR in prediction. However, regardless of neural networks or MLR, traditional modelling methods based on the principle of empirical minimization have many flaws. The support vector machine (SVM) based on statistical theory and minimum structure risk is not only helpful to preferably solve certain problems, such as the small-sample, nonlinear, dimension disaster and local minimum problems, but is also helpful to provide a strong generalization ability.15 SVM includes support vector classification and support vector regression (SVR), and SVR is more applicable to develop QSAR models.16
There is a widespread concern over big datasets, but small datasets are inevitable in many fields. On the basis of our previous work,17 we further evaluate the effectiveness and application of methods in coping with small datasets. We found out the literature dataset is a typical one, and several different groups continuously researched it using their different methods. So, we developed the SVR models using the same toxicity dataset in the present work. To select more appropriate descriptors, we employed two methods developed in-house based on SVR: the worst descriptor elimination multi-roundly (WDEM)18 and the high-dimensional descriptors selection nonlinearly (HDSN).19 We further utilized SVR to build more effective models based on the obtained descriptors, and used MLR and stepwise multiple linear regression (SLR) to build the reference model. Our analysis will offer some theoretical references to predict the activities of several phenol compounds.
No. | Compound | log![]() |
No. | Compound | log![]() |
---|---|---|---|---|---|
1 | Phenol | −0.431 | 26 | 2,5-Dichlorophenol | 1.128 |
2 | 2,6-Difluorophenol | 0.396 | 27 | 2,3-Dichlorophenol | 1.271 |
3 | 2-Fluorophenol | 0.248 | 28 | 4-Chloro-2-methyl phenol | 0.700 |
4 | 4-Fluorophenol | 0.017 | 29 | 4-Chloro-3-methylphenol | 0.795 |
5 | 3-Fluorophenol | 0.473 | 30 | 2,4-Dichlorophenol | 1.036 |
6 | 4-Methylphenol | −0.192 | 31 | 3-tert-Butylphenol | 0.730 |
7 | 3-Methylphenol | −0.062 | 32 | 4-tert-Butylphenol | 0.913 |
8 | 2-Chlorophenol | 0.277 | 33 | 3,5-Dichlorophenol | 1.562 |
9 | 2-Bromophenol | 0.504 | 34 | 2-Phenylphenol | 1.094 |
10 | 4-Chlorophenol | 0.545 | 35 | 2,4-Dibromophenola | 1.403 |
11 | 3-Ethylphenol | 0.229 | 36 | 2,3,6-Trimethylphenol | 0.418 |
12 | 2-Ethylphenol | 0.176 | 37 | 3,4,5-TrimethylphenoI | 0.930 |
13 | 4-Bromophenol | 0.681 | 38 | 2,4,6-Trichlorophenol | 1.695 |
14 | 2,3-Dimethylphenol | 0.122 | 39 | 4-Chloro-3,5-dimethylphenol | 1.203 |
15 | 2,4-Dimethylphenol | 0.128 | 40 | 4-Bromo-2,6-dichlorophenol | 1.779 |
16 | 2,5-Dimethylphenol | 0.009 | 41 | 2,4,5-TrichlorophenoI | 2.100 |
17 | 3,4-Dimethylphenol | 0.122 | 42 | 4-Bromo-6-chloro-2-methylphenol | 1.277 |
18 | 3,5-Dimethylphenol | 0.113 | 43 | 4-Bromo-2,6-dimethylphenol | 1.278 |
19 | 3-Chloro-4-fluorophenol | 0.842 | 44 | 2,4,6-Tribrmophenol | 2.050 |
20 | 2-Chloro-5-methylphenol | 0.640 | 45 | 2-tert-Butyl-4-methylphenol | 1.297 |
21 | 4-Iodophenol | 0.854 | 46 | 4-Chloro-2-isopropyl-5-methylphenol | 1.862 |
22 | 3-Iodophenol | 1.118 | 47 | 6-tert-Butyl-2,4-dimethylphenol | 1.245 |
23 | 2-Isopropylphenol | 0.803 | 48 | 2,6-Diphenylphenol | 2.113 |
24 | 3-Isopropylphenol | 0.609 | 49 | 2,4-Dibromo-6-phenylphenol | 2.207 |
25 | 4-Isopropylphenol | 0.473 | 50 | 2,6-Di-tert-butyl-4-methylphen | 1.788 |
Group | Group of descriptors | Count | Group | Group of descriptors | Count |
---|---|---|---|---|---|
1 | Constitutional descriptors | 48 | 13 | RDF descriptors | 150 |
2 | Topological descriptors | 119 | 14 | 3D-MoRSE descriptors | 160 |
3 | Walk and path counts | 47 | 15 | WHIM descriptors | 99 |
4 | Connectivity indices | 33 | 16 | GETAWAY descriptors | 197 |
5 | Information indices | 47 | 17 | Functional group counts | 121 |
6 | 2D autocorrelations | 96 | 18 | Atom-centered fragments | 120 |
7 | Edge adjacency indices | 107 | 19 | Charge descriptors | 14 |
8 | BCUT descriptors | 64 | 20 | Molecular properties | 28 |
9 | Topological charge indices | 21 | 21 | ET-state indices | >300 |
10 | Eigenvalue-based indices | 44 | 22 | ET-state properties* | 3 |
11 | Randic molecular profiles | 41 | 23 | GSFRAG descriptor | 307 |
12 | Geometrical descriptors | 74 | 24 | GSFRAG-L descriptor | 886 |
Total: >3000 |
Firstly, we screened all the descriptors coarsely and nonlinearly by HDSN.19 In this method, the original training set was arranged as (y, x), including n samples and m descriptors. A K*m controlled matrix with the number of K such as 500 was generated in this paper. The matrix consists of an equal number of randomly positioned 1s and 0s per column, representing whether the descriptor in that column was included in the modelling or not. Then a tenfold cross-validation with SVR on the training set was conducted for each row of the matrix using only the descriptors corresponding to 1 s in the row. The importance of a descriptor was judged based on contrasting the prediction accuracy of two sets of models, one set with the descriptor included and the other with the descriptor excluded. The accuracy was defined as the mean square error (MSE) of the constructed model. Because the controlled matrix was random, the optimal descriptors subset might be a different result in each implementation. We conducted the descriptor selection 20 times on the training set and obtained 20 sets of optimal descriptors.
Then we further screened each set of descriptors carefully using the worst descriptor elimination multi-roundly method (WDEM).18 The matrix was arranged as (yi, xij), i = 1,2, …, n, j = 1,2,…, m′, including n samples and m′ descriptors. In the first round, the original MSE0 was computed by leave-one-out (LOO) cross-validation with SVR, and then the corresponding MSEj were obtained by successively wiping out the jth descriptor. When min (MSEj) ≤ MSE0, the corresponding descriptor would be removed and entered into the next round of screening instead of ending the filter.
We assessed the predictive capacity of the models based on MSE and the squared predictive correlation coefficient (Rpred2) values calculated by the following equations:
![]() | (1) |
![]() | (2) |
Generally, researchers considered that an outstanding QSAR model should have a lower strong MSE24 and higher Rpred2 (>0.5).25
SVR | MLR | SLR | |||||||
---|---|---|---|---|---|---|---|---|---|
t = 0 | t = 1, d = 2 | t = 1, d = 3 | t = 2 | t = 3 | |||||
a Notes: the bold indicated the best results in each case. | |||||||||
1st partition | External | MSE | 0.099 | 80.517 | 442.585 | 0.323 | 40.172 | 0.772 | 0.326 |
Rpred2 | 0.925 | −59.663 | −332.450 | 0.757 | −29.266 | 0.551 | 0.920 | ||
Internal | MSE | 0.022 | 0.031 | 0.029 | 0.022 | 0.044 | 0.317 | 0.146 | |
Rpred2 | 0.937 | 0.910 | 0.916 | 0.937 | 0.874 | 0.711 | 0.938 | ||
2nd partition | External | MSE | 0.040 | 0.051 | 0.099 | 0.039 | 0.056 | 0.276 | 0.207 |
Rpred2 | 0.913 | 0.890 | 0.786 | 0.915 | 0.879 | 0.834 | 0.907 | ||
Internal | MSE | 0.027 | 0.131 | 0.039 | 0.025 | 0.045 | 0.241 | 0.157 | |
Rpred2 | 0.939 | 0.697 | 0.910 | 0.943 | 0.895 | 0.865 | 0.943 | ||
3rd partition | External | MSE | 0.065 | 0.067 | 0.052 | 0.064 | 0.070 | 0.377 | 0.238 |
Rpred2 | 0.904 | 0.900 | 0.922 | 0.905 | 0.895 | 0.788 | 0.915 | ||
Internal | MSE | 0.025 | 0.137 | 0.027 | 0.025 | 0.026 | 0.302 | 0.154 | |
Rpred2 | 0.939 | 0.667 | 0.935 | 0.939 | 0.936 | 0.777 | 0.942 |
These results indicated that SVR with a suitable kernel function was a powerful technique for a given set of low-dimensional descriptor space, and that kernel function selection would be effective for more accurate predictions.
Indices | SVR1 | SVR2 | SVR3 | SVR4 | SVR5 | ||
---|---|---|---|---|---|---|---|
a SR, screening rounds; ON, obtained number; BKF, best kernel function. | |||||||
1st partition | HDSN | SR | 9 | 7 | 9 | 9 | 6 |
ON | 27 | 20 | 21 | 21 | 30 | ||
WDEM | SR | 11 | 4 | 7 | 6 | 10 | |
ON | 16 | 16 | 14 | 15 | 20 | ||
Models | BKF | t = 3 | t = 2 | t = 0 | t = 2 | t = 2 | |
Rpred2 | 0.959 | 0.955 | 0.949 | 0.941 | 0.939 | ||
2nd partition | HDSN | SR | 7 | 10 | 4 | 8 | 8 |
ON | 25 | 19 | 40 | 26 | 23 | ||
WDEM | SR | 8 | 8 | 10 | 8 | 10 | |
ON | 17 | 11 | 30 | 18 | 13 | ||
Models | BKF | t = 2 | t = 0 | t = 2 | t = 3 | t = 2 | |
Rpred2 | 0.977 | 0.972 | 0.968 | 0.965 | 0.960 | ||
3rd partition | HDSN | SR | 5 | 10 | 9 | 9 | 5 |
ON | 39 | 22 | 23 | 23 | 40 | ||
WDEM | SR | 13 | 5 | 7 | 7 | 5 | |
ON | 26 | 17 | 16 | 16 | 35 | ||
Models | BKF | t = 1, d = 2 | t = 2 | t = 1, d = 2 | t = 1, d = 2 | t = 2 | |
Rpred2 | 0.996 | 0.987 | 0.973 | 0.967 | 0.967 |
Not every model makes good sense, so we only analyzed the effective models whose Rpred2 values were greater than or equal to 0.5. In particular, we only chose the model with the best kernel function but not every kernel function for every repetition. Therefore, we got 20 models for every partition. The retention probability of the effective SVR models referred to the ratio of the effective models to the 20 times repetition. The statistical result showed that (1) in the case of three training-test partitions, the probabilities of effective models from SLR increased by 5% to 15% over the ones from MLR, and the ones from SVR increased by 5% to 35%; (2) the probabilities of effective models in the two literature partitions (the 2nd partition and the 3rd one) were higher. The appearance suggested that the SVR method can construct more effective models than the others and the literature can provide important guidance on selecting the training-test partition (Table 6).
1st partition | 2nd partition | 3rd partition | |
---|---|---|---|
SVR | 95% | 100% | 100% |
SLR | 75% | 90% | 100% |
MLR | 60% | 80% | 95% |
To compare the external predictive ability of SLR, MLR and SVR in more detail, we discussed the changed trend of the retention models. We conducted trend analysis through the Rpred2 values of the retention models, since the models with higher Rpred2 values had stronger generalization ability. In order to observe the general trend of the retention models with different partitions and different methods, we sorted the retention models according to their Rpred2 values and then drew Fig. 1. From this it appeared that: (1) with three partitions, SVR had more advantages than the other two methods in predictive ability, and the advantages were not only in the amount of the retention model but also in the assessed value; (2) after putting the best SVR models for every set of descriptors together observing the three partitions, the top 15 models all performed similarly well, and the third partition presented the best stability.
![]() | ||
Fig. 1 Analysis of the modeling advantages of the first (A), the second (B), the third (C) and the best (D) classification. |
Because of the varying quantity of retention models with the different training-test partitions and different modelling methods, direct comparison of retention model trends could not demonstrate the pros and cons of the methods objectively and comprehensively. Therefore, we further analyzed the significant test of Rpred2 difference for all the retention models in each partition statistically comparing the modelling methods. The result indicated that the difference between SVR and MLR reached a significant level in the third partition, and the difference between SVR and SLR reached a highly significant level in the first and the third partitions (Table 7 and Fig. 2).
1st partition | 2nd partition | 3rd partition | ||||
---|---|---|---|---|---|---|
SVR | MLR | SVR | MLR | SVR | MLR | |
a Notes: ** represented highly significant; * represented significant. | ||||||
MLR | 0.126 | 0.058 | 0.017* | |||
SLR | 0.004** | 0.191 | 0.180 | 0.179 | 0.010** | 0.315 |
Many previous research results have indicated that SVR has better generalization but poorer interpretability in some nonlinear fields. So, aiming at this weak point, our research group established a complete set of an interpretability system for SVR based on F-tests in the previous studies.26 According to the interpretability analysis of the SVR2_2 model, we obtained the significance of the regression model and the importance of single indicators based on SVR and the F-test. The results showed that the nonlinear regression of the SVR2_2 model (Rpred2 = 0.972) was highly significant because its F-value (4461.81) was greater than F0.05(11, 33).
We listed the obtained descriptors of phenols compounds from the SVR2_2 model in Table 8, and the algorithms supported the indication that all these descriptors played the most important roles in describing the narcosis activity of these compounds. For all the descriptors in the SVR2_2 model, the analysis of single-factor effects showed that the narcosis activity was highly significantly negative correlated with BLTD48 values, but highly significantly positively correlated with 10 other descriptor values (Table 8 and Fig. 4).
Group name | Descriptor name | F-Value |
---|---|---|
a *, 0.01 < p < 0.05; **, p < 0.01; F0.05(1, 10) = 4.96; F0.01(1, 10) = 10.04; F0.05(1, 25) = 4.24; F0.01(1, 25) = 7.77; F0.05(26, 18) = 2.13; F0.01(26, 18) = 2.97; F0.05(11,33) = 2.09; F0.01(11, 33) = 2.84. | ||
Molecular properties | BLTD48: Verhaar model of Daphnia base-line toxicity from MLOGP (mmol l−1) | 19![]() |
Geometrical descriptors | G(O..Cl): sum of geometrical distances between O..Cl | 4461.810** |
GETAWAY descriptors | H7e: H autocorrelation of lag 7/weighted by atomic Sanderson electronegativities | 2839.560** |
GETAWAY descriptors | HTm: H total index/weighted by atomic masses | 2477.360** |
GETAWAY descriptors | R5v+: R maximal autocorrelation of lag 5/weighted by atomic van der Waals volumes | 2430.070** |
Topological descriptors | PHI: Kier flexibility index | 1345.930** |
Information indices | CIC3: complementary information content (neighborhood symmetry of 3-order) | 1327.350** |
3D-MoRSE descriptors | Mor16ez: 3D-MoRSE − signal 16/weighted by atomic Sanderson electronegativities | 1140.060** |
RDF descriptors | RDF040p: radial distribution function − 4.0/weighted by atomic polarizabilities | 1082.480** |
Walk and path counts | SRW01: self-returning walk count of order 01 (number of non-H atoms, nSK) | 1017.870** |
2D autocorrelations | MATS3e: Moran autocorrelation − lag 3/weighted by atomic Sanderson electronegativities | 626.600** |
In previous studies, BLTD48,27 G(O..Cl),28 H7e,29 HTm,30,31 PHI,32–34 CIC3,35–37 Mor16ez,38,39 RDF040p,21 SRW01 (ref. 40) and MATS3e21,41 have been reported as modeling descriptors, but the fifth important descriptor R5v+ and this combination of descriptors have never been mentioned. In our SVR2_2 model, the most important descriptor BLTD48 is one of the molecular properties, and the toxicity has decreased with increasing its structure in phenolic compounds;27 the second important descriptor G(O..Cl) is one of the geometrical descriptors, and has been used to predict the tertiary structure of α-glucosidase and inhibition properties of N-(phenoxydecyl) phthalimide derivatives;28 the third important descriptor H7e is one of the GETAWAY descriptors and has been selected by the genetic algorithm to model the HIV-1 RT inhibitory activity and MT4 blood toxicity;29 the fourth important descriptor HTm is also one of the GETAWAY descriptors and has been selected to model the design of potent antimalarial bisbenzamidines30 and to predict activity coefficients at infinite dilution of hydrocarbons in aqueous solutions.31 Particularly, HTm provides information on the degree of interaction between all the molecule atoms, determined by the atomic masses of every individual atom in the molecule;30 the sixth important descriptor PHI is one of the topological descriptors and has been found to improve the prediction ability.33,34 The PHI is a measure of molecular flexibility. When the molecular flexibility of a molecule increases, the resistance of molecule against changes increases and its molecular diffusivity decreases;32 the seventh important descriptor CIC3 is one of the information indices, and has been developed for predicting anti-HCV activity of thiourea derivatives36 and nonlinear relationships between retention time and molecular structure of peptides originating from proteomes.37 CIC3 depicts the topological features of atoms based on neighborhood environment;35 the eighth important descriptor Mor16ez is a 3D-molecule representation of structures based on electron diffraction (3D-MoRSE) descriptor weighted by electronegativity, and has been used to develop QSAR models.38,39 It illustrates the role of geometry of the peptide molecules and their electrical diffraction properties during the interaction with the binding site of the receptor;38 the ninth important descriptor RDF040p is one of the RDF descriptors and has been selected to model as an important feature;21 the tenth important descriptor SRW01 is one of the walk and path counts, and has been selected as a feature that weighted most heavily at the ends of PC1 of physico-chemical space;40 the last important descriptor MATS3e is classified in the Moran autocorrelation descriptors, and it is calculated using the average value of a special property of a molecule and the number of vertex pairs at a topological distance. Therefore, the MATS3e has the length three in the lag (the prescribed length or topological distance that connects a pair of atoms) and bear the atomic Sanderson electronegativities as the weighting scheme.21,41
These results might help to explain how the descriptors could determine the narcosis activity of phenols, and to improve and exploit the physical and chemical technology for environmental chemical pollutants. Based on the above results, it is confirmed that we can construct some ideal QSAR models that should be capable of accurately predicting several groups of the desired property for a newly synthesized or a hypothetical molecule. In addition, the descriptors selection methods and the modelling techniques from our studies cannot only be applied to develop and improve the control of phenols, but also be applied to all the fields of small molecules modelling.
Footnote |
† These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2016 |