N. S. Hari Narayana Moorthy*,
Silvia A. Martins,
Sergio F. Sousa,
Maria J. Ramos and
Pedro A. Fernandes*
REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, s/n, Rua do Campo Alegre, 4169-007 Porto, Portugal. E-mail: hari.moorthy@fc.up.pt; hari.nmoorthy@gmail.com; pafernan@fc.up.pt; Fax: +351-220-402-506; Tel: +351-220-402-506
First published on 3rd November 2014
In this work, we have developed a list of classification models to categorise organic molecules with respect to their solvation free energies using different machine learning approaches (decision tree, random forest and support vector machine). The solvation free energies of the molecules (experimental values obtained from the literature) were split into highly favourable (<−3 kcal mol−1) and less favourable (>−3 kcal mol−1) values; −3 kcal mol−1 was set as the threshold value for the classification model development. The MACCS fingerprint along with a set of physicochemical descriptors such as atom count, topology, vdW surface area (volsurf) and subdivided surface area contributed to the classification models. The validation studies using test set and 10-fold cross-validation methods provide statistical parameters such as accuracy, sensitivity and specificity with >90% significance. The sum of ranking difference (SRD) analysis reveals that the support vector machine models are comparatively significant, while the MACCS fingerprints containing models are ranked as good models in all approaches. The MACCS fingerprints indicate that the presence of halogen atoms causes less favourable solvation free energies. However, the presence of polar atoms/groups and some functional groups such as heteroatoms, double bonded branched aliphatic chains, CN, N–C–C–O, NCO, >1 heterocyclic atoms, OCO, etc. cause highly favourable solvation free energies. The results derived from these investigations can be used along with some quantitative models to predict the solvation free energies of organic molecules and to design novel molecules with acceptable solvation free energies.
Protein–ligand binding and the transport of drugs across membranes are closely connected to the solvation free energy, as it is an important component of binding free energy. Molecules that are important in the chemical, biological and pharmaceutical sciences are usually polyfunctional (e.g., drug molecules). The exposure or protection of chemical groups from solvent influences the binding process and involves the thermodynamic process of ligand desolvation. Therefore, the determination of ΔGsolv is a valuable objective with significance in the study of chemical/biochemical processes that has been pursued since the beginning of computer-aided drug design.4,5 The ΔGsolv of a molecule is an important thermodynamic property that is affected by the groups that constitute the molecule and the physicochemical features of the molecule.2 Earlier experiments demonstrated that the contributions of the electrostatic and nonpolar parts of a molecule determine its solvation free energy.6–11 The nonpolar contribution is usually modelled as proportional to the solvation surface area. The electrostatic term dominates the total solvation free energy of the molecule, although it does not always indicate a high affinity.12 This showed that the physicochemical features of molecules cause variations in their free energies of solvation. Hence, an analysis was carried out to investigate the important physicochemical properties and topological features responsible for the free energy of solvation. Furthermore, classification analysis (qualitative analysis) was used to categorise the molecules based on their solvation free energies using different machine learning approaches (quantitative analysis requires more computational cost, more time, precise experimental activities, etc.). The reported quantitative models on the solvation free energy prediction were developed with high computing powers.13,14 In order to simplify the analysis, the initial qualitative models developed with the same data set can support the development of quantitative models with less time and more precision.
Machine learning is a field of artificial intelligence used to extract characteristics of interest from a data set where their underlying probability distribution is unknown. Machine learning focuses on prediction based on known properties learned from the training data.15 These methods use different algorithms for classification and are evaluated based on their generalisation capability, which is their ability to successfully apply the learned knowledge to unseen data. Generally, supervised and non-supervised machine learning methods are available for classification analysis.16 In the present study, we have used some supervised machine learning methods to classify the solvation free energies of organic molecules as highly favourable and less favourable.
Random forests, proposed by Breiman (2001), is an ensemble learning method for classification (and regression) that operates by constructing a multitude of decision trees at training time. Random forests change how the classification or regression trees are constructed using different bootstrap samples. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best split among a subset of predictors randomly chosen at that node.27,28
Support vector machines are supervised learning models with associated learning algorithms for classification study; they are based on the concept of decision planes that define decision boundaries. A decision plane separates a set of objects with different classes of memberships.29
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 |
---|---|---|---|---|
a_nH | MACCSFP49 | nN | MACCSFP134 | nN |
a_nN | MACCSFP88 | SHdsCH | MACCSFP161 | nHBAcc_Lipinski |
KierA3 | MACCSFP103 | ETA_AlphaP | a_nN | nHBDon |
Lip_acc | MACCSFP104 | nHBAcc_Lipinski | Lip_acc | MACCSFP134 |
Lip_don | MACCSFP107 | nHBDon | Lip_don | MACCSFP161 |
SlogP_VSA0 | MACCSFP121 | TopoPSA | SlogP_VSA0 | SlogP_VSA0 |
SMR_VSA1 | MACCSFP127 | SMR_VSA7 | SMR_VSA7 | |
Vsurf_CW3 | MACCSFP134 | Vsurf_EWmin1 | Vsa_pol | |
Vsurf_W2 | MACCSFP139 | Vsurf_W2 | Vsurf_A | |
MACCSFP151 | Vsurf_Wp3 | Vsurf_CW1 | ||
MACCSFP156 | Vsurf_CW2 | |||
MACCSFP157 | ||||
MACCSFP161 |
In this analysis, solvation free energy values <−3 kcal mol−1 were considered to be highly favourable, while those >−3 kcal mol−1 were classified as less favourable. The results derived from all three approaches are provided in Tables 2 and 3. The analysis was also performed with other threshold values (<−1 kcal mol−1 and >−1 kcal mol−1); unfortunately, that data set did not provide a balanced number of compounds classified as having highly favourable and less favourable solvation free energies. The abovementioned threshold value (<−3 kcal mol−1 and >−3 kcal mol−1) yielded better classification models with significant statistical parameters; hence these models are discussed herein. The physicochemical descriptors contributing to the models were able to classify the solvation free energies of molecules as highly favourable and less favourable.
Model No | Dataset | Total | Decision tree | Support vector machine | Random forest | |||
---|---|---|---|---|---|---|---|---|
Total P (TP/FN) | Total N (TN/FP) | Total P (TP/FN) | Total N (TN/FP) | Total P (TP/FN) | Total N (TN/FP) | |||
a Total P: total positives, total N: total negatives, TP: true positives, TN: true negatives, FP: false positives, FN: false negatives. | ||||||||
1 | Training | 241 | 129 (128/1) | 112 (111/1) | 129 (122/7) | 112 (110/2) | 129 (129/0) | 112 (112/0) |
Test 30% | 72 | 41 (40/1) | 31 (31/0) | 41 (39/2) | 31 (30/1) | 41 (41/0) | 31 (31/0) | |
Test 40% | 96 | 53 (51/1) | 43 (43/0) | 53 (51/2) | 43 (42/1) | 53 (52/1) | 43 (42/1) | |
10-fold | 241 | 129 (124/5) | 112 (109/3) | 129 (122/7) | 112 (110/2) | 129 (123/6) | 112 (109/3) | |
2 | Training | 241 | 129 (122/7) | 112 (110/2) | 129 (122/2) | 112 (110/2) | 129 (121/8) | 112 (111/1) |
Test 30% | 72 | 41 (39/2) | 31 (30/1) | 41 (39/2) | 31 (30/1) | 41 (39/2) | 31 (30/1) | |
Test 40% | 96 | 53 (51/2) | 43 (42/1) | 53 (51/2) | 43 (42/1) | 53 (51/2) | 43 (42/1) | |
10-fold | 241 | 129 (122/7) | 112 (110/2) | 129 (122/7) | 112 (110/2) | 129 (121/8) | 112 (111/1) | |
3 | Training | 241 | 129 (127/2) | 112 (110/2) | 129 (122/7) | 112 (110/2) | 129 (127/2) | 112 (112/0) |
Test 30% | 72 | 41 (41/0) | 31 (30/1) | 41 (39/2) | 31 (30/1) | 41 (40/1) | 31 (30/1) | |
Test 40% | 96 | 53 (53/0) | 43 (42/1) | 53 (51/2) | 43 (42/1) | 53 (53/0) | 43 (42/1) | |
10-fold | 240 | 129 (127/2) | 111 (109/2) | 129 (123/6) | 112 (107/2) | 129 (125/4) | 112 (109/3) | |
4 | Training | 241 | 129 (129/0) | 112 (110/2) | 129 (122/7) | 112 (110/2) | 120 (120/0) | 112 (112/0) |
Test 30% | 72 | 41 (39/2) | 31 (30/1) | 41 (39/2) | 31 (30/1) | 41 (40/1) | 31 (31/0) | |
Test 40% | 96 | 53 (52/1) | 43 (40/3) | 53 (51/2) | 43 (42/1) | 53 (52/1) | 43 (41/2) | |
10-fold | 241 | 129 (124/5) | 112 (106/6) | 129 (122/7) | 112 (110/2) | 129 (123/6) | 113 (110/3) | |
5 | Training | 241 | 129 (127/2) | 112 (109/3) | 129 (122/7) | 112 (110/2) | 129 (129/0) | 112 (112/0) |
Test 30% | 72 | 41 (39/2) | 31 (30/1) | 41 (39/2) | 31 (30/1) | 41 (41/0) | 31 (30/1) | |
Test 40% | 96 | 53 (51/2) | 43 (41/2) | 53 (51/2) | 43 (42/1) | 53 (52/1) | 43 (43/0) | |
10-fold | 241 | 129 (125/4) | 112 (103/9) | 129 (122/7) | 112 (110/2) | 129 (125/4) | 112 (107/5) |
Model No | Data set | Specificity | Sensitivity | Accuracy | Precision | G-mean | F-measure | MCC | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DT | SV | RF | DT | SV | RF | DT | SV | RF | DT | SV | RF | DT | SV | RF | DT | SV | RF | DT | SV | RF | ||
a DT: decision tree, SV: support vector machine, RF: random forest. | ||||||||||||||||||||||
1 | Training | 0.99 | 0.98 | 1.00 | 0.99 | 0.94 | 1.00 | 0.99 | 0.96 | 1.00 | 0.99 | 0.98 | 1.00 | 0.99 | 0.96 | 1.00 | 0.99 | 0.96 | 1.00 | 0.98 | 0.93 | 1.00 |
Test (30%) | 0.97 | 0.98 | 1.00 | 1.00 | 0.94 | 1.00 | 0.99 | 0.96 | 1.00 | 1.00 | 0.97 | 1.00 | 0.99 | 0.96 | 1.00 | 0.99 | 0.96 | 1.00 | 0.97 | 0.92 | 1.00 | |
Test (40%) | 0.98 | 0.98 | 0.98 | 1.00 | 0.95 | 0.98 | 0.99 | 0.97 | 0.98 | 1.00 | 0.98 | 0.98 | 0.99 | 0.97 | 0.98 | 0.99 | 0.97 | 0.98 | 0.98 | 0.94 | 0.96 | |
10-fold | 0.96 | 0.98 | 0.95 | 0.98 | 0.94 | 0.98 | 0.97 | 0.96 | 0.96 | 0.98 | 0.98 | 0.98 | 0.97 | 0.96 | 0.96 | 0.97 | 0.96 | 0.96 | 0.93 | 0.93 | 0.93 | |
2 | Training | 0.94 | 0.98 | 0.93 | 0.98 | 0.94 | 0.99 | 0.96 | 0.96 | 0.96 | 0.98 | 0.98 | 0.99 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.93 | 0.93 | 0.93 |
Test 30% | 0.94 | 0.98 | 0.94 | 0.98 | 0.94 | 0.98 | 0.96 | 0.96 | 0.96 | 0.97 | 0.97 | 0.97 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.92 | 0.92 | 0.92 | |
Test 40% | 0.95 | 0.98 | 0.95 | 0.98 | 0.95 | 0.98 | 0.97 | 0.97 | 0.97 | 0.98 | 0.98 | 0.98 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.94 | 0.94 | 0.94 | |
10-fold | 0.94 | 0.98 | 0.93 | 0.98 | 0.94 | 0.99 | 0.96 | 0.96 | 0.96 | 0.98 | 0.98 | 0.99 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.93 | 0.93 | 0.93 | |
3 | Training | 0.98 | 0.98 | 0.98 | 0.98 | 0.94 | 1.00 | 0.98 | 0.96 | 0.99 | 0.98 | 0.98 | 1.00 | 0.98 | 0.96 | 0.99 | 0.98 | 0.96 | 0.99 | 0.97 | 0.93 | 0.98 |
Test 30% | 1.00 | 0.98 | 0.97 | 0.98 | 0.94 | 0.98 | 0.99 | 0.96 | 0.97 | 0.98 | 0.97 | 0.97 | 0.98 | 0.96 | 0.97 | 0.99 | 0.96 | 0.98 | 0.97 | 0.92 | 0.94 | |
Test 40% | 1.00 | 0.98 | 1.00 | 0.98 | 0.95 | 0.98 | 0.99 | 0.97 | 0.99 | 0.98 | 0.98 | 0.98 | 0.99 | 0.97 | 0.99 | 0.99 | 0.97 | 0.99 | 0.98 | 0.94 | 0.98 | |
10-fold | 0.98 | 0.96 | 0.96 | 0.98 | 0.95 | 0.98 | 0.98 | 0.95 | 0.97 | 0.98 | 0.96 | 0.98 | 0.98 | 0.95 | 0.97 | 0.98 | 0.96 | 0.97 | 0.97 | 0.91 | 0.94 | |
4 | Training | 1.00 | 0.98 | 1.00 | 0.98 | 0.94 | 1.00 | 0.99 | 0.96 | 1.00 | 0.98 | 0.98 | 1.00 | 0.99 | 0.96 | 1.00 | 0.99 | 0.96 | 1.00 | 0.98 | 0.93 | 1.00 |
Test 30% | 0.94 | 0.98 | 0.97 | 0.98 | 0.94 | 1.00 | 0.96 | 0.96 | 0.99 | 0.97 | 0.97 | 1.00 | 0.96 | 0.96 | 0.99 | 0.96 | 0.96 | 0.99 | 0.92 | 0.92 | 0.97 | |
Test 40% | 0.98 | 0.98 | 0.98 | 0.95 | 0.95 | 0.96 | 0.96 | 0.97 | 0.97 | 0.94 | 0.98 | 0.96 | 0.96 | 0.97 | 0.97 | 0.96 | 0.97 | 0.97 | 0.92 | 0.94 | 0.94 | |
10-fold | 0.95 | 0.98 | 0.95 | 0.95 | 0.94 | 0.98 | 0.95 | 0.96 | 0.96 | 0.95 | 0.98 | 0.98 | 0.95 | 0.96 | 0.96 | 0.96 | 0.96 | 0.96 | 0.91 | 0.93 | 0.93 | |
5 | Training | 0.98 | 0.98 | 1.00 | 0.98 | 0.94 | 1.00 | 0.98 | 0.96 | 1.00 | 0.98 | 0.98 | 1.00 | 0.98 | 0.96 | 1.00 | 0.98 | 0.96 | 1.00 | 0.96 | 0.93 | 1.00 |
Test 30% | 0.94 | 0.98 | 1.00 | 0.98 | 0.94 | 0.98 | 0.96 | 0.96 | 0.99 | 0.97 | 0.97 | 0.98 | 0.96 | 0.96 | 0.98 | 0.96 | 0.96 | 0.99 | 0.92 | 0.92 | 0.97 | |
Test 40% | 0.95 | 0.98 | 0.98 | 0.96 | 0.95 | 1.00 | 0.96 | 0.97 | 0.99 | 0.96 | 0.98 | 1.00 | 0.96 | 0.97 | 0.99 | 0.96 | 0.97 | 0.99 | 0.92 | 0.94 | 0.98 | |
10-fold | 0.96 | 0.98 | 0.96 | 0.93 | 0.94 | 0.96 | 0.95 | 0.96 | 0.96 | 0.93 | 0.98 | 0.96 | 0.94 | 0.96 | 0.96 | 0.95 | 0.96 | 0.97 | 0.89 | 0.93 | 0.92 |
All the developed classification models were validated by 10-fold cross-validation and test set methods. In the test set method, 30% and 40% of the molecules in the data set were considered as test set to validate the models. The classification performances of the models constructed through all the methods were observed with the confusion matrix. These models correctly classified >95% of the molecules as having highly favourable or less favourable solvation free energies. The true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) of the classified molecules are provided in Tables 3 and 4. The statistical parameters such as sensitivity, specificity, precision and negative predicted values calculated from the confusion matrix are >0.9 for all the models in response to the training set, the test set and the 10-fold cross-validation analyses. MCC measures the quality of a classification model by calculating the value between −1 and +1. An MCC value of 0 indicates an average or random prediction, −1 indicates the worst prediction, and +1 represents the perfect prediction.32,33 An MCC value above 0.4 is considered to be predictive in classification studies. In our analysis, the MCC values are >0.95 for all the models (through all the methods), revealing that the models are significant. In the random forest method, some models produce MCC values of 1, suggesting that those models classified the molecules perfectly. The G-mean is a statistical parameter that measures the overall performance of models and is used to check the balance of the predicted solvation free energy categories of the molecules. The models in this study yielded G-mean values >0.95, indicating that the models predict the solvation free energies of the molecules in a balanced way. Based on the discussed statistical parameters, the models provided significant accuracy (>0.95) with respect to all the studied classification methods. These results reveal that the developed models using the physicochemical descriptors and fingerprints perfectly classified the solvation free energies of the molecules as highly favourable or less favourable.
Ranking results | %p | |||
---|---|---|---|---|
Model no. (original) | Model code | SRD | x 〈SRD〉 = x | |
a DT: decision tree, SV: support vector machine, RF: random forest, XX1—first icosaile (5%), Q1—first quartile, Med—median, Q3—last quartile, XX19—last icosaile (95%). | ||||
SV-3 | 8 | 68 | 6.46 × 10−8 | 7.69 × 10−8 |
DT-2 | 2 | 93 | 6.01 × 10−6 | 7.22 × 10−6 |
SV-1 | 6 | 93 | 6.01 × 10−6 | 7.22 × 10−6 |
SV-2 | 7 | 93 | 6.01 × 10−6 | 7.22 × 10−6 |
SV-4 | 9 | 93 | 6.01 × 10−6 | 7.22 × 10−6 |
SV-5 | 10 | 93 | 6.01 × 10−6 | 7.22 × 10−6 |
DT-1 | 1 | 97 | 1.18 × 10−5 | 1.41 × 10−5 |
RF-2 | 12 | 100 | 1.96 × 10−5 | 2.28 × 10−5 |
DT-5 | 5 | 117 | 2.73 × 10−4 | 3.20 × 10−4 |
RF-3 | 13 | 127 | 1.14 × 10−3 | 1.32 × 10−3 |
RF-5 | 15 | 151 | 2.43 × 10−2 | 2.75 × 10−2 |
RF-4 | 14 | 152 | 2.75 × 10−2 | 3.06 × 10−2 |
DT-4 | 4 | 158 | 5.40 × 10−2 | 6.00 × 10−2 |
RF-1 | 11 | 183 | 0.63 | 0.70 |
XX1 | 209 | 4.94 | 5.25 | |
DT-3 | 3 | 227 | 13.08 | 13.76 |
Q1 | 240 | 24.61 | 25.61 | |
Med | 262 | 49.91 | 51.15 | |
Q3 | 284 | 74.25 | 75.26 | |
XX19 | 315 | 94.70 | 95.02 |
In order to compare the performance of each method (and model), the SRD values were calculated for all the models. This ranking is intended to compare models, methods, techniques, etc. with some scaled (calculated) variables. SRD values provide a refined scale for ranking, even when the differences among the results (methods, models, etc.) are very small. A value close to zero indicates a better model (possessed dissimilar variable values), when it has larger SRD value indicates similarity of the variables. The results derived from the SRD analysis are provided in Table 4 and graphically represented in Fig. 1. These results reveal that the support vector machine approach provided significant results, and these models ranked as better models. It is interesting that the models developed with only fingerprint descriptors provided better SRD values (compared to the other studied approaches). Hence, it is important to investigate the kind of fingerprints (substructures, atoms, groups, etc.) present in molecules with highly favourable and less favourable solvation free energies in order to design novel molecules with appropriate solvation free energies.
Interestingly, the presence of halogen atoms is associated with increased solvation free energies (>−3 kcal mol−1), and molecules with aliphatic long chains have higher solvation free energies than those containing aromatic or aliphatic rings. These results demonstrate that the presence of these substructures results in the variation in solvation free energies.
The electrotopological state descriptors are designated by the E-state symbol that comprises three parts. The first part is “S,” which is the sum of the E-state values of all atoms of the same type in the molecule. The second part is a string representing the bond types associated with that atom (“s” for single bond, “d” for double, “t” for triple and “a” for aromatic). Finally, the third part is a symbol for the set of atoms in the hydride group, such as CH3, CH2, OH, Br, or NH. The SHdsCH is the hydrogen atom-type electrotopological state index for CH– groups.34–36 Another descriptor present in this category is ETA_AlphaP, an extended group of topochemical atom (ETA) indices, which are topological descriptors derived from the modification and refinement of the topologically arrived unique (TAU) scheme parameters of the 1980s.37,38
P_VSAk = ∑viδ(PiΣ(ak − 1, ak)) k = 1,2,3, …, n | (1) |
The Vsurf_Wp descriptor describes the polar volume (either polarisability and dispersion forces or hydrogen bond acceptor–donor regions) of the molecule and are calculated at eight different energy levels (−0.2, −0.5, −1.0, −2.0, −3.0, −4.0, −5.0 and −6.0 kcal mol−1); this descriptor may be defined as the molecular envelope accessible by solvent (water) molecules. Other Vsurf descriptors such as Vsurf_A and Vsurf_EWmin1 represent the amphiphilic moment and lowest hydrophilic energy of the molecule, respectively.
Our analysis was performed with easily calculable descriptors and freely available modelling tools. Earlier reports of models for the prediction of solvation free energies presented quantitative models that used costly computational algorithms.13,14 The results obtained from our study are significant and can be improved with sophisticated methods and algorithms, which will be used along with other quantitative studies to reduce the computing power and time consumption. Furthermore, this study supports the further development of quantitative models for the prediction of the solvation free energies of organic molecules and aids in the design of novel molecules with acceptable solvation free energies.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c4ra07961b |
This journal is © The Royal Society of Chemistry 2014 |