Classification study of solvation free energies of organic molecules using machine learning techniques

N. S. Hari Narayana Moorthy*, Silvia A. Martins, Sergio F. Sousa, Maria J. Ramos and Pedro A. Fernandes*
REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, s/n, Rua do Campo Alegre, 4169-007 Porto, Portugal. E-mail: hari.moorthy@fc.up.pt; hari.nmoorthy@gmail.com; pafernan@fc.up.pt; Fax: +351-220-402-506; Tel: +351-220-402-506

Received 1st August 2014 , Accepted 3rd November 2014

First published on 3rd November 2014


Abstract

In this work, we have developed a list of classification models to categorise organic molecules with respect to their solvation free energies using different machine learning approaches (decision tree, random forest and support vector machine). The solvation free energies of the molecules (experimental values obtained from the literature) were split into highly favourable (<−3 kcal mol−1) and less favourable (>−3 kcal mol−1) values; −3 kcal mol−1 was set as the threshold value for the classification model development. The MACCS fingerprint along with a set of physicochemical descriptors such as atom count, topology, vdW surface area (volsurf) and subdivided surface area contributed to the classification models. The validation studies using test set and 10-fold cross-validation methods provide statistical parameters such as accuracy, sensitivity and specificity with >90% significance. The sum of ranking difference (SRD) analysis reveals that the support vector machine models are comparatively significant, while the MACCS fingerprints containing models are ranked as good models in all approaches. The MACCS fingerprints indicate that the presence of halogen atoms causes less favourable solvation free energies. However, the presence of polar atoms/groups and some functional groups such as heteroatoms, double bonded branched aliphatic chains, C[double bond, length as m-dash]N, N–C–C–O, NCO, >1 heterocyclic atoms, OCO, etc. cause highly favourable solvation free energies. The results derived from these investigations can be used along with some quantitative models to predict the solvation free energies of organic molecules and to design novel molecules with acceptable solvation free energies.


Introduction

Solvent accessible surface area is an important analytical tool used by biologists to characterise the hydrophobic and/or hydrophilic nature of the exposed molecular surface. These surface area properties are used to calculate the solvation free energies of the molecules.1 The majority of biological processes take place in solution. Thus, solvation effects are an essential part in the analysis of reactions that occur in the liquid phase, with water being the solvent par excellence. Solvation free energy (ΔGsolv) is the amount of energy necessary to transfer a molecule from the gas phase to a solvated environment.2,3

Protein–ligand binding and the transport of drugs across membranes are closely connected to the solvation free energy, as it is an important component of binding free energy. Molecules that are important in the chemical, biological and pharmaceutical sciences are usually polyfunctional (e.g., drug molecules). The exposure or protection of chemical groups from solvent influences the binding process and involves the thermodynamic process of ligand desolvation. Therefore, the determination of ΔGsolv is a valuable objective with significance in the study of chemical/biochemical processes that has been pursued since the beginning of computer-aided drug design.4,5 The ΔGsolv of a molecule is an important thermodynamic property that is affected by the groups that constitute the molecule and the physicochemical features of the molecule.2 Earlier experiments demonstrated that the contributions of the electrostatic and nonpolar parts of a molecule determine its solvation free energy.6–11 The nonpolar contribution is usually modelled as proportional to the solvation surface area. The electrostatic term dominates the total solvation free energy of the molecule, although it does not always indicate a high affinity.12 This showed that the physicochemical features of molecules cause variations in their free energies of solvation. Hence, an analysis was carried out to investigate the important physicochemical properties and topological features responsible for the free energy of solvation. Furthermore, classification analysis (qualitative analysis) was used to categorise the molecules based on their solvation free energies using different machine learning approaches (quantitative analysis requires more computational cost, more time, precise experimental activities, etc.). The reported quantitative models on the solvation free energy prediction were developed with high computing powers.13,14 In order to simplify the analysis, the initial qualitative models developed with the same data set can support the development of quantitative models with less time and more precision.

Machine learning is a field of artificial intelligence used to extract characteristics of interest from a data set where their underlying probability distribution is unknown. Machine learning focuses on prediction based on known properties learned from the training data.15 These methods use different algorithms for classification and are evaluated based on their generalisation capability, which is their ability to successfully apply the learned knowledge to unseen data. Generally, supervised and non-supervised machine learning methods are available for classification analysis.16 In the present study, we have used some supervised machine learning methods to classify the solvation free energies of organic molecules as highly favourable and less favourable.

Computational methods

Data set

A data set comprised of 241 organic molecules and their experimental solvation free energies was retrieved from the literature (Table S1).3,4,12,13,17–22 The Molecular Operating Environment (MOE) software was used to calculate the physicochemical descriptors of the molecules. The semi-empirical MOPAC program with the Hamiltonian Austin Model 1 (AM1) force field with 0.05 RMS gradients was used to optimise the molecules for the calculation of volume surface (Volsurf) descriptors.23,24 Additionally, 2D descriptors of the molecules were calculated using the PaDEL software.25

MACCS fingerprints

MACCS fingerprints for the data set compounds were calculated using PaDEL software. These 166 MACCS structural keys (fingerprints) computed from the molecular graph and represents a list of keys (fragments/substructures) present in the molecules.

Machine learning methods

In this study, the support vector machine, random forest and decision tree approaches were used for the classification analysis with the help of Weka software.26 Decision tree is one of the predictive modelling approaches used in statistics, data mining and machine learning. It is a classification method that predicts the value of a dependent attribute (variable) through the given values of the independent (input) attributes (variables). Decision tree classifies instances by sorting them down the tree from the root to some leaf node.

Random forests, proposed by Breiman (2001), is an ensemble learning method for classification (and regression) that operates by constructing a multitude of decision trees at training time. Random forests change how the classification or regression trees are constructed using different bootstrap samples. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best split among a subset of predictors randomly chosen at that node.27,28

Support vector machines are supervised learning models with associated learning algorithms for classification study; they are based on the concept of decision planes that define decision boundaries. A decision plane separates a set of objects with different classes of memberships.29

Sum of ranking difference (SRD)

The ranking analysis was performed using the software CRRN_DNA and SRDrep (SRD with ties) (downloadable from: http://aki.ttk.mta.hu/srd or http://goliat.eik.bme.hu/%7Ekollarne/CRRN). The calculated performance parameters (statistical parameters) such as specificity, sensitivity, precision, accuracy, G-mean, F-measure and Mathewś correlation coefficient (MCC) were used for the SRD analysis of the developed models.30,31

Results and discussion

Classification models

Classification models with the ability to predict the solvation free energy of molecules were developed based on a data set comprised of 241 molecules using different approaches such as decision tree, random forest and support vector machine. The physicochemical descriptors calculated from MOE and PaDEL software and the MACCS fingerprints of the molecules were used as independent variables in the classification studies. There are five classification models for each approach (algorithm), and each model was comprised of different descriptors: MOE descriptors in model 1, fingerprints in model 2, PaDEL descriptors in model 3, fingerprint and MOE descriptors in model 4 and all the descriptors (MOE, PaDEL and fingerprints) in model 5. Initially, the descriptor pool was reduced using stepwise regression and principal component analysis. The pruned descriptors were used for the classification studies, and the descriptors contributing to the models are provided in Table 1. Many classification models were constructed with different descriptors and approaches because a single model/method does not give the best result for any data set; hence, multiple models/methods are needed to construct classification models and compare their results.
Table 1 Physicochemical and fingerprint descriptors contributed in each model
Model 1 Model 2 Model 3 Model 4 Model 5
a_nH MACCSFP49 nN MACCSFP134 nN
a_nN MACCSFP88 SHdsCH MACCSFP161 nHBAcc_Lipinski
KierA3 MACCSFP103 ETA_AlphaP a_nN nHBDon
Lip_acc MACCSFP104 nHBAcc_Lipinski Lip_acc MACCSFP134
Lip_don MACCSFP107 nHBDon Lip_don MACCSFP161
SlogP_VSA0 MACCSFP121 TopoPSA SlogP_VSA0 SlogP_VSA0
SMR_VSA1 MACCSFP127   SMR_VSA7 SMR_VSA7
Vsurf_CW3 MACCSFP134   Vsurf_EWmin1 Vsa_pol
Vsurf_W2 MACCSFP139   Vsurf_W2 Vsurf_A
  MACCSFP151   Vsurf_Wp3 Vsurf_CW1
  MACCSFP156     Vsurf_CW2
  MACCSFP157      
  MACCSFP161      


In this analysis, solvation free energy values <−3 kcal mol−1 were considered to be highly favourable, while those >−3 kcal mol−1 were classified as less favourable. The results derived from all three approaches are provided in Tables 2 and 3. The analysis was also performed with other threshold values (<−1 kcal mol−1 and >−1 kcal mol−1); unfortunately, that data set did not provide a balanced number of compounds classified as having highly favourable and less favourable solvation free energies. The abovementioned threshold value (<−3 kcal mol−1 and >−3 kcal mol−1) yielded better classification models with significant statistical parameters; hence these models are discussed herein. The physicochemical descriptors contributing to the models were able to classify the solvation free energies of molecules as highly favourable and less favourable.

Table 2 Confusion matrix of the classification modelsa
Model No Dataset Total Decision tree Support vector machine Random forest
Total P (TP/FN) Total N (TN/FP) Total P (TP/FN) Total N (TN/FP) Total P (TP/FN) Total N (TN/FP)
a Total P: total positives, total N: total negatives, TP: true positives, TN: true negatives, FP: false positives, FN: false negatives.
1 Training 241 129 (128/1) 112 (111/1) 129 (122/7) 112 (110/2) 129 (129/0) 112 (112/0)
Test 30% 72 41 (40/1) 31 (31/0) 41 (39/2) 31 (30/1) 41 (41/0) 31 (31/0)
Test 40% 96 53 (51/1) 43 (43/0) 53 (51/2) 43 (42/1) 53 (52/1) 43 (42/1)
10-fold 241 129 (124/5) 112 (109/3) 129 (122/7) 112 (110/2) 129 (123/6) 112 (109/3)
2 Training 241 129 (122/7) 112 (110/2) 129 (122/2) 112 (110/2) 129 (121/8) 112 (111/1)
Test 30% 72 41 (39/2) 31 (30/1) 41 (39/2) 31 (30/1) 41 (39/2) 31 (30/1)
Test 40% 96 53 (51/2) 43 (42/1) 53 (51/2) 43 (42/1) 53 (51/2) 43 (42/1)
10-fold 241 129 (122/7) 112 (110/2) 129 (122/7) 112 (110/2) 129 (121/8) 112 (111/1)
3 Training 241 129 (127/2) 112 (110/2) 129 (122/7) 112 (110/2) 129 (127/2) 112 (112/0)
Test 30% 72 41 (41/0) 31 (30/1) 41 (39/2) 31 (30/1) 41 (40/1) 31 (30/1)
Test 40% 96 53 (53/0) 43 (42/1) 53 (51/2) 43 (42/1) 53 (53/0) 43 (42/1)
10-fold 240 129 (127/2) 111 (109/2) 129 (123/6) 112 (107/2) 129 (125/4) 112 (109/3)
4 Training 241 129 (129/0) 112 (110/2) 129 (122/7) 112 (110/2) 120 (120/0) 112 (112/0)
Test 30% 72 41 (39/2) 31 (30/1) 41 (39/2) 31 (30/1) 41 (40/1) 31 (31/0)
Test 40% 96 53 (52/1) 43 (40/3) 53 (51/2) 43 (42/1) 53 (52/1) 43 (41/2)
10-fold 241 129 (124/5) 112 (106/6) 129 (122/7) 112 (110/2) 129 (123/6) 113 (110/3)
5 Training 241 129 (127/2) 112 (109/3) 129 (122/7) 112 (110/2) 129 (129/0) 112 (112/0)
Test 30% 72 41 (39/2) 31 (30/1) 41 (39/2) 31 (30/1) 41 (41/0) 31 (30/1)
Test 40% 96 53 (51/2) 43 (41/2) 53 (51/2) 43 (42/1) 53 (52/1) 43 (43/0)
10-fold 241 129 (125/4) 112 (103/9) 129 (122/7) 112 (110/2) 129 (125/4) 112 (107/5)


Table 3 Statistical parameters calculated through classification analysisa
Model No Data set Specificity Sensitivity Accuracy Precision G-mean F-measure MCC
DT SV RF DT SV RF DT SV RF DT SV RF DT SV RF DT SV RF DT SV RF
a DT: decision tree, SV: support vector machine, RF: random forest.
1 Training 0.99 0.98 1.00 0.99 0.94 1.00 0.99 0.96 1.00 0.99 0.98 1.00 0.99 0.96 1.00 0.99 0.96 1.00 0.98 0.93 1.00
Test (30%) 0.97 0.98 1.00 1.00 0.94 1.00 0.99 0.96 1.00 1.00 0.97 1.00 0.99 0.96 1.00 0.99 0.96 1.00 0.97 0.92 1.00
Test (40%) 0.98 0.98 0.98 1.00 0.95 0.98 0.99 0.97 0.98 1.00 0.98 0.98 0.99 0.97 0.98 0.99 0.97 0.98 0.98 0.94 0.96
10-fold 0.96 0.98 0.95 0.98 0.94 0.98 0.97 0.96 0.96 0.98 0.98 0.98 0.97 0.96 0.96 0.97 0.96 0.96 0.93 0.93 0.93
2 Training 0.94 0.98 0.93 0.98 0.94 0.99 0.96 0.96 0.96 0.98 0.98 0.99 0.96 0.96 0.96 0.96 0.96 0.96 0.93 0.93 0.93
Test 30% 0.94 0.98 0.94 0.98 0.94 0.98 0.96 0.96 0.96 0.97 0.97 0.97 0.96 0.96 0.96 0.96 0.96 0.96 0.92 0.92 0.92
Test 40% 0.95 0.98 0.95 0.98 0.95 0.98 0.97 0.97 0.97 0.98 0.98 0.98 0.97 0.97 0.97 0.97 0.97 0.97 0.94 0.94 0.94
10-fold 0.94 0.98 0.93 0.98 0.94 0.99 0.96 0.96 0.96 0.98 0.98 0.99 0.96 0.96 0.96 0.96 0.96 0.96 0.93 0.93 0.93
3 Training 0.98 0.98 0.98 0.98 0.94 1.00 0.98 0.96 0.99 0.98 0.98 1.00 0.98 0.96 0.99 0.98 0.96 0.99 0.97 0.93 0.98
Test 30% 1.00 0.98 0.97 0.98 0.94 0.98 0.99 0.96 0.97 0.98 0.97 0.97 0.98 0.96 0.97 0.99 0.96 0.98 0.97 0.92 0.94
Test 40% 1.00 0.98 1.00 0.98 0.95 0.98 0.99 0.97 0.99 0.98 0.98 0.98 0.99 0.97 0.99 0.99 0.97 0.99 0.98 0.94 0.98
10-fold 0.98 0.96 0.96 0.98 0.95 0.98 0.98 0.95 0.97 0.98 0.96 0.98 0.98 0.95 0.97 0.98 0.96 0.97 0.97 0.91 0.94
4 Training 1.00 0.98 1.00 0.98 0.94 1.00 0.99 0.96 1.00 0.98 0.98 1.00 0.99 0.96 1.00 0.99 0.96 1.00 0.98 0.93 1.00
Test 30% 0.94 0.98 0.97 0.98 0.94 1.00 0.96 0.96 0.99 0.97 0.97 1.00 0.96 0.96 0.99 0.96 0.96 0.99 0.92 0.92 0.97
Test 40% 0.98 0.98 0.98 0.95 0.95 0.96 0.96 0.97 0.97 0.94 0.98 0.96 0.96 0.97 0.97 0.96 0.97 0.97 0.92 0.94 0.94
10-fold 0.95 0.98 0.95 0.95 0.94 0.98 0.95 0.96 0.96 0.95 0.98 0.98 0.95 0.96 0.96 0.96 0.96 0.96 0.91 0.93 0.93
5 Training 0.98 0.98 1.00 0.98 0.94 1.00 0.98 0.96 1.00 0.98 0.98 1.00 0.98 0.96 1.00 0.98 0.96 1.00 0.96 0.93 1.00
Test 30% 0.94 0.98 1.00 0.98 0.94 0.98 0.96 0.96 0.99 0.97 0.97 0.98 0.96 0.96 0.98 0.96 0.96 0.99 0.92 0.92 0.97
Test 40% 0.95 0.98 0.98 0.96 0.95 1.00 0.96 0.97 0.99 0.96 0.98 1.00 0.96 0.97 0.99 0.96 0.97 0.99 0.92 0.94 0.98
10-fold 0.96 0.98 0.96 0.93 0.94 0.96 0.95 0.96 0.96 0.93 0.98 0.96 0.94 0.96 0.96 0.95 0.96 0.97 0.89 0.93 0.92


All the developed classification models were validated by 10-fold cross-validation and test set methods. In the test set method, 30% and 40% of the molecules in the data set were considered as test set to validate the models. The classification performances of the models constructed through all the methods were observed with the confusion matrix. These models correctly classified >95% of the molecules as having highly favourable or less favourable solvation free energies. The true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) of the classified molecules are provided in Tables 3 and 4. The statistical parameters such as sensitivity, specificity, precision and negative predicted values calculated from the confusion matrix are >0.9 for all the models in response to the training set, the test set and the 10-fold cross-validation analyses. MCC measures the quality of a classification model by calculating the value between −1 and +1. An MCC value of 0 indicates an average or random prediction, −1 indicates the worst prediction, and +1 represents the perfect prediction.32,33 An MCC value above 0.4 is considered to be predictive in classification studies. In our analysis, the MCC values are >0.95 for all the models (through all the methods), revealing that the models are significant. In the random forest method, some models produce MCC values of 1, suggesting that those models classified the molecules perfectly. The G-mean is a statistical parameter that measures the overall performance of models and is used to check the balance of the predicted solvation free energy categories of the molecules. The models in this study yielded G-mean values >0.95, indicating that the models predict the solvation free energies of the molecules in a balanced way. Based on the discussed statistical parameters, the models provided significant accuracy (>0.95) with respect to all the studied classification methods. These results reveal that the developed models using the physicochemical descriptors and fingerprints perfectly classified the solvation free energies of the molecules as highly favourable or less favourable.

Table 4 Sum of ranking difference (SRD) and p% interval of the variables of the classification analysesa
Ranking results %p
Model no. (original) Model code SRD x 〈SRD〉 = x
a DT: decision tree, SV: support vector machine, RF: random forest, XX1—first icosaile (5%), Q1—first quartile, Med—median, Q3—last quartile, XX19—last icosaile (95%).
SV-3 8 68 6.46 × 10−8 7.69 × 10−8
DT-2 2 93 6.01 × 10−6 7.22 × 10−6
SV-1 6 93 6.01 × 10−6 7.22 × 10−6
SV-2 7 93 6.01 × 10−6 7.22 × 10−6
SV-4 9 93 6.01 × 10−6 7.22 × 10−6
SV-5 10 93 6.01 × 10−6 7.22 × 10−6
DT-1 1 97 1.18 × 10−5 1.41 × 10−5
RF-2 12 100 1.96 × 10−5 2.28 × 10−5
DT-5 5 117 2.73 × 10−4 3.20 × 10−4
RF-3 13 127 1.14 × 10−3 1.32 × 10−3
RF-5 15 151 2.43 × 10−2 2.75 × 10−2
RF-4 14 152 2.75 × 10−2 3.06 × 10−2
DT-4 4 158 5.40 × 10−2 6.00 × 10−2
RF-1 11 183 0.63 0.70
  XX1 209 4.94 5.25
DT-3 3 227 13.08 13.76
  Q1 240 24.61 25.61
  Med 262 49.91 51.15
  Q3 284 74.25 75.26
  XX19 315 94.70 95.02


In order to compare the performance of each method (and model), the SRD values were calculated for all the models. This ranking is intended to compare models, methods, techniques, etc. with some scaled (calculated) variables. SRD values provide a refined scale for ranking, even when the differences among the results (methods, models, etc.) are very small. A value close to zero indicates a better model (possessed dissimilar variable values), when it has larger SRD value indicates similarity of the variables. The results derived from the SRD analysis are provided in Table 4 and graphically represented in Fig. 1. These results reveal that the support vector machine approach provided significant results, and these models ranked as better models. It is interesting that the models developed with only fingerprint descriptors provided better SRD values (compared to the other studied approaches). Hence, it is important to investigate the kind of fingerprints (substructures, atoms, groups, etc.) present in molecules with highly favourable and less favourable solvation free energies in order to design novel molecules with appropriate solvation free energies.


image file: c4ra07961b-f1.tif
Fig. 1 SRD-CRRN results of the classification models. 1–5 = decision tree models 1–5; 6–10 = support vector machine models 1–5; 11–15 = random forest models 1–5; XX1—first icosaile (5%), Q1—first quartile, Med—median, Q3—last quartile, XX19—last icosaile (95%).

MACCS fingerprint analysis

In order to understand the structure–property relationship, the MACCS fingerprint was calculated for all the molecules based on their molecular structures. The frequency of appearance for each fingerprint (substructure) in the molecules with highly favourable and less favourable solvation free energies was calculated. This provides the important substructures/functional groups/atoms responsible for the observed (increased and decreased) variation in the solvation free energies of the molecules. Some substructures are present in both kinds of molecules (those with highly favourable and less favourable solvation free energies), while limited fingerprints are present in both of these kinds of molecules. The details of those fingerprints present in the molecules with different solvation free energies (threshold) are graphically represented in Fig. 2. The graphs show that the molecules with solvation free energies of <−5 kcal mol−1 specifically exhibited the following substructures: heteroatoms, double-bonded branched aliphatic chains, C[double bond, length as m-dash]N, N–C–C–O, NCO, >1 heterocyclic atoms and OCO. These substructures are absent in molecules with solvation free energies of >−5 kcal mol−1.
image file: c4ra07961b-f2.tif
Fig. 2 Graphical representation of the frequency of fingerprints on the molecules.

Interestingly, the presence of halogen atoms is associated with increased solvation free energies (>−3 kcal mol−1), and molecules with aliphatic long chains have higher solvation free energies than those containing aromatic or aliphatic rings. These results demonstrate that the presence of these substructures results in the variation in solvation free energies.

Description of the contributing descriptors

The classification models generated in this analysis possess descriptors from different categories to classify the molecules based on their solvation free energies. These descriptors are categorised below.
Atom count descriptors (a_nH, A_nN and nN). These atom count descriptors count the numbers of nitrogen and hydrogen atoms in the molecules.24
Topological descriptors (KierA3, SHdsCH and ETA_AlphaP). The KierA3 descriptor describes the shapes of the molecules using the third alpha shape index. It is calculated by (s − 1)(s − 3)2/p32 for odd n and (s − 3)(s − 2)2/p32 for even n, where s = n + a. The Kier and Hall kappa molecular shape indices compare the minimal and maximal molecular graphs and are intended to capture different aspects of molecular shape.24

The electrotopological state descriptors are designated by the E-state symbol that comprises three parts. The first part is “S,” which is the sum of the E-state values of all atoms of the same type in the molecule. The second part is a string representing the bond types associated with that atom (“s” for single bond, “d” for double, “t” for triple and “a” for aromatic). Finally, the third part is a symbol for the set of atoms in the hydride group, such as CH3, CH2, OH, Br, or NH. The SHdsCH is the hydrogen atom-type electrotopological state index for [double bond, length as m-dash]CH– groups.34–36 Another descriptor present in this category is ETA_AlphaP, an extended group of topochemical atom (ETA) indices, which are topological descriptors derived from the modification and refinement of the topologically arrived unique (TAU) scheme parameters of the 1980s.37,38

Polar descriptors (Lip_acc, Lip_don, nHBAcc_Lipinski, nHBDon, TopoPSA and Vsa_pol). These descriptors explain the number of hydrogen bond acceptor and hydrogen bond donor atoms/groups present in the molecules. The TopoPSA and Vsa_pol descriptors describe the polar properties on the van der Waals (vdW) surface area of the molecules.24
Subdivided surface area descriptors (SlogP_VSA0, SMR_VSA1 and SMR_VSA7). The subdivided surface area descriptors are based on an approximate accessible vdW surface area (VSA) calculation (in Å2) for each atom, vi, along with another atomic property, Pi (either partition coefficient or molar refractivity). The vi values are calculated using a connection table approximation. The properties (Pi) of small molecules can be calculated as the sum of the contributions of each of the atoms in the molecule as follows:
 
P_VSAk = ∑viδ(PiΣ(ak − 1, ak)) k = 1,2,3, …, n (1)
where ao < ak < an are interval boundaries such that the (ao, an) bound are values of Pi in any molecule. Each VSA type descriptor can be characterised as the amount of surface area with P in a certain range. SlogP_VSA and SMR_VSA descriptors explain the partition coefficient and molecular refractivity, respectively, on the vdW surface areas of the molecules. These are defined to be the sum of the vi over all atoms i. Pi denotes the contribution to partition coefficient or molar refractivity for atom i as calculated in the SlogP or SMR descriptor for a specified range.24,39
Volsurf descriptors (Vsurf_CW1, Vsurf_CW2, Vsurf_CW3, Vsurf_W2, Vsurf_Wp3, Vsurf_EWmin1 and Vsurf_A). The Vsurf descriptors depend on the structural connectivity and the conformation (dimensions are measured in Å) of the molecules. It generally describes the hydrophobic and hydrophilic properties mediated by surface properties such as shape, electrostatic interaction, hydrogen bonding and hydrophobicity. The Vsurf_CW descriptor describes the capacity factor of the molecules and is calculated at different energy levels. It provides information on the amount of hydrophilic regions per unit surface area.40,41

The Vsurf_Wp descriptor describes the polar volume (either polarisability and dispersion forces or hydrogen bond acceptor–donor regions) of the molecule and are calculated at eight different energy levels (−0.2, −0.5, −1.0, −2.0, −3.0, −4.0, −5.0 and −6.0 kcal mol−1); this descriptor may be defined as the molecular envelope accessible by solvent (water) molecules. Other Vsurf descriptors such as Vsurf_A and Vsurf_EWmin1 represent the amphiphilic moment and lowest hydrophilic energy of the molecule, respectively.

MACCS fingerprints. The MACCS fingerprints explain the presence or absence of particular functional groups, atoms or fragments in different molecules. These contributing fingerprints provide information on the following structural aspects of the compounds: MACCSFP49 (charge on the molecule), MACCSFP88 (presence of sulphur atoms), MACCSFP103 (presence of chlorine atoms), MACCSFP104 (hetero atom with hydrogen and connected with CH2 through any other atom), MACCSFP107 (halogen atom connected with branched atoms (any atom (any atom) + any atom)), MACCSFP121 (nitrogen-containing heterocycles), MACCSFP127 (any atom + ring bond + any atom + non ring bond connected with O2), MACCSFP134 (halogens), MACCSFP139 (hydroxyl group), MACCSFP151 (–NH group), MACCSFP156 (N connected with branched atom as any atom (any atom) + any atom), MACCSFP157 (C–O) and MACCSFP161 (nitrogen atom).

Conclusion

In conclusion, all the developed models provided >90% significance on the statistical parameters such as sensitivity, specificity, MCC, accuracy, G-mean, etc. The frequency of appearance of MACCS fingerprints in the molecules explained the substructures/groups/atoms responsible for the change in the solvation free energies of the molecules. Multiple methods and models are reported in the study because a single method/model cannot provide significant predictions. The SRD values showed that all the models have similar performances in dataset classification; however, the support vector machine showed slightly better performance than other methods.

Our analysis was performed with easily calculable descriptors and freely available modelling tools. Earlier reports of models for the prediction of solvation free energies presented quantitative models that used costly computational algorithms.13,14 The results obtained from our study are significant and can be improved with sophisticated methods and algorithms, which will be used along with other quantitative studies to reduce the computing power and time consumption. Furthermore, this study supports the further development of quantitative models for the prediction of the solvation free energies of organic molecules and aids in the design of novel molecules with acceptable solvation free energies.

Acknowledgements

N. S. H. N. Moorthy is grateful to the Fundaçao para a Ciencia e Technologia (FCT), Portugal, for a Postdoctoral Grant (SFRH/BPD/44469/2008).

References

  1. L. Cavallo, J. Kleinjung and F. Fraternali, Nucleic Acids Res., 2003, 31, 3364–3366 CrossRef CAS PubMed.
  2. P. F. B. Gonçalves and H. Stassen, Pure Appl. Chem., 2004, 76, 231–240 CrossRef.
  3. S. Lee, K. H. Cho, C. J. Lee, G. E. Kim, C. H. Na, Y. In and K. T. No, J. Chem. Inf. Model., 2010, 51, 105–114 CrossRef PubMed.
  4. R. C. Rizzo, T. Aynechi, D. A. Case and I. D. Kuntz, J. Chem. Theory Comput., 2006, 2, 128–139 CrossRef CAS.
  5. D. S. Palmer, V. P. Sergiievskyi, F. Jensen and M. V. Fedorov, J. Chem. Phys., 2010, 133, 044104,  DOI:10.1063/1.3458798.
  6. B. Honig and A. Nicholls, Science, 1995, 268, 1144–1149 CAS.
  7. M. K. Gilson and B. Honig, Proteins, 1998, 4, 7–18 CrossRef PubMed.
  8. C. J. Cramer and D. G. Truhlar, Chem. Rev., 1999, 99, 2161–2200 CrossRef CAS PubMed.
  9. C. J. Cramer and D. G. Truhlar, Science, 1992, 256, 213–217 CAS.
  10. D. Sitkoff, K. A. Sharp and B. Honig, J. Phys. Chem., 1994, 98, 1978–1988 CrossRef CAS.
  11. R. Luo, J. Moult and K. Gilson, J. Phys. Chem. B, 1997, 101, 11226–11236 CrossRef CAS.
  12. J. Wang, W. Wang, S. Huo, M. Lee and P. A. Kollman, J. Phys. Chem. B, 2001, 105, 5055–5067 CrossRef CAS.
  13. V. N. Viswanadhan, A. K. Ghose, U. C. Singh and J. J. Wendoloski, J. Chem. Inf. Comput. Sci., 1999, 39, 405–412 CrossRef CAS.
  14. L. Bernazzani, C. Duce, A. Micheli, V. Mollica, A. Sperduti, A. Starita and M. R. Tine, J. Chem. Inf. Model., 2006, 46, 2030–2042 CrossRef CAS PubMed.
  15. I. H. Witten, E. Frank and M. A. Hall, Data mining: Practical machine learning tools and techniques, Morgan Kaufmann, Burlington, MA, 2011 Search PubMed.
  16. J. Han and M. Kamber, Data mining:concepts and techniques, Morgan Kaufmann Publishers, San Francisco, 2001 Search PubMed.
  17. S. Cabani, P. Gianni, V. Mollica and L. Lepori, J. Solution Chem., 1981, 10, 563–595 CrossRef CAS.
  18. R. Wolfenden, L. Andersson, P. M. Cullis and C. C. G. Southgate, Biochemistry, 1981, 20, 849–855 CrossRef CAS.
  19. E. Gallicchio, L. Y. Zhang and R. M. Levy, J. Comput. Chem., 2002, 23, 517–529 CrossRef CAS PubMed.
  20. W. L. Jorgensen, J. P. Ulmschneider and J. Tirado-Rives, J. Phys. Chem. B, 2004, 108, 16264–16270 CrossRef CAS.
  21. A. V. Marenich, C. J. Cramer and D. G. Truhlar, J. Phys. Chem. B, 2009, 113, 4538–4543 CrossRef CAS PubMed.
  22. E. O. Purisima, C. R. Corbeil and T. Sulea, J. Chem. Theory Comput., 2010, 6, 1622–1637 CrossRef.
  23. MOE 2012, Chemical Computing Group Inc., Montreal, H3A 2R7, Canada, 2012 Search PubMed.
  24. A. Lin, QuaSAR-descriptors, Chemical Computing Group Inc., Montreal, H3A 2R7, Canada, 2002 Search PubMed.
  25. C. W. Yap, J. Comput. Chem., 2011, 32, 1466–1474 CrossRef CAS PubMed.
  26. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, SIGKDD Explorations, 2009, 11, 10–18 CrossRef.
  27. L. Breiman, Mach. Learn., 2001, 45(1), 5–32 CrossRef.
  28. L. Breiman, Mach. Learn., 1996, 24(2), 123–140 Search PubMed.
  29. C. Cortes and V. Vapnik, Mach. Learn., 1995, 20(3), 273–297 Search PubMed.
  30. K. Heberger and K. Kollár-Hunek, J. Chemom., 2011, 25, 151–158 CrossRef CAS.
  31. K. Kollár-Hunek and K. Héberger, Chemom. Intell. Lab. Syst., 2013, 127, 139–146 CrossRef PubMed.
  32. A. Jurik, R. Reicherstorfer, B. Zdrazil and G. F. Ecker, Mol. Inf., 2013, 32, 415–419 CrossRef CAS PubMed.
  33. K. M. Thai and G. F. Ecker, Bioorg. Med. Chem., 2008, 16, 4107–4119 CrossRef CAS PubMed.
  34. N. S. H. N. Moorthy, M. J. Ramos and P. A. Fernandes, J. Enzyme Inhib. Med. Chem., 2011, 26(6), 755–766 CrossRef CAS PubMed.
  35. N. S. H. N. Moorthy, M. J. Ramos and P. A. Fernandes, Chemom. Intell. Lab. Syst., 2011, 109, 101–112 CrossRef CAS PubMed.
  36. N. S. H. N. Moorthy, M. J. Ramos and P. A. Fernandes, Lett. Drug Des. Discovery, 2011, 8, 14–25 CrossRef CAS.
  37. K. Roy and R. N. Das, J. Hazard. Mater., 2013, 254–255, 166–178 CrossRef CAS PubMed.
  38. K. Roy and G. Ghosh, Chemosphere, 2009, 77(7), 999–1009 CrossRef CAS PubMed.
  39. N. S. H. N. Moorthy, S. F. Sousa, M. J. Ramos and P. A. Fernandes, J. Enzyme Inhib. Med. Chem., 2011, 26(6), 777–791 CrossRef CAS PubMed.
  40. N. S. H. N. Moorthy, M. J. Ramos and P. A. Fernandes, RSC Adv., 2011, 1, 1126–1136 RSC.
  41. N. S. H. N. Moorthy, M. J. Ramos and P. A. Fernandes, SAR QSAR Environ. Res., 2012, 23, 521–536 CrossRef CAS PubMed.

Footnote

Electronic supplementary information (ESI) available. See DOI: 10.1039/c4ra07961b

This journal is © The Royal Society of Chemistry 2014
Click here to see how this site uses Cookies. View our privacy policy here.