Classification study of solvation free energies of organic molecules using machine learning techniques†
Abstract
In this work, we have developed a list of classification models to categorise organic molecules with respect to their solvation free energies using different machine learning approaches (decision tree, random forest and support vector machine). The solvation free energies of the molecules (experimental values obtained from the literature) were split into highly favourable (<−3 kcal mol−1) and less favourable (>−3 kcal mol−1) values; −3 kcal mol−1 was set as the threshold value for the classification model development. The MACCS fingerprint along with a set of physicochemical descriptors such as atom count, topology, vdW surface area (volsurf) and subdivided surface area contributed to the classification models. The validation studies using test set and 10-fold cross-validation methods provide statistical parameters such as accuracy, sensitivity and specificity with >90% significance. The sum of ranking difference (SRD) analysis reveals that the support vector machine models are comparatively significant, while the MACCS fingerprints containing models are ranked as good models in all approaches. The MACCS fingerprints indicate that the presence of halogen atoms causes less favourable solvation free energies. However, the presence of polar atoms/groups and some functional groups such as heteroatoms, double bonded branched aliphatic chains, CN, N–C–C–O, NCO, >1 heterocyclic atoms, OCO, etc. cause highly favourable solvation free energies. The results derived from these investigations can be used along with some quantitative models to predict the solvation free energies of organic molecules and to design novel molecules with acceptable solvation free energies.