Gonçalo
Carrera
and
João
Aires-de-Sousa
*
REQUIMTE, CQFB, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal. E-mail: jas@fct.unl.pt; Fax: +351 21 2948550
First published on 3rd December 2004
Regression trees were built with an initial pool of 1085 molecular descriptors calculated by DRAGON software for 126 pyridinium bromides, to predict the melting point. A single tree was derived with 9 nodes distributed over 5 levels in less than 2 min showing very good correlation between the estimated and experimental values (R2 = 0.933, RMS = 12.61 °C). A number n of new trees were grown sequentially, without the descriptors selected by previous trees, and combination of predictions from the n trees (ensemble of trees) resulted in higher accuracy. A 3-fold cross-validation with the optimum number of trees (n = 4) yielded an R2 value of 0.822. A counterpropagation neural network was trained with the variables selected by the first tree, and reasonable results were achieved (R2 = 0.748). In a test set of 9 new pyridinium bromides, all the low melting point cases were successfully identified.
Traditionally, chemists have relied on chemical intuition to predict such properties. But in some cases, very subtle features have a tremendous impact on the property. Fig. 1 illustrates such behaviour related to the melting point.
![]() | ||
Fig. 1 Examples of similar chemical structures of pyridinium bromides with highly diverse melting points. |
In cases like that, computational tools can analyse experimental data to ferret out hidden relationships between aspects of the chemical structure and the melting point, and propose a predictive model.
Chemical structures can be numerically encoded by molecular descriptors. Thousands of molecular descriptors have been proposed covering features from constitution to geometry to electronic properties, and they can be easily calculated with available software. Then, relationships between molecular descriptors and the property being predicted can be obtained computationally. Models like neural networks (NN),3 decision trees,4 ensembles of trees,5 partial least squares (PLS)6 and multiple linear regressions (MLR) are among the most popular. Some linear relationships between molecular descriptors and melting points of ionic liquids were reported in the past.7,8
Usually the most important features concerning models obtained from computational tools are the flexibility to model multiple mechanisms of action, the capability to deal with high dimensional data, and the level of predictive accuracy.
Neural networks and decision trees are highly flexible and are able to account for multiple mechanisms of action, while linear methods are more rigid. On the other hand NNs are not so efficient dealing with large sets of variables. Time-consuming genetic algorithms9,10 are often used to obtain more compact and efficient sets of variables for MLR and neural networks. Decision trees are fast and efficient for high dimensional data, but usually show relatively low prediction accuracy. Accuracy can be significantly improved with ensembles of trees (forests).11 Classification and regression trees (CART) have recently gained much interest in chemoinformatics, particularly for QSAR studies based on large pools of available molecular descriptors.5,11–13
In this paper we report our studies using decision trees and ensembles of decision trees (forests) to model the melting points of a series of 126 pyridinium bromides, as a fast, flexible alternative to linear regressions. Additionally, the potential of the molecular descriptors chosen by the trees to train counterpropagation neural networks (CPG NNs) was investigated.
In this study, regression trees were grown with the R program version 1.9.0 using the RPART library with the default parameters.22 A tree was obtained by submitting the 126 molecules to RPART, each molecule represented by 1085 molecular descriptors. The variables chosen for building the tree (global tree) were further evaluated in an additional 3-fold cross-validation procedure. This step employed exactly the same 3 subsets A, B, and C as in the Katritzky et al. study.7 The cross-validation trees were allowed to grow using a) only the variables selected for the global tree, and b) only the objects pertaining to 2 subsets. At the end, predictions were obtained for the third subset (not used for growing the tree).
![]() | ||
Fig. 2 Training and prediction procedures in a CPG NN that learns relationships between molecular descriptors and melting points. |
Before the training, the weights are randomly generated. During the training, each object activates the neuron with the most similar input weights to the molecular descriptors of that object—this is the winning neuron. The weights of the winning neuron are adjusted to become even closer to the input vector (molecular descriptors) and the neighbouring neurons are also corrected to approach the input object, with the correction depending on the distance to the winning neuron. The corresponding output weight is also changed to become closer to the experimental value of the property being predicted. The network is trained with all the objects from the training set several times until a predefined number of cycles is reached. Afterward, the network is able to make predictions for new compounds—Fig. 2 right. When a new object (molecule) is submitted, the network activates the winning neuron, and uses the corresponding output weight as the predicted value, in our case the predicted melting point.
In this investigation, CPG NNs were trained using the descriptors selected for the first decision tree. Training and cross-validation were performed with the same subsets as for the decision trees. The molecular descriptors were linearly scaled between 0 and 1. For some experiments (see Results) each descriptor i was multiplied by a factor L − li, where L is the number of levels in the first tree, and li is the distance of the node defined by descriptor i to the root node. CPG NNs were implemented with in-house developed software based on JATOON Java applets.23,24 The networks (with 12 × 12 neurons for the training set A + B + C and 10 × 10 neurons for the cross-validation experiments) were trained over 100 cycles, using an initial learning span of 5 neurons.
![]() | ||
Fig. 3 Regression tree model obtained for the estimation of melting points. Terminal nodes are represented with rectangles while intermediate nodes are represented with ovals. The number of objects reaching a node, as well as the average of their melting points is displayed inside each node. The rules are displayed near the corresponding nodes using names for the descriptors as defined in Table 1. |
![]() | ||
Fig. 4 Predicted values vs. experimental melting points (mp) for the set of 126 pyridinium bromides. Predictions were obtained by a regression tree grown with 126 objects described by 1085 variables. |
Descriptor | Description |
---|---|
G2 | Geometrical, G2 is the gravitational index (bond restricted). |
G2 =
![]() |
|
RDF105u | RDF, 20 accounts for the abundance of pairs of atoms at a distance of 10.5 Å. |
HATSe | GETAWAY descriptor, 21 accounts for the influence of individual atoms in the total shape of the molecule combined with electronegativities. HATSe is the leverage-weighted total index weighted by atomic Sanderson electronegativities. |
R6P | GETAWAY descriptor, 21 R6P is the R autocorrelation of lag 6 weighted by atomic polarizabilities. |
De | WHIM descriptor, 19 geometrical descriptor calculated on the projections of the atoms along principal axes. De is total accessibility index weighted by atomic Sanderson electronegativities. |
piPCO7 | Topological descriptor, 2D descriptor, independent of the conformation. piPCO7 is the molecular multiple path count of order 7. |
GATS2v | 2D Geary autocorrelation, 2D descriptor, independent of the conformation, obtained by the summation of all products of the terminal atoms of all the paths of path length (lag) 2 weighted by van der Waals volumes. |
HATS7v | GETAWAY descriptor, 21 accounts for the influence of individual atoms in the total shape of the molecule combined with van der Waals volume. HATS7v is the leverage-weighted autocorrelation of lag 7 weighted by atomic van der Waals volume. |
H0p | GETAWAY descriptor, 21 accounts for the influence of individual atoms in the total shape of the molecule combined with atomic polarizabilities. H0p is the H autocorrelation of lag 0 weighted by atomic polarizabilities. |
A + B + C (training set) | A (CV) | B (CV) | C (CV) | ||
---|---|---|---|---|---|
a RMS is the root mean square of errors; R2 is the goodness of fit correlation coefficient; R2* is the square of the Pearson correlation coefficient; ē is the mean error, and SP is the Spearman correlation coefficient. | |||||
Decision tree | RMS | 12.61 | 22.13 | 28.27 | 26.07 |
R 2 | 0.933 | 0.794 | 0.662 | 0.711 | |
R 2* | 0.933 | 0.795 | 0.708 | 0.714 | |
ē | 10.07 | 15.95 | 21.76 | 18.61 | |
SP | 0.940 | 0.871 | 0.798 | 0.818 | |
CPG NN (trained with weighted descriptors) | RMS | 24.41 | 29.21 | 23.97 | 31.50 |
R 2 | 0.748 | 0.640 | 0.757 | 0.578 | |
R 2* | 0.762 | 0.661 | 0.783 | 0.592 | |
ē | 19.67 | 24.29 | 19.38 | 26.80 | |
SP | 0.871 | 0.789 | 0.901 | 0.754 | |
Ensemble of 3 trees | RMS | 11.41 | 18.79 | 22.06 | 25.42 |
R 2 | 0.945 | 0.851 | 0.794 | 0.725 | |
R 2* | 0.946 | 0.855 | 0.809 | 0.737 | |
ē | 9.420 | 15.33 | 16.82 | 20.06 | |
SP | 0.937 | 0.925 | 0.873 | 0.858 | |
Ensemble of 4 trees | RMS | 10.82 | 17.11 | 20.32 | 23.67 |
R 2 | 0.950 | 0.877 | 0.825 | 0.772 | |
R 2* | 0.951 | 0.882 | 0.836 | 0.762 | |
ē | 8.727 | 13.73 | 15.22 | 18.64 | |
SP | 0.954 | 0.942 | 0.886 | 0.861 | |
Ensemble of 8 trees | RMS | 9.910 | 18.17 | 18.70 | 21.96 |
R 2 | 0.958 | 0.861 | 0.852 | 0.795 | |
R 2* | 0.960 | 0.870 | 0.880 | 0.812 | |
ē | 8.140 | 14.55 | 14.88 | 17.65 | |
SP | 0.962 | 0.934 | 0.913 | 0.910 | |
Multiple linear regression (MLR)7 | RMS | 23.41 | 27.10 | 25.53 | 24.05 |
R 2 | 0.788 | 0.769 | 0.779 | 0.757 | |
R 2* | 0.768 | n.a. | n.a. | n.a. | |
ē | 18.07 | n.a. | n.a. | n.a | |
SP | 0.871 | n.a. | n.a. | n.a. |
![]() | ||
Fig. 5 Influence of the size of the forest on the quality of mp predictions, growing the trees with the whole data set, or in a 3-fold cross-validation procedure. |
![]() | ||
Fig. 6 Predicted vs. experimental melting points (mp) for the set of 126 pyridinium bromides. Predictions were obtained by averaging predictions from 4 regression trees grown using the 126 objects. |
![]() | ||
Fig. 7 Predicted vs. experimental melting points (mp) for the set of 126 pyridinium bromides with a 3-fold cross-validation procedure on an ensemble of 4 regression trees. |
Tree descriptors R2 | Tree descriptors (weighted) R2 | |
---|---|---|
Prediction for the training set (A + B + C) | 0.653 | 0.748 |
CV—prediction for test set A | 0.475 | 0.640 |
CV—prediction for test set B | 0.627 | 0.757 |
CV—prediction for test set C | 0.406 | 0.578 |
Pyridinium bromide | Experimental mp/°C | Decision tree | Ensemble of 4 trees |
---|---|---|---|
1-(2-Oxo-butyl)-pyridinium | 180 ref. 29 | 138 | 122 |
1-(3-Methyl-2-oxo-butyl)-pyridinium | 149 ref. 29 | 138 | 139 |
1-Isobutoxycarbonylmethyl-pyridinium | 87 ref. 29 | 85 | 84 |
1-(2-Dodecanoyloxy-ethyl)-pyridinium | 80.5 ref. 30 | 44 | 44 |
1-(2-Hexadecanoyloxy-ethyl)-pyridinium | 80 ref. 30 | 44 | 44 |
1-(2-Icosanoyloxy-ethyl)-pyridinium | 81 ref. 30 | 44 | 44 |
3,4-Dihydroxy-2-(1-hydroxy-ethyl)-1-methyl-pyridinium | 142 ref. 31 | 125 | 108 |
1-Benzyl-3-(2-oxo-oxazolidine-3-carbonyl)-pyridinium | 109 ref. 32 | 85 | 84 |
1-Benzyl-3-dimethylcarbamoyl-pyridinium | 158.5 ref. 32 | 85 | 84 |
Fig. 8 shows the output layer of the CPG NN trained with the weighted descriptors for the 126 objects. The neurons were coloured according to their output values, and the 9 new objects were mapped onto the surface. The activated neurons are identified by the experimental mp of the mapped objects.
![]() | ||
Fig. 8 Mapping of the 9 new objects onto the output layer of a CPG NN (see text). The labels shown in the map correspond to experimental mp values of the nine objects. Low mp objects were correctly identified by exciting neurons in the blue region. |
The descriptors involved in the right main branch of the tree are more difficult to interpret. Most of the selected descriptors are 3D descriptors, particularly GETAWAY descriptors, which are calculated from a single molecular conformation. This reinforces our initial proposal that 3D structures generated by CORINA are useful, and that working with a single conformation is realistic since the melting point can be viewed as a property of a solid. GETAWAY descriptors account for the influence of individual atoms on the shape of the molecule, and are often combined with atomic properties such as electronegativities, polarizabilities or van der Waals volumes.21 GETAWAY descriptors have been successful for the prediction of melting points of polycyclic aromatic hydrocarbons and polychlorinated biphenyls.27 The other 3D descriptor is a WHIM descriptor, namely the total accessibility index weighted by atomic Sanderson electronegativities. WHIM descriptors were designed to capture 3D information regarding molecular size, shape, and symmetry and atom distribution with respect to the molecule as a whole.19 The De descriptor is a compactness descriptor, which encodes information about the distribution of electronegativities along the three axes of the molecule. Another size directional WHIM index was selected in a QSPR model to predict melting points of polychlorinated biphenyls.28
The piPCO7 descriptor isolates a group of 12 compounds with relatively low melting point (average = 83 °C). This descriptor counts paths of order 7 and at the same time is influenced by multiple bonds. In fact, most of the isolated 12 compounds have no multiple bonds outside the pyridinium ring, and most of the 25 compounds in the right child node have a functional group with a multiple bond.
The accuracy of the predictions was significantly improved by combining several trees in an ensemble—Fig. 5 and Table 2. We chose the ensemble with 4 trees as the optimum, since after that point no significant improvement in predictive ability can be observed. This forest model accurately fits the experimental data (R2 = 0.950, RMS = 10.82 °C), and performs well in cross-validation tests (R2 = 0.877, 0.825, 0.772, respectively for sets A, B, and C). Interpretation of these results should take into account the difficulty of modelling melting points of organic salts, and the fuzziness associated with such data—differences of 10–15 °C in the reported mp for the same compound from different authors are not uncommon in the literature. An R2 value of 0.950 compares to the value of 0.788 obtained by Katritzky et al.7 and the value of 0.790 obtained by Eike et al.8 This better fitting of the entire set (after the entire set was used to develop the model) may be due to the higher flexibility of the decision trees and the larger number of descriptors employed. A more significant comparison can be obtained with cross-validation tests using the same subsets as Katritzky et al. (Table 2)—the ensembles of trees yielded slightly better correlations than the linear methods. However, it must be emphasized that it is not possible to rigorously compare the two methods without an external set, even if the procedures here described for variable selection and cross-validation were designed to follow as much as possible those described in the previous studies. Essentially the quality of predictions by trees and MLR are not very different.
Although the ensemble of 4 trees employs 35 descriptors (9 descriptors in the first tree, 9 in the second tree, 8 in the third tree, and 9 in the fourth tree), this number overstates the real amount of information encoded by the set of descriptors. In fact, the architectures of the 4 trees overlap to a large extent, and the 4 trees use descriptors which are in most cases highly correlated, i.e. the descriptors do not change much from tree to tree, even if they are not exactly the same. The left main branch of the trees (leading to low mp predictions) is particularly similar in all the 4 trees, with almost the same content in the terminal nodes (it is constructed by descriptors related to molecular size). This explains why the ensemble predicts the same value for all the low melting point compounds—all of them get the same prediction in all trees.
Interestingly, the second tree exhibits two nodes under the right main branch defined by information content of order 1 descriptors (1CIC and 1IC). These two descriptors describe the connectivity of one molecule in terms of symmetry. Information content descriptors were selected both in the Katritzky et al. and Eike et al. works, and an inverse relationship between the melting points and the information content descriptors was also observed. This fact can be interpreted as low-symmetry molecules, which have low coordination ability, having low melting points.7
An additional experiment has shown that exclusion of highly intercorrelated descriptors from the initial pool (with Pearson correlation coefficient above 0.8) does not lead to improved predictions.
Tree construction can also be seen as a very fast method for selecting relevant variables. However, descriptors that are successfully employed by trees are not necessarily successful for CPG NNs, because the two methods use the descriptors in radically different ways. In trees, descriptors are used sequentially, some to make decisions involving all the objects, others only in small branches for a few objects. In opposition, CPG NNs always use all the descriptors simultaneously for all the objects.
CPG NNs were trained with the 9 descriptors selected by the first tree, and reasonable results could be achieved, particularly if the descriptors were weighted according to the level where they are used in the tree. The predictions were not as accurate as with the trees, but comparable to the MLR models in terms of some criteria.
Beyond predictive accuracy, CPG NNs are interesting because they provide information regarding the topology of the data. Mapping an object into a CPG NN also allows for retrieval of other objects that were mapped onto the same region, and that can to some extent explain and support the prediction. Visualisation of the output weights (Fig. 8) reveals several regions on the surface corresponding to different ranges of melting points. It is a representation of the combined relationship between the selected descriptors and the melting point of pyridinium bromides. When applied to the test set of 9 new compounds, the network mapped those with lower melting points onto dark blue neurons, and one salt with low melting point (87 °C) onto a light blue neuron. This means the network would correctly identify the salts with highest potential as ionic liquids.
The decision tree and the ensemble also recognized the low melting point objects in the test set (Table 4). The accuracy of other predictions varied. The wrong assignment of a low melting point (80–85 °C) to a salt with mp = 158 °C is the worst prediction in the set. However, for this application, a false positive (false prediction of an ionic liquid) is certainly preferable to a false negative (missing a compound with ionic liquid properties).
Footnote |
† Electronic supplementary information (ESI) available: ensemble of trees grown and list of initial descriptors. See http://www.rsc.org/suppdata/gc/b4/b408967g/ |
This journal is © The Royal Society of Chemistry 2005 |