Lionello Pogliani*ab and
Jesus Vicente de Julián-Ortizab
aMOLware SL, c/Burriana 36-3, 46005, Valencia, Spain
bUnidad de Investigación de Diseño de Farmacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, 46100 Burjassot, València, Spain. E-mail: liopo@uv.es; jejuor@uv.es
First published on 3rd September 2014
New type of indices, the mean molecular connectivity indices (MMCI), based on nine different concepts of mean are proposed to model, together with molecular connectivity indices (MCI), experimental parameters and random variables, eleven properties of organic solvents. Two model methodologies are used to test the different descriptors: the multilinear least-squares (MLS) methodology and the Artificial Neural Network (ANN) methodology. The top three quantitative structure–property relationships (QSPR) for each property are chosen with the MLS method. The indices of these three QSPRs were used to train the ANNs that selected the best training sets of indices to estimate the evaluation sets of compounds. The best ANN relationships for most properties are of the semiempirical types that include mean molecular connectivity indices (MMCI), molecular connectivity indices (MCI) and experimental parameters. Refractive index, RI, viscosity, η, and surface tension, γ, prefer a semiempirical relationship made of MCI and an experimental parameter only. In our previous study with no MMCI, random variables contributed to semiempirical relationships for two properties at the ANN level (MS, and El), here the use of MMCIs undo the contribution of such variables. Most of the MMCIs that contribute to improve the model of the properties are valence-delta-dependent (δv), that is, they encode both the hydrogen atom contribution and the core electrons of higher-row atoms.
In the literature, nine definitions of mean between two numbers (see Appendix) can be found, and these definitions are used to define new type of indices, the mean molecular connectivity indices (MMCI). These new indices are here used together with molecular connectivity indices, experimental parameters, and random variables to build optimal semiempirical quantitative structure–property relationships (QSPR) for eleven properties of a set of organic solvents. These properties were recently modeled5 with a semiempirical set of descriptors that encompassed only the molecular connectivity indices (MCI), empirical parameters, and random variables. The cited work5 emphasized the advantage in using ANN for model purposes. The two main aims of the present study are: (i) test the usefulness of the new MMCI indices and (ii) the related usefulness of the ANN methodology.
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
Here, i (and j) assigns the N atoms of a hydrogen-depleted molecule, ij means two atoms directly bonded through a σ bond, and p = N, even if other values are possible. It should be underlined that N for the studied molecules (see Table 1) is not that large. The reader may notice, among other similarities, that the Lehmer mean, LM, for p = 2 equals the symmetrical mean, SM.
Solvents | M | Tb | ε | d | RI | FP | η | γ | UV | μ | MS | El |
---|---|---|---|---|---|---|---|---|---|---|---|---|
a (°) externally validated compounds. Italicized bold values: test compounds used in ANN-MLP calculations. | ||||||||||||
(°) Acetone | 58.1 | 329 | 20.7 | 0.791 | 1.359 | 256 | 0.32 | 23.46 | 330 | 2.88 | 0.46 | 0.43 |
(°) Acetonitrile | 41.05 | 355 | 37.5 | 0.786 | 1.344 | 278 | 0.37 | 28.66 | 190 | 3.92 | 0.534 | 0.50 |
Benzene | 78.1 | 353 | 2.3 | 0.84 | 1.501 | 262 | 0.65 | 28.22 | 280 | 0 | 0.699 | 0.27 |
Benzonitrile | 103.1 | 461 | 25.2 | 1.010 | 1.528 | 344 | 1.241 | 38.79 | ||||
1-Butanol | 74.1 | 391 | 17.1 | 0.810 | 1.399 | 308 | 2.95 | 24.93 | 215 | |||
(°) 2-Butanone | 72.1 | 353 | 18.5 | 0.805 | 1.379 | 270 | 0.40 | 23.97 | 330 | 0.39 | ||
Butyl acetate | 116.2 | 398 | 5.0 | 0.882 | 1.394 | 295 | 0.73 | 24.88 | 254 | |||
CS2 | 76.1 | 319 | 2.6 | 1.266 | 1.627 | 240 | 0.37 | 31.58 | 380 | 0 | 0.532 | |
CCl4 | 153.8 | 350 | 2.2 | 1.594 | 1.460 | 0.97 | 26.43 | 263 | 0 | 0.691 | 0.14 | |
Cl-benzene | 112.6 | 405 | 5.6 | 1.107 | 1.524 | 296 | 0.80 | 32.99 | 287 | |||
1Cl-butane | 92.6 | 351 | 7.4 | 0.886 | 1.4024 | 267 | 0.35 | 23.18 | 225 | |||
CHCl3 | 119.4 | 334 | 4.8 | 1.492 | 1.446 | 0.57 | 26.67 | 245 | 1.01 | 0.740 | 0.31 | |
Cyclohexane | 84.2 | 354 | 2.0 | 0.779 | 1.426 | 255 | 1.00 | 24.65 | 200 | 0 | 0.627 | 0.03 |
(°) Cyclopentane | 70.1 | 323 | 2.0 | 0.751 | 1.400 | 236 | 0.47 | 21.88 | 200 | 0.629 | ||
1,2-DiCl-benzene | 147.0 | 453 | 9.9 | 1.306 | 1.551 | 338 | 1.32 | 295 | 2.50 | 0.748 | ||
1,2-DiCl-ethane | 98.95 | 356 | 10.4 | 1.256 | 1.444 | 288 | 0.79 | 31.86 | 225 | 1.75 | ||
DiCl-methane | 84.9 | 313 | 9.1 | 1.325 | 1.424 | 0.44 | 27.20 | 235 | 1.60 | 0.733 | 0.32 | |
N,N-DiM-acetamide | 87.1 | 438 | 37.8 | 0.937 | 1.438 | 343 | 268 | 3.8 | ||||
N,N-DiM-formamide | 73.1 | 426 | 36.7 | 0.944 | 1.431 | 330 | 0.92 | 268 | 3.86 | |||
1,4-Dioxane | 88.1 | 374 | 2.2 | 1.034 | 1.422 | 285 | 1.54 | 32.75 | 215 | 0.45 | 0.606 | |
Ether | 74.1 | 308 | 4.3 | 0.708 | 1.353 | 233 | 0.24 | 16.95 | 215 | 1.15 | 0.29 | |
Ethyl acetate | 88.1 | 350 | 6.0 | 0.902 | 1.372 | 270 | 0.45 | 23.39 | 260 | 1.8 | 0.554 | 0.45 |
(°) Ethyl alcohol | 46.1 | 351 | 24.3 | 0.785 | 1.360 | 281 | 1.20 | 21.97 | 210 | 1.69 | 0.575 | |
Heptane | 100.2 | 371 | 1.9 | 0.684 | 1.387 | 272 | 19.65 | 200 | 0.00 | |||
Hexane | 86.2 | 342 | 1.9 | 0.659 | 1.375 | 250 | 0.33 | 17.89 | 200 | 0.00 | ||
2-Methoxyethanol | 76.1 | 398 | 16.0 | 0.965 | 1.402 | 319 | 1.72 | 30.84 | 220 | |||
(°) Methyl alcohol | 32.0 | 338 | 32.7 | 0.791 | 1.329 | 284 | 0.60 | 22.07 | 205 | 1.70 | 0.530 | 0.73 |
(°) 2-Methylbutane | 72.15 | 303 | 1.8 | 0.620 | 1.354 | 217 | ||||||
4-Me-2-pentanone | 100.2 | 391 | 13.1 | 0.800 | 1.396 | 286 | 334 | |||||
2-Me-1-propanol | 74.1 | 381 | 17.7 | 0.803 | 1.396 | 310 | ||||||
2-Me-2-propanol | 74.1 | 356 | 10.9 | 0.786 | 1.387 | 277 | 19.96 | 1.66 | 0.534 | |||
DMSO | 78.1 | 462 | 46.7 | 1.101 | 1.479 | 368 | 2.24 | 42.92 | 268 | 3.96 | ||
(°) Nitromethane | 61.0 | 374 | 35.9 | 1.127 | 1.382 | 308 | 0.67 | 36.53 | 380 | 3.46 | 0.391 | |
1-Octanol | 130.2 | 469 | 10.3 | 0.827 | 1.429 | 354 | 10.62 | 27.10 | ||||
(°) Pentane | 72.15 | 309 | 1.8 | 0.626 | 1.358 | 224 | 0.23 | 15.49 | 200 | 0.00 | ||
3-Pentanone | 86.1 | 375 | 17.0 | 0.853 | 1.392 | 279 | 24.74 | |||||
(°) 1-Propanol | 60.1 | 370 | 20.1 | 0.804 | 1.384 | 288 | 2.26 | 23.32 | 210 | |||
(°) 2-Propanol | 60.1 | 356 | 18.3 | 0.785 | 1.377 | 295 | 2.30 | 20.93 | 210 | 0.63 | ||
Pyridine | 79.1 | 388 | 12.3 | 0.978 | 1.510 | 293 | 0.94 | 36.56 | 305 | 2.2 | 0.611 | 0.55 |
TetraCl-ethylene | 165.8 | 394 | 2.3 | 1.623 | 1.506 | 0.90 | 0.802 | |||||
(°) Tetra-hydrofuran | 72.1 | 340 | 7.6 | 0.886 | 1.407 | 256 | 0.55 | 215 | 1.75 | 0.35 | ||
Toluene | 92.1 | 384 | 2.4 | 0.867 | 1.496 | 277 | 0.59 | 27.93 | 285 | 0.36 | 0.618 | 0.22 |
1,1,2-TriCl, triFEthane | 187.4 | 321 | 2.4 | 1.575 | 1.358 | 0.69 | 230 | 0.02 | ||||
2,2,4-TriMe-pentane | 114.2 | 372 | 1.9 | 0.692 | 1.391 | 266 | 0.50 | 215 | 0.01 | |||
o-Xylene | 106.2 | 417 | 2.6 | 0.870 | 1.505 | 305 | 0.81 | 29.76 | ||||
p-Xylene | 106.2 | 411 | 2.3 | 0.866 | 1.495 | 300 | 0.65 | 28.01 | ||||
(°) Acetic acid | 60.05 | 391 | 6.15 | 1.049 | 1.372 | 27.10 | 1.2 | 0.551 | ||||
Decaline | 138.2 | 465 | 2.2 | 0.879 | 1.476 | 0.681 | ||||||
DiBr-methane | 173.8 | 370 | 7.8 | 1.542 | 2.497 | 39.05 | 1.43 | 0.935 | ||||
1,2-DiCl-ethylen (Z) | 96.9 | 334 | 9.2 | 1.284 | 1.449 | 1.90 | 0.679 | |||||
(°) 1,2-DiCl-ethylen (E) | 96.9 | 321 | 2.1 | 1.255 | 1.446 | 0 | 0.638 | |||||
1,1-DiCl-ethylen | 96.9 | 305 | 4.7 | 1.213 | 1.425 | 1.34 | 0.635 | |||||
Dimethoxymethane | 76.1 | 315 | 2.7 | 0.866 | 1.356 | 0.611 | ||||||
(°) Dimethylether | 46.1 | 249 | 5.0 | |||||||||
Ethylene carbonate | 88.1 | 511 | 89.6 | 1.321 | 1.425 | 4.91 | ||||||
(°) Formamide | 45.0 | 484 | 109 | 1.133 | 1.448 | 57.03 | 3.73 | 0.551 | ||||
(°) Methylchloride | 50.5 | 249 | 12.6 | 0.916 | 1.339 | 1.87 | ||||||
Morpholine | 87.1 | 402 | 7.3 | 1.005 | 1.457 | 0.631 | ||||||
Quinoline | 129.2 | 510 | 9.0 | 1.098 | 1.629 | 42.59 | 2.2 | 0.729 | ||||
(°) SO2 | 64.1 | 263 | 17.6 | 1.434 | 1.6 | |||||||
2,2-TetraCl-ethane | 167.8 | 419 | 8.2 | 1.578 | 1.487 | 35.58 | 1.3 | 0.856 | ||||
TetraMe-urea | 116.2 | 450 | 23.1 | 0.969 | 1.449 | 3.47 | 0.634 | |||||
TriCl-ethylen | 131.4 | 360 | 3.4 | 1.476 | 1.480 | 0.734 |
Replacing throughout these definitions δ, with: (i) the valence delta, δv, (ii) the intrinsic-I-state indices, and (iii) the electrotopological-S-state indices (see ref. 5–7 and Appendix), it is possible to obtain the three subsets: the valence MMCI,{AMv, GMv, HMv, RMv, SMv, UMv, HoMv, LMv, StMv}, the I-state MMCI: {AMI, GMI, HMI, RMI, SMI, UMI, HoMI, LMI, StMI}, and E-state MMCI: {AME, GME, HME, RME, SME, UME, HoME, LME, StME}. The basic notions of delta, valence delta, I- and S-indices belong to the origins of the molecular connectivity theory, and are based on graph concepts.1–3,8–10 To avoid imaginary S-State MMCIs, as some S values for highly electropositive atoms can be negative, a rescaling of the S value is undertaken (see ref. 5). Summing up we have thirty-six MMCIs. Other MMCI can be derived following different types of bonding and branching as suggested by Kier and Hall9 but for our present purpose these are enough. To model our eleven properties we will also use thirty MCI (see Table 2 in ref. 5), fifty random variable rn1–rn50 (where 0 < rn < 1). The five experimental variables, {M, Tb, ε, d, RI}, of Table 1 will also be used as indices throughout the present calculations, i.e., in some cases they will show up on the right-side of the modeling relationships, and then the relationship will be labeled semiempirical. The final number of independent variables sums up to 121. The best relationship for each property might then encompass these four different type of indices: MMCIs, MCIs, experimental variables, and random variables.
The MMCI have been obtained with a visual basic home-made program that uses both adjacency and distance matrices6 and that runs on a PC. The number of indices of the present multilinear relationships equals the number of indices of the corresponding relationship of ref. 5 that obeyed the Topliss-Costello rule:11 the ratio of data points to the number of variables should be higher or equal to five and should provide a correlation coefficient r > 0.84 (r2 > 0.70).
The multilinear least-squares procedure of Statistica 8 is used to find out the best relationship for the training compounds of Table 1, which is then used to evaluate the left-out compounds [those with (°) in Table 1]. It should be underlined that in principle the experimental values of the evaluated points are unknown, and they have to be guessed from the predictive relationship obtained with the training points. This equation will check how much the guessed evaluated points will deviate from the true values, and how symmetrically are the residuals (deviations) placed around the zero line in a residual plot.12 The overall quality of the model for each property, that is, r2, s, and N (here number of compounds), is obtained with the EXCEL spreadsheet plotting the observed property (P) vs. the calculated one, Pclc. The quality of the training regression equation is given also by the q2 leave-one-out statistics5 (Table 2).
δv-type† | Regression equations |
---|---|
a * with no outliers to allow comparison with previous results5 (present are better). † for the meaning of po and ppo see Appendix 2. | |
δvppo(0.5) | Tb = 183.6 + 1.807ε + 5.556RMv + 7.02GME − 41.59D + 102.021χv + 8.241TΣ/M (10) |
(8.3, 0.1, 1.0, 0.6, 4.4, 8.4, 1.01) | |
N(TR) = 45, q2 = 0.948, r2 = 0.963, s = 10; N(T) = 62, r2 = 0.874, s = 20 | |
Excluded strong outliers in EV: formamide | |
δvpo(50) | ε = −8.668 + 0.145Tb − 0.069M − 5.934HoME + 4.802Dv + 17.141ψI + 13.07TΣ/M (11) |
(5.3, 0.01, 0.01, 1.0, 0.8, 3.4, 3.0) | |
N(TR) = 44, q2 = 0.905, r2 = 0.941, s = 2.5; N(T) = 60, r2 = 0.937, s = 2.8 | |
Excluded strong outliers: ethylencarbonate (TR), HAc & formamide (EV) | |
δvppo(2) | d = −1.840 + 0.001Tb + 0.592AMv − 0.031Dv + 0.2390χv + 1.737TψI (12) |
(0.1, 0.0001, 0.01, 0.002, 0.01, 0.1) | |
N(TR) = 45, q2 = 0.981, r2 = 0.986, s = 0.03; N(T) = 60, r2 = 0.953, s = 0.06 | |
Excluded strong outliers in EV: formamide, MeOH; (N = 62, r2 = 0.906, s = 0.08*) | |
δvppo(1) | RI = 1.287 + 0.0007Tb − 0.131HoMI + 0.011M − 0.4791χ + 0.071Dv − 0.080Δ (13) |
(0.03, 0.0001, 0.005, 0.0003, 0.02, 0.002, 0.006) | |
N(TR) = 45, q2 = 0.970, r2 = 0.983, s = 0.02; N(T) = 61, r2 = 0.979, s = 0.02 | |
δvpo(−0.5) | FP = −75.22 + 0.873Tb + 21.26d + 7.018AMv − 1.112GMv + 13.72rn41 (14) |
(8.2, 0.02, 4.1, 0.6, 0.07, 2.8) | |
N(TR) = 29, q2 = 0.986, r2 = 0.992, s = 3.1; N(T)= 41, r2 = 0.967, s = 6.4 | |
δvppo(2) | γ = −14.25 + 0.153 Tb + 3.467 RI + 2.345GMI + 0.4750ψId − 0.902SψE (15) |
(2.3, 0.01, 1.2, 0.2, 0.09, 0.05) | |
N(TR) = 29, q2 = 0.953, r2 = 0.977, s = 1.1; N(T) = 40, r2 = 0.865, s = 3.0 | |
Excluded strong outlier in EV: methanol | |
δvpo(5) | UV = −776.0 + 682.0 RI − 35.44HM + 7.259HMv + 27.69D (16) |
(68, 40, 5.6, 0.8, 6.2) | |
N(TR) = 25, q2 = 0.928, r2 = 0.955, s = 9.1; N(T) = 33, r2 = 0.919, s = 12 | |
Excluded strong outlier: 4-Me-2-Pentanone (TR); 2-butanone, MeCl, nitromethane (EV) | |
δvppo(50), φ = 0, 1 | μ = 0.0311 + 0.043ε + 0.327HMv − 0.293SME + 3.3171χ + 0.188Σ (17) |
(0.1, 0.003, 0.04, 0.03, 0.3, 0.02) | |
N(TR) = 24, q2 = 0.939, r2 = 0.984, s = 0.2; N(T) = 34, r2 = 0.897, s = 0.4 | |
Excluded strong outlier in EV: formamide | |
δvpo(50) | − χ × 106 = 0.231 + 0.004M − 0.008LME − 0.004StME − 0.1371ψI + 0.2521ψEs (18) |
(0.03, 0.0003, 0.001, 0.0006, 0.02, 0.05) | |
N(TR) = 23, q2 = 0.842, r2 = 0.945, s = 0.02; N(T) = 31, r2 = 0.911, s = 0.03 | |
Excluded strong outlier in EV: nitromethane | |
δvpo(5) | El = −1.479 + 0.006Tb + 0.332AMv − 0.021SME − 0.166rn12 (19) |
(0.1, 0.0003, 0.01, 0.001, 0.03) | |
N(TR) = 15, q2 = 0.945, r2 = 0.986, s = 0.02; N(T) = 20, r2 = 0.831, s = 0.1 | |
Pentane and tetrahydrofurane ∈{TR} |
Our previous ANN study5 has shown that, as a rule, ANN models fit the data better than the MLS ones and this is the reason that three best sets of MLS descriptors, with similar quality, for the training set of compounds have been passed over to the ANN method. Additionally, the ANN program chooses a small set (20%) of test compounds (underlined and bold compounds in Table 1) belonging to the training compounds to achieve a rapid convergence and to avoid overtraining.
ANN methods, which are capable of performing regression and data validation, carry out both tasks in a non-parametric way that makes no assumption regarding the relationship between y and x, where y = f(x). This means that the function Property = f(indices) is not known a priori. In short, a non-parametric model is a kind of black box that tries to discover the mathematical function that can approximate the relationship between the indices and the property well enough. It uses highly flexible transfer functions with adaptable parameters that can model a wide spectrum of functional relationships.13 ANN results were obtained with the built-in utility of Statistica 8, the multilayer perceptron neural network (MLP). The ANN-MLP network used here has three-layered feedforward architecture with unidirectional full connections between successive layers and with error backpropagation (or backprop). The three layers are: input units – hidden units – output units, that correspond to: indices – hidden units – 1 (one), where the only output unit or neuron is the targeted property. The connections between the units (here two sets of links: input-hidden, and hidden-output) are the weights that determine the values assigned to the nodes. There exist additional weights assigned to the bias values that act as node value offsets. The weights that are adjusted by the training process are initially random and are passed to all nodes of the following layer. The training process is iterative and each iteration is called an epoch. The weights are slightly varied in each epoch to minimize the sum-of-squares error function: , where Piclc (clc = calculated) is the ith predicted value (network outputs) of the property, Pi (target value), to be predicted. This function is the sum of differences between the prediction outputs and the target defined over the entire training set of points (compounds) N. The number of hidden nodes in Statistica 8 is set, by default, between 3 and 11. For UV, MS, and El this number is set between 3 and 10. This means that the final weight values for a single property of, for instance, a [5–7–1] network could fill an entire page. In Table 3 are given, as in our previous work, only the sensitivity values, which are the values that are due to the sensitivity analysis that rates the importance of the models' input variables. The activation functions for both hidden and output nodes in Statistica 8 are: identity (i), logistic sigmoid (l), hyperbolic tangent (t), sine (s), and exponential (e). The detailed activation function together with the neuronal architecture will be given in Table 3 together with the statistics, r2 and s, for each property, that were obtained with the EXCEL spreadsheet plotting the observed P vs. the calculated Pclc ANN-MLP values.
MLP | δv (type) − (descriptors) → property |
---|---|
a * activation functions (in parenthesis): e = exponential, i = identity, l = logistic, t = tanh, s = sin. | |
6-3-1 | δvppo(0.5) − (ε, RMv, GME, D, 1χv, TΣ/M) → Tb (20) |
(e, t)* | (12.9, 17.63, 149.0, 86.34, 24.26, 2.443) |
30 | N(Tr) = 38, r2 = 0.968, s = 9.4, N(Tr + 7Te + 16 EV) = 61, r2 = 0.909, s = 17 |
0.001/0.001 | Excluded strong outlier in EV![]() ![]() |
6-3-1 | δvpo(50) − (Tb, M, HoME, Dv, 1ψI, TΣ/M) → ε (21) |
(l, i) | (53.26, 2.150, 240.0, 108.1, 24.37, 2.646) |
54 | N(Tr) = 38, r2 = 0.985, s = 2.0, N(Tr+ 7Te + 16 EV) = 61, r2 = 0.984, s = 2.5 |
0.0002/0.0003 | Excluded strong outliers in EV: acetonitrile, and HAc |
5-6-1 | δvppo(1) − (Tb, AMv, Dv, 0χv, TψI) → d (22) |
(e, t) | (5.100, 138.9, 111.7, 74.96, 61.29) |
37 | N(Tr) = 36, r2 = 0.992, s = 0.03, N(Tr + 9Te + 15 EV) = 60, r2 = 0.992, s = 0.03 |
0.0004/0.0001 | Excluded strong outliers in EV: formamide, and Me–Cl |
5-5-1 | δvppo (5) − (Tb, RI, GMI, 1χv, TΣ/M) → FP (23) |
(t, i) | (133.3, 3.009, 4.263, 5.809, 2.443) |
13 | N(Tr) = 22, r2 = 0.990, s = 3.3, N(Tr + 7Te + 12 EV) = 41, r2 = 0.979, s = 5.1 |
0.0004/0.001 | |
4-5-1 | δvppo(2) − (RI, HM, GMv, Δ) → UV (25) |
(e, t) | (20.90, 138.9, 186.1, 19.98) |
29 | N(Tr) = 20, r2 = 0.969, s = 7.7, N(Tr + 5Te + 8 EV) = 33, r2 = 0.936, s = 11 |
0.0009/0.0004 | Excluded strong outliers: 4M2-pentanone in Tr, nitromethane, 2-butanone, and acetone in EV |
5-5-1 | δvppo(5) [φ = 0, 1] − (ε, LME, 1χd, SψE, 0ψEd) → μ (26) |
(l, s) | (89.93, 98.12, 69.91, 127.5, 212.0) |
53 | N(Tr) = 19, r2 = 0.989, s = 0.1, N(Tr + 5Te + 10 EV) = 34, r2 = 0.937, s = 0.3 |
0.0004/0.00006 | Excluded strong outliers in EV: MeOH |
5-5-1 | δvpo(0.5) − (M, LME, StME, 1ψI, 1ψEs) → −χ × 106 (27) |
(t, e) | (79.09, 57.12, 15.52, 38.21, 8.276) |
33 | N(Tr) = 19, r2 = 0.968, s = 0.02, N(Tr + 4Te + 8 EV) = 31, r2 = 0.932, s = 0.03 |
0.0009/0.0006 | Excluded strong outliers in EV: nitromethane |
4-9-1 | δvpo(0.5) − (ε, LMv, 1ψE, Δ) → El (28) |
(e, l) | (37.46, 247.3, 489.7, 18.30) |
33 | N(Tr) = 12, r2 = 0.995, s = 0.01, N(Tr + 3Te + 5 EV) = 20, r2 = 0.932, s = 0.06 |
0.0002/0.00004 | Pentane and tetrahydrofurane belong here to {TR} |
Statistica 8 allows one to set only the number of networks to train and retain (100/40), without taking into account the number of training cycles/epochs. The type of algorithm that optimizes the network is the BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm that ensures a fast convergence rate.14 In Table 3 are given the number of epochs for which it runs even if the actual number of cycles used to train the model might be greater. As the number of epochs is not definitive it cannot be held as an unfailing parameter (it can exceed the given number).
It is not rare the case that the model becomes exceedingly good giving rise to overfitting with exceedingly poor externally evaluated values. The choice of training (here 80%) and test (here 20%) sets normally avoids overfitting because the network is repeatedly trained for a number of cycles so long as the test error is on the decrease, as soon as it increases again the training is halted.
As we already explained in the previous section the ANN-MLP methodology further subdivides the training compounds into training (80%) and test compounds (20%, underlined and bold in Table 1). Concerning the model for dipole moments, all indices were multiplied by a two-valued symmetry indicator variable which is zero for symmetric molecules (with μ = 0 in Table 1) and 1 otherwise. Due to PC limitations the entire space of {MMCI, MCI, Rn, ExpPar} could not be searched for the best descriptor. The search was done in two different ways: (i) search for the best descriptor within the set {MMCI, MCI, Exp.Par.}, i.e., best (MMCI, MCI, Exp.Par.), and, finally, (ii) search for the best descriptor within the set {best (MMCI, MCI, Exp.Par.), Rn}.
Table 3 shows the ANN-MLP results: the 1st column describes the MLP architecture with the abbreviation for the activation functions for the hidden and output layers, the number of epochs, and the training and test errors. The second column shows the best set of indices of the ANN-MLP method, the values of the sensitivity analysis for the indices (2nd line), the statistical parameter for the training [N(Tr)], and for all compounds [N(Tr + nTe + nEV)] (3rd line). Notice that the training (TR) set of Table 2 throughout the ANN-MLP calculations of Table 3 is subdivided into training (Tr) and test (Te) sets.
The reader can notice that in Table 2 viscosity, η, is silent as MMCI contribute no model equation with improved quality relatively to the one given in ref. 5, where there was no talk about MMCI. For the same reason in Table 3 refractive index, RI, viscosity, η, and surface tension, γ, are silent.
For six properties out of eleven (RI, FP, γ, UV, μ, El) ANN-MLP methodology do not choose the best descriptive equation of the MLS method. In general, ANN-MLP overall model quality (training + test + evaluation) improve (Table 3) over MLS (training + evaluation) model ability (Table 2).
This confirms our previous findings.5 Let us now compare MLS results of the present Table 2 with results of the corresponding results of Table 3 (MLS for −χ × 106, and El), and 4 (MLS all other properties) of ref. 5. On the whole, semiempirical equations with one or more MMCI fare better than the corresponding semiempirical equations with no MMCI. Only exception being, as already told in the previous paragraph, the viscosity, η, whose semiempirical equation has no use for MMCI. The training semiempirical relationship for the magnetic susceptibility (−χ × 106) with no MMCI (ref. 5) show a better quality, but it is made of three random variables also, while the corresponding present training semiempirical regression with MMCI + MCI has no use for random variables.
Going over to the overall model ability (training plus evaluated properties) of the MLS regressions things change a bit: Tb, FP, γ, and μ, show preference for relationships with MCI only (Table 4 of ref. 5), UV, and −χ × 106 show no interesting improvement with MMCI, while for the ε (dielectric constant) r2 prefers MCI only (ref. 5) but s improves when MMCI are added. Only for the remaining three properties, d, RI, and El there is a clear preference for model equations that include MMCI.
Comparison of ANN-MLP results of Table 3 with the corresponding results of Table 5 in ref. 5, show that more often than not training semiempirical equations with MMCI + MCI fare better. There are some exceptions: the already cited case for η, RI, γ, and for −χ × 106. Comparison, instead, of the overall description (i.e., training plus evaluated points) shows improvements with nearly the same exceptions: η, RI, and γ. The overall description of −χ × 106 improves consistently, while for UV, and El the model quality is rather similar. All in all mean molecular connectivity indices (MMCI) are useful both to improve a model quality and also to get rid of the random variables.
The model plots for the different properties improve relatively to the previous ones.5 To give the reader an idea of the improvement two model plots for the density, d, are here shown. Fig. 1 on the left side shows the plot obtained in ref. 5 with ANN-MLP-{Tb, M, 0χv, χtv, 1ψIs}, and on the right side the plot obtained with the present ANN-MLP-{Tb, M, AMv, Dv, TψI}, where AMv is a MMCI.
![]() | ||
Fig. 1 Plot for density (d) obtained in our previous ANN study5 (left) and in the present study (right). Full circles: training compounds; crosses: evaluation compounds; empty squares: test compounds. |
The detected asymmetry in the residual plots for the evaluated points (more deviations are located on one side of the zero line than on the other side), is not as drastic as in ref. 5 but it continues to show up. This detail that is probably due to the fact that higher order regressions are needed to model the present properties does not thwart the predictive character of the present relationships.
Before closing this section let us add some words about the descriptive quality of relationships made either with MMCI or with MCI alone:15 RI, FP, UV, and El are advantageously described with equations made of pure MMCI indices, μ is indifferent to the type of indices, while, Tb, d, ε, η, γ, and MS ( −χ × 106) are advantageously described with pure MCI descriptors. Needless to say the present semiempirical equations perform much better.
The implementation brought about by the new mean indices is a good hint that strategies to find new descriptors, even if not always successful, are always worth to trying.16–18 Thirteen out of thirty-six MMCI show up in our semiempirical equations side-by-side with MCI and experimental parameters. MCI nearly double in number the MMCI, and the five experimental parameters are always a good help. Concerning the MMCIs AMv shows up four times, LME, thrice, HM, HMv, GMI, GME, GM v, SME, RMv, HoME, and StME, twice, and HoMI, and LMv, only once. A brief look at these indices shows that they are mainly δv-dependent (only exception being HM), either directly or through the intrinsic I-state and electrotopological S-state indices. This means that, not only they depend on properties of general graphs, but that both the hydrogen contribution and the complete graph contribution for the core electrons (see Appendix) play an important role in the descriptive quality of the MMCI. Notice that in eqn (13) and (17) the simple and seminal Randić 1χ index1 shows up underlining how the corresponding more complex 1χv index2,9 does not cover all the properties.
The index HoME brings about the greatest improvement in the model of a property, the dielectric constant, ε: with no MMCIs we had, N(T) = 61, r2 = 0.933, s = 3.9,5 while with HoME we have, N(T) = 61, r2 = 0.984, s = 2.5. MMCI help to reduce the importance of random variables while MMCI together with the ANN-MLP calculations have no use of them.
Concerning the different types of configurations of the MMCI & MCI indices due to the different types of valence delta, δvpo/ppo(n), it is possible to notice that in MLS calculations half of the properties prefer the ppo configuration, while in ANN-MLP calculations five out of eight properties prefer this configuration. Concerning the n values of δvpo/ppo(n) practically all values show up. Thus, general and complete graph characteristics like multiple bonds and core electrons of heteroatoms (here chlorine and bromine), as well as the hydrogen contribution, are an important factor in modeling studies.
Arithmetic mean: AM = (a + b)/2 | (A1) |
Geometric mean: GM = (ab)1/2 | (A2) |
Harmonic mean: HM = 2/(a−1 + b−1) | (A3) |
Root mean square: RM = [(a2 + b2)/2]1/2 | (A4) |
Symmetric mean: SM = (a2 + b2)/(a + b) | (A5) |
Unsymmetrical mean: UM = [b − a + (a2 − 2ab + 5b2)0.5]/2 | (A6) |
Hölder mean: HoM(p) = (ap + bp)1/p/2 | (A7) |
Lehmer's mean: LM(p) = (ap + bp)/(ap−1 + bp−1) | (A8) |
Stolarsky's mean: StM(p) = [(ap − bp)/(pa − pb)]1/(p−1) | (A9) |
The reader can notice that the Stolarsky's mean has a minus instead of a plus sign in the denominator. The plus sign in eqn (9) was introduced to avoid an undefined value, zero/zero, whenever δi = δj (a = b in A9) had we hold on to the minus sign.
![]() | (A10) |
δv(ps) is the valence of a vertex in a chemical pseudograph (or general graph) that allows multiple bonds and self-connections (or loops). Normally, in chemical graph theory simple graphs (with no multiple bonds and loops) and general graphs (or pseudographs) are hydrogen-depleted. Parameters p (= 1, 2, 3, 4, …) is the order of a complete graph, Kp, and r is its regularity (r = p − 1). A complete graph is a graph where every pair of its vertices is adjacent. The first order complete graph, K1, is just a vertex and it is usually used to encode second row atoms. Parameter q in eqn (A1) is two-valued: q = 1 or p. Generally, two representations (or configurations) for δv are useful (see Tables 2 and 3): δvpo(n) where q = 1, and p = odd, and δvppo(n) where q = p and p = odd. Number n that appears in the two deltas is the value of exponent n in fδ (eqn (A10)). It quantifies the importance of the hydrogen perturbation: the higher the n values the lower the importance of the perturbation. The values for n here used that generate different sets of indices are: n = −0.5, 0.5, 1, 2, 5, 50. This parameter could be used as a fine-tuning optimization variable, something like (but not quite) the Randić's variable chi index,21,22 that was proposed as an alternative way of characterizing heteroatoms in molecules. The fδ fractional hydrogen perturbation parameter that encodes the depleted hydrogen atoms is defined in the following way.
fδ = 1 − δv(ps)/δvm(ps) = nH/δvm(ps) | (A11) |
I = (δv + 1)/δ, S = I + ΣΔI, with ΔI = (Ii − Ij)/rij2 | (A12) |
This journal is © The Royal Society of Chemistry 2014 |