QSPR with descriptors based on averages of vertex invariants. An artificial neural network study

Lionello Pogliani*ab and Jesus Vicente de Julián-Ortizab
aMOLware SL, c/Burriana 36-3, 46005, Valencia, Spain
bUnidad de Investigación de Diseño de Farmacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, 46100 Burjassot, València, Spain. E-mail: liopo@uv.es; jejuor@uv.es

Received 10th July 2014 , Accepted 2nd September 2014

First published on 3rd September 2014


Abstract

New type of indices, the mean molecular connectivity indices (MMCI), based on nine different concepts of mean are proposed to model, together with molecular connectivity indices (MCI), experimental parameters and random variables, eleven properties of organic solvents. Two model methodologies are used to test the different descriptors: the multilinear least-squares (MLS) methodology and the Artificial Neural Network (ANN) methodology. The top three quantitative structure–property relationships (QSPR) for each property are chosen with the MLS method. The indices of these three QSPRs were used to train the ANNs that selected the best training sets of indices to estimate the evaluation sets of compounds. The best ANN relationships for most properties are of the semiempirical types that include mean molecular connectivity indices (MMCI), molecular connectivity indices (MCI) and experimental parameters. Refractive index, RI, viscosity, η, and surface tension, γ, prefer a semiempirical relationship made of MCI and an experimental parameter only. In our previous study with no MMCI, random variables contributed to semiempirical relationships for two properties at the ANN level (MS, and El), here the use of MMCIs undo the contribution of such variables. Most of the MMCIs that contribute to improve the model of the properties are valence-delta-dependent (δv), that is, they encode both the hydrogen atom contribution and the core electrons of higher-row atoms.


1. Introduction

Molecular connectivity became a full grown-up branch of chemical graph theory with Randić1 and Kier and Hall2 and nearly quarter of a century later Todeschini and Consonni3 were able to write an opus magnum on descriptors. In it they elegantly stated that “a descriptor is the final result of a logico-mathematical procedure, which transforms an information, encoded within a symbolic representation of an event, into useful numbers”. Descriptors are critical in QSAR/QSPR modeling studies, thus, finding new and useful ones is an important task for those working in the field.4

In the literature, nine definitions of mean between two numbers (see Appendix) can be found, and these definitions are used to define new type of indices, the mean molecular connectivity indices (MMCI). These new indices are here used together with molecular connectivity indices, experimental parameters, and random variables to build optimal semiempirical quantitative structure–property relationships (QSPR) for eleven properties of a set of organic solvents. These properties were recently modeled5 with a semiempirical set of descriptors that encompassed only the molecular connectivity indices (MCI), empirical parameters, and random variables. The cited work5 emphasized the advantage in using ANN for model purposes. The two main aims of the present study are: (i) test the usefulness of the new MMCI indices and (ii) the related usefulness of the ANN methodology.

2. Computational tools

In the following are the definitions of the mean molecular connectivity indices (MMCI).
 
image file: c4ra06484d-t1.tif(1)
 
image file: c4ra06484d-t2.tif(2)
 
image file: c4ra06484d-t3.tif(3)
 
image file: c4ra06484d-t4.tif(4)
 
image file: c4ra06484d-t5.tif(5)
 
image file: c4ra06484d-t6.tif(6)
 
image file: c4ra06484d-t7.tif(7)
 
image file: c4ra06484d-t8.tif(8)
 
image file: c4ra06484d-t9.tif(9)

Here, i (and j) assigns the N atoms of a hydrogen-depleted molecule, ij means two atoms directly bonded through a σ bond, and p = N, even if other values are possible. It should be underlined that N for the studied molecules (see Table 1) is not that large. The reader may notice, among other similarities, that the Lehmer mean, LM, for p = 2 equals the symmetrical mean, SM.

Table 1 Eleven properties of organic solvents plus their molar mass M (g mol−1): Tb, boiling points (K); ε, dielectric constant; d, density (at 20 °C ± 5 °C relative to water at 4 °C, g cm−3); RI, refractive index (20 °C); FP, FlashPoint (K); η, viscosity (cP, 20 °C; 1 at 25 °C, 2 at 15 °C); γ, surface tension (mN m−1 at 25 °C); UV, cutoff UV values (nm); μ, dipole moments in Debye (1 D = 10−18 esu cm = 3.3356 × 10−3 C m); MS (−χ × 106), magnetic susceptibility (also, −χ × 106, in emu mol−1, 1 emu = 1 cm3, temperatures cover a range from 15 °C to 32 °C); and El, Elutropic value (silica)a
Solvents M Tb ε d RI FP η γ UV μ MS El
a (°) externally validated compounds. Italicized bold values: test compounds used in ANN-MLP calculations.
(°) Acetone 58.1 329 20.7 0.791 1.359 256 0.32 23.46 330 2.88 0.46 0.43
(°) Acetonitrile 41.05 355 37.5 0.786 1.344 278 0.37 28.66 190 3.92 0.534 0.50
Benzene 78.1 353 2.3 0.84 1.501 262 0.65 28.22 280 0 0.699 0.27
Benzonitrile 103.1 461 25.2 1.010 1.528 344 1.241 38.79        
1-Butanol 74.1 391 17.1 0.810 1.399 308 2.95 24.93 215      
(°) 2-Butanone 72.1 353 18.5 0.805 1.379 270 0.40 23.97 330     0.39
Butyl acetate 116.2 398 5.0 0.882 1.394 295 0.73 24.88 254      
CS2 76.1 319 2.6 1.266 1.627 240 0.37 31.58 380 0 0.532  
CCl4 153.8 350 2.2 1.594 1.460   0.97 26.43 263 0 0.691 0.14
Cl-benzene 112.6 405 5.6 1.107 1.524 296 0.80 32.99 287      
1Cl-butane 92.6 351 7.4 0.886 1.4024 267 0.35 23.18 225      
CHCl3 119.4 334 4.8 1.492 1.446   0.57 26.67 245 1.01 0.740 0.31
Cyclohexane 84.2 354 2.0 0.779 1.426 255 1.00 24.65 200 0 0.627 0.03
(°) Cyclopentane 70.1 323 2.0 0.751 1.400 236 0.47 21.88 200   0.629  
1,2-DiCl-benzene 147.0 453 9.9 1.306 1.551 338 1.32   295 2.50 0.748  
1,2-DiCl-ethane 98.95 356 10.4 1.256 1.444 288 0.79 31.86 225 1.75    
DiCl-methane 84.9 313 9.1 1.325 1.424   0.44 27.20 235 1.60 0.733 0.32
N,N-DiM-acetamide 87.1 438 37.8 0.937 1.438 343     268 3.8    
N,N-DiM-formamide 73.1 426 36.7 0.944 1.431 330 0.92   268 3.86    
1,4-Dioxane 88.1 374 2.2 1.034 1.422 285 1.54 32.75 215 0.45 0.606  
Ether 74.1 308 4.3 0.708 1.353 233 0.24 16.95 215 1.15   0.29
Ethyl acetate 88.1 350 6.0 0.902 1.372 270 0.45 23.39 260 1.8 0.554 0.45
(°) Ethyl alcohol 46.1 351 24.3 0.785 1.360 281 1.20 21.97 210 1.69 0.575  
Heptane 100.2 371 1.9 0.684 1.387 272   19.65 200     0.00
Hexane 86.2 342 1.9 0.659 1.375 250 0.33 17.89 200     0.00
2-Methoxyethanol 76.1 398 16.0 0.965 1.402 319 1.72 30.84 220      
(°) Methyl alcohol 32.0 338 32.7 0.791 1.329 284 0.60 22.07 205 1.70 0.530 0.73
(°) 2-Methylbutane 72.15 303 1.8 0.620 1.354 217            
4-Me-2-pentanone 100.2 391 13.1 0.800 1.396 286     334      
2-Me-1-propanol 74.1 381 17.7 0.803 1.396 310            
2-Me-2-propanol 74.1 356 10.9 0.786 1.387 277   19.96   1.66 0.534  
DMSO 78.1 462 46.7 1.101 1.479 368 2.24 42.92 268 3.96    
(°) Nitromethane 61.0 374 35.9 1.127 1.382 308 0.67 36.53 380 3.46 0.391  
1-Octanol 130.2 469 10.3 0.827 1.429 354 10.62 27.10        
(°) Pentane 72.15 309 1.8 0.626 1.358 224 0.23 15.49 200     0.00
3-Pentanone 86.1 375 17.0 0.853 1.392 279   24.74        
(°) 1-Propanol 60.1 370 20.1 0.804 1.384 288 2.26 23.32 210      
(°) 2-Propanol 60.1 356 18.3 0.785 1.377 295 2.30 20.93 210     0.63
Pyridine 79.1 388 12.3 0.978 1.510 293 0.94 36.56 305 2.2 0.611 0.55
TetraCl-ethylene 165.8 394 2.3 1.623 1.506   0.90       0.802  
(°) Tetra-hydrofuran 72.1 340 7.6 0.886 1.407 256 0.55   215 1.75   0.35
Toluene 92.1 384 2.4 0.867 1.496 277 0.59 27.93 285 0.36 0.618 0.22
1,1,2-TriCl, triFEthane 187.4 321 2.4 1.575 1.358   0.69   230     0.02
2,2,4-TriMe-pentane 114.2 372 1.9 0.692 1.391 266 0.50   215     0.01
o-Xylene 106.2 417 2.6 0.870 1.505 305 0.81 29.76        
p-Xylene 106.2 411 2.3 0.866 1.495 300 0.65 28.01        
(°) Acetic acid 60.05 391 6.15 1.049 1.372     27.10   1.2 0.551  
Decaline 138.2 465 2.2 0.879 1.476           0.681  
DiBr-methane 173.8 370 7.8 1.542 2.497     39.05   1.43 0.935  
1,2-DiCl-ethylen (Z) 96.9 334 9.2 1.284 1.449         1.90 0.679  
(°) 1,2-DiCl-ethylen (E) 96.9 321 2.1 1.255 1.446         0 0.638  
1,1-DiCl-ethylen 96.9 305 4.7 1.213 1.425         1.34 0.635  
Dimethoxymethane 76.1 315 2.7 0.866 1.356           0.611  
(°) Dimethylether 46.1 249 5.0                  
Ethylene carbonate 88.1 511 89.6 1.321 1.425         4.91    
(°) Formamide 45.0 484 109 1.133 1.448     57.03   3.73 0.551  
(°) Methylchloride 50.5 249 12.6 0.916 1.339         1.87    
Morpholine 87.1 402 7.3 1.005 1.457           0.631  
Quinoline 129.2 510 9.0 1.098 1.629     42.59   2.2 0.729  
(°) SO2 64.1 263 17.6 1.434           1.6    
2,2-TetraCl-ethane 167.8 419 8.2 1.578 1.487     35.58   1.3 0.856  
TetraMe-urea 116.2 450 23.1 0.969 1.449         3.47 0.634  
TriCl-ethylen 131.4 360 3.4 1.476 1.480           0.734  


Replacing throughout these definitions δ, with: (i) the valence delta, δv, (ii) the intrinsic-I-state indices, and (iii) the electrotopological-S-state indices (see ref. 5–7 and Appendix), it is possible to obtain the three subsets: the valence MMCI,{AMv, GMv, HMv, RMv, SMv, UMv, HoMv, LMv, StMv}, the I-state MMCI: {AMI, GMI, HMI, RMI, SMI, UMI, HoMI, LMI, StMI}, and E-state MMCI: {AME, GME, HME, RME, SME, UME, HoME, LME, StME}. The basic notions of delta, valence delta, I- and S-indices belong to the origins of the molecular connectivity theory, and are based on graph concepts.1–3,8–10 To avoid imaginary S-State MMCIs, as some S values for highly electropositive atoms can be negative, a rescaling of the S value is undertaken (see ref. 5). Summing up we have thirty-six MMCIs. Other MMCI can be derived following different types of bonding and branching as suggested by Kier and Hall9 but for our present purpose these are enough. To model our eleven properties we will also use thirty MCI (see Table 2 in ref. 5), fifty random variable rn1–rn50 (where 0 < rn < 1). The five experimental variables, {M, Tb, ε, d, RI}, of Table 1 will also be used as indices throughout the present calculations, i.e., in some cases they will show up on the right-side of the modeling relationships, and then the relationship will be labeled semiempirical. The final number of independent variables sums up to 121. The best relationship for each property might then encompass these four different type of indices: MMCIs, MCIs, experimental variables, and random variables.

The MMCI have been obtained with a visual basic home-made program that uses both adjacency and distance matrices6 and that runs on a PC. The number of indices of the present multilinear relationships equals the number of indices of the corresponding relationship of ref. 5 that obeyed the Topliss-Costello rule:11 the ratio of data points to the number of variables should be higher or equal to five and should provide a correlation coefficient r > 0.84 (r2 > 0.70).

The multilinear least-squares procedure of Statistica 8 is used to find out the best relationship for the training compounds of Table 1, which is then used to evaluate the left-out compounds [those with (°) in Table 1]. It should be underlined that in principle the experimental values of the evaluated points are unknown, and they have to be guessed from the predictive relationship obtained with the training points. This equation will check how much the guessed evaluated points will deviate from the true values, and how symmetrically are the residuals (deviations) placed around the zero line in a residual plot.12 The overall quality of the model for each property, that is, r2, s, and N (here number of compounds), is obtained with the EXCEL spreadsheet plotting the observed property (P) vs. the calculated one, Pclc. The quality of the training regression equation is given also by the q2 leave-one-out statistics5 (Table 2).

Table 2 The best MLS results for ten out of the eleven properties. 1st column: δv type5 for the valence-dependent indices. 2nd column: relationships and statistical results for the training, N(TR), and training plus evaluation compounds, N(T). In the last line, the excluded strong outliers (those with residuals >3s)a
δv-type† Regression equations
a * with no outliers to allow comparison with previous results5 (present are better). † for the meaning of po and ppo see Appendix 2.
δvppo(0.5) Tb = 183.6 + 1.807ε + 5.556RMv + 7.02GME − 41.59D + 102.021χv + 8.241TΣ/M (10)
(8.3, 0.1, 1.0, 0.6, 4.4, 8.4, 1.01)
N(TR) = 45, q2 = 0.948, r2 = 0.963, s = 10; N(T) = 62, r2 = 0.874, s = 20
Excluded strong outliers in EV: formamide
δvpo(50) ε = −8.668 + 0.145Tb − 0.069M − 5.934HoME + 4.802Dv + 17.141ψI + 13.07TΣ/M (11)
(5.3, 0.01, 0.01, 1.0, 0.8, 3.4, 3.0)
N(TR) = 44, q2 = 0.905, r2 = 0.941, s = 2.5; N(T) = 60, r2 = 0.937, s = 2.8
Excluded strong outliers: ethylencarbonate (TR), HAc & formamide (EV)
δvppo(2) d = −1.840 + 0.001Tb + 0.592AMv − 0.031Dv + 0.2390χv + 1.737TψI (12)
(0.1, 0.0001, 0.01, 0.002, 0.01, 0.1)
N(TR) = 45, q2 = 0.981, r2 = 0.986, s = 0.03; N(T) = 60, r2 = 0.953, s = 0.06
Excluded strong outliers in EV: formamide, MeOH; (N = 62, r2 = 0.906, s = 0.08*)
δvppo(1) RI = 1.287 + 0.0007Tb − 0.131HoMI + 0.011M − 0.4791χ + 0.071Dv − 0.080Δ (13)
(0.03, 0.0001, 0.005, 0.0003, 0.02, 0.002, 0.006)
N(TR) = 45, q2 = 0.970, r2 = 0.983, s = 0.02; N(T) = 61, r2 = 0.979, s = 0.02
δvpo(−0.5) FP = −75.22 + 0.873Tb + 21.26d + 7.018AMv − 1.112GMv + 13.72rn41 (14)
(8.2, 0.02, 4.1, 0.6, 0.07, 2.8)
N(TR) = 29, q2 = 0.986, r2 = 0.992, s = 3.1; N(T)= 41, r2 = 0.967, s = 6.4
δvppo(2) γ = −14.25 + 0.153 Tb + 3.467 RI + 2.345GMI + 0.4750ψId − 0.902SψE (15)
(2.3, 0.01, 1.2, 0.2, 0.09, 0.05)
N(TR) = 29, q2 = 0.953, r2 = 0.977, s = 1.1; N(T) = 40, r2 = 0.865, s = 3.0
Excluded strong outlier in EV: methanol
δvpo(5) UV = −776.0 + 682.0 RI − 35.44HM + 7.259HMv + 27.69D (16)
(68, 40, 5.6, 0.8, 6.2)
N(TR) = 25, q2 = 0.928, r2 = 0.955, s = 9.1; N(T) = 33, r2 = 0.919, s = 12
Excluded strong outlier: 4-Me-2-Pentanone (TR); 2-butanone, MeCl, nitromethane (EV)
δvppo(50), φ = 0, 1 μ = 0.0311 + 0.043ε + 0.327HMv − 0.293SME + 3.3171χ + 0.188Σ (17)
(0.1, 0.003, 0.04, 0.03, 0.3, 0.02)
N(TR) = 24, q2 = 0.939, r2 = 0.984, s = 0.2; N(T) = 34, r2 = 0.897, s = 0.4
Excluded strong outlier in EV: formamide
δvpo(50) χ × 106 = 0.231 + 0.004M − 0.008LME − 0.004StME − 0.1371ψI + 0.2521ψEs (18)
(0.03, 0.0003, 0.001, 0.0006, 0.02, 0.05)
N(TR) = 23, q2 = 0.842, r2 = 0.945, s = 0.02; N(T) = 31, r2 = 0.911, s = 0.03
Excluded strong outlier in EV: nitromethane
δvpo(5) El = −1.479 + 0.006Tb + 0.332AMv − 0.021SME − 0.166rn12 (19)
(0.1, 0.0003, 0.01, 0.001, 0.03)
N(TR) = 15, q2 = 0.945, r2 = 0.986, s = 0.02; N(T) = 20, r2 = 0.831, s = 0.1
Pentane and tetrahydrofurane ∈{TR}


Our previous ANN study5 has shown that, as a rule, ANN models fit the data better than the MLS ones and this is the reason that three best sets of MLS descriptors, with similar quality, for the training set of compounds have been passed over to the ANN method. Additionally, the ANN program chooses a small set (20%) of test compounds (underlined and bold compounds in Table 1) belonging to the training compounds to achieve a rapid convergence and to avoid overtraining.

ANN methods, which are capable of performing regression and data validation, carry out both tasks in a non-parametric way that makes no assumption regarding the relationship between y and x, where y = f(x). This means that the function Property = f(indices) is not known a priori. In short, a non-parametric model is a kind of black box that tries to discover the mathematical function that can approximate the relationship between the indices and the property well enough. It uses highly flexible transfer functions with adaptable parameters that can model a wide spectrum of functional relationships.13 ANN results were obtained with the built-in utility of Statistica 8, the multilayer perceptron neural network (MLP). The ANN-MLP network used here has three-layered feedforward architecture with unidirectional full connections between successive layers and with error backpropagation (or backprop). The three layers are: input unitshidden unitsoutput units, that correspond to: indiceshidden units – 1 (one), where the only output unit or neuron is the targeted property. The connections between the units (here two sets of links: input-hidden, and hidden-output) are the weights that determine the values assigned to the nodes. There exist additional weights assigned to the bias values that act as node value offsets. The weights that are adjusted by the training process are initially random and are passed to all nodes of the following layer. The training process is iterative and each iteration is called an epoch. The weights are slightly varied in each epoch to minimize the sum-of-squares error function: image file: c4ra06484d-t10.tif, where Piclc (clc = calculated) is the ith predicted value (network outputs) of the property, Pi (target value), to be predicted. This function is the sum of differences between the prediction outputs and the target defined over the entire training set of points (compounds) N. The number of hidden nodes in Statistica 8 is set, by default, between 3 and 11. For UV, MS, and El this number is set between 3 and 10. This means that the final weight values for a single property of, for instance, a [5–7–1] network could fill an entire page. In Table 3 are given, as in our previous work, only the sensitivity values, which are the values that are due to the sensitivity analysis that rates the importance of the models' input variables. The activation functions for both hidden and output nodes in Statistica 8 are: identity (i), logistic sigmoid (l), hyperbolic tangent (t), sine (s), and exponential (e). The detailed activation function together with the neuronal architecture will be given in Table 3 together with the statistics, r2 and s, for each property, that were obtained with the EXCEL spreadsheet plotting the observed P vs. the calculated Pclc ANN-MLP values.

Table 3 ANN-MLP results for eight out of eleven properties. 1st column: the MLP architecture, the abbreviation for the activation functions for the hidden and output layers, the number of epochs, and the training and test errors; 2nd column: indices of the ANN relations, sensitivity values for the indices, and statistical parameters for the training (Tr), and training plus test (Te) and evaluation (EV) compoundsa
MLP δv (type) − (descriptors) → property
a * activation functions (in parenthesis): e = exponential, i = identity, l = logistic, t = tanh, s = sin.
6-3-1 δvppo(0.5) − (ε, RMv, GME, D, 1χv, TΣ/M) → Tb (20)
(e, t)* (12.9, 17.63, 149.0, 86.34, 24.26, 2.443)
30 N(Tr) = 38, r2 = 0.968, s = 9.4, N(Tr + 7Te + 16 EV) = 61, r2 = 0.909, s = 17
0.001/0.001 Excluded strong outlier in EV[thin space (1/6-em)]:[thin space (1/6-em)]SO2
6-3-1 δvpo(50) − (Tb, M, HoME, Dv, 1ψI, TΣ/M) → ε (21)
(l, i) (53.26, 2.150, 240.0, 108.1, 24.37, 2.646)
54 N(Tr) = 38, r2 = 0.985, s = 2.0, N(Tr+ 7Te + 16 EV) = 61, r2 = 0.984, s = 2.5
0.0002/0.0003 Excluded strong outliers in EV: acetonitrile, and HAc
5-6-1 δvppo(1) − (Tb, AMv, Dv, 0χv, TψI) → d (22)
(e, t) (5.100, 138.9, 111.7, 74.96, 61.29)
37 N(Tr) = 36, r2 = 0.992, s = 0.03, N(Tr + 9Te + 15 EV) = 60, r2 = 0.992, s = 0.03
0.0004/0.0001 Excluded strong outliers in EV: formamide, and Me–Cl
5-5-1 δvppo (5) − (Tb, RI, GMI, 1χv, TΣ/M) → FP (23)
(t, i) (133.3, 3.009, 4.263, 5.809, 2.443)
13 N(Tr) = 22, r2 = 0.990, s = 3.3, N(Tr + 7Te + 12 EV) = 41, r2 = 0.979, s = 5.1
0.0004/0.001  
4-5-1 δvppo(2) − (RI, HM, GMv, Δ) → UV (25)
(e, t) (20.90, 138.9, 186.1, 19.98)
29 N(Tr) = 20, r2 = 0.969, s = 7.7, N(Tr + 5Te + 8 EV) = 33, r2 = 0.936, s = 11
0.0009/0.0004 Excluded strong outliers: 4M2-pentanone in Tr, nitromethane, 2-butanone, and acetone in EV
5-5-1 δvppo(5) [φ = 0, 1] − (ε, LME, 1χd, SψE, 0ψEd) → μ (26)
(l, s) (89.93, 98.12, 69.91, 127.5, 212.0)
53 N(Tr) = 19, r2 = 0.989, s = 0.1, N(Tr + 5Te + 10 EV) = 34, r2 = 0.937, s = 0.3
0.0004/0.00006 Excluded strong outliers in EV: MeOH
5-5-1 δvpo(0.5) − (M, LME, StME, 1ψI, 1ψEs) → −χ × 106 (27)
(t, e) (79.09, 57.12, 15.52, 38.21, 8.276)
33 N(Tr) = 19, r2 = 0.968, s = 0.02, N(Tr + 4Te + 8 EV) = 31, r2 = 0.932, s = 0.03
0.0009/0.0006 Excluded strong outliers in EV: nitromethane
4-9-1 δvpo(0.5) − (ε, LMv, 1ψE, Δ) → El (28)
(e, l) (37.46, 247.3, 489.7, 18.30)
33 N(Tr) = 12, r2 = 0.995, s = 0.01, N(Tr + 3Te + 5 EV) = 20, r2 = 0.932, s = 0.06
0.0002/0.00004 Pentane and tetrahydrofurane belong here to {TR}


Statistica 8 allows one to set only the number of networks to train and retain (100/40), without taking into account the number of training cycles/epochs. The type of algorithm that optimizes the network is the BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm that ensures a fast convergence rate.14 In Table 3 are given the number of epochs for which it runs even if the actual number of cycles used to train the model might be greater. As the number of epochs is not definitive it cannot be held as an unfailing parameter (it can exceed the given number).

It is not rare the case that the model becomes exceedingly good giving rise to overfitting with exceedingly poor externally evaluated values. The choice of training (here 80%) and test (here 20%) sets normally avoids overfitting because the network is repeatedly trained for a number of cycles so long as the test error is on the decrease, as soon as it increases again the training is halted.

3. Studied properties

The eleven properties of organic solvents are listed in Table 1. The source of the experimental values is given in ref. 7. Compounds with (°) in Table 1 build the evaluation set of compounds while the remaining compounds build the training set used to find out, with a full combinatorial least-squares regression, the best descriptors.

As we already explained in the previous section the ANN-MLP methodology further subdivides the training compounds into training (80%) and test compounds (20%, underlined and bold in Table 1). Concerning the model for dipole moments, all indices were multiplied by a two-valued symmetry indicator variable which is zero for symmetric molecules (with μ = 0 in Table 1) and 1 otherwise. Due to PC limitations the entire space of {MMCI, MCI, Rn, ExpPar} could not be searched for the best descriptor. The search was done in two different ways: (i) search for the best descriptor within the set {MMCI, MCI, Exp.Par.}, i.e., best (MMCI, MCI, Exp.Par.), and, finally, (ii) search for the best descriptor within the set {best (MMCI, MCI, Exp.Par.), Rn}.

4. Results

Table 2 shows the best relationships and their statistical parameters obtained with the stepwise multilinear least-squares (MLS) search procedure. The quality of each training equation is also accounted by the errors (±Δci) of the regression parameters ci (in vector form in parenthesis with no ± signs). Each training equation (obtained without (°) compound in Table 1) has then been applied to model the evaluated points of Table 1 (those with (°)).

Table 3 shows the ANN-MLP results: the 1st column describes the MLP architecture with the abbreviation for the activation functions for the hidden and output layers, the number of epochs, and the training and test errors. The second column shows the best set of indices of the ANN-MLP method, the values of the sensitivity analysis for the indices (2nd line), the statistical parameter for the training [N(Tr)], and for all compounds [N(Tr + nTe + nEV)] (3rd line). Notice that the training (TR) set of Table 2 throughout the ANN-MLP calculations of Table 3 is subdivided into training (Tr) and test (Te) sets.

The reader can notice that in Table 2 viscosity, η, is silent as MMCI contribute no model equation with improved quality relatively to the one given in ref. 5, where there was no talk about MMCI. For the same reason in Table 3 refractive index, RI, viscosity, η, and surface tension, γ, are silent.

5. Discussion

Tables 2 and 3 show that the best regression equations (relationships) are always of semiempirical type, i.e., composed of MCI, MMCI, experimental parameters, and, in two MLS cases (FP, and El in Table 2), of a random variable also. ANN-MLP calculations, instead, show no preference for random variables.

For six properties out of eleven (RI, FP, γ, UV, μ, El) ANN-MLP methodology do not choose the best descriptive equation of the MLS method. In general, ANN-MLP overall model quality (training + test + evaluation) improve (Table 3) over MLS (training + evaluation) model ability (Table 2).

This confirms our previous findings.5 Let us now compare MLS results of the present Table 2 with results of the corresponding results of Table 3 (MLS for −χ × 106, and El), and 4 (MLS all other properties) of ref. 5. On the whole, semiempirical equations with one or more MMCI fare better than the corresponding semiempirical equations with no MMCI. Only exception being, as already told in the previous paragraph, the viscosity, η, whose semiempirical equation has no use for MMCI. The training semiempirical relationship for the magnetic susceptibility (−χ × 106) with no MMCI (ref. 5) show a better quality, but it is made of three random variables also, while the corresponding present training semiempirical regression with MMCI + MCI has no use for random variables.

Going over to the overall model ability (training plus evaluated properties) of the MLS regressions things change a bit: Tb, FP, γ, and μ, show preference for relationships with MCI only (Table 4 of ref. 5), UV, and −χ × 106 show no interesting improvement with MMCI, while for the ε (dielectric constant) r2 prefers MCI only (ref. 5) but s improves when MMCI are added. Only for the remaining three properties, d, RI, and El there is a clear preference for model equations that include MMCI.

Comparison of ANN-MLP results of Table 3 with the corresponding results of Table 5 in ref. 5, show that more often than not training semiempirical equations with MMCI + MCI fare better. There are some exceptions: the already cited case for η, RI, γ, and for −χ × 106. Comparison, instead, of the overall description (i.e., training plus evaluated points) shows improvements with nearly the same exceptions: η, RI, and γ. The overall description of −χ × 106 improves consistently, while for UV, and El the model quality is rather similar. All in all mean molecular connectivity indices (MMCI) are useful both to improve a model quality and also to get rid of the random variables.

The model plots for the different properties improve relatively to the previous ones.5 To give the reader an idea of the improvement two model plots for the density, d, are here shown. Fig. 1 on the left side shows the plot obtained in ref. 5 with ANN-MLP-{Tb, M, 0χv, χtv, 1ψIs}, and on the right side the plot obtained with the present ANN-MLP-{Tb, M, AMv, Dv, TψI}, where AMv is a MMCI.


image file: c4ra06484d-f1.tif
Fig. 1 Plot for density (d) obtained in our previous ANN study5 (left) and in the present study (right). Full circles: training compounds; crosses: evaluation compounds; empty squares: test compounds.

The detected asymmetry in the residual plots for the evaluated points (more deviations are located on one side of the zero line than on the other side), is not as drastic as in ref. 5 but it continues to show up. This detail that is probably due to the fact that higher order regressions are needed to model the present properties does not thwart the predictive character of the present relationships.

Before closing this section let us add some words about the descriptive quality of relationships made either with MMCI or with MCI alone:15 RI, FP, UV, and El are advantageously described with equations made of pure MMCI indices, μ is indifferent to the type of indices, while, Tb, d, ε, η, γ, and MS ( −χ × 106) are advantageously described with pure MCI descriptors. Needless to say the present semiempirical equations perform much better.

6. Conclusion

Once E. Bright Wilson remarked (cited in ref. 20): “it is always worthwhile to explore a region which is really new. Unexpected results can generally be relied upon under these circumstances”. Now, of the two main aims of the present work one is unexpected while the other is partially unexpected: (i) the new indices here proposed are really useful, and (ii) the ANN methodology gives rise to better estimations than the normal least-squares methods. Even if this was already known from ref. 5, the unexpected finding is that the quality of ANN calculations can be improved if they were allowed to choose, by the aid of a combinatorial search algorithm, the best subset of indices. Present ANN computations rely on prior least-squares-combinatorial calculations that choose the first, second, and third best subset of indices. These three subsets are then passed over to ANN that chooses the optimal subset of indices that is usually better and different from the very best one chosen with least-squares method. The message is that coupling ANN with a combinatorial search algorithm could surely help to improve the modeling relationships. Present paper even if it is not a study into the details and complexity of ANN nonetheless suggests how to improve it.

The implementation brought about by the new mean indices is a good hint that strategies to find new descriptors, even if not always successful, are always worth to trying.16–18 Thirteen out of thirty-six MMCI show up in our semiempirical equations side-by-side with MCI and experimental parameters. MCI nearly double in number the MMCI, and the five experimental parameters are always a good help. Concerning the MMCIs AMv shows up four times, LME, thrice, HM, HMv, GMI, GME, GM v, SME, RMv, HoME, and StME, twice, and HoMI, and LMv, only once. A brief look at these indices shows that they are mainly δv-dependent (only exception being HM), either directly or through the intrinsic I-state and electrotopological S-state indices. This means that, not only they depend on properties of general graphs, but that both the hydrogen contribution and the complete graph contribution for the core electrons (see Appendix) play an important role in the descriptive quality of the MMCI. Notice that in eqn (13) and (17) the simple and seminal Randić 1χ index1 shows up underlining how the corresponding more complex 1χv index2,9 does not cover all the properties.

The index HoME brings about the greatest improvement in the model of a property, the dielectric constant, ε: with no MMCIs we had, N(T) = 61, r2 = 0.933, s = 3.9,5 while with HoME we have, N(T) = 61, r2 = 0.984, s = 2.5. MMCI help to reduce the importance of random variables while MMCI together with the ANN-MLP calculations have no use of them.

Concerning the different types of configurations of the MMCI & MCI indices due to the different types of valence delta, δvpo/ppo(n), it is possible to notice that in MLS calculations half of the properties prefer the ppo configuration, while in ANN-MLP calculations five out of eight properties prefer this configuration. Concerning the n values of δvpo/ppo(n) practically all values show up. Thus, general and complete graph characteristics like multiple bonds and core electrons of heteroatoms (here chlorine and bromine), as well as the hydrogen contribution, are an important factor in modeling studies.

Appendix

1. The original means

In literature19 the following nine definitions of means between numbers a and b can be found.
 
Arithmetic mean: AM = (a + b)/2 (A1)
 
Geometric mean: GM = (ab)1/2 (A2)
 
Harmonic mean: HM = 2/(a−1 + b−1) (A3)
 
Root mean square: RM = [(a2 + b2)/2]1/2 (A4)
 
Symmetric mean: SM = (a2 + b2)/(a + b) (A5)
 
Unsymmetrical mean: UM = [ba + (a2 − 2ab + 5b2)0.5]/2 (A6)
 
Hölder mean: HoM(p) = (ap + bp)1/p/2 (A7)
 
Lehmer's mean: LM(p) = (ap + bp)/(ap−1 + bp−1) (A8)
 
Stolarsky's mean: StM(p) = [(apbp)/(papb)]1/(p−1) (A9)

The reader can notice that the Stolarsky's mean has a minus instead of a plus sign in the denominator. The plus sign in eqn (9) was introduced to avoid an undefined value, zero/zero, whenever δi = δj (a = b in A9) had we hold on to the minus sign.

2. The valence delta

All χ, ψ, Δ, Σ, and TΣ/M indices employed in the present study are defined in Table 2 of ref. 5. Here we will only define some concepts that will help to understand Tables 2 and 3 The δv number used throughout present and previous works5 is defined in the following way.
 
image file: c4ra06484d-t11.tif(A10)

δv(ps) is the valence of a vertex in a chemical pseudograph (or general graph) that allows multiple bonds and self-connections (or loops). Normally, in chemical graph theory simple graphs (with no multiple bonds and loops) and general graphs (or pseudographs) are hydrogen-depleted. Parameters p (= 1, 2, 3, 4, …) is the order of a complete graph, Kp, and r is its regularity (r = p − 1). A complete graph is a graph where every pair of its vertices is adjacent. The first order complete graph, K1, is just a vertex and it is usually used to encode second row atoms. Parameter q in eqn (A1) is two-valued: q = 1 or p. Generally, two representations (or configurations) for δv are useful (see Tables 2 and 3): δvpo(n) where q = 1, and p = odd, and δvppo(n) where q = p and p = odd. Number n that appears in the two deltas is the value of exponent n in fδ (eqn (A10)). It quantifies the importance of the hydrogen perturbation: the higher the n values the lower the importance of the perturbation. The values for n here used that generate different sets of indices are: n = −0.5, 0.5, 1, 2, 5, 50. This parameter could be used as a fine-tuning optimization variable, something like (but not quite) the Randić's variable chi index,21,22 that was proposed as an alternative way of characterizing heteroatoms in molecules. The fδ fractional hydrogen perturbation parameter that encodes the depleted hydrogen atoms is defined in the following way.

 
fδ = 1 − δv(ps)/δvm(ps) = nH/δvm(ps) (A11)
δvm(ps) is the maximal δv(ps) value a heteroatom (a vertex) can have in a hydrogen depleted chemical pseudograph when all bonded hydrogen atoms are substituted by heteroatoms, and nH equals the number of hydrogen atoms bonded to a heteroatom. For completely substituted heteroatoms, fδ = 0 as δvm(ps) = δv(ps) (i.e., nH = 0). In hydrocarbons δv(ps) = δ, which is the delta number in simple chemical graphs with no multiple bonds and loops. In this case: δv = (1 + fnδ)δ (for p = 1). For quaternary carbons fδ = 0 and δv = δ.

3. The intrinsic I-state and electrotopological S-state indices

The I- and E-state indices (in ψE,I: E means electrotopological, and I intrinsic) are related to δv in the following way,10
 
I = (δv + 1)/δ, S = I + ΣΔI, with ΔI = (IiIj)/rij2 (A12)
rij counts the atoms in the minimum path length separating atoms i and j, which equals the graph distance, dij + 1; ΣΔI incorporates the information about the influence of the remainder of the molecular environment.

References

  1. M. Randić, J. Am. Chem. Soc., 1975, 97, 6609–6615 CrossRef.
  2. L. B. Kier, L. H. Hall, W. J. Murray and M. Randic, J. Pharm. Sci., 1975, 64, 1971–1974 CrossRef CAS.
  3. R. Todeschini and V. Consonni, Molecular Descriptors for Chemoinformatics, 2nd edn, Wiley-VCH, Weinheim, 2000 Search PubMed.
  4. Topological Indices and Related Descriptors in QSAR and QSPR, ed. J. Devillers and A.T. Balaban, Gordon and Breach, UK, 1999 Search PubMed.
  5. L. Pogliani and J. V. de Julián-Ortiz, RSC Adv., 2013, 3, 14710–14721 RSC.
  6. R. García-Domenech, J. Gálvez, J. V. de Julián-Ortiz and L. Pogliani, Chem. Rev., 2008, 108, 1127–1169 CrossRef PubMed.
  7. L. Pogliani, J. Comput. Chem., 2010, 31, 295–307 CAS.
  8. L. B. Kier and L. H. Hall, J. Pharm. Sci., 1981, 70, 583–589 CrossRef CAS.
  9. L. B. Kier and L. H. Hall, Molecular Connectivity in Structure–Activity Analysis, Wiley, NY, 1986 Search PubMed.
  10. L. B. Kier and L. H. Hall, Molecular Structure Description. The Electrotopological State, New York, Academic Press, 1999 Search PubMed.
  11. J. G. Topliss and R. J. Costello, J. Med. Chem., 1972, 15, 1066–1069 CrossRef CAS.
  12. E. Besalu, J. V. de Julian-Ortiz and L. Pogliani, MATCH Commun. Math. Comput. Chem., 2006, 55, 281–286 CAS.
  13. J. Zupan and J. Gasteiger, Neural Networks in Chemistry and Drug Design: An Introduction, 2nd edn, Wiley-VCH, Weinheim, 1999 Search PubMed.
  14. E. Castillo, B. Guijarro-Berdiñas, O. Fontenla-Romero and A. Alonso-Betanzos, J. Mach. Learn. Res., 2006, 7, 1159–1182 Search PubMed.
  15. L. Pogliani and J. V. de Julián-Ortiz, Int. J. Chem. Model., 2014, 6 Search PubMed , in press.
  16. R. García-Domenech, J. V. de Julián-Ortiz, M. J. Duart, J. M. García-Torrecillas, G. M. Antón-Fos, I. Ríos-Santamarina, C. de Gregorio Alapont and J. Gálvez, SAR QSAR Environ. Res., 2001, 12, 237–254 CrossRef PubMed.
  17. M. J. Duart, G. M. Antón-Fos, J. V. de Julián-Ortiz, R. Gozalbes, J. Gálvez and R. García-Domenech, Int. J. Pharm., 2002, 246, 111–119 CrossRef CAS.
  18. L. Pogliani, J. V. Julian-Ortiz and E. Besalu, Int. J. Chem. Model., 2013, 5, 295–302 Search PubMed.
  19. Wolfram MathWorld: http://mathworld.wolfram.com/.
  20. M. Randić, Chem. Rev., 2003, 103, 3449–3605 CrossRef PubMed 3470.
  21. M. Randić, Chemom. Intell. Lab. Syst., 1991, 10, 213–227 CrossRef.
  22. M. Randić, J. Mol. Graphics Modell., 2001, 20, 19–35 CrossRef.

This journal is © The Royal Society of Chemistry 2014
Click here to see how this site uses Cookies. View our privacy policy here.