Jelena
Antanasijević
*a,
Davor
Antanasijević
b,
Viktor
Pocajt
a,
Nemanja
Trišović
a and
Katalin
Fodor-Csorba
c
aUniversity of Belgrade, Faculty of Technology and Metallurgy, Karnegijeva 4, 11120 Belgrade, Serbia. E-mail: jantanasijevic@tmf.bg.ac.rs
bUniversity of Belgrade, Innovation Center of the Faculty of Technology and Metallurgy, Karnegijeva 4, 11120 Belgrade, Serbia
cWigner Research Centre for Physics, Institute for Solid State Physics and Optics of the Hungarian Academy of Sciences, P.O. Box 49, H-1525 Budapest, Hungary
First published on 9th February 2016
Accelerating progress in the discovery of new bent-core liquid crystal (LC) materials with enhanced features relies on the understanding of structure–property relationships that underline the formation of LC phases. The aim of this study was to develop a model for the prediction of LC behaviour of five-ring bent-core systems using a QSPR approach that combines dimension reduction techniques (e.g. genetic algorithms etc.) for the selection of molecular descriptors and decision trees, multivariate adaptive regression splines (MARS) and artificial neural networks (ANN) as classification methods. A total of 27 models based on separate pools of calculated molecular descriptors (2D; 2D and 3D) and published experimental outcomes were evaluated. Overall, the results suggest that the acquired ANN LC classifiers are usable for the prediction of LC behaviour. The best of these models showed high accuracy and precision (91% and 97%). Since the best classifier is able to successfully capture trends in a homologous series, it can be used not only to screen new bent-core structures for potential LCs, but also for the estimation of influence of structural modifications on LC phase formation, as well as for the evaluation of LC phase stability.
Extensive efforts have been made to determine a relationship between the mesomorphic properties of bent-core liquid crystals and their molecular structure.6 Although some general understandings about the matter have been established,7–10 designing of the LC molecular structure with favourable properties is still a great challenge for chemists, concerning that those molecules need to exhibit LC behaviour at lower temperatures. Also, it should be noted that the mesophase behaviour of the bent-core compounds is more sensitive to structural modifications than that of calamitic ones concerning the number of the rings, type and orientation of the connecting groups, substituents on the central and outer rings as well as the length of the terminal chains.7 In addition, the synthesis of bent-core LCs is often very complex, expensive and time consuming, and therefore the use of statistical classification techniques may be helpful in order to reduce the ratio of synthesized bent-core molecules that does not exhibit LC properties.
Although there are various studies related to the prediction of a particular LC property,11–15 only a limited number of quantitative structure–property relationship (QSPR) models for the prediction of liquid crystallinity can be found in the literature.16–19 In those papers, the LC behaviour of ferrocene derivatives, copolyethers, polyazomethines and calamitic compounds was predicted using different statistical methods, mainly artificial neural networks (ANNs).
Because of the complex relationship between the bent-core structure and its LC behaviour, the use of nonlinear classifiers is required to achieve an accurate prediction. In this study, decision tree (DT), multivariate adaptive regression splines (MARS) and three different ANN architectures, namely supervised Kohonen (SKNN), counter propagation (CPNN) and probabilistic neural network (PNN), were applied for the prediction of LC behaviour of five-ring system (Fig. 1). In addition, feature selection (FSL) as well as genetic algorithms (GA) and principal component analysis (PCA) as dimension reduction methods were employed for the selection of descriptors.
The selection of proper molecular descriptors is a difficult and target-dependent task that can be handled using dimension reduction methods.29 Therefore, after the initial descriptor removal a further dimension reduction was carried out using feature selection, genetic algorithms and principal component analysis. It should be noted that this dimension reduction step is necessary only in the case of ANN models. Decision trees and MARS are capable to select the most important descriptors during the training of the model, thus the subsets of descriptors selected by DT and MARS were also used for the development of ANN model. The Statistica30 feature selection routine was utilized to select 20 most significant descriptors, based on the computed Chi-square statistic and p value (significance) for each descriptor. The PCA was performed also in Statistica30 by extracting the principal components (PCs) with the eigenvalues higher than 1. A genetic algorithm descriptor selection was performed using Neuroshell 2 Genetic Adaptive module31 by applying the input smoothing factors (ISFs) (see section Artificial neural networks) as a sensitivity tool. After this ISF sensitivity analysis, the minimum number of incorrect classifications (MNIC) from the PNN training was used as the measure of subset quality. Descriptor subsets with fewer descriptors, but equivalent MNIC values, were favoured in the process of reduction.
The considered pools of descriptors, applied dimension reduction and classification techniques and all 27 developed models of the present study have been schematically presented in Fig. 2.
Several DT algorithms have been used in practice: Chi-Squared Automatic Interaction Detector (CHAID), Quick, Unbiased, Efficient Statistical Trees (QUEST), and Classification and Regression Trees (CART). Among them, CART, which is a non-parametric binary recursive tree structure developed by Breiman et al.,34 was adopted for this study. CART is an efficient tree induction method for large data sets, and it has been used as a classification35 and a feature selection method.36 The DT was built by splitting the root node into two child nodes which were then split repeatedly until the terminal nodes were reached. Each split was evaluated using Gini measure as an impurity function.37 In order to avoid overfitting, the obtained DT has been pruned at the end of the categorization process. The pruning procedure develops a sequence of smaller trees, based on the cost-complexity parameter, and determines the DT with higher accuracy.36
In this study, the CART implementation in Statistica (C&RT module) was used for DT model development. A 5-fold cross-validation is performed in order to obtain a stable tree with the smallest overall misclassification rate.
![]() | (1) |
![]() | (2) |
A knot marks the border of the data region where the behaviour of the function significantly changes and marks the edge of a pair of basis functions, thus building contiguous plane surfaces by summing up basis functions (Bm) with suitable coefficient (am):40
![]() | (3) |
![]() | (4) |
The procedure for finding the best MARS model includes the forward selection and backward elimination procedures. In the forward stepwise addition procedure, the pairs of basis functions were added until the performance of MARS model was improved. Such model is often a very complex and overfitting. During the backward elimination, the model is pruned by removing the redundant basis functions using the generalized cross-validation (GCV).41 The GCV is the mean squared residual error divided by a penalty dependent on the model complexity.42 Further details on MARS can be found elsewhere.32,38
For classification purposes, MARS can be implemented in two manners: (1) the pairwise classification, with output coded as 0 or 1, is handled as a regression, and (2) the classification of more than two classes need to be performed using a hybrid of MARS called POLYMARS.43
In this study, the first technique is adopted and the MARS models were produced in Statistica using MARSpline routine. The developed MARS models had a maximum of 40 basis functions, allowed backward pruning, and a GCV penalty of 2. In order to determine the optimal order of interaction of the spline basis functions, the models with the order of interaction restricted to 2, 3 and 4 were compared. A lower-order model, with similar accuracy as a higher-order one, was favoured, as suggested by Zhang and Singer.44
In this work, the SKNN and CPNN models were created using the Kohonen and CP-ANN MATLAB toolbox 3.6 (ref. 49) released by Milano Chemometrics and QSAR research Group, while the PNN models were created using NeuroShell 2 software.31
SKNN is based on a self-organizing map (SOM) learning algorithm developed by Kohonen.50,51 The SOM is a single layered network and this (Kohonen) layer is often visualized as a square or hexagonal toroidal space, which is consisted of a grid of N2 neurons, where N is the number of neurons for each side. Each neuron contains as many weights as the number of inputs. The weights of each neuron are updated on the basis of the input vectors, for a certain number of times (epochs). Both the N and epochs must be defined by the user.52
SKNN consists of the input and output map, which are ‘glued’ together forming a combined input–output map which is updated according to the SOM training procedure. After training, the input and output maps are decoupled. The topological formation of the combined input–output map is performed in a supervised way, since the input and output values are used explicitly during the SKNN training. The prediction of unknown class of a new sample is performed by locating the winning neuron in the input map, which is followed by locating of the class of the corresponding neuron in the output map. The maximum value of this neuron's weight vector determines the actual class membership.53
CPNN can be also considered as an extension of SOMs, but it combines characteristics from both supervised and unsupervised learning. The theoretical concept of the CPNN was founded by Hecht-Nielsen.54 CPNN consists of two layers: an input layer (called Kohonen layer), which performs the mapping of the input data, and an output layer (called Grossberg layer) that serves as a “pointing device”55 and whose neurons have as many weights as the number of classes that need to be learned. In contrast to the learning in the Kohonen layer, the correct response is needed for the correction of the weights in the Grossberg layer, thus the learning is performed in the supervised manner.56 At the end of the CPNN training, each neuron of the Kohonen layer can be assigned to a class on the basis of the output weights and all the samples placed in that neuron are automatically assigned to the corresponding class.57 The class of a new sample is estimated following the same procedure as in the case of SKNN.
The overfitting of both SKNN and CPNN is prevented by the optimization of architecture in terms of the number of neurons in output layer and the number of epochs of using genetic algorithms as it is described by Ballabio et al.52 For this purpose the following GA fitness function is used:
F = accv(1 − |accv − acct|) | (5) |
PNN, invented by Specht,58 is a one-pass feed-forward supervised learning neural network consisting of four layers: input, pattern, summation and decision layer. PNN approximates Bayes classifier where the class conditional probabilities are estimated by using the Parzen's window approach.59 In a binary classification problem, PNN predicts the class of samples using the Bayes decision rule:
hkckfk(x) > hmcmfm(x) | (6) |
![]() | (7) |
![]() | (8) |
In eqn (8)p is the number of inputs, while the σ is so-called smoothing factor, which is only adjustable parameter that needs to be optimized during the PNN training. The σ represents the width of the calculated Gaussian curve for each PDF. One of the major issues associated with the PNN is the selection of optimal smoothing factor.62 In this study, genetic algorithms were used for searching the optimal σ. When GA is used, beside the overall σ, the so-called individual smoothing factors (ISFs), for each input, are also calculated. The ISF quantifies the importance of a given input to the model, thus ISFs were used as a sensitivity tool.
During the PNN training the learning dataset is used to set the network weights, while the validation data was utilized for the determination of optimal smoothing factor and corresponding ISFs.
![]() | (9) |
![]() | (10) |
![]() | (11) |
![]() | (12) |
Actual class | Predicted class | |
---|---|---|
LC | NLC | |
LC | a | b |
NLC | c | d |
Accuracy gives the percentage of LCs and NLCs correctly classified, while the precision gives the percentage of correctly classified LCs among all compounds which are classified as LCs. The numerical value of recall represents the probability of identifying compounds that exhibit the LC phases. In addition to the acc the Gmean is used, since the acc can be misleading in cases where the classes are unequally represented in the training set. Gmean has two distinctive properties of being independent of the distribution of examples between classes and being nonlinear. The second property means that the “cost” of misclassifying each positive example increases the more often positive examples are misclassified.63
The root node was split using the JGI9 descriptor, which is related to the charge transfer between the pairs of atoms and therefore it indicates the charge transfer over the molecule. The 30 compounds with higher JGI9 values were finally split by the SM1_Dzp descriptor to two terminal nodes: the first with the ratio of LCs of 72% and the second containing only NLC compounds. The SM1_Dzp descriptor was calculated as a molecular spectral moment of order 1 from Barysz matrix weighted by its polarizability. Barysz matrix is a symmetric weighted distance matrix accounting contemporarily for the presence of heteroatoms and multiple bonds in a molecule. The spectral moment of order 1 from Barysz matrix is equal to the sum of the matrix eigenvalues.23
The remaining 220 compounds were additionally split using the MPC9 and after that using the MLFER_E descriptor, which resulted in one LC and two NLC terminal nodes. The root node containing 82% of LCs was “purified” to the ratio of LCs of 93%. The MPC9 is a molecular path count of order 9 topological descriptor that counts the total number of paths of length m (in this case 9) in the molecule. The length m of the path is the number of edges along the path and it is called path order.23 The MLFER_E, which quantifies the excess molar refraction, is one of the molecular linear free energy relation (MLFER) descriptors. The excess molar refraction represents polarizability contributions from n- and π-electrons and can be calculated from the refractive index and characteristic molecular volume.64
The pairwise correlation coefficient among these descriptors has an average value of 0.25, a minimum value of 0.05 (between JGI9 and MPC9) and a maximum of 0.47 (between SM1_Dzp and MLFER_E).
The classification results obtained for the prediction set are presented in Table 2. The compounds in the prediction dataset were classified correctly with the accuracy of 80% and precision of 87%, while the Gmean had a lower value of only 50%. Low Gmean value indicates a significant misclassification of NLC compounds by the DT model, which is obvious from the confusion table (Table 2).
Des. type | acc (%) | Pr (%) | r (%) | G mean (%) | Actual class | Predicted class | |
---|---|---|---|---|---|---|---|
LC | NLC | ||||||
2D | 80 | 87 | 89 | 50 | LC | 33 | 4 |
NLC | 5 | 2 |
A list of basis functions for each of the two MARS models is shown in Table 3, while the corresponding coefficients are presented in ESI (eqn (S1) and (S2)†). The values of performance metrics and corresponding confusion tables for both MARS models are shown in Table 4. The 2&3D-MARS model has higher accuracy and excellent recall of 95%, thus it was able to predict almost all LCs from the prediction set. Both MARS models have low Gmean values, similar as the DT model, because of the substantial misclassification of NLC compounds.
Basic function | 2D-MARS model | 2&3D-MARS model |
---|---|---|
B 1 | max(0; 6.10 × 10−1 − AVP-0) | max(0; AVP-0 − 6.10 × 10−1) |
B 2 | max(0; 2.09 × 102 − MPC10) | max(0; 6.10 × 10−1 − AVP-0) |
B 3 | max(0; MPC10 − 2.09 × 102) | max(0; MPC10 − 2.09 × 102) |
B 4 | max(0; WPOL − 9.50 × 101) | max(0; WPOL − 9.50 × 101) |
B 5 | max(0; MDEC-12 − 4.39) | max(0; E3s − 3.20 × 10−1) |
B 6 | max(0; 9.50 × 101 − WPOL) | max(0; 3.20 × 10−1 − E3s) |
B 7 | max(0; 4.39 − MDEC-12) | max(0; 3.75 × 10−1 − RPCS) |
B 8 | max(0; BCUTp-1h − 9.46) | max(0; ETA_EtaP_F − 1.10) |
B 9 | max(0; 9.46 − BCUTp-1h) | max(0; 1.10 − ETA_EtaP_F) |
B 10 | max(0; SCH-7 − 6.60 × 10−1) | max(0; 3.84 − IC2) |
B 11 | max(0; 6.60 × 10−1 − SCH-7) | max(0; IC2 − 3.84) |
B 12 | max(0; 2.11 − VP-6) | max(0; MLFER_BO − 1.63) |
B 13 | max(0; MIC5 − 4.29 × 101) | max(0; geomDiameter − 4.49 × 101) |
B 14 | max(0; 4.29 × 101 − MIC5) | max(0; MLFER_L − 2.53 × 101) |
B 15 | max(0; VE1_Dt − 3.58 × 10−2) | max(0; WNSA-2 + 2.05 × 103) |
B 16 | max(0; MLFER_S − 3.66) | max(0; −2.05 × 103 − WNSA-2) |
B 17 | max(0; 3.66 − MLFER_S) | max(0; VP-4 − 5.90) |
B 18 | max(0; SpAbs_Dzp − 1.75 × 103) | max(0; 2.53 × 101 − MLFER_L) |
B 19 | max(0; 1.75 × 103 − SpAbs_Dzp) | max(0; 5.90 − VP-4) |
B 20 | max(0; VE3_Dzs + 4.72 × 101) | max(0; VE3_Dzv + 8.84) |
B 21 | max(0; AVP-0 − 6.10 × 10−1) | max(0; −8.84 − VE3_Dzv) |
B 22 | max(0; VE3_Dzv + 8.84) | max(0; P2m − 3.23 × 10−1) |
B 23 | max(0; −8.84 − VE3_Dzv) | max(0; VE3_Dt + 1.62 × 101) |
B 24 | max(0; VE2_Dt − 2.81 × 10−4) | max(0; 1.47 × 102 − ETA_Eta_R) |
B 25 | max(0; 2.81 × 10−4 − VE2_Dt) | max(0; 8.75 × 10−2 − VC-5) |
B 26 | max(0; 7.32 − MDEC-12) | max(0; −3.57 × 101 − VE3_Dze) |
B 27 | max(0; VPC-4 − 1.52) | N/A |
Des. type | acc (%) | Pr (%) | r (%) | G mean (%) | Actual class | Predicted class | |
---|---|---|---|---|---|---|---|
LC | NLC | ||||||
2D | 80 | 89 | 86 | 61 | LC | 32 | 5 |
NLC | 4 | 3 | |||||
2&3D | 84 | 88 | 95 | 52 | LC | 35 | 2 |
NLC | 5 | 2 |
A list of descriptors used in the MARS models with short description is presented in ESI (Table S2†). The pairwise correlation coefficient among descriptors used in the 2D-MARS model has an average value of 0.36, while those coefficients average value between the descriptors of 2&3D-MARS model is 0.34.
Among selected 2D descriptors, the MPC is described in previous section (Prediction of LCs using decision tree). Three new MLFER descriptors are used in the MARS models: MLFER_S quantifies dipolarity/polarizability, MLFER_L is a solute gas-hexadecane partition coefficient and MLFER_BO represents overall solute hydrogen bond basicity. Descriptors labelled as VP-4, VP-6, AVP-0, SCH-7, VPC-4 and VC-5 are topological descriptors that give information regarding the connectivity of various atoms in the molecule and they are referred as connectivity indices calculated using Chi operator. Those descriptors are able to take into account the presence of heteroatoms in a molecule, as well as double and triple bonds, molecular size, degree of branching and flexibility.
Next group of descriptors are those calculated from Barysz matrix, namely the logarithmic coefficient sum of the last eigenvector weighted by van der Waals volumes (VE3_Dzv), by I-state (VE3_Dzs) or by Sanderson electronegativities (VE3_Dze) and graph energy weighted by polarizabilities (SpAbs_Dzp). Descriptors determined from the detour matrix (also known as a matrix of maximal topological distances), i.e. coefficient sum of the last eigenvector (VE1_Dt) and its average (VE2_Dt) and logarithmic (VE3_Dt) values, were also used in the models.
The Wiener polarity number65 (WPOL) is equal to the number of bonds around which free rotations can take place. Moreover, it is related to the flexibility and steric properties of molecules.
Information content (IC) descriptors are based on the calculation of equivalence classes from the molecular graph. Among them, the IC indices of neighbourhood symmetry take into account also neighbour degree and edge multiplicity. The Modified Information Content index (MIC) is the IC index weighted by the corresponding atomic masses of all atoms in the molecule. The MDEC-12 descriptor counts the molecular distance edge between all primary and secondary carbons.
BCUTs (Burden – CAS – University of Texas eigenvalues) are extensions of the Burden descriptors, which are based on a combination of atomic numbers for each atom and a description of nominal bond-types for adjacent and nonadjacent atoms. The BCUT descriptors expand the number and types of atomic features that can be considered and also provide a greater variety of proximity measures and weighting schemes. The result is a new whole-molecule descriptor that has proved useful in measuring molecular diversity and related tasks.66
The last two selected 2D descriptors (ETA_Eta_R and ETA_EtaP_F) belong to the group of extended topochemical atom (ETA) indices. ETA_Eta_R is a composite index that consider both bonded and non-bonded interactions and describes overall topological environment of a molecule relative to the molecular size. ETA_EtaP_F is a functionality index, which accounts the presence of heteroatoms and multiple bonds.67
As mentioned above, five descriptors derived from 3D molecular geometry are chosen by the 2&3D-MARS model. One of them is geometric diameter (geomDiameter), defined as the maximum geometric eccentricity in a molecule, and it represents the longest geometric distance between two atoms in the molecule. The other two (WNSA-2 and RPCS) are charged partial surface area descriptors that combine shape and electronic information to characterize molecules and, therefore, encode features responsible for polar interactions between molecules. The WNSA-2 is related to the negative charge surface area, while the RPCS is related to the positive one. Finally, the descriptors labelled as E3s and P2m are WHIM (Weighted Holistic Invariant Molecular) descriptors that give a relevant molecular 3D information regarding the molecular size, shape, symmetry, and atom distribution with respect to invariant reference frames.
More details on the descriptors, which are briefly presented in this and next section, are available in literature.23
A list of descriptors selected by FSL and GA is presented in Table 5, while the short description is provided in ESI (Table S3†). 17 PCs from the pool of 2D descriptors and 24 PCs from the pool of 2&3D descriptors, both with cumulative variance of 98%, were extracted using the PCA. The eigenvalue of each PC along with corresponding variance is presented in Fig. 4.
Descriptor group | 2D-FSL | 2D-GA | 2&3D-GA |
---|---|---|---|
a 3D descriptors. | |||
Barysz matrix | SM1_Dzi; SM1_Dzs | VR2_Dzs | SM1_Dzi; SM1_Dzp; EE_Dzi; EE_Dzm; EE_Dzv |
BCUT | BCUTp-1h | BCUTp-1h | BCUTc-1h |
Carbon types | C1SP3 | C3SP2 | |
Chi chain | SCH-6; SCH-7; VCH-6; VCH-7 | ||
Chi cluster | VC-5 | VC-3 | |
Chi path cluster | VPC-4; VPC-5 | VPC-6 | |
Path count | piPC5; TpiPC | piPC7; TpiPC | MPC8 |
Extended topochemical atom | ETA_dAlpha_B; ETA_dPsi_A | ETA_dAlpha_B; ETA_BetaP | ETA_Beta; ETA_Beta_ns_d |
Molecular distance edge | MDEC-13 | ||
Molecular linear free energy relation | MLFER_E | ||
Topological distance matrix | VE3_D | ||
Topological charge | GGI6; GGI9; JGI9 | ||
Information content | SIC3 | ||
Constitutional descriptor | Mare | ||
WHIMa | Dp; L1s |
The same set of 20 2D descriptors was obtained for both considered pools of descriptors by the FSL. The application of GA yielded a set of 11 2D descriptors selected from the corresponding 2D pool, and a set of 10 2D and 2 3D descriptors chosen from the 2&3D pool of descriptors. Pairwise correlations among descriptors selected by FSL have an average value of 0.44, while the average value of pairwise correlation coefficients among the descriptors selected by GA from the pool of 2D and 2&3D descriptors was 0.26 and 0.28, respectively.
A relatively high average value of pairwise correlation coefficients between descriptors selected by FSL is consistent with the fact that FSL measures the significance of single descriptors, one by one, and, in contrast to the GA, FSL does not select the “best” combination of descriptors.
Only new descriptor types that haven't been previously mentioned will be described in this section. Carbon-type descriptors calculate the carbon connectivity in terms of hybridization: C1SP3 represents the number of singly bound carbon bound to one other carbon, while C3SP2 represents the number of doubly bound carbon bound to three other carbons. The descriptors labelled as piPC5, piPC7 and TpiPC are conventional bond order ID number descriptors, and they belong to the path count descriptor group. The ID number is a molecular weighted path sum which accounts for multiple bonds in the molecule. One of the selected descriptors is a mean atomic Allred–Rochow electronegativity (Mare), scaled on the carbon atom, and it is a constitutional descriptor. Different Estrada indices calculated from Barysz matrix were selected by GAs. The Estrada indices encode information on complexity of molecular graphs and are also used to describe characteristic physicochemical features of complex systems. They are based on the exponential function and consider both positive and negative eigenvalues at the same time, without compensation effects.
At this point it can be summarized that the selected descriptors encode information about molecular geometry, polarity, flexibility, intermolecular interactions and distribution of the electronic charge. Each of these features alters molecular packing and results in the formation and properties of bent-core LC phases. Considering that molecular packing is determined by a sensitive balance between many competing factors, a variety of descriptors is required for a satisfactory prediction of LC behaviour.
Descriptor | N D a | Optimal SKNN | Optimal CPNN | |||
---|---|---|---|---|---|---|
Type | Select. | Out. layer | Epochs | Out. layer | Epochs | |
a N D – number of descriptors. b Number of PCs. | ||||||
2D | DT | 4 | 12 × 12 | 350 | 8 × 8 | 200 |
MARS | 15 | 12 × 12 | 350 | 10 × 10 | 400 | |
GA | 11 | 12 × 12 | 200 | 12 × 12 | 300 | |
FSL | 20 | 12 × 12 | 300 | 12 × 12 | 350 | |
PCA | 360 (17)b | 12 × 12 | 350 | 12 × 12 | 250 | |
2&3D | MARS | 18 | 12 × 12 | 200 | 12 × 12 | 250 |
GA | 12 | 10 × 10 | 500 | 12 × 12 | 350 | |
PCA | 501 (24)b | 12 × 12 | 400 | 12 × 12 | 350 |
The PNN architecture parameters, i.e. the number of neurons in each layer, are solely dependent on the features of training dataset. More precisely, in this case the number of neurons in the input layer corresponds to the number of descriptors, while the pattern layer has as many neurons as the number of compounds in the learning set. The number of neurons in the summation layer is equal to the number of classes, while the decision layer has only one neuron in the case of binary classification. Since the same learning set was used, all PNN models had 200 patterns, 2 summation and 1 decision neuron, while the number of input neurons was varied from 4 to 24, in order to match the number of descriptors used. The GAs were employed for the determination of optimal value of smoothing factor during the PNN training and their parameters are presented in Table 6. The values of performance metrics and corresponding confusion tables for ANN models are shown in Tables 8–10. It can be noticed that the performance of majority of ANN models was good, with 2D-FSL-SKNN, 2&3D-GA-CPNN and 2&3D-GA-PNN performing better than others, achieving accuracy higher than 90%.
Model | acc (%) | Pr (%) | r (%) | G mean (%) | Actual class | Predicted class | |
---|---|---|---|---|---|---|---|
LC | NLC | ||||||
2D-DT | 80 | 87 | 89 | 50 | LC | 33 | 4 |
NLC | 5 | 2 | |||||
2D-MARS | 80 | 87 | 89 | 50 | LC | 33 | 4 |
NLC | 5 | 2 | |||||
2D-GA | 86 | 90 | 95 | 64 | LC | 35 | 2 |
NLC | 4 | 3 | |||||
2D-FSL | 91 | 92 | 97 | 75 | LC | 36 | 1 |
NLC | 3 | 4 | |||||
2D-PCA | 68 | 81 | 81 | 0 | LC | 30 | 7 |
NLC | 7 | 0 | |||||
2&3D-MARS | 84 | 88 | 95 | 52 | LC | 35 | 2 |
NLC | 5 | 2 | |||||
2&3D-GA | 84 | 89 | 92 | 63 | LC | 34 | 3 |
NLC | 4 | 3 | |||||
2&3D-PCA | 77 | 89 | 84 | 60 | LC | 31 | 6 |
NLC | 4 | 3 |
Model | acc (%) | Pr (%) | r (%) | G mean (%) | Actual class | Predicted class | |
---|---|---|---|---|---|---|---|
LC | NLC | ||||||
2D-DT | 82 | 89 | 89 | 62 | LC | 33 | 4 |
NLC | 4 | 3 | |||||
2D-MARS | 82 | 87 | 92 | 51 | LC | 34 | 3 |
NLC | 5 | 2 | |||||
2D-GA | 82 | 89 | 89 | 62 | LC | 33 | 4 |
NLC | 4 | 3 | |||||
2D-FSL | 86 | 92 | 92 | 72 | LC | 34 | 3 |
NLC | 3 | 4 | |||||
2D-PCA | 75 | 84 | 86 | 35 | LC | 32 | 5 |
NLC | 6 | 1 | |||||
2&3D-MARS | 86 | 90 | 95 | 64 | LC | 35 | 2 |
NLC | 4 | 3 | |||||
2&3D-GA | 91 | 95 | 95 | 82 | LC | 35 | 2 |
NLC | 2 | 5 | |||||
2&3D-PCA | 80 | 85 | 92 | 36 | LC | 34 | 3 |
NLC | 6 | 1 |
Model | acc (%) | Pr (%) | r (%) | G mean (%) | Actual class | Predicted class | |
---|---|---|---|---|---|---|---|
LC | NLC | ||||||
2D-DT | 75 | 91 | 78 | 67 | LC | 29 | 8 |
NLC | 3 | 4 | |||||
2D-MARS | 86 | 97 | 86 | 86 | LC | 32 | 5 |
NLC | 1 | 6 | |||||
2D-GA | 84 | 94 | 86 | 79 | LC | 32 | 5 |
NLC | 2 | 5 | |||||
2D-FSL | 80 | 97 | 78 | 82 | LC | 29 | 8 |
NLC | 1 | 6 | |||||
2D-PCA | 70 | 85 | 78 | 47 | LC | 29 | 8 |
NLC | 5 | 2 | |||||
2&3D-MARS | 66 | 87 | 70 | 55 | LC | 26 | 11 |
NLC | 4 | 3 | |||||
2&3D-GA | 91 | 97 | 92 | 89 | LC | 34 | 3 |
NLC | 1 | 6 | |||||
2&3D-PCA | 77 | 86 | 86 | 50 | LC | 32 | 5 |
NLC | 5 | 2 |
A detailed evaluation of ANN classifiers is presented in the next section.
![]() | ||
Fig. 5 Accuracy of models obtained using different classification and dimension reduction techniques. The dimension reduction techniques are arranged by ascending number of descriptors. |
It can be seen from Fig. 5 that ANN models based on DT or MARS descriptor selection, in most cases (7/9), have the same or better performance in comparison with the corresponding DT or MARS models. However, the best ANN classifiers were created using the descriptor sets obtained by FSL and GA, which are dedicated selection techniques.
The models based on PCs were outperformed by all other models, whilst the 2D descriptors based models proved to be less accurate than the corresponding models created with both 2D and 3D descriptors.
The average performance of classifiers created with descriptors selected using different techniques is presented in Fig. 6. The division of classifiers, in respect to the dimension reduction technique used, which emerges from Fig. 5 is more obvious in Fig. 6. Considering overall performance, the obtained models can be divided into three groups: (1) PCA based, (2) DT/MARS based and (3) FSL/GA based. A difference of about 5% in terms of acc and Pr, between the groups can be observed, and, among the best performing ones, the FSL based ANN models have better precision, while the GA ones are slightly more accurate.
![]() | ||
Fig. 7 Predicted LC behaviour of the test compounds by the best DT, MARS and ANN models based on: (a) 2D and (b) 2&3D descriptor sets. |
Since two 2&3D ANN models have the same accuracy, GA-PNN was regarded as better, owing to its higher Gmean value (89%).
Although the 2D-DT model demonstrates inferior predictive power in comparison with the models presented in Fig. 7, it can be considered as very convenient for the practical use. Namely, the simplicity of DT approach allows the execution of the model even in a spreadsheet environment (Microsoft Excel and similar) by applying the if-then rules obtained from DT. Thus, during the molecular design, new structures can be easily screened for a potential LC behaviour and the influence of structural units varied in homologous series can be quickly evaluated. The fact that DT model uses only 2D descriptors, further favour its usage.
In the case of best 2D and 2&3D models (2D-FSL-SKNN and 2&3D-GA-PNN), the prediction results exhibit four misclassifications. The 2D-FSL-SKNN has given three false positives (that is, a NLC is classified as the LC) and one false negative (i.e. a LC is classified as the NLC), while the 2&3D-GA-PNN has predicted one false positive and three false negatives. Neither of the false predictions is common to the both ANN models.
The output map for the 2D-FSL-SKNN is presented in ESI (Fig. S6†). As can be seen, the compounds 74 and 267 (P10 and P37 in Fig. S6†) are classified with the probability of 50%.
This means that the model simply does not have enough information from the available training set and selected descriptors to make a confident prediction of the LC property of those molecules. The compound 92 is most likely misclassified as LC, by 2D-FSL-SKNN model, because it is more similar to the LC compound 93 in the term of values of selected descriptors, than with other compounds from the same homologous series, which are NLC compounds.
The NLC compound 128, which is misclassified as LC by the 2&3D-GA-PNN model with the probability of 81% (according to the PNN output, Table S4 in ESI†), can be consider as an outlier since all other compounds from the same homologous series are actually LC compounds. This is also supported by the fact that all models, except the 2D-FSL-SKNN, have misclassified this compound. The other three LC compounds, namely 166, 212 and 243, were classified as such with the probability of 47%, 19% and 6%, respectively.
Further decrease of misclassification rate could be achieved by using a more balanced dataset with an increased number of compounds, especially the NLC ones. Unfortunately, there is a deficit of reported NLC structures available in literature since published papers contain series of compounds most of which being liquid crystals. Since LCs are often synthesised in series of 5 to 10 compounds, from a practical point of view it is necessary only to make a confident identification of LC series among the candidate series, while the classifiers are actually trained for a more complex task, i.e. to predict individual “losers” in the whole “winning” series.
![]() | ||
Fig. 8 The pLC trend in a homologous series as the number of carbons (n) in the terminal chain is increased. |
In order to highlight the pLC trend, the pLC for compounds other than those from the prediction set were inter- and extrapolated, i.e. for the NLC compounds pLC is set to 0%, while for the LC compounds pLC is estimated to be >50% according to the observed trend. Fig. 8 shows the pLC trends, which are observed as the chain length is increased by addition of carbons to the chain end, for 4 series of compounds that contain two or more molecules from the prediction set.
In the case of series of compounds 284–294, the GA-PNN model has captured well-known effect of LC phase stabilization by the increase of terminal chains length. The increase of terminal chains length results in increased lateral attractive forces which stabilize the LC phases.68 Therefore, the bent-core compounds with longer terminal chains have higher pLC values, because they are more like to be LCs than shorter-chained ones.
For the series of compounds 5–15 and 133–143 the observed pLC trend shows no influence of the terminal chain length. This is probably related with the fact that in those homologous series the length of only one terminal chain is varied, while the second one had fixed length (a long dodecyloxy chain). Apparently the presence of dodecyloxy chain maintains the stability of LC phase, and therefore all homologous exhibit LC behaviour with the same or similar values of pLC.
Compounds from the homologous series 94–101 form dark conglomerate (DC) mesophases. For this particular homologous series, it was determined that the homologous with medium alkyl chain length are the most stable and that upon further chain elongation the DC phases become instable. For example, the crystallization of compound 100 takes place immediately after the formation of DC, while the longest homologous (101) doesn't exhibit DC mesophase.69 This behaviour is well captured by the 2&3D-GA-PNN model: the pLC decreases from the medium homologous to the longest one.
Overall, the results suggest that the each tested ANN architecture (SKNN, CPNN and PNN) is usable for the prediction of LC behaviour. Especially, the 2D-FSL-SKNN and 2&3D-GA-PNN models demonstrated to be practical and effective tools for the LCs prediction, demonstrating a high accuracy of 91% and precision of 92% and 97%, respectively.
Finally, the analysis of ability of the best classifier (2&3D-GA-PNN) to capture the trends in homologous series showed that this model is capable to predict the stability of potential LC compound, as the function of PNN output probability. Therefore chemists can use the proposed PNN approach: (1) to screen the new bent-core structures in their quest for new LCs, (2) to estimate the stability of LC mesophase, and (3) to quantify the influence of structural modifications on the LC phase formation and its stability. Although the created models do not provide an understanding of the LC phase formation mechanism itself, they represent a rational and practical approach for the prediction of liquid crystallinity with high accuracy.
Further research is planned in expanding the proposed approach for the prediction of particular type of mesophase, as well as in developing regression models for the prediction of transition temperatures of LC phases of various five-ring bent-core systems.
Footnote |
† Electronic supplementary information (ESI) available: Molecular structure and liquid crystal behaviour of the modelled five-ring bent-core compounds. Basic terms of molecular descriptors. The LC equations of created 2D and 2&3D MARS models. List of descriptors used in the MARS models with short description. Short description of descriptors selected using feature selection and genetic algorithms. The SKNN and CPNN optimization results. Prediction of LC behaviour of the external test set compounds for all 27 created models. The 2D-FSL-SKNN output map. The 2&3D-GA-PNN output with probabilities. See DOI: 10.1039/c5ra20775d |
This journal is © The Royal Society of Chemistry 2016 |