M. Salahinejad
Environmental Laboratory, NSTRI, Tehran, Iran. E-mail: salahinejad@gmail.com
First published on 20th February 2015
In this study, a list of classification models was developed to categorise organic solvents with respect to their dispersibility of single-walled carbon nanotubes (SWNTs). The organic solvents were classified into solvents and nonsolvents based on their ability to disperse the SWNTs. Various feature selection techniques combined with different classifier algorithms of linear and quadratic discriminate analysis (LDA and QDA), decision trees (random forest and J48), neural networks and support vector machines (SVMs) were explored on a data set consisting of structurally diverse organic solvents. The physicochemical descriptors such as partial charges, volsurf (the volumes and surfaces of grid points at different energy levels), subdivided surface area and some shape descriptors contributed to the classification models. The validation studies using test set, leave-one-out and 10-fold cross-validation methods provide statistical parameters such as specificity, sensitivity, accuracy, Mathew's correlation coefficient and the kappa index to evaluate the developed classification models. The sum of ranking difference (SRD) procedure reveals that the random forest classifier based on selected descriptors by the wrapper feature selection method is the best classification model, while the SVM, MLP and QDA containing models that are ranked as good models. The structural features along with electrostatic interactions of solvent molecules play a significant role in discriminating good solvents from nonsolvents in SWNT dispersion.
Stable dispersion with the aid of surfactants, biomolecules12 and organic polymers13,14 are the most common “solubilization” method of SWNTs in different aqueous and organic media. However, these procedures tend to degrade of the SWNTs' electronic properties15 and make difficulties for completely removal of these solubilizing agents from nanotubes.16 Thus, the direct dispersion of nanotubes into proper organic solvent have the potential advantages involving the ability to remove the solvent through evaporation and to find suitable purification and dispersion methods.
Attempts to identify the optimal solvent properties have been based on solubility parameters such as Hildebrand or Hansen parameters17–21 or surface energy.17,22 Ham et al. explored the relation between the Hansen solubility parameters and the degree of dispersion state of SWNTs in various organic solvents.21 The dispersion state of organic solvents classified as three groups of dispersed, swollen and sedimented of SWNTS based on the dispersive component of the solubility parameter. Bergin et al. measured the dispersibility of SWNTs in a range of organic solvents and explored the nanotube dispersibility based on the Hansen and surface energy solubility parameters of the solvents.23 The organic solvents classified as solvent and nonsolvent. Nonsolvents are defined as solvents with effectively zero of SWNTs dispersibility. However, it was concluded that neither Hansen nor surface energy solubility parameters were fundamental to distinct between solvents and nonsolvents and to evaluate and predict the dispersion state for SWNTs in different organic solvents.
In previous attempt, we investigated the application of quantitative-structure–property relationship (QSPR) models to predict the dispersibility of SWNTs in various organic solvents.24,25 This work aims to develop in silico classification models, which can be used to classify the organic solvents based on their SWNTs dispersibility and to explore the important structural features related to the dispersion of carbon nanotubes. Various feature selection and classification techniques were used to compare different chemometrics tools to approach the difficult problems of predicting the dispersion state of organic solvents for SWNTs.
The Kennard–Stone (KS) algorithm was applied to split the data set into training and test sets. The KS method is usually performed on the matrix of molecular descriptors (X) based on Euclidean distance measure most representative objects. In modified KS, the response vector (y) was added as an additional column to the matrix of descriptors (X). This modified method, KS(Xy), help to evenly distribution of samples within both descriptors and response spaces28 and can enhance the influence of the response on the splitting results.28 A training set of 48 compounds was used to build and adjust the parameters of the classification models, and the rest of the molecule (14 compounds) was used to evaluate classification model's prediction ability as test set.
- Correlation-based feature selection (Cfs) subset evaluator (CfssubsetEval) with best first search method, which evaluates the worth of a subset of features by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of descriptors that are highly correlated with the class while having low inter-correlation are preferred. The best first search method searches the space of descriptor subsets by greedy hill-climbing augmented with a backtracking facility.29
- Relief-F attribute evaluator (reliefFAttributeEval) uses instance based learning to assign a relevance weight to each feature that each feature's weight reflects its ability to distinguish among the class values.29
- Information gain (InfoGain) attributes evaluator (InfoGainAttributeEval) uses information gain to select attributes by measuring information gain with respect to the class.30
- Wrapper attributes subset evaluator (WrapperSubsetEval) which uses the method of classification itself to measure the importance of features set.31
The DA classification as known and classic method among traditional classifiers, performs dimensionality reduction by maximizing the between-class variance and minimizing the within-class variance.32 Linear discriminant analysis (LDA), can only consider linear boundaries while quadratic discriminant analysis (QDA) separates the class regions by quadratic boundaries.33
The DT classifiers algorithm are effective and powerful tools for classification which are in the form of a tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.34 Random Forest (RF) classifier uses a collection of decision trees, in order to improve the classification rate while J48 tree algorithm basically uses the divide-and-conquer algorithm by splitting a root tree into a subset of two partitions of child nodes.35
The feed forward multilayer networks or multilayer perceptrons (MLPs) and radial basis function networks (RBFN) are two if the most widely used neural network classifiers, which are based on the training procedure by an activation function that associates input vectors with a corresponding target vector.36 The MLP uses one or more hyper planes to isolate the classes in the input space, while RBFs use a local approach, which model the separate class distributions by localized radial basis functions. Support vector machine learning classification are based on the concept of separating planes that define decision boundaries.37,38
Sn = TP/(TP + FN) | (1) |
Sp = TN/(TN + FP) | (2) |
ACC = (TP + TN)/(TP + TN + FP + FN) | (3) |
Matthew's correlation coefficient (MCC) and Cohen's kappa values are two statistics used to validate the predictive performance of classification models. The MCC is computed as below:
![]() | (4) |
The MCC takes a range of values from +1 to −1, where +1 represents a perfect prediction and −1 an inverse prediction. The kappa statistic is a metric that compares an observed accuracy with an expected accuracy. The expected accuracy is defined as the accuracy that would be expected to be present by chance alone.39,40 The equation used for computing kappa coefficient, k, is expressed as below:
k = (Po − Pe)/(1 − Pe) | (5) |
In order to evaluate the ability of classification models, leave-one-out cross-validation (LOO-CV) and 10-fold cross-validation was performed.
Classification models with the ability to predict the SWNTs dispersibility of organic solvents were developed based on physicochemical descriptors calculated from MOE software as independent variables. A single model or method will not give the best result for any set of data, hence multiple classification models with different approaches were constructed to compare their results. Three filter feature selection methods, as classifier-independent techniques, based on a specific criteria such as correlation (Cfs), distance (ReliefF) and information (InfoGain) and a wrapper method (Wrapper), which are based on the performance of a particular classifier, were examined for selecting the most effective subset of features. Table 1 displayed the abbreviation and a brief description of the selected features for each method.
Selected descriptors | Description | Subset evaluator |
---|---|---|
a_nO | Number of oxygen atoms | ReliefF |
BCUT_SLOGP_2 | BCUT descriptors using atomic contribution to log![]() |
Wrapper |
CASA+ | Positive charge weighted surface area | InfoGain, Cfs |
GCUT_PEOE_0 | GCUT descriptors | Wrapper |
PEOE_VSA_FNEG | Fractional negative van der Waals surface area | Wrapper |
PEOE_VSA_FPOL | Fractional polar van der Waals surface area | Cfs |
PEOE_VSA_FPPOS | Fractional positive polar van der Waals surface area | Wrapper |
PEOE_VSA_POL | Total polar van der Waals surface area | ReliefF |
PEOE_VSA_PPOS | Total positive polar van der Waals surface area | ReliefF |
PEOE_VSA+4 | Sum of vi where qi is in the range (0.20, 0.25) | ReliefF |
Q_VSA_FPOS | Fractional positive van der Waals surface area | InfoGain, Cfs |
Q_VSA_NEG | Total negative van der Waals surface area | Wrapper |
SMR_VSA1 | Sum of atomic molar refractivity with polarities in the range 0.11 to 0.26 | Cfs |
Std_dim2 | Standard dimension 2 | Wrapper |
vsurf_CW1 | Capacity factor of order 1 | Wrapper |
vsurf_CW6 | Capacity factor of order 6 | InfoGain, Cfs |
vsurf_EWmin1 | Lowest hydrophilic energy | InfoGain |
The features selected by aforementioned FS methods were applied to different classification techniques, namely DA, DT, RBF, MLP and SVM classifiers. Tables 2–4 gave the statistical performance results of the obtained classification models based on different feature selection techniques and classifiers. The statistical parameters such as Sp, Sn, ACC and MCC were calculated from the confusion matrix of the training and test set and the LOO and 10-fold cross-validation techniques for all developed classification models. Here, Sn is the ability of the classification model to correctly recognize solvent compounds as solvent and Sp is a measure of the classification model to identify nonsolvent compounds as nonsolvent (both in percentage). All the developed classification models were validated by LOO and 10-fold cross-validation and test set methods.
Subset evaluator | Classifier | Training set | Test set | LOO-CV | 10-fold-CV | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sp | Sn | ACC | MCC | k | Sp | Sn | ACC | MCC | k | ACC | MCC | ACC | MCC | ||
a Sp: specificity (%), Sn: sensitivity (%), ACC: accuracy (%), MCC: Matthews correlation coefficient, k: kappa coefficient, LOO-CV: leave-one-out cross-validation. | |||||||||||||||
Cfs | LDA | 70.83 | 60.71 | 62.50 | 0.25 | 0.25 | 50.00 | 50.00 | 57.14 | 0.13 | 0.13 | 56.25 | 0.13 | 56.25 | 0.13 |
InfoGain | 70.83 | 65.38 | 66.67 | 0.33 | 0.33 | 50.00 | 50.00 | 57.14 | 0.13 | 0.13 | 56.25 | 0.13 | 60.42 | 0.21 | |
ReliefF | 75.00 | 78.26 | 77.08 | 0.54 | 0.54 | 83.33 | 71.43 | 78.57 | 0.58 | 0.57 | 77.08 | 0.54 | 79.17 | 0.58 | |
Wrapper | 75.00 | 69.23 | 70.83 | 0.42 | 0.42 | 83.33 | 50.00 | 57.14 | 0.23 | 0.19 | 60.42 | 0.21 | 66.67 | 0.33 | |
Cfs | QDA | 75.00 | 78.26 | 77.08 | 0.54 | 0.54 | 75.00 | 60.00 | 62.50 | 0.26 | 0.42 | 52.08 | 0.04 | 58.33 | 0.12 |
InfoGain | 83.33 | 60.61 | 64.58 | 0.31 | 0.29 | 70.83 | 58.62 | 58.33 | 0.19 | 0.57 | 52.08 | 0.04 | 60.42 | 0.21 | |
ReliefF | 79.17 | 79.17 | 79.17 | 0.58 | 0.58 | 75.00 | 75.00 | 75.00 | 0.50 | 0.71 | 75.00 | 0.50 | 77.08 | 0.54 | |
Wrapper | 83.33 | 90.91 | 87.50 | 0.75 | 0.75 | 62.50 | 60.00 | 60.42 | 0.21 | 0.16 | 60.42 | 0.21 | 58.33 | 0.17 |
Subset evaluator | Classifier | Training set | Test set | LOO-CV | 10-fold-CV | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sp | Sn | ACC | MCC | k | Sp | Sn | ACC | MCC | k | ACC | MCC | ACC | MCC | ||
a Sp: specificity (%), Sn: sensitivity (%), ACC: accuracy (%), MCC: Matthews correlation coefficient, k: kappa coefficient, LOO-CV: leave-one-out cross-validation. | |||||||||||||||
Cfs | RF | 95.83 | 79.31 | 85.42 | 0.72 | 0.71 | 100.00 | 75.00 | 85.71 | 0.75 | 0.72 | 79.17 | 0.58 | 81.25 | 0.63 |
InfoGain | 95.83 | 76.67 | 83.33 | 0.69 | 0.67 | 100.00 | 66.67 | 78.57 | 0.65 | 0.59 | 68.75 | 0.38 | 70.83 | 0.43 | |
ReliefF | 70.83 | 100.00 | 85.42 | 0.74 | 0.71 | 83.33 | 71.43 | 78.57 | 0.58 | 0.57 | 66.67 | 0.33 | 72.92 | 0.46 | |
Wrapper | 100.00 | 92.31 | 95.83 | 0.92 | 0.92 | 100.00 | 60.00 | 71.43 | 0.55 | 0.47 | 68.75 | 0.38 | 70.83 | 0.42 | |
Cfs | J48 | 83.33 | 86.96 | 85.42 | 0.71 | 0.71 | 83.33 | 62.50 | 71.43 | 0.46 | 0.44 | 77.08 | 0.54 | 68.75 | 0.38 |
InfoGain | 100.00 | 72.73 | 81.25 | 0.67 | 0.63 | 83.33 | 50.00 | 57.14 | 0.23 | 0.19 | 75.00 | 0.53 | 72.92 | 0.48 | |
ReliefF | 58.33 | 100.00 | 79.17 | 0.64 | 0.58 | 66.67 | 80.00 | 78.57 | 0.56 | 0.55 | 70.83 | 0.44 | 66.67 | 0.34 | |
Wrapper | 100.00 | 62.50 | 81.25 | 0.67 | 0.63 | 66.67 | 57.14 | 64.29 | 0.29 | 0.29 | 64.58 | 0.29 | 62.50 | 0.25 |
Subset evaluator | Classifier | Training set | Test set | LOO-CV | 10-fold-CV | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sp | Sn | ACC | MCC | k | Sp | Sn | ACC | MCC | k | ACC | MCC | ACC | MCC | ||
a Sp: specificity (%), Sn: sensitivity (%), ACC: accuracy (%), MCC: Matthews correlation coefficient, k: kappa coefficient, LOO-CV: leave-one-out cross-validation. | |||||||||||||||
Cfs | LiBSvm | 83.33 | 90.91 | 87.50 | 0.75 | 0.75 | 83.33 | 83.33 | 85.71 | 0.71 | 0.71 | 81.25 | 0.63 | 81.25 | 0.63 |
InfoGain | 66.67 | 72.73 | 70.83 | 0.42 | 0.42 | 83.33 | 62.50 | 71.43 | 0.46 | 0.44 | 62.50 | 0.25 | 60.42 | 0.21 | |
ReliefF | 70.83 | 100.00 | 85.42 | 0.74 | 0.71 | 83.33 | 62.50 | 71.43 | 0.46 | 0.44 | 54.17 | 0.16 | 75.00 | 0.50 | |
Wrapper | 95.83 | 92.00 | 93.75 | 0.88 | 0.88 | 83.33 | 55.56 | 64.29 | 0.34 | 0.31 | 58.33 | 0.17 | 64.58 | 0.29 | |
Cfs | RBFN | 79.17 | 100.00 | 89.58 | 0.81 | 0.79 | 83.33 | 71.43 | 78.57 | 0.58 | 0.57 | 77.08 | 0.54 | 75.00 | 0.51 |
InfoGain | 75.00 | 87.50 | 79.17 | 0.59 | 0.58 | 100.00 | 66.67 | 78.57 | 0.65 | 0.59 | 56.25 | 0.13 | 62.50 | 0.25 | |
ReliefF | 70.83 | 94.44 | 83.33 | 0.69 | 0.67 | 83.33 | 83.33 | 85.71 | 0.71 | 0.71 | 68.75 | 0.38 | 66.67 | 0.33 | |
Wrapper | 95.83 | 100.00 | 97.92 | 0.96 | 0.96 | 50.00 | 50.00 | 57.14 | 0.13 | 0.13 | 58.33 | 0.17 | 60.42 | 0.21 | |
Cfs | MLP | 79.17 | 82.61 | 81.25 | 0.63 | 0.63 | 83.33 | 71.43 | 78.57 | 0.58 | 0.57 | 77.08 | 0.55 | 77.08 | 0.54 |
InfoGain | 65.22 | 62.50 | 64.58 | 0.29 | 0.29 | 50.00 | 60.00 | 64.29 | 0.26 | 0.26 | 60.42 | 0.21 | 62.50 | 0.25 | |
ReliefF | 75.00 | 85.71 | 81.25 | 0.63 | 0.63 | 83.33 | 71.43 | 78.57 | 0.58 | 0.57 | 56.25 | 0.46 | 75.00 | 0.50 | |
Wrapper | 70.83 | 70.83 | 70.83 | 0.42 | 0.42 | 66.67 | 50.00 | 57.14 | 0.17 | 0.16 | 56.25 | 0.56 | 60.42 | 0.21 |
Sum of ranking difference (SRD) values were calculated for all classification models in order to compare the performance of each method. The SRD carried using simulated random numbers in conjunction with the theoretical distribution of the SRD values called comparison of ranks with random numbers (CRRN) procedure. Table 5 provides the SRD and p% interval of the variables of the classification analyses while Fig. 1 displayed the SRD-CRRN test results of the data matrix given in Table 5. As shown in Fig. 1 and Table 5, the RF classification based on the variables selected by wrapper method feature selection method, gave the best ranking values with the smallest SRD (the smaller SRD, the better the model). As bolded in Table 3 and labeled in Fig. 1, the developed Wrapper-RF model gave an accuracy more than 95 and 71% in the training and test sets respectively, and reasonable results on cross-validation techniques.
Model | Ranking results | p% | ||
---|---|---|---|---|
Model code | SRD | x 〈SRD〉 = x | ||
a XX1-first icosaile (5%), Q1-first quartile, Med-median, Q3-last quartile, XX19-last icosaile (95%). | ||||
Wrapper-RF | 12 | 10 | 7.45 × 10−5 | 1.73 × 10−4 |
Cfs-MLP | 21 | 14 | 3.88 × 10−4 | 8.44 × 10−4 |
Wrapper-QDA | 8 | 18 | 1.78 × 10−3 | 3.67 × 10−3 |
Cfs-RBF | 25 | 18 | 1.78 × 10−3 | 3.67 × 10−3 |
Wrapper-SVM | 20 | 20 | 3.67 × 10−3 | 7.35 × 10−3 |
Cfs-QDA | 5 | 22 | 7.35 × 10−3 | 1.43 × 10−2 |
InfoGain-RF | 10 | 22 | 7.35 × 10−3 | 1.43 × 10−2 |
ReliefF-RF | 11 | 22 | 7.35 × 10−3 | 1.43 × 10−2 |
Cfs-J48 | 13 | 22 | 7.35 × 10−3 | 1.43 × 10−2 |
ReliefF-SVM | 19 | 24 | 1.43 × 10−2 | 2.71 × 10−2 |
ReliefF-RBF | 27 | 24 | 1.43 × 10−2 | 2.71 × 10−2 |
Wrapper-LDA | 4 | 26 | 2.71 × 10−2 | 5.00 × 10−2 |
Cfs-SVM | 17 | 26 | 2.71 × 10−2 | 5.00 × 10−2 |
Wrapper-J48 | 16 | 28 | 5.00 × 10−2 | 8.98 × 10−2 |
InfoGain-SVM | 18 | 28 | 5.00 × 10−2 | 8.98 × 10−2 |
InfoGain-MLP | 22 | 28 | 5.00 × 10−2 | 8.98 × 10−2 |
InfoGain-QDA | 6 | 32 | 0.16 | 0.27 |
Cfs-RF | 9 | 32 | 0.16 | 0.27 |
Wrapper-MLP | 24 | 32 | 0.16 | 0.27 |
ReliefF-QDA | 7 | 34 | 0.27 | 0.44 |
ReliefF-MLP | 23 | 34 | 0.27 | 0.44 |
InfoGain-J48 | 14 | 36 | 0.44 | 0.72 |
ReliefF-J48 | 15 | 36 | 0.44 | 0.72 |
InfoGain-RBF | 26 | 36 | 0.44 | 0.72 |
Wrapper-RBF | 28 | 36 | 0.44 | 0.72 |
Cfs-LD | 1 | 38 | 0.72 | 1.13 |
InfoGain-LD | 2 | 38 | 0.72 | 1.13 |
ReliefF-LD | 3 | 44 | 2.60 | 3.80 |
XX1 | 46 | 4.61 | 5.47 | |
Q1 | 58 | 24.45 | 27.12 | |
Med | 66 | 48.78 | 52.08 | |
Q3 | 74 | 73.59 | 76.22 | |
XX19 | 84 | 94.77 | 95.59 |
![]() | ||
Fig. 1 SRD-CRRN results of the classification models. XX1: first icosaile (5%), Q1: first quartile, Med: median, Q3: last quartile, XX19: last icosaile (95%). The label numbers indicated the model code given in the Table 5. |
The LDA classifier that can only learn linear boundaries, showed the worst performance in developed classification models, while the QDA, RBF and SVM that also can learn nonlinear boundaries and are therefore more flexible, showed better performance and significant results.
A deeper inspection in the Tables 2–5, reveals that the wrapper method yielded better performance in classification models with significant statistical parameters. In this study, a wrapper FS technique adopted with genetic algorithm as a random search method was used for all classifier procedures. As mentioned above, wrapper method is based on the performance of a particular classifier to measure the importance of features set; hence they generally result in better performance than filter methods in which the feature subset selected based on a specific criteria.
As indicated in Table 1, different kind of descriptors involved in the classification models generated to discriminate between solvents and nonsolvents for SWNTs dispersion. Most of selected descriptors are partial charge descriptors, that depends on the partial charge of each atom of a chemical structure, such as fractional polar van der Waals surface area (PEOE_VSA_FPOL), fractional positive van der Waals surface area (Q_VSA_FPOS), total polar van der Waals surface area (PEOE_VSA_POL) and positive charge weighted surface area (CASA+). The selected features of vsurf_CW1, vsurf_CW6 and vsurf_EWmin1 are volsurf descriptors. The volsurf descriptors depend on the structural connectivity and the conformation of the molecules.48,49 The vsurf_CW descriptor describes the capacity factor of a molecule at different energy levels and reveals the hydrophilicity of the molecules on unit surface area. The vsurf_EWmin1 represents the lowest hydrophilic energy of a molecule. The SMR_VSA1, as a subdivided surface area descriptor, defined as van der Waals surface area (VSA) descriptor49 that characterized as the amount of surface area with molar refractivity and describes the polarizability of a molecule. The BCUT_SLOGP_2 is a BCUT descriptor using atomic contribution to logP (octanol/water). The BCUT descriptor encodes atomic properties relevant to intermolecular interactions and calculated from the distance and adjacency matrices.50 The std_dim2 is the second largest standardized dimension and depends on the structure connectivity and conformation of molecule. As a count atom descriptor, a_nO represent the number of oxygen atoms.
The importance of partial charges descriptors, which are dominated by electrostatic interactions, in classification models imply the role of electronic properties of organic solvents in the dispersibility of SWNTs. The impact of volsurf, subdivided surface area and shape descriptors in developed classification models highlight the effect of structural features of solvent molecules on their ability to disperse SWNTs.
As a first report on the classification of organic solvents based on their SWNTs dispersibility, simple molecular descriptors and freely available classification packages were used to develop classification models. One important challenge in constructing in silico modeling would be possibility to develop a reliable and predictive model based on a limited number of experimental data on nanomaterials.51 The influence of training sample size on the classification performance and the hypothesis that variance in classification learning can be expected to decrease as training set size increases were examined and confirmed by many studies.52–54 We examined different feature selection techniques combined by various classifier algorithm to overcome the sample size effect on classification difficulty. However, the results obtained from this study are significant and can be improved with larger sample size and with sophisticated classifier methods.
Footnote |
† Electronic supplementary information (ESI) available: A complete list of the simplified molecular input line entry specification (SMILES) and molecular structures of the organic solvents used for classification model with their dispersion state of SWNTs. See DOI: 10.1039/c5ra01261a |
This journal is © The Royal Society of Chemistry 2015 |