Development of classification and regression models for Vibrio fischeri toxicity of ionic liquids: green solvents for the future

Rudra Narayan Das and Kunal Roy *
Drug Theoretics and Cheminformatics Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata 700 032, India. E-mail: kunalroy_in@yahoo.com; kroy@pharma.jdvu.ac.in; Web: http://sites.google.com/site/kunalroyindia/ Fax: +91-33-2837-1078; Tel: +91 98315 94140

Received 13th March 2012 , Accepted 5th July 2012

First published on 25th July 2012


Abstract

Ionic liquids (ILs) have important industrial applications due to their unconventional and encouraging properties making them unique chemical species with extremely lowered vapour pressure, high thermal stability, and enhanced solvation characteristics. They are considered as “green solvents” chiefly due to their task-specificity and minimal release into the environment. The assessment of toxicity of ILs to living ecosystems has received considerable attention in recent years. Development of predictive quantitative structure–toxicity relationship (QSTR) models for ionic liquids can help in designing derivatives with a reduced toxicity profile, thereby making them greener and eco-friendlier. The present study attempts to develop a classification model as well as a regression model to capture specific structural information of ionic liquids responsible for their toxic manifestation to Vibrio fischeri. The models were developed using various two-dimensional chemical descriptors along with dummy variables and subjected to rigorous statistical validation employing multiple strategies, whereas other models of ILs for Vibrio fischeri toxicity reported so far have not employed such strict tools. The classification model has been characterized by acceptable Wilk's λ statistics, pharmacological distribution diagram assessment, and receiver operating characteristics (ROC) analysis parameters. The regression models have been judged according to the OECD guidelines, and the best model showed encouraging external predictivity (R2pred = 0.739). The toxicity of ionic liquids to V. fischeri was found to be related to branching, molecular size, and solvation entropy of cations along with a lipophilicity contribution of the anions.


Introduction

The modern growth of chemical industrialization has been an alarming issue for exerting toxicity to the living ecosystem in a multidimensional way. The presence of a huge number of diverse chemicals in the environment necessitates the assessment of their harmful effects. The search for alternative chemicals providing safety and sustainability is a real challenge to be addressed in this century. Now-a-days the use and deployment of green technologies as well as green chemistry1 (principles presented in Table S1 in the ESI) is an essential issue to protect the environment from potential hazardous effects of chemicals. Ionic liquids (ILs), comprising of cations and anions, are among the latest additions to the world of chemical compounds and deployed to a wide field of chemical applications. ILs are liquid in nature below 100 °C temperature2 and those liquid at room temperature are termed as room temperature ionic liquids (RTILs). They are considered as green chemicals due to their uniqueness in terms of low vapour pressure,3 enhanced thermal and chemical stability and wide solvation characteristics.4 However, extensive researches in the past years have opened up a multitude of highly interesting and attractive application opportunities in the fields of electrochemistry,5 analytical usage,6 production of biodiesel,7 cleaning technology, extraction and separation technology,8 and even in the recovery of nuclear waste.9 Because of offering enough flexibility in design only by changing the cation–anion combination, ILs are also used as “designer solvents”.10,11 The environmental benevolence shown by the ILs over the conventional organic solvents may be attributed to their extreme low vapour pressure allowing lesser release of volatile organic compounds (VOCs)12 as well as their non-inflammability contributing to reduced chances of accidents in the case of fast and exothermic reactions.

Although ionic liquids potentially correspond to environmental acceptability, there is still a dearth of specific knowledge regarding their behaviour and interaction with the living ecosystem. Apart from the advantages obtained through reduced vapour pressure, the ionic liquids must be subjected to special consideration regarding a sustainable synthesis13i.e., maximum utilization of reaction material (green chemistry principle 2), lowered toxicity profile and harmlessness (green chemistry principle 3), and restricted environmental diligence i.e., easy breakdown into harmless products (green chemistry principle 10).1 Aqueous solubility of many ionic liquids poses the danger of their release and thereby exertion of untoward effects in aquatic ecosystems. The high chemical and thermal stability also highlights the problem of bioaccumulation of these agents. Hence, proper assessment of the toxicological profile of ionic liquids is necessary.14,15 The present study focuses on the predictive modelling of toxicity of ionic liquids against aquatic bacterium, Vibrio fischeri.

Quantitative structure–activity relationship (QSAR) can sufficiently provide a rational basis for the design of suitable ionic liquids with the intended purpose of use. In this study, we have focused on the prediction of toxicity, i.e., quantitative structure–toxicity relationship (QSTR), of diverse ionic liquid species. The practise of QSAR technique for the prediction of activity, toxicity, or physicochemical properties is robustly supported with compounds’ detailed chemistry and interaction with biological system on a strong background of statistical validation.16 Presently, the use of predictive QSAR models is widely accepted by different international regulatory authorities including the Organization for Economic Co-operation and Development (OECD), which has proposed guidelines for developing QSAR models to serve as an alternative strategy of toxicity testing17 on a reasonable basis.

Toxicity data for ionic liquids to different reference aquatic invertebrate organisms namely Vibrio fischeri (bacteria),18Pseudokirchneriella subcapitata (green algae),19Daphnia magna (crustacean),20 vertebrate organisms namely Danio rerio (zebrafish),21Rana nigromaculata (frog),22 and aquatic plant Lemna minor (duckweed)23 are now available in the literature. Here, we have focussed on developing QSTR models on Vibrio fischeri toxicity endpoint which is widely employed for toxicity studies for different environmental issues,24 namely screening of surface or ground waters, wastewaters, soil or sediment extract monitoring, as well as for toxicity testing of chemical compounds.25

Attempts to develop predictive mathematical models on Vibrio fischeri toxicity data for ionic liquids have been quite a few compared to other classes of hazardous chemicals. Couling et al.26 developed a four parameter equation containing E-state indices and surface area descriptors against V. fischeri toxicity of ionic liquids. Bruzzone et al.27 reported a QSTR model on V. fischeri toxicity data of a class of 33 ILs employing the CODESSA treatment. In another study, Viboud et al.28 conducted a QSTR analysis on the V. fischeri toxicity data of 54 ILs using multifactorial component analysis and reported the importance of the planarity of the cation ring on toxicity. A multiple regression model was developed by Luis et al.29 involving V. fischeri toxicity values of 96 ILs and the effect of different structural groups were reported. A number of other studies also exist in the literature where the authors have reported structural correlation of ionic liquids e.g. effect of alkyl chain length of cations with V. fischeri toxicity without developing any true mathematical model.23,30 In none of the above mentioned QSTR studies, strict external validation strategies employing multiple tools have been employed. However, in this present study, we have externally validated the developed models using several metrics so as to judge the reliability and predictivity of the models on untested new ionic liquid agents with greater confidence.

In the present study, we have attempted to develop predictive QSTR regression models as well as a classification model on Vibrio fischeri toxicity data of ionic liquids (collected from the literature)31 in order to derive a structural relationship of the compounds with the toxicity. The objective of the work has been to develop an initial classification model for ionic liquids using linear discriminant analysis32,33 as a primary screening tool enabling preliminary assessment of toxicity towards Vibrio fischeri and finally to develop a robust mathematical regression model to quantitatively predict the toxicity and to derive specific information regarding the contribution of different structural components towards the toxicity. By using the information derived from both the studies, it will be possible to design newer ionic liquids with less toxicity potential to the living ecosystem (specifically Vibrio fischeri), hence making them ‘Greener alternatives’ to other conventional solvents. We have developed and validated the regression models by strictly following the OECD guidelines for QSAR model development.

Materials and methods

The dataset and descriptors

The ecotoxicity data for 148 ionic liquids corresponding to the logarithmic value of 15 and 30 min EC50 (log EC50 in μmol L−1) of bioluminescence bacterium Vibrio fischeri was collected from the literature.31 The whole dataset comprises of a varying combination of 30 anionic and 64 cationic species (presented in Table S2 in the ESI).31 In the present study, we have used 147 compounds by omitting one (MIM) that was not a part of the 94 ionic species and we converted the ecotoxicity values into their corresponding negative logarithmic units, i.e. pEC50 (−log EC50). Considering the structures of both cations and anions, various two dimensional descriptors34–37 computed using Dragon (version 6)38 software (running on a Fujitsu workstation) along with dummy variables (showing presence and/or absence of a specific ion) have been used to develop predictive models. The categorical list of the calculated descriptors is shown in Table S3 in the ESI.

Development of models

We have developed classification based as well as regression based QSAR models to encode suitable structural information on the V. fischeri toxicity of the studied ionic liquids. Employing the k-means clustering approach, the dataset was divided into a training set of 110 compounds for model development and a test set of 37 compounds for determining external predictive ability of the developed model. However, few compounds lacking numerical toxicity values were removed only for regression model development. Only the potential descriptors obtained via a thinning procedure were employed for the construction of models. We have considered toluene (pEC50 = 3.670 mol L−1 for Vibrio fischeri)39 as the reference toxic solvent for the development of the classification model. For the development of the QSAR based classification model, linear discriminant analysis (LDA) has been employed as a supervised classification tool.32 Regression models40–46 were derived by using multiple linear regression (MLR)42 and partial least squares (PLS)46 as the statistical techniques and Stepwise method and Genetic Function Approximation (GFA)43–45 techniques as the chemometric tools for feature selection. The procedure of dataset division, the methodology for descriptor thinning and the performed statistical analyses have been detailed in the ESI.

Software used for model development

MINITAB47 software has been used for carrying out stepwise multiple linear regression (MLR) and the corresponding partial least squares (PLS) analysis, while k-means cluster division has been performed using SPSS48 in a windows operating system computer supplied with Pentium 4 processor. Cerius 2 (version 4.10)49 software, running under IRIX 6.5 operating system on a Silicon Graphics workstation, has been employed for genetic function approximation analysis. The linear discriminant analysis was carried out using the discriminant function module in STATISTICA50 software.

Validation of the developed models

The present study involves computation of several statistical metrics following multiple validation strategies to determine the reliability of performed analyses. The quality of the regression equations were characterized by the determination coefficient R2,51 the adjusted R2 parameter (Ra2),51 the leave-one-out cross-validation parameter Q2 (internal validation),52 as well as the external predictivity metric R2pred.53 We have additionally reported the more stricter rm2 metric for internal (ugraphic, filename = c2tx20020a-t1.gif, Δrm2(LOO)), external (ugraphic, filename = c2tx20020a-t2.gif, Δrm2(test)) as well as overall validation (ugraphic, filename = c2tx20020a-t3.gif, Δrm2(overall)), developed by the present authors’ group.54–56 The genetic model was subjected to Y-randomization operation and another metric cRp2 has been reported.57

The measure of goodness of fit of the discriminant classification model was characterized by Wilk's lambda (λ) statistics58 and Mahalanobis distance measure59 with the respective p-values. Other validation measures for the classification analysis include sensitivity (or true positive rate), specificity, precision, accuracy and F-measure criteria60 for the training as well as test set observations. The chi-square (χ2) parameter (higher value signifying greater separability) has been additionally determined for judging the quality of achieved discrimination. The popular receiver operating characteristics (ROC)60–64 analysis was performed on the obtained sensitivity and specificity parameters and the result have been validated employing the area under the ROC curve measure. We have also computed two more recently described validation parameters for the ROC analysis namely the ROC graph Euclidean distance (ROCED) and the ROC graph Euclidean distance corrected with Fitness Function (FIT(λ)) (ROCFIT).65 Other than these, the Matthews correlation coefficient (MCC)66 has been reported. We have additionally determined the pharmacological distribution diagram (PDD)67 with the aim of providing a visual representation of the obtained classification between toxic and non-toxic compounds. Another validation parameter ρ (ratio between the numbers of compounds in the training set to the final descriptor in the model) was reported for the classification model. The theory of the validation parameters can be found at the ESI.

Results and discussion

Results obtained from the discriminant analysis

Model developed using best discriminant function. Considering the toxicity of toluene as the threshold, we obtained a priori class of 101 toxic (pEC50 > 3.670 mol L−1) and 46 non-toxic (pEC50 < 3.670 mol L−1) compounds, both containing training as well as test set compounds. A pool of 33 indicator variables and 31 other various two-dimensional descriptors, selected by stepwise MLR technique, was used as input to build the discriminant model using the training set compounds. The pool of 64 variables was further subjected to stepwise selection during execution of LDA, and the best discriminant function obtained consisted of 12 variables characterized by an appreciable composite Wilk's λ value of 0.298, and F (Fischer's statistics) value of 19.0244 (df = 12, 97). Table 1 shows the best twelve descriptors selected by the discriminant technique with their corresponding coefficient values and individual Wilk's λ values. It is interesting to note that the range of Wilk's λ value for individual 12 descriptors possesses a minimum value of 0.317 and a maximum value of 0.400 which are quite satisfactory signifying that the descriptors are of good discriminatory property. Eqn (1) is also characterized by an encouraging χ2 value of 123.422 (df = 12) and canonical regression coefficient (R) value of 0.838, which are significantly high and hence corresponding to good quality of classification model. The ρ value for the discriminant model is also appreciable (ρ = 9.167) showing the number of descriptors in final model is about one-ninth to the number of training set compounds. The discriminant equation (eqn (1)) has been used to calculate the discriminant function (DF) value for all compounds from which the pharmacological distribution diagram (PDD) was developed for training and test set compounds.
 
ugraphic, filename = c2tx20020a-t4.gif(1)
Table 1 The best twelve descriptors obtained from the LDA model (at a priori probability level of 50%) with their corresponding coefficients, Wilk's λ values, F-values and p-values
Variables Coefficient Wilk's λ F-remove (df = 1, 97) p-Value
MSD(cation) 6.658 0.400 33.022 0.0000
FeCl4 1.226 0.348 16.353 0.0001
SaasC(cation) 1.555 0.346 15.440 0.0002
MOC2MIM 0.988 0.333 11.288 0.0011
C4MMPy −0.857 0.317 6.033 0.0158
MLOGP(anion) 1.274 0.337 12.771 0.0006
Σα(cation) −4.597 0.345 15.279 0.0002
C4(CH3)2N-Py 0.807 0.318 6.579 0.0118
Cap 1.462 0.361 20.387 0.0000
C8MIM 0.663 0.314 5.035 0.0271
12OSO3 1.017 0.329 10.128 0.0020
C6MPy 0.653 0.311 4.226 0.0425
Constant −3.638      


The compounds were further classified into three groups based on their individual discriminant function scores. Those showing DF values greater than 0.5 were assigned as toxic, those with less than 0, i.e., negative values were assigned as non-toxic, and those with values between 0 and 0.5 were assigned in the group of undetermined toxicity. In all the cases the term toxicity corresponds to the V. fischeri toxicity of the ionic liquids with respect to toluene as a threshold. This treatment was performed separately on priorly defined higher toxic and lower toxic groups maintaining the training and test set division of compounds. Hence, from this treatment of grouping, we get the results for training and test set compounds separately for the a priori defined more toxic compounds (46 observations), and less toxic compounds (101 observations). Eqn (1) is characterized by eight indicator parameters of which five are for cationic and three are for anionic species, and four two-dimensional descriptors, of which three belonging to the cationic part and one for the anionic part of the analyzed ionic liquids. The cationic indicator variable C4MMPy and extended topochemical atom (ETA) parameter Σα(cation) for cations bear negative coefficients. Presence of the former parameter C4MMPy, the butyldimethylpyridinium moiety, in an ionic liquid will reduce its toxicity, whereas the ETA parameter Σα(cation) signifies that increase in molecular bulk of cations causes reduction of V. fischeri toxicity of ionic liquids. On the other hand, we have ten more parameters in the best discriminant equation (eqn (1)) reflecting their positive impact in raising toxicity of ILs to V. fischeri. Occurrence of the cationic fragments methoxyethylmethylimidazolium (MOC2MIM), butyl(dimethylamino)pyridinium (C4(CH3)2N-Py), octylmethylimidazolium (C8MIM), and hexylmethylpyridinium (C6MPy); anionic fragments tetrachloroferrate(III) (FeCl4), caprylate (Cap), and o-dodecylsulfate (12OSO3) in structures of ionic liquids cause increase in the toxicity. Three additional topological, connectivity, and thermodynamic parameters in eqn (1) show their positive impact towards toxicity of ILs. MSD(cation) (mean square distance), a parameter derived from the distance matrix, decreases with increased branching in compounds; hence higher values of MSD, i.e., reduced branchedness of the cations has a propensity of raising toxicity. SaasC(cation), an electrotopological state parameter which represents an aromatic carbon atom attached to two other aromatic carbons and one non-hydrogen atom (ugraphic, filename = c2tx20020a-u1.gif, the C-atom shown in dot), refers to the aromaticity of carbon atoms. Here, it indicates that carbon atom of such type in cations will positively contribute towards toxicity. MLOGP(anion), the Moriguchi octanol–water partition coefficient, indicates the proportional relationship of lipophilicity of anions towards toxicity ILs to V. fischeri. The structural information provided by MSD and Σα for cations in eqn (1) correspond to each other. Toxicity of ionic liquids to V. fischeri increases when the value of MSD rises and that of Σα reduces. Now, increase of MSD refers to reduced branching, whereas reduction in values of Σα refers to reduced molecular bulk. The hydrophobicity parameter MLOGP for the anionic fragment is directly related to the toxicity; hence, increase in hydrophobicity of anions causes increase in toxicity. The cationic indicator parameters MOC2MIM and C8MIM explain the effect of imidazolium moiety, and C4(CH3)2N-Py, C6MPy, and C4MMPy show the importance of pyridinium nucleus. Hence, from the present study, it is evident that V. fischeri toxicity is explainable in terms of presence of heterocyclic ring with nitrogen atom.

A contribution plot was developed, shown in Fig. 1, for the best discriminating descriptors (eqn (1)) by taking the product of their average values with their corresponding coefficients. From the plot, it can be observed that the cationic indicator parameters MOC2MIM, C4MMPy, C4(CH3)2N-Py, C8MIM, C6Mpy and the anionic parameter 12SO3 share almost equal contributions towards toxic and non-toxic groups, and the other parameters show their effects on both the groups. The ETA parameter Σα(cation) seems to be more important for negative contribution to the toxic group. The variable MSD(cation) bears a marked positive contribution for the toxic class, whereas, the cationic electrotopological state atom parameter SaasC(cation), the anionic indicator variable Cap, FeCl4 and the lipophilicity parameter MLOGP(anion) show more positive contributions towards the toxic group than a negative effect on the non-toxic group. This plot (Fig. 1) demonstrates the marked importance of molecular branching (MSD(cation)) and molecular bulk (Σα(cation)) for predicting the toxicity of ionic liquids towards V. fischeri.


Contribution plot of the variables present in the classification model (eqn (1)).
Fig. 1 Contribution plot of the variables present in the classification model (eqn (1)).

A pharmacological distribution diagram (PDD) was developed between the expectancy values and discriminant function ranges in the form of a frequency distribution diagram. We have identified the training and test sets separately within the priorly defined toxic and non-toxic groups. Fig. 2 shows that the bars representing those of toxic groups and non-toxic groups lie in the ranges of positive and negative values of DF respectively with less degree of overlap, and the maximum values of expectancy for toxic and non-toxic groups lie in the different sides of DF = 0; this infers a good separation between the two classes. The plot also explains that among the classified toxic chemicals, the maximum toxicity value is obtained at DF = 5 for the training set and DF = 6 for the test set and for the non-toxic class, the maximum value is obtained at DF = −4 for both the training and test sets.


Pharmacological distribution diagram between the expectancy values and discriminant function (DF) intervals for the training and test sets.
Fig. 2 Pharmacological distribution diagram between the expectancy values and discriminant function (DF) intervals for the training and test sets.

Additional validation parameters namely undetermined toxicity, false toxicity/non-toxicity, overall accuracy, and adjusted accuracy, calculated for the training as well as test sets for both the toxic and non-toxic groups have been indicated in Table S2 in the ESI.

Receiver operating characteristics (ROC) analysis on discriminant model. The developed discriminant model showed encouraging quality for different classification validation parameters namely sensitivity, specificity, accuracy, precision and F-measure for both the training and test set compounds as shown in Table 2. It can be observed that the model is more predictive for the non-toxic entries (with respect to toluene) of the test set than for the toxic ones as shown by a specificity value of 86.36% and sensitivity value of 60%.
Table 2 Statistical measures obtained from the ROC analysis
Division of compounds N Sensitivity (%) Specificity (%) Accuracy (%) Precision (%) F-measure (%)
Training set 110 87.10 96.20 93.64 90.00 88.52
Test set 37 60.00 86.36 75.68 75.00 66.67


We have generated a ROC plot separately for the training and test set compounds. From Fig. 3, it can be observed that all the points lie in the upper left triangle of the ROC space for both the training and test sets, and hence the performance of the model is assumed to be better than random classifier. In case of the training set, the points are more closely aligned to 100% sensitivity as the curve rises upward and the curve possesses an AUC value of 0.976 square units which is highly appreciable (Fig. 3). For the test set, the points are near the liberal zone and also carry an acceptable value of AUC of 0.852 square units.


Receiver operating characteristics (ROC) curve of the classification model for the training and test set compounds.
Fig. 3 Receiver operating characteristics (ROC) curve of the classification model for the training and test set compounds.

The ROCED parameter was computed using the Euclidean distance values of the training and test set results. The parameter ROCED bears a value of 0 for perfect classifier and 4.5 for random classifier; a value within 2.5 was proposed as good classifier by Pérez-Garrido et al.65 Our model showed a value of 1.021 for ROCED, which corresponds to a good quality of the ROC analysis. ROCFIT was calculated by dividing the ROCED with Wilk's λ value, and an acceptable value of 3.426 was obtained. The Matthews correlation coefficient (MCC) usually varies from −1 to +1 referring to an inverse classification to a perfect classification respectively, whereas a value of 0 corresponds to random classification performance. The present study also showed an acceptable value for the Matthews correlation coefficient (MCC); a value of 0.841 for the training and 0.486 for the test set. The training set shows a near perfect classifier value of MCC (0.841) which is better than that for the test set, where the value is still well above a random classification instance (0.486).

Results obtained from the regression analysis

We developed two predictive models following PLS and MLR algorithms. The details of the PLS model have been discussed in the ESI while that of the MLR model obtained from GFA (spline) technique (the best regression model obtained from this study) is described below. Definitions of selected variables appearing in both the models have been presented in Table S5 in the ESI.
Compliance with the OECD guidelines. We have thoroughly followed and attempted to justify our regression model in the light of the proposed OECD guidelines.68,69 The present study involves a dataset of ionic liquids with specific toxicity endpoint to Vibrio fischeri31 and that complies with first principle of OECD (i.e., a defined endpoint). The techniques employed for model development i.e., stepwise MLR, PLS and GFA method are very direct, transparent and reproducible and thereby obeys the unambiguity criterion of OECD principle 2. We also determined the domain of applicability defining the chemical space (OECD principle 3). Various statistical metrics describing the goodness-of-fit (R2, Ra2), robustness of the training set (Q2,ugraphic, filename = c2tx20020a-t5.gif, Δrm2(LOO)) and predictivity of the test set (R2pred,ugraphic, filename = c2tx20020a-t6.gif, Δrm2(test)) have been reported as described in OECD principle 4. Finally, corresponding to the OECD principle 5 (a mechanistic interpretation), we have attempted to explain the effect of the descriptors appeared.
Statistical quality of the model and interpretation of the variables. The model developed using GFA (spline) operation is shown in eqn (2). Eqn (2) is capable of explaining 65.10% of the predicted and 67.60% of the explained variance of the training set.
 
ugraphic, filename = c2tx20020a-t7.gif(2)

The external predictivity metric R2pred is also highly encouraging and it accounts for 73.90% predicted variance for the test set compounds. All the rm2 metric values are well in the acceptable region. The average rm2 (ugraphic, filename = c2tx20020a-t8.gif) value for external validation (ugraphic, filename = c2tx20020a-t9.gif= 0.623) of this model is higher than that for internal validation (ugraphic, filename = c2tx20020a-t10.gif), thereby explaining better external predictivity of the model. Theugraphic, filename = c2tx20020a-t11.gifvalue (0.554) is also acceptable for describing the overall predictive performance of the model. The Δrm2 values for external (Δrm2(test) = 0.156) and overall validation (Δrm2(overall) = 0.195) are acceptable excepting internal validation (Δrm2(LOO) = 0.209), where it is just above the proposed tolerance limit of 0.2. Additionally, the value of the metric for randomization, cRp2, is also greater than the threshold value of 0.5. The results of process and model randomization tests are impressively characterized by cRp2 values of 0.634 and 0.669 respectively. Fig. 4 indicates the observed toxicity values in the abscissa plotted against calculated (training set) and/or predicted (test set) toxicity values in the ordinate in the form of a scatter plot using eqn (2). The diagram shows uniform scattering of test and training set chemicals.


Observed vs. calculated (training set)/predicted (test set) toxicity plot for the regression model (eqn (2)).
Fig. 4 Observed vs. calculated (training set)/predicted (test set) toxicity plot for the regression model (eqn (2)).

Any negative value of a spline function appearing in a GFA equation indicates zero value i.e., no contribution of the variable. All the variables in eqn (2) have been ranked according to their standardized coefficients in a descending order referring most prior variable appearing first and so forth. Eqn (2) shows presence of five descriptors, three among them correspond to the effect of cations and the remaining two describe the effect of anions on V. fischeri toxicity. The most significant descriptor, as ranked by the standardized coefficient values, is the spline form of ‘GMTI’, indicating that a value of ‘GMTI’ for the cations less than 2305 will be negatively contributing towards toxicity. The parameter specifically accounts for better measure of branching for cyclic structure than acyclic ones.34 Increased branching causes reduction in ‘GMTI’ values. Hence, considering the negative impact of the spline term (<2305−GMTI>) it can be inferred that toxicity will be reduced with raised branchedness among cationic groups up to a ‘GMTI’ value less than 2305. The next important descriptor is the valence connectivity index of fifth order (5χv)34 indicating branching among cationic groups with five bond fragments. The parameter actually refers to the effect of molecular size, because smaller cationic groups having less than five contiguous bond fragments will not have its effect, unlike larger length cyclic or acyclic cations where values of the parameter 5χv(cation) can be calculated. Being a valence connectivity index, the parameter 5χv also shares information from heteroatoms and multiple bonds. A lower value of 5χv chiefly represents reduced molecular size that can be attributed to reduced chain length. As 5χv bears a negative coefficient, toxicity of ILs will decrease if 5χv increases i.e., if the molecular size is raised in terms of acyclic straight or branched chain mostly in the form of side chain substituents to cations where the opportunity of presence of five and more contiguous bonds lies. Next, we have the extended topochemical atom (ETA) index [η′]local for anions that corresponds to the contribution of locally bonded atoms.35–37 Here, local boding refers to the atoms connected with single covalent bonds, i.e., contributions of atoms separated by two or more covalent bonds and any non-bonded interactions are not taken in consideration for this specific index. It represents the local molecular topological nature and local branching for each atom and then sums for all atoms, and its value increases as branching rises. Hence, because of the positive coefficient of [η′]local, reduction in branching of anionic groups will contribute to the reduced toxicity of ionic liquids. The variable ‘NsOH’ is an electrotopological state atom index count of hydroxyl (–OH) group. Here, the negative coefficient of ‘NsOH’ represents that reduction of V. fischeri toxicity value can be achieved with an increased number of hydroxyl group in cations. This reduction of toxic effects with the increase of –OH groups can be attributed to raised hydrophilicity of the cations due to hydroxyl groups, which hinders them to pass through biological membranes. The final parameter 4χs(anion) gives information on the anionic groups. 4χs is the solvation connectivity index of fourth order bond fragment and it describes the solvation entropy and dispersion interactions of molecules34 in solution phase. It also describes branching characteristics for fourth order bond fragments, and such long fragments make 4χs to be related to molecular size as well as molecular surface area. Now, as this parameter is weighted by principal quantum number term, it can be considered as a measure of electronegativity that controls the hydrophilicity of a molecule. Here, it is negatively related to toxicity; hence, a higher value signifying higher electronegativity will reduce toxicity. If electronegativity of a molecule is higher, it will get hydrophilic because of greater separation of molecular charge, and its penetration across biological membrane reduces.

We used the leverage method to define the AD using eqn (2).70Fig. 5 shows the Williams plot,71 where three standard deviation (σ) unit of standardized cross-validated residual value on both positive and negative sides of zero has been identified for detection of any response outlier compounds. The calculated critical HAT value in this study is 0.200. From Fig. 5 it can be observed that the training set compound with serial number 114 possesses standardized cross-validated residual value greater than 3σ unit, but it lies within the critical HAT (h*) value of 0.2. This compound may have erroneous observed toxicity value and it can be identified as an outlier due to wrong value of response variable, i.e., a response outlier. Compounds with serial numbers 75, 85, 139 and 140 belonging to the training set are characterized by high leverage values, but they lie within the assumed 3σ limit of the ordinate. These chemicals are not outliers, but influence the applicability domain to span over larger area and hence can be termed as influential chemicals. The test set compound with serial number 137 is within the 3σ limit of the ordinate, but possess higher leverage value (h > h*). As we can see from Fig. 5 that compound 137 (serial number) is surrounded by other four influential training set chemicals (serial numbers 75, 85, 139 and 140), its prediction can be assumed to be somewhat reliable or dependable.


Williams plot showing the domain of applicability of the regression model (eqn (2)).
Fig. 5 Williams plot showing the domain of applicability of the regression model (eqn (2)).

Discussion on the classification and regression models

The results obtained from the QSAR based classification (discriminant) and the regression models are comparable to each other. The classification model comprises of eight indicator variables that are representative of the structural features of the respective cationic or anionic groups of selected dataset required to perform good discrimination; the corresponding structures are presented in Fig. S3 in the ESI. Apart from that, the discriminant function DF has been correlated with four other two-dimensional variables which confer similar type of conclusion as observed in the regression model. The mean square distance parameter (MSD) refers that increasing branching among cations causes reduction in Vibrio fischeri toxicity of ionic liquids which is similar to the inference obtained from the spline term of the variable GMTI (<2305−GMTI>) in the regression model. In the latter case, a specified range of the descriptor GMTI is obtained for cations which may be very useful for designing future compounds. The regression model parameters 4χs and NsOH explained the hydrophilicity of cations, with the former specifically related to the entropy of solvation and dispersion interaction between solute and solvent. For both the descriptors, an increase of hydrophilicity has been found to be related with reduced toxicity for the cationic groups. The 4χs variable specifically applies to those cations having four or greater bond length molecular fragments. On the other hand, the classification model shows that the hydrophobicity parameter MLOGP for anions is positively related to toxicity signifying a reduction in toxicity with reduction in hydrophobicity of anions. Hence, increase in hydrophilicity has been identified as determined factor for both cases of anions (classification model) and cations (regression model) which confers the theoretical possibility of lowering toxicity due to attenuation of penetration of compounds through biological membranes. The regression model (eqn (2)) parameter 5χv(cation) confers that raised molecular size of cations will contribute to lowered toxicity and the discriminant variable Σα(cation) predicted that toxicity is reduced with increase in molecular bulk of cations which in turn denotes increase in molecular size. Hence, these parameters corroborate each other and 5χv specifically identifies the increase of molecular fragment in a fashion of five or more contiguous bond lengths. The compounds which were excluded in QSAR analysis due to lack of exact experimental data, have been predicted using eqn (2) shown in Table S2 in the ESI.

Comparison with previous work

Comparison of the present work has been performed with other previously reported studies on modeling toxicity of ionic liquids to Vibrio fischeri. Bruzzone et al.27 developed a QSTR model to predict V. fischeri toxicity of a class of ILs based on halides employing the CODESSA treatment. The authors considered the cationic structures of 33 ILs for model development and reported a solvent phase model with R2 = 0.765, R2cv = 0.739, and a gas phase model with R2 = 0.745, R2cv = 0.717; however, no external validation was reported. Luis et al.72 developed a model for 43 ILs using group contribution method, and reported R2 and Ra2 values of 0.925 and 0.907 respectively; no external validation was made in this study as well. Luis et al.29 worked on a dataset comprising of 9 types of cationic and 17 anionic species and reported a regression model showing 15 variables for 96 compounds, characterized by a R2 value of 0.924 and Ra2 value of 0.911 with no external prediction. Another QSAR model for V. fischeri toxicity was developed by Couling et al.26 with 28 compounds showing a value of R2 = 0.887, but the author reported that the model failed in producing good predictive results for test set compounds. Guerra and Irabien31 developed PLS-DA classification models for 148 ionic liquids using presence or absence of the ionic species only as variables. The authors reported a model with sensitivity of 100% for both the training and test sets, and specificity values of 94.6% and 90% for the training and test sets respectively considering toluene as reference toxic chemical. Viboud et al.28 adopted a multifactorial component analysis (MCA) approach on V. fischeri toxicity values of 54 ionic liquids and reported correlation coefficient values after separately plotting several parameters namely total number of carbons (R2 = 0.861), total alkyl carbons (R2 = 0.934) etc. against IL toxicity.

The present work involves development of a classification as well as a regression model for the prediction of toxicity of ILs to Vibrio fischeri. The dataset31 contains diverse 147 ionic liquids comprising of 30 anions and 64 cations. The most important aspect of our work is that we have performed rigorous validation of both the classification and regression models by applying multiple strategies and in both the cases encouraging results have been obtained for both external validation (compounds not used for model development) and internal validation (compounds used for model development). The classification model is characterized by highly acceptable sensitivity and specificity values of 87.10% and 96.20% for the training set, and 60% and 86.36% for the test set respectively. The values are in a good state of comparison with the results reported in the source dataset.31 The training set showed better overall accuracy and precision values (93.64% and 90% respectively) than the test set (75.68% and 75%) (as mentioned in Table 2). The discriminant model was also explained as of good quality in terms of Wilk's λ statistics (0.298). The best regression model reported in the present study contains much more number of compounds than most of the previous reports. The training set comprises of 90 compounds and, compared to previous reports, we noted additional validation parameters like Q2 (leave-one-out cross-validation) and rm2 metrics besides the quality parameters (R2 or Ra2). The model is characterized by good external predictivity with an appreciable R2pred value of 0.718 for 36 compounds, which has not been reported for the previous models. The rm2 metrics73 were determined for the training, test, as well as the overall sets. Moreover we have performed randomization for the genetic process and for the model as well to obviate the possibility of chance correlation of the model; the process showed encouraging values of the metric cRp2 (Eqn (2)). It is obvious that a true comparison cannot be performed among studies where the dataset composition varies. However, prediction of results must be assessed in terms of higher statistical stability and reliability. The models developed by us consist of appreciable number of observations and they have undergone a good amount of statistical tests, and hence can be considered as statistically more reliable, predictive, robust to other proposed models on the toxicity of ionic liquids to Vibrio fischeri.

Conclusion

Toxicity due to chemicals is an integrated part of modern development in industrialization. The search for less toxic (if not non-toxic) eco-friendly chemicals is an issue of much importance. Room temperature ionic liquids have proved to be good alternative solvents to exploit a wide variety of industrial operations as ‘Green solvents’ due to their minimal release to the environment. However, they may show toxicity to different organisms of the ecosystem in case of accidental release to the environment. By employing the QSAR approach, it is possible to design suitable ionic liquids with reduced toxicity profile. The present study demonstrates an approach of modeling Vibrio fischeri toxicity of a large number of diverse ionic liquids, and interprets the structural factors responsible for such toxic manifestations. By using the developed regression and classification models, toxicity of ionic liquids may be predicted and this information may help in prioritization ranking of ILs for any toxicity determination experimentation. In this way, one may also reduce the amount of animal experimentation. In the present study, the developed models have undergone extensive statistical validation and have found to be acceptable in terms of robustness and predictivity. In addition to that, we have attempted to justify our regression model in terms of the proposed OECD guidelines for QSAR model development. The best regression model has also been subjected to comparison studies with some earlier developed models with varying dataset composition using the same endpoint; the result reported by us bears more statistical confidence especially with respect to external validation which has not been focused on in other studies. The best regression model in this study chiefly depicts an inverse relationship of toxicity with branching of cations up to a certain value (evidenced by the topological parameter GMTI), raised molecular size and presence of hetero atoms i.e. increased electronegativity on cations (shown by 5χv), presence of hydroxyl groups on cations (indicated by NsOH) and solvation entropy of anionic fragment (presented by 4χs). The ETA index [η′]local additionally shows the positive impact of local topological contribution of anionic fragments towards toxicity. Reduction of ionic liquid toxicity with raised branching is also observed from the classification model (eqn (1)). Apart from that, eqn (1) also finds positive relationship of cationic aromaticity (shown by SaasC) and anionic lipophilicity (shown by MLOGP) with toxicity, and a negative relationship of volume of cation with toxicity. Any untested new ionic liquid analogue falling within the defined applicability domain of the model, can be successfully predicted for its possible toxic effects. All these structural information along with the indicator parameters mentioned can be highly useful for designing new suitable ionic liquid. Most importantly, the model reported in this present study involves simple two-dimensional topological and other connectivity indices which are easily calculable using less computation time by avoiding any sort of energy minimization operations. Hence, the models proposed in this study may be used satisfactorily to predict toxicities of ionic liquids to Vibrio fischeri in the design process of ‘greener ionic-liquids’.

Acknowledgements

Financial assistance to KR in the form of a major research proposal from the Council of Scientific and Industrial Research (CSIR), New Delhi and a fellowship to RND from the University Grants Commission (UGC) are thankfully acknowledged.

References

  1. P. A. Anastas and J. C. Warner, Green Chemistry: Theory and Practice, Oxford University Press, Oxford, 1998 Search PubMed.
  2. M. J. Earle, J. M. S. S. Esperanc, M. A. Gilea, J. N. C. Lopes, L. P. N. Rebelo, J. W. Magee, K. R. Seddon and J. A. Widegren, Nature, 2006, 439, 831–834 CrossRef CAS.
  3. D. J. Couling, R. J. Bernot, K. M. Docherty, J. K. Dixon and E. J. Maginn, Green Chem., 2006, 8, 82–90 RSC.
  4. A. Stark and K. R. Seddon, in Kirk-Othmer Encyclopaedia of Chemical Technology, ed. A. Seidel, John Wiley & Sons Inc., New Jersey, 2007, vol. 26, pp. 836–920 Search PubMed.
  5. M. C. Buzzeo, R. G. Evans and R. G. Compton, ChemPhysChem, 2004, 5, 1106–1120 CrossRef CAS.
  6. J. F. Liu, G. B. Jiang and J. A. Jönsson, TrAC, Trends Anal. Chem., 2005, 24, 20–27 CrossRef CAS.
  7. M. J. Earle, N. V. Plechkova and K. R. Seddon, Pure Appl. Chem., 2009, 81, 2045–2057 CrossRef CAS.
  8. J. F. Huang, H. M. Luo, C. D. Liang, D. E. Jiang and S. Dai, Ind. Eng. Chem. Res., 2008, 47, 881–888 CrossRef CAS.
  9. S. Ho Ha, R. N. Menchavez and Y.-M. Koo, Korean J. Chem. Eng., 2010, 27, 1360–1365 Search PubMed.
  10. M. Freemantle, Chem. Eng. News, 1998, 76, 32–37,  DOI:10.1021/cen-v076n013.p032 , Accessed May 2012.
  11. N. V. Plechkova and K. R. Seddon, in Methods and Reagents for Green Chemistry: An Introduction, ed. P. Tundo, A. Perosa and F. Zecchini, Wiley-Interscience, 2007, ch. 5, pp. 105–130 Search PubMed.
  12. J. Ranke, S. Stolte, R. Störmann, J. Arning and B. Jastorff, Chem. Rev., 2007, 107, 2183–2206 CrossRef CAS.
  13. M. Deetlefs and K. R. Seddon, Green Chem., 2010, 12, 17–30 RSC.
  14. T. P. T. Pham, C.-W. Cho and Y.-S. Yun, Water Res., 2010, 44, 352–372 CrossRef.
  15. R. F. M. Frade and C. A. M. Afonso, Hum. Exp. Toxicol., 2010, 29, 1038–1054 CrossRef CAS.
  16. K. Roy, Expert Opin. Drug Discovery, 2007, 2, 1567–1577 Search PubMed.
  17. OECD Quantitative Structure–Activity Relationships Project [(Q)SARs]. http://www.oecd.org/document/23/0,3746,en_2649_34377_33957015_1_1_1_1,00.html, Accessed May 2012.
  18. D. J. Couling, R. J. Bernot, K. M. Docherty, J. K. Dixon and E. J. Maginn, Green Chem., 2006, 8, 82–90 RSC.
  19. M. Matzke, S. Stolte, K. Thiele, T. Juffernholz, J. Arning, J. Ranke, U. W. Biermann and B. Jastorff, Green Chem., 2007, 9, 1198–1207 RSC.
  20. M. T. Garcia, N. Gathergood and P. J. Scammells, Green Chem., 2005, 7, 9–14 RSC.
  21. C. Pretti, C. Chiappe, I. Baldetti, S. Brunini, G. Monni and L. Intorre, Ecotoxicol. Environ. Saf., 2009, 72, 1170–1176 CrossRef.
  22. X. Y. Li, J. Zhou, M. Yu, J. J. Wang and Y. C. Pei, Ecotoxicol. Environ. Saf., 2009, 72, 552–556 CrossRef CAS.
  23. S. Stolte, M. Matzke, J. Arning, A. [B with combining umlaut]oschen, W.-R. Pitner, U. Welz-Biermann, B. Jastorff and J. Ranke, Green Chem., 2007, 9, 1170–1179 RSC.
  24. A. Romero, A. Santos, J. Tojo and A. Rodŕıguez, J. Hazard. Mater., 2008, 151, 268–273 CrossRef CAS.
  25. S. M. Steinberg, E. J. Poziomek, W. H. Engelmann and K. R. Rogers, Chemosphere, 1995, 30, 2155–2197 CrossRef CAS.
  26. D. J. Couling, R. J. Bernot, K. M. Docherty, J. K. Dixon and E. J. Maginn, Green Chem., 2006, 8, 82–90 RSC.
  27. S. Bruzzone, C. Chiappe, S. E. Focardi, C. Pretti and M. Renzid, Chem. Eng. J., 2011, 175, 17–23 CrossRef CAS.
  28. S. Viboud, N. Papaiconomou, A. Cortesi, G. Chatel, M. Draye and D. Fontvieille, J. Hazard. Mater., 2012, 215–216, 40–48 Search PubMed.
  29. P. Luis, A. Garea and A. Irabien, J. Mol. Liq., 2010, 152, 28–33 CrossRef CAS.
  30. S. P. M. Ventura, C. S. Marques, A. A. Rosatella, C. A. M. Afonso, F. Gonc-alves and J. A. P. Coutinho, Ecotoxicol. Environ. Saf., 2012, 76, 162–168 Search PubMed.
  31. M. Alvarez-Guerra and A. Irabien, Green Chem., 2011, 13, 1507–1516 RSC.
  32. P. Mitteroecker and F. Bookstein, Evolutionary Biology, 2011, 38, 100–114 Search PubMed.
  33. K. V. Mardia, J. T. Kent and J. M. Bibby, Multivariate analysis, Academic Press, London, 1979 Search PubMed.
  34. R. Todeschini and V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, 2000 Search PubMed.
  35. K. Roy and G. Ghosh, Internet Electron. J. Mol. Des., 2003, 2, 599–620 Search PubMed.
  36. K. Roy and R. N. Das, J. Hazard. Mater., 2010, 183, 913–922 Search PubMed.
  37. K. Roy and R. N. Das, SAR QSAR Environ. Res., 2011, 22, 451–472 Search PubMed.
  38. Dragon ver. 6 is a software from TALETE srl, Italy, 2010 http://www.talete.mi.it/products/dragon_description.htm, Accessed May 2012.
  39. K. L. E. Kaiser and V. S. Palabrica, Water Poll. Res. J. Canada, 1991, 26, 361–431 Search PubMed.
  40. P. P. Roy, J. T. Leonard and K. Roy, Chemom. Intell. Lab. Syst., 2008, 90, 31–42 Search PubMed.
  41. B. Everitt, S. Landau and M. Leese, Cluster Analysis, Arnold, London, 2001 Search PubMed.
  42. R. B. Darlington, Regression and Linear Models, McGraw-Hill, New York, 1990 Search PubMed.
  43. J. Holland, Adaptation in Artificial and Natural Systems, University of Michigan Press, Ann Arbor, MI, 1975 Search PubMed.
  44. J. Friedman, Multivariate Adaptive Regression Splines, Technical Report No. 102, Laboratory for Computational Statistics, Department of Statistics, Stanford University, Stanford, CA, Nov 1988 (revised Aug 1990).
  45. D. Rogers and A. J. Hopfinger, J. Chem. Inf. Comput. Sci., 1994, 34, 854–866 CrossRef CAS.
  46. S. Wold, in Chemometric Methods in Molecular Design, ed. H. van de Waterbeemd, VCH, Weinheim, 1995, pp. 195–218 Search PubMed.
  47. MINITAB is a statistical software of Minitab Inc., USA; software available at http://www.minitab.com/en-US/default.aspx, Accessed May 2012.
  48. SPSS is a statistical software of SPSS Inc., USA. http://www.spss.com, Accessed May 2012.
  49. Cerius 2 version 4.10 is a product of Accelrys Inc., San Diego, CA, USA, 2005; software available at http://www.accelrys.com/, Accessed May 2012.
  50. STATISTICA is a statistical software of STATSOFT Inc., USA; software available at http://www.statsoft.com/, Accessed May 2012.
  51. G. W. Snedecor and W. G. Cochran, Statistical Methods, Oxford & IBH, New Delhi, 1967 Search PubMed.
  52. D. M. Hawkins, S. C. Basak and D. Mills, J. Chem. Inf. Comput. Sci., 2003, 43, 579–586 CrossRef CAS . http://dx.doi.org/10.1021/ci025626i, Accessed May 2012.
  53. G. Schuurmann, R. U. Ebert, J. Chen, B. Wang and R. Kuhne, J. Chem. Inf. Model., 2008, 48, 2140–2145 CrossRef . http://dx.doi.org/10.1021/ci800253u, Accessed May 2012.
  54. P. P. Roy, S. Paul, I. Mitra and K. Roy, Molecules, 2009, 14, 1660–1701 Search PubMed.
  55. I. Mitra, P. P. Roy, S. Kar, P. K. Ojha and K. Roy, J. Chemom., 2010, 24, 22–33 CrossRef CAS.
  56. P. K. Ojha, I. Mitra, R. N. Das and K. Roy, Chemom. Intell. Lab. Syst., 2011, 107, 194–205 Search PubMed.
  57. R. Todeschini, Milano Chemometrics, Italy (Personal Communication), 2010.
  58. S. S. Wilks, Biometrika, 1932, 24, 471–494.
  59. P. C. Mahalanobis, Proc. Natl. Acad. Sci. U. S. A., 1936, 12, 49–55 Search PubMed.
  60. T. Fawcett, Pattern Recognit. Lett., 2006, 27, 861–874 CrossRef.
  61. J. P. Egan, Signal Detection Theory and ROC Analysis, Series in Cognition and Perception, Academic Press, New York, 1975 Search PubMed.
  62. A. P. Bradley, Pattern Recognit., 1997, 30, 1145–1159 Search PubMed.
  63. R. P. Sheridan, S. B. Singh, E. M. Fluder and S. K. Kearsley, J. Chem. Inf. Comput. Sci., 2001, 41, 1395–1406 Search PubMed.
  64. D. A. Pearlman and P. S. Charifson, J. Med. Chem., 2001, 44, 502–511 CrossRef CAS.
  65. A. Pérez-Garrido, A. M. Helguera, F. Borges, M. N. D. S. Cordeiro, V. Rivero and A. G. Escudero, J. Chem. Inf. Model., 2011, 51, 2746–2759 Search PubMed.
  66. B. W. Matthews, Biochim. Biophys. Acta, 1975, 405, 442–451 Search PubMed.
  67. J. Gálvez, R. G. Domenech, C. de Gregorio-Alapont, J. V. de Julián-Ortiz and L. Popa, J. Mol. Graphics Modell., 1996, 14, 272–276 Search PubMed.
  68. J. Jaworska, M. Comber, C. Auer and C. J. van Leeuwen, Environ. Health Perspect., 2003, 111, 1358–1360 CrossRef.
  69. OECD Environment Health and Safety Publications Series on Testing and Assessment No. 69, Guidance Document on the Validation of (Quantitative) Structure–Activity Relationship [(Q)SAR] Models. http://www.oecd.org/officialdocuments/displaydocumentpdf/?cote=env/jm/mono(2007)2&doclanguage=en, Accessed May 2012.
  70. A. C. Atkinson, Plots, Transformations and Regression, Clarendon Press, Oxford, 1985 Search PubMed.
  71. P. Gramatica, QSAR Comb. Sci., 2007, 26, 694–701 CrossRef CAS.
  72. P. Luis, I. Ortiz, R. Aldaco and A. Irabien, Ecotoxicol. Environ. Saf., 2007, 67, 423–429 CrossRef CAS.
  73. K. Roy, I. Mitra, S. Kar, P. K. Ojha, R. N. Das and H. Kabir, J. Chem. Inf. Model., 2012, 52, 396–408 Search PubMed.

Footnote

Electronic supplementary information (ESI) available. See DOI: 10.1039/c2tx20020a

This journal is © The Royal Society of Chemistry 2012