Multicomponent ionic liquid CMC prediction

I. E. Kłosowska-Chomiczewska a, W. Artichowicz b, U. Preiss c and C. Jungnickel *a
aDepartment of Colloid and Lipid Science, Faculty of Chemistry, Gdańsk University of Technology, Narutowicza St. 11/12, Gdańsk 80-233, Poland. E-mail:; Tel: +48 58 347 2469
bDepartment of Hydraulic Engineering, Faculty of Civil and Environmental Engineering, Gdańsk University of Technology, Narutowicza St. 11/12, Gdańsk 80-233, Poland
cInterdisciplinary Centre for Advanced Materials Simulation (ICAMS), Ruhr-Universität Bochum, Universitätsstraße 150, Bochum 44780, Germany

Received 25th July 2017 , Accepted 6th September 2017

First published on 11th September 2017

We created a model to predict CMC of ILs based on 704 experimental values published in 43 publications since 2000. Our model was able to predict CMC of variety of ILs in binary or ternary system in a presence of salt or alcohol. The molecular volume of IL (Vm), solvent-accessible surface (Ŝ), solvation enthalpy (ΔsolvG), concentration of salt (Cs) or alcohol (Ca) and their molecular volumes (Vms and Vma, respectively) were chosen as descriptors, and Kernel Support Vector Machine (KSVM) and Evolutionary Algorithm (EA) as regression methodologies to create the models. Data was split into training and validation set (80/20) and subjected to bootstrap aggregation. KSVM provided better fit with average R2 of 0.843, and MSE of 0.608, whereas EA resulted in R2 of 0.794 and MSE of 0.973. From the sensitivity analysis it was shown that Vm and Ŝ have the highest impact on ILs micellization in both binary and ternary systems, however surprisingly in the presence of alcohol the Vm becomes insignificant/irrelevant. Micelle stabilizing or destabilizing influence of the descriptors depends upon the additives. Previous attempts at modelling the CMC of ILs was generally limited to small number of ILs in simplified (binary) systems. We however showed successful prediction of the CMC over a range of different systems (binary and ternary).


Ionic liquids (ILs) are low temperature molten salts, which have received much attention in the past decade due to their tunable properties, and thereby their possible environmentally friendly application. This tunability is the key to their success, as addition or subtraction of a moiety or carbon can significantly alter its environmental fate, or technical property.1 They are usually composed of large, asymmetric cations with for example nitrogen groups (amine, imidazole, pyridine, piperidinium, pyrrolidine, etc.) and a wide spectrum of anions, ranging from small inorganic (Cl, Br, F, etc.) to bulky organic (mandelates, prolinates, benzoates, etc.).

The cation and anion combinations result in a myriad2,3 of possible compounds. This results in the challenge of finding the correct compound for the correct application. To solve the issue accurate prediction of the physicochemical properties of ILs is required. A number of attempts have been published to design ILs with a specified melting point,4–7 solubility,8,9 surface composition,10 surface tension,11 heat capacity,5,12 cloud point,13,14 density,12,15,16 viscosity,15–18 conductivity,16,18 and hydrophobicity.9,19 In addition, to reduce environmental impact a number of parameters such as toxicity,20–30 biodegradation,31 and soil sorption32–34 have been predicted.

One of the phenomenological parameters which is often chosen is the critical micelle concentration (CMC). This parameter provides information about a wide variety of other properties such as molar solubilization ratio, sorption and toxicity. In addition, the CMC is an easily measured, and often cited parameter, which makes it a perfect target value for prediction.

Previous attempts were made to model and predict the CMC of ionic surfactants and ILs. Barycki et al.35 created a model to predict CMC based on a dataset of only 59 ILs in 2016. The three descriptors chosen were based on the molecular GEometry, Topology, and Atom-Weights AssemblY (GETAWAY) descriptors, which resulted in an R2 of 0.959. The authors have incorrectly stated that their attempt is the first to predict the CMC of ionic liquids, when in fact we did this already in 2009. In that publication we developed a model based on molecular volume, solvent accessible surface area, as well as various interaction enthalpies, determined by COSMO-RS using 36 ILs, and a resulting R2 of 0.994.36 Vishnyakov et al. presented a model to predict CMC of non-ionic surfactants in binary solutions using dissipative particle dynamics simulations.37 Kardanpour et al. developed a model to predict CMC of gemini surfactants. The data set included 94 CMC values of gemini surfactants using topological, geometrical, functional group and WHIM descriptors. In the final equation created using a wavelet neural network 12 descriptors were used in optimized model and the highest R2 was 0.994.38 Jalali-Heravi et al. used multiple linear regression to model CMC of cationic surfactants based on 30 literature CMC values for alkyltrimethylammonium and alkylpyridinium salts, using topological (Balaban and Randic indices), electronic (total energy of the molecules) and molecular structure descriptors (volume of the tail of the molecule, maximum distance between the atoms, and surface area) and a stepwise regression method. The highest R2 of the models after cross-validation was 0.955.39 Another attempt in which Roy and Kabir developed a model to predict CMC of non-ionic surfactants in aqueous solutions based on 54 CMC values with use of extended topochemical atom (ETA) and non-ETA indices as descriptors, and stepwise multiple linear regression (MLR), genetic function approximation (GFA) and partial least squares (PLS) as chemometric tools. The PLS allowed to avoid inter-correlation among the descriptors. The coefficient of determination R2 after external validation for the best ETA + non-ETA-PLS was 0.986.40 Huibers et al. created a model to predict CMC of anionic surfactants (sodium alkyl sulfates and sodium sulfonates) based on 119 literature values. An R2 of 0.942 was obtained for a multiple linear model with three descriptors based on molecular topology and constitution.41 Katritzky et al. generated a CMC prediction model for 50 cationic surfactants (35 quaternary ammonium salts and 15 quaternary pyridinium salts) using molecular descriptors (related to the size and charge of the hydrophobic tail and to the size of the head). They used best multilinear regression and heuristic algorithm to determine the best multilinear models (mean R2 after cross-validation was 0.978), and a nonlinear artificial neural network to develop nonlinear regression models (mean R2 after cross-validation was 0.979).42 Huibers et al. used three topological descriptors (size of the hydrophobic group, the size of the hydrophilic group, and the structural complexity of the hydrophobic group) to create a model to predict the CMC based on values for 77 nonionic surfactants. Multiple linear regression analyses carried out with the heuristic algorithm resulted in R2 of 0.984.43 In 2007, Gad used structural, topological and thermodynamic descriptors (namely molecular weight, hydrophobic–hydrophilic fragments molecular weight ratio, polarizability, log[thin space (1/6-em)]P, energy of hydration, surface area, and dipole moment) to create a model to predict the CMC based on 50 CMC values for nonionic surfactants. The models were created with principal component analysis (PCA) and multiple linear regression technique (MLR) with an R2 of 0.9889.44 Yuan et al. used four electronic, spatial and thermodynamic descriptors to create a model to predict CMC based on 37 literature values for nonionic surfactants. As a chemometric tools the authors used stepwise multiple linear regression analysis, multiple simple linear model analysis, multiple linear regression, and genetic function approximation analysis giving R2 of 0.990.45 They also created a similar model based on 37 anionic surfactants, and obtained R2 of 0.996.46

All these models have the disadvantage of being created with limited data, usually with no more than 100 data points. They usually rely on various single software solutions to calculate a plethora of descriptors of which the best subset which has the most predictive value is eventually chosen. In addition they have a limited applicability because they always refer to binary solutions (i.e. IL + water).

Therefore, for the first time in 2013 we (Preiss and Jungnickel) extended the models to also include the effects of, in this case, salts. These ternary systems were effectively modelled with the R2 of 0.859 with 151 data points.47 However, we were still using descriptors based on quantum chemical calculations. In the paper of Cho et al., we (Preiss and Jungnickel) used a poly parameter linear free energy relationship using the Abraham equation. The approach here is simpler as opposed to the previous works, because it does not require any quantum/chemical calculations. The prediction had an R2 of 0.9949 for the IL/water binary system. The disadvantage however is the necessity of determining the Abraham descriptors experimentally for each compound.

The prediction of data of either binary or various ternary systems is easier, because in each system, the descriptors are responsible for generally one effect. However, in systems in which a multitude of components (salt or alcohol) are present, the magnitude of impact of a descriptor may change. Therefore, the aim of this paper was to compare several approaches to predict the CMC of a large set of ILs (704), over a wide range of conditions. Here we present a complex, multi-methodological approach to predict the CMC of ILs in binary (IL and water) or ternary systems (IL and water, with monovalent inorganic salt or alcohol), as shown in Fig. 1. We aim to show several models that will allow for the prediction of the CMC based on an ILs molecular volume (Vm), solvent-accessible surface (Ŝ), solvation enthalpy (at infinite dilution) (ΔsolvG), and temperature. For ternary systems additional input variables are taken into account, namely concentration of salt (Cs) or alcohol (Ca) and their molecular volumes (Vm,s and Vm,a, respectively).

image file: c7cp05019d-f1.tif
Fig. 1 Schematic showing novelty of research, usually CMC predictions are based on single systems.


Data was collected from papers ranging over the last 16 years of publication (since 2000). Each publication year was searched separately using the search terms “ionic liquid” and “CMC”. Google Scholar was used as a search engine, as it provides a thorough coverage of terms.48 All experimental papers which gave a measured CMC were taken into account. As a result a total of 43 publications were found, providing a total of 704 data points for modelling. A multitude of methods for the ILs CMC determination were taken into account, including tensiometry, conductometry, nuclear magnetic resonance spectrometry, spectroscopy, small-angle neutron scattering, potentiometry, calorimetry, and turbidity. ILs only with single charges were taken into consideration, and ILs with two or more valent ions or multiple charges (Zwitterionic) were ignored.

Based on our previous experience,36,47 the following descriptors were considered: molecular volume of IL (Vm, as a sum of the anion and cation, or separately), solvent-accessible surface (Ŝ, as a sum of the anion and cation, or separately), solvation enthalpy (ΔsolvG), concentration of salt (Cs) or alcohol (Ca) and their molecular volumes (Vm,s and Vm,a, respectively). The variables CMC, Cs, Ca and T were taken from the papers, while other input variables were calculated for the purpose of this paper. The range of each of the parameters used for modelling is presented in Table 1. The complete data is given in Table S1 (ESI).

Table 1 Range and diversity of different parameters used for creation of model to predict CMC of ILs in different systems. The input parameters clearly span a wide range and are well dispersed, and thus the resulting models have a wide applicability
Parameter CMC, mM V m, nm3 Ŝ, nm2 ΔsolvG, kJ mol−1 C s, mM V ms, nm3 C a, mM V ma, nm3 T, K
Min. value 0.01 199.04 221.36 −710.65 0.10 100.15 20.49 69.12 278.15
Max. value 2200 1150.50 1004.07 −324.14 1000.00 229.22 1085.34 134.96 328.15
Mean value 32.01 461.46 453.26 −531.24 190.28 142.52 203.99 93.84 300.08
SD 120.53 114.14 96.18 60.64 230.85 35.02 160.41 25.80 5.84

The program Molconvert was used for conversion between names and chemical structures.49 To obtain a reasonable initial number of descriptors, each molecule was optimized in the gas phase with MOPAC201650 using PM6-DH+ and the PRECISE keyword.51,52 A vibrational analysis was performed to ensure the absence of a transition state.53 To retain consistency with comparable prediction models,54,55 a COSMO geometry optimization in the virtually ideal electrical conductor (εr = 999) using the same method, but not taking molecular symmetry into account, was then appended;56 the solvent-accessible (COSMO) surface Ŝ, the molecular volume Vm and the free solvation enthalpy in an ideal electric conductor ΔsolvG were taken directly from the final output.

Selection of training and validation data

The data collected covered three types of systems: IL + water (binary system), IL + water + salt (ternary salt system), and IL + water + alcohol (ternary alcohol system). To make sure that all of these systems were included equally in both training and validation sets for modelling, we separately sorted data for every system according to increasing molecular volume of ILs. Subsequently data for every system were divided into three subgroups so as include ILs with low, medium and high molecular volume in each training and validation set. Then we randomly picked 80% of the data among each three subgroups of the three systems in ten repetitions to create ten different training sets to build the model. Each time the remaining 20% of randomly picked data points created 10 different validation sets, corresponding to 10 training sets (Fig. 2). The split of data into training and validation sets to 80[thin space (1/6-em)]:[thin space (1/6-em)]20 was done according other prediction attempts.7,20,57 Bootstrap aggregation was used to give the ternary systems the same weighting as the binary data. Bootstrap aggregation was conducted by replicating the input data for the ternary systems 4.8 times IL + water + salt, and 7.34 times for the IL + water + alcohol data. A comparison in data before and after using bootstrap aggregation is shown in Fig. S1 (ESI).
image file: c7cp05019d-f2.tif
Fig. 2 Schematic representation of the data splitting and randomization used.

Computational modelling

Two approaches for constructing regression models were used. The first one is parametric regression in which an expression is fitted to the provided data. The expression is constructed by evolutionary algorithm. For this purpose Eureqa (EA) software was used (v1.24.0, build 9367). The second approach was kernel support vector machine (KSVM) regression which is a method based on the statistical learning theory. In this approach non-parametric regression function based on the provided data is created. In this work LIBSVM implementation of KSVM was used.

The EA generates clusters of equations for a given target expression using modified evolutionary algorithm.58 The software was calculating for each target expression for a minimum of 5 × 109 generations, attempting to find an equation with optimal complexity and coefficient of determination R2 in each generation. Due to extensive data set, 80% of the data was used for training and 20% for validation. In order to provide interpretability of the regression model only basic mathematical operators were chosen for computation (addition, subtraction, multiplication). The equations that provided the highest R2 were recorded. The end point of the calculations for EA was consistently set to 150[thin space (1/6-em)]000 generations for all sets and systems, since after that period no significant improvements in R2 was observed, as shown in Fig. 3.

image file: c7cp05019d-f3.tif
Fig. 3 Influence of mean number of generations calculated with EA on coefficient of determination R2 with 95% confidence intervals (CI). As can be noticed, the R2 stabilizes after 150[thin space (1/6-em)]000 generations, which was thus set as the end-point for the EA calculations.

Support vector machine regression

Support vector machine is a method based on machine learning theory. It was introduced by Vapnik and others.59–63 Along with kernel mapping of feature space this technique became very popular as it is suitable to build robust and efficient regression models of multidimensional non-linear relationships.

KSVM handles multidimensional non-linear relationships very well due to the usage of the feature space mapping. This proceeding creates a new feature space in which the non-linear relationships may become linear or close to such. Formally mapping is done with a function Φ(xa,xb). However, due to the usage of the “kernel trick” and Lagrange multipliers method the resulting KSVM formulation does not require the knowledge of the mapping function explicitly but its scalar product only k(xa,xb) = Φ(xa,xbΦ(xa,xb). Thus the mapping is performed by means of the abovementioned scalar product.64 The detailed discussion of kernel functions in KSVM can be found for example in Hoffmann et al.65 By default the Gaussian kernel k(xa,xb) = exp(−γ|xaxb|) is usually used, in which γ is the kernel parameter.

The idea of KSVM regression is to find such function f(x) which deviates at most by some arbitrarily chosen value ε from the data provided as a training set (Fig. 2). Additionally the sought function is supposed to be as flat as possible. From practical point of view such property delivers the robustness against perturbations present in the observed data. However function which is flat enough and does not exceed the allowed error tolerance may not exist. To overcome this problem it is purposeful to allow some deviations exceeding the assumed level ε and to penalize them. For this purpose slack variables ξ and ξ* are introduced along with the concept of a soft margin loss function.66 Such situation is depicted in Fig. 2. Slack variables are not present in the resulting optimization problem explicitly due to application of the Lagrange multipliers method for solution of the arising problem.

The sought regression function expressed by means of the Lagrange multipliers is obtained as the result of solution of the following optimization problem:64

image file: c7cp05019d-t1.tif(1)
subject to αi, αi* with constraints: 0 ≤ αiC, 0 ≤ αi* ≤ C (i = 1,…,p), and image file: c7cp05019d-t2.tif, where αi and αi* are Lagrange multipliers assigned to ith data point. In eqn (1)p denotes the number of points in the training set, yi is the observed value of the explained variable. The regularization coefficient C > 0 determines the trade-off between the flatness of the regression function and the amount up to which deviations larger than ε are tolerated.59,67 The higher is the value of the C parameter the more sensitive the model will be to the outliers in the data. The value of C equal to infinity would not allow any errors greater than the assumed ε value.

The regression function has the following form

image file: c7cp05019d-t3.tif(2)
where b is the intercept, x = [x1,…,xn]T is n-dimensional point, T – transposition symbol, n – number of features (space dimensionality).

Vectors α and α* are the solution of the eqn (1). If one of the values αi,αi* lies between 0 and C, then the corresponding point is a support vector, that is a point in a training dataset at which the margin is built. Example support vectors were highlighted with red colour and displayed in Fig. 4.

image file: c7cp05019d-f4.tif
Fig. 4 The concept of the regression function, the margin of tolerance (ε), slack variables (ξ) and support vectors (red).

For practical use of KSVM regression a kernel has to be chosen and its parameters provided. Moreover, it is necessary to provide the regularization coefficient value C and width of the error tolerance margin ε. There are many proposed methods for parameter selection, however to the authors' knowledge none of them is suitable to all possible applications of KSVM. There are two most popular approaches. First one is parameter determination by optimization algorithms like grid search or other optimization techniques. However a drawback of such proceeding is that it can lead to model overfitting which results in bad prediction accuracy, and it requires solution of additional optimization problem with respect to at least three variables – kernel parameter, ε and C, which often is very time consuming. In the considered case of the CMC regression this approach did not turn out to provide acceptable results. The second approach is to choose parameters arbitrarily using hints based on the data. Examples of such approach are given for example by Cherkassky and Mulier67 and Cherkassky et al.68

SVM regression methodology

Data rescaling is a necessary step when using KSVM technique, therefore all descriptors were rescaled to the range of [−1, 1].

In this work LIBSVM implementation of KSVM was used.69 The kernel of choice was a Gaussian one as this is usually the best choice for non-linear data.70 The parameters were chosen arbitrarily as suggested in ref. 69 and 70. The kernel parameter was set to the value of reciprocal of descriptors number γ = 1/n which is the default value suggested by the LIBSVM developers as suitable for most cases. The value of regularization parameter was chosen according to ref. 68, who suggest to use the range of the dependent variable y as the value of C. The error margin tolerance was set to ε = 0.1 which is a default value used in LIBSVM.

Sensitivity for EA and KSVM

The sensitivity analysis, both for the EA and KSVM, was conducted as previously described by Kłosowska-Chomiczewska et al.71

Results and discussion

We used two different regression methods to find a correlation between the CMC of ILs and a variety of descriptors. KSVM is known for its good properties in finding regression functions,60,72,73 however it does not give explicit information about relationships between descriptors and the explained variable y.

The primary aim of the paper, to determine if a global model can be applied to predict the CMC in a variety of complex systems can be answered simply affirmative. An exemplary result is shown in Fig. 5. Both regression methods of obtaining a predictive model have shown a satisfactory success. The summary of the results of all data sets is presented in Table 2.

image file: c7cp05019d-f5.tif
Fig. 5 Exemplary results of prediction of the CMC of a random set using 704 original data points. The results are obtained with two methods, KSVM (R2 of 0.867) and EA (R2 0.805). Both methodologies provide an adequate fit for IL in water, water + salt, and water + alcohol systems. It is interesting to note that both methods also underpredict the CMC (slope < 1).
Table 2 Comparison of different regression methods for models of all three systems (IL + water, IL + water + salt, and IL + water + alcohol) after bootstrap aggregation. Shown are mean values of coefficient of determination (image file: c7cp05019d-t4.tif), mean squared error (image file: c7cp05019d-t5.tif), and maximal error (image file: c7cp05019d-t6.tif) with confidence interval (CI) for 95% confidence level. It can be observed that in this case KSVM generally performs better than the EA
Regression method

image file: c7cp05019d-t7.tif

image file: c7cp05019d-t8.tif

image file: c7cp05019d-t9.tif

Evolutionary algorithm (EA) 0.794 ± 0.021 0.973 ± 0.534 2.788 ± 0.498
Kernel support vector machine (KSVM) 0.843 ± 0.013 0.608 ± 0.050 3.039 ± 0.231

The KSVM also gave a better fit with mean coefficient of determination equal to 0.843, whereas evolutionary algorithm (EA) resulted in image file: c7cp05019d-t10.tif of 0.794. The difference in the obtained results can be explained on the basis of the mathematical foundations of the applied models. KSVM unfolds nonlinear relationships hidden in the data and handles the problem in a holistic way. Unlike EA models, the KSVM model does not fit itself to the data (in terms of mathematical expression), but it has generalized form which is designed to handle such problems. Therefore KSVM regression models are less vulnerable to overfitting issues than EA models. This provides a very robust tool for regression which has the ability to give accurate prediction in case of previously unseen data.

When comparing these results with the models of CMCs of ILs published before, the fit of our model is satisfactory (R2 of 0.843 and 0.794 for KSVM and EA, respectively), but lower than those found in literature (R2 of 0.942–0.996).35–46 This is due to diversity of ILs (as shown in Table 1) taken into account as compared to other attempts to predict CMC where authors usually create models based on small samples,35–46 and focus on a very specific sub-group of surfactants, e.g. only alkyltrimethylammonium and alkylpyridinium salts39 or quaternary ammonium salts and quaternary pyridinium salts42 for cationic surfactants, only sodium alkyl sulfates and sodium sulfonates for anionic surfactants.41 Therefore, with such limited diversity in data, such models will generally produce a better fit since the molecules are already similar. In addition, our models are the first to predict the CMC over a range of different systems (binary and ternary), whereas all previous models focus only on binary systems. When analyzing the sensitivity of each descriptor on the final result similarities between KSVM and EA may be observed as shown in Fig. 6.

image file: c7cp05019d-f6.tif
Fig. 6 Comparison of sensitivity toward different variables for models created with evolutionary algorithm (EA) and kernel support vector machine (KSVM). The method of calculation of sensitivity (A) and % positive/negative influence (B) is described by Kłosowska-Chomiczewska et al.71 It can be seen that in both approaches Vm and S dominate effect, with the Vm being mostly positive (for EA), and S always negative. However, no clear attributable effect may be observed. Error bars represent 95% CI.

The surface area (Ŝ) has in each case the highest and always negative impact on the CMC, in both models. This may be interpreted as the higher the surface area, the larger the water cage surrounding the molecule, and thus the higher entropic penalty for each molecule, and thus the lower the CMC. For the Vm no clear positive or negative effect may be observed that is true for both EA and KSVM. This is due to various effects that each descriptor has in each system. Since this model is a summation of all systems, the sensitivity of the descriptors represents an average of all the effects. When comparing EA and KSVM results, the sensitivity of similar magnitude, and follows similar trends, which shows that using either numerical method the descriptors have similar effects on the CMC, independent of the numerical path.

To determine the effect of each descriptor in each system (IL + water, IL + water + salt, and IL + water + alcohol), we repeated the EA model for the individual systems. This therefore, will shed light on effect on dominant mechanisms responsible for micellization of ILs. The results of this analysis are displayed in Fig. 7. It should be noted that in this case, not only the optimal solution was taken into account, but rather the top three solutions of the EA.

image file: c7cp05019d-f7.tif
Fig. 7 Comparison of EA sensitivity analysis for the binary system (IL + water), and ternary systems (IL + water + salt, and IL + water + alcohol). The dominant descriptors were Vm, Ŝ, ΔsolvG, and in the case of IL + water + salt, also the Cs. It is noteworthy that the effect of the molecular volume of the IL is reduced in the IL + water + alcohol system.

From the sensitivity analysis (Fig. 7A and B) we may elucidate the dominant mechanisms responsible for micellization in every system. In the IL + water system, Vm and Ŝ have the highest and mostly negative impact on the CMC, whereas the influence of ΔsolvG is minor and usually positive. Therefore, we may conclude that stabilization of micelles is mostly due to chain/chain interactions (−36% Vm) and avoidance of hydration of IL molecules (−91% Ŝ). At the same time Vm contributes to some destabilization of the micellization process, namely by the steric hindrance (+64% Vm) (these are largely the long chained ionic liquids). Additionally, the process is hindered by the interaction of ILs with water described by ΔsolvG, where its effect is positive (+93% ΔsolvG), which corresponds to the description given by Varfolomeev.74

The addition of salt to the system has a moderately strong (Fig. 7A) but completely negative effect on the CMC (−100% Cs, Fig. 7B), which follows well with earlier descriptions of the effect of salt on the CMC.75 Moreover, it dramatically changes the influence of Vm, where in the presence of salt the stabilizing effect of the chain/chain interactions is no longer prominent in this system, but instead the charge shielding of the salt dominates. Which means that the strongly positive influence of Vm on CMC relates now to the steric hindrance; that is, the bigger the molecule, the more difficult it is to fit into micelle, therefore the higher the CMC. Parallel, the effect of Ŝ in the presence of salt becomes more pronounced (Ŝ sensitivity 2.71%, Fig. 7A), but completely negative, while the influence of ΔsolvG remains similar, however less positive (+86% ΔsolvG), with the same justification as for the IL + water system.

The most interesting was the effect of alcohol on the micellization of ILs. The Vm term is smaller, indicating that the influence of chain/chain interactions and steric hindrance are less relevant for the process. At the same time Ŝ and ΔsolvG influence shifted to completely negative. Especially the latter one is interesting, as now in the system strong interactions between ILs molecules and solvent no longer destabilize micelles. This can be explained by alcohol acting as cosurfactant, incorporating between IL molecules in the micelles, changing the curvature of micelles,76–78 and therefore making formation of aggregates easier.

These mirror the common understanding of the effects of salts and alcohols on surfactant micellization.79–82 That these effects are so visible in the equations highlights that the models not only allow for the prediction of CMC, but also to provide an insight into the underlying mechanism of micellization.

Finally, in order to prove the robustness of our calculations we aimed at validating the model for both ternary systems (IL + water + salt and IL + water + alcohol) on the data for less complicated binary system (IL + water). That is, all ternary data was used as a training set, and the binary data was used as a validation set. This time EA performed better giving R2 of 0.566, while the R2 obtained with KSVM was 0.566 (as shown in Fig. 8A). However, both fits are considered as satisfactory.83

image file: c7cp05019d-f8.tif
Fig. 8 Results of prediction of the CMC of binary system (IL + water) using the model created for ternary systems (IL + water + salt and IL + water + alcohol). The results are obtained with two methods, KSVM (R2 of 0.566 and 0.620) and EA (R2 0.576 and 0.669) for coupled (A) and uncoupled (B) data for ILs. Both methodologies provide an adequate fit, and at the same time they underpredict the CMC (slope < 1).

The justification for this type of calculation is that the contribution of each of the descriptors of ILs (Vm, Ŝ and ΔsolvG), should have the same effect on the CMC, with or without the additives of salt and alcohol. In essence, this experiment is analogous to taking a 3D plane, and projecting it onto a 2D surface. The essence of the curvature should be maintained, and the minimum that the methodologies find, should also be similar. Using the EA we can see a relatively good fit of the binary data using a model trained on ternary data. The reduced R2 compared to the overall model shown in Fig. 3 is due to the fact that the ternary data was much more scarce, and less training data was used (N = 229 for ternary data training, and N = 475 for binary data validation, as compared to N = 563 for training and N = 141 for validation for the overall model). To improve the fit, it was attempted to “uncouple” the ions. That is, the ionic descriptors were taken not as a sum of the cation and anion, but instead the cationic and anionic contribution were modelled separately. As can be seen in Fig. 6B, the effect of coupling or decoupling of the ions of the ionic liquids have some influence of the quality of fit with EA, and have smaller influence on the coefficient of determination for KSVM. This is expected, since the fitting of the binary data is in effect a reduction in complexity of the ternary system, and the salt, or alcohols terms with the EA, would simply cancel or be set to zero, and thus the only remaining terms in the equations are those of the binary system. In the case of KSVM this is not the same as feature space mapping is used. In such a case if some variable is included in the learning set but all values are equal to zero the space is not reduced, but it is assigned to some region in mapped feature space. Thus when in unseen data a value different from zero of such descriptor appears the projection can be invalid, as the mapping for such value was not explicitly created. Therefore a prediction can have very poor accuracy. In such case EA models perform better as they have better extrapolation abilities unlike KSVM. However some reports on extension of KSVM models to such applications are present. The fact that both KSVM and EA regression approaches were capable of predicting these effects correctly indicates the robustness of the descriptors, highlighting therefore that the projection of multidimensional solution to a less dimensional space is successful as well.


Previous attempts to model the CMC of ILs have always been restricted to simple binary systems with limited numbers and groups of ILs. To successful model these is not difficult, because of the system simplicity, and the small number of compounds. In this work, we have for the first time shown that it is possible to combine various systems and still produce viable fits. It is possible to predict the CMC of ILs not just in water, but with any salt or alcohol as well. The global model was able to predict the CMC with a MSE of 0.608 and the fit R2 of 0.843 for KSVM and with MSE of 0.973 and R2 of 0.794 for EA, which is very satisfactory taking into account that for the very first time such huge number of versatile ILs was used. The sensitivity analysis performed proved that micellization of ILs mirrors mechanism governing micellization of surfactants. To highlight the functionality of the models we also showed for the first time that a model for global prediction of CMC (in ternary systems) can serve for predicting the CMC of binary systems, indicating that therefore a projection of a multidimensional solution to a less dimensional space is successful as well. However, the fit this time was satisfactory,83 with R2 of 0.6199 and 0.6689 for KSVM and EA, respectively. Modelling of phenomenological parameters like CMC using decades of literature data results in a consistent insights into mechanisms governing the phenomenon, regardless of the system.79–82

Conflicts of interest

There are no conflicts to declare.

Notes and references

  1. J. Ranke, S. Stolte, R. Störmann, J. Arning and B. Jastorff, Chem. Rev., 2007, 107, 2183–2206 CrossRef CAS PubMed.
  2. N. Canter, Tribol. Lubr. Technol., 2005, 61, 15 Search PubMed.
  3. R. D. Rogers and K. R. Seddon, Science, 2003, 302, 792–793 CrossRef PubMed.
  4. D. M. Eike, J. F. Brennecke and E. J. Maginn, Green Chem., 2003, 5, 323–328 RSC.
  5. C. P. Fredlake, J. M. Crosthwaite, D. G. Hert, S. N. Aki and J. F. Brennecke, J. Chem. Eng. Data, 2004, 49, 954–964 CrossRef CAS.
  6. A. R. Katritzky, A. Lomaka, R. Petrukhin, R. Jain, M. Karelson, A. E. Visser and R. D. Rogers, J. Chem. Inf. Comput. Sci., 2002, 42, 71–74 CrossRef CAS PubMed.
  7. C. Yan, M. Han, H. Wan and G. Guan, Fluid Phase Equilib., 2010, 292, 104–109 CrossRef CAS.
  8. M. G. Freire, C. M. Neves, S. P. Ventura, M. J. Pratas, I. M. Marrucho, J. Oliveira, J. A. Coutinho and A. M. Fernandes, Fluid Phase Equilib., 2010, 294, 234–240 CrossRef CAS.
  9. C.-W. Cho, U. Preiss, C. Jungnickel, S. Stolte, J. Arning, J. Ranke, A. Klamt, I. Krossing and J. Thöming, J. Phys. Chem. B, 2011, 115, 6040–6050 CrossRef CAS PubMed.
  10. C. Kolbeck, T. Cremer, K. Lovelock, N. Paape, P. Schulz, P. Wasserscheid, F. Maier and H.-P. Steinruck, J. Phys. Chem. B, 2009, 113, 8682–8688 CrossRef CAS PubMed.
  11. R. L. Gardas and J. A. Coutinho, Fluid Phase Equilib., 2008, 265, 57–65 CrossRef CAS.
  12. U. P. Preiss, J. M. Slattery and I. Krossing, Ind. Eng. Chem. Res., 2009, 48, 2290–2296 CrossRef CAS.
  13. P. D. Huibers, D. O. Shah and A. R. Katritzky, J. Colloid Interface Sci., 1997, 193, 132–136 CrossRef CAS PubMed.
  14. Y. Ren, H. Liu, X. Yao, M. Liu, Z. Hu and B. Fan, J. Colloid Interface Sci., 2006, 302, 669–672 CrossRef CAS PubMed.
  15. J. Jacquemin, P. Husson, A. A. Padua and V. Majer, Green Chem., 2006, 8, 172–180 RSC.
  16. J. M. Slattery, C. Daguenet, P. J. Dyson, T. J. Schubert and I. Krossing, Angew. Chem., 2007, 119, 5480–5484 CrossRef.
  17. G. Yu, D. Zhao, L. Wen, S. Yang and X. Chen, AIChE J., 2012, 58, 2885–2899 CrossRef CAS.
  18. K. Tochigi and H. Yamamoto, J. Phys. Chem. C, 2007, 111, 15989–15994 CAS.
  19. C.-W. Cho, J. Ranke, J. Arning, J. Thöming, U. Preiss, C. Jungnickel, M. Diedenhofen, I. Krossing and S. Stolte, SAR QSAR Environ. Res., 2013, 24, 863–882 CrossRef CAS PubMed.
  20. Y. Zhao, J. Zhao, Y. Huang, Q. Zhou, X. Zhang and S. Zhang, J. Hazard. Mater., 2014, 278, 320–329 CrossRef CAS PubMed.
  21. M. I. Hossain, B. B. Samir, M. El-Harbawi, A. N. Masri, M. A. Mutalib, G. Hefter and C.-Y. Yin, Chemosphere, 2011, 85, 990–994 CrossRef PubMed.
  22. B. Peric, J. Sierra, E. Martí, R. Cruañas and M. A. Garau, Ecotoxicol. Environ. Saf., 2015, 115, 257–262 CrossRef CAS PubMed.
  23. C.-W. Cho, J.-S. Park, S. Stolte and Y.-S. Yun, J. Hazard. Mater., 2016, 311, 168–175 CrossRef CAS PubMed.
  24. D. J. Couling, R. J. Bernot, K. M. Docherty, J. K. Dixon and E. J. Maginn, Green Chem., 2006, 8, 82–90 RSC.
  25. F. Yan, S. Xia, Q. Wang and P. Ma, J. Chem. Eng. Data, 2012, 57, 2252–2257 CrossRef CAS.
  26. K. Roy, R. N. Das and P. L. Popelier, Chemosphere, 2014, 112, 120–127 CrossRef CAS PubMed.
  27. S. Bruzzone, C. Chiappe, S. Focardi, C. Pretti and M. Renzi, Chem. Eng. J., 2011, 175, 17–23 CrossRef CAS.
  28. J. S. Torrecilla, J. Palomar, J. Lemus and F. Rodríguez, Green Chem., 2010, 12, 123–134 RSC.
  29. F. Yan, Q. Shang, S. Xia, Q. Wang and P. Ma, J. Hazard. Mater., 2015, 286, 410–415 CrossRef CAS PubMed.
  30. F. Yan, S. Xia, Q. Wang and P. Ma, Ind. Eng. Chem. Res., 2012, 51, 13897–13901 CrossRef CAS.
  31. Y. Yu, X. Lu, Q. Zhou, K. Dong, H. Yao and S. Zhang, Chem. – Eur. J., 2008, 14, 11174–11182 CrossRef CAS PubMed.
  32. W. Mrozik, C. Jungnickel, T. Ciborowski, W. R. Pitner and P. Stepnowski, Gdansk, Poland, in 5th International Conference on Oils & Fuels for Sustainable Development, AUZO 2008, ed. J. Hupka, A. Tonderski, R. Aranowski and C. Jungnickel, Gdansk, Poland, 2008 Search PubMed.
  33. W. Mrozik, C. Jungnickel, T. Ciborowski, W. R. Pitner, J. Kumirska, Z. Kaczyński and P. Stepnowski, J. Soils Sediments, 2009, 9, 237–245 CrossRef CAS.
  34. W. Mrozik, J. Nichthauser and P. Stepnowski, Pol. J. Environ. Stud., 2008, 17, 383–388 CAS.
  35. M. Barycki, A. Sosnowska and T. Puzyn, J. Colloid Interface Sci., 2017, 487, 475–483 CrossRef CAS PubMed.
  36. U. Preiss, C. Jungnickel, J. Thöming, I. Krossing, J. Łuczak, M. Diedenhofen and A. Klamt, Chem. – Eur. J., 2009, 15, 8880–8885 CrossRef CAS PubMed.
  37. A. Vishnyakov, M.-T. Lee and A. V. Neimark, J. Phys. Chem. Lett., 2013, 4, 797–802 CrossRef CAS PubMed.
  38. Z. Kardanpour, B. Hemmateenejad and T. Khayamian, Anal. Chim. Acta, 2005, 531, 285–291 CrossRef CAS.
  39. M. Jalali-Heravi and E. Konouz, J. Surfactants Deterg., 2003, 6, 25–30 CrossRef CAS.
  40. K. Roy and H. Kabir, Chem. Eng. Sci., 2012, 73, 86–98 CrossRef CAS.
  41. P. D. Huibers, V. S. Lobanov, A. Katritzky, D. Shah and M. Karelson, J. Colloid Interface Sci., 1997, 187, 113–120 CrossRef CAS PubMed.
  42. A. R. Katritzky, L. M. Pacureanu, S. H. Slavov, D. A. Dobchev, D. O. Shah and M. Karelson, Comput. Chem. Eng., 2009, 33, 321–332 CrossRef CAS.
  43. P. D. Huibers, V. S. Lobanov, A. R. Katritzky, D. O. Shah and M. Karelson, Langmuir, 1996, 12, 1462–1470 CrossRef CAS.
  44. E. A. Mahmoud Gad, J. Dispersion Sci. Technol., 2007, 28, 231–237 CrossRef.
  45. S. Yuan, Z. Cai, G. Xu and Y. Jiang, Colloid Polym. Sci., 2002, 280, 630–636 CAS.
  46. S. Yuan, Z. Cai, G. Xu and Y. Jiang, J. Dispersion Sci. Technol., 2002, 23, 465–472 CrossRef CAS.
  47. U. P. Preiss, P. Eiden, J. Łuczak and C. Jungnickel, J. Colloid Interface Sci., 2013, 412, 13–16 CrossRef CAS PubMed.
  48. J. Brophy and D. Bawden, Aslib Proceedings, Emerald Group Publishing Limited, 2005, vol. 57, p. 498.
  49. C. v.5.11.4, Molecule File Converter Molconvert,, accessed 2017-04-23.
  50. J. J. Stewart, MOPAC2016, Stewart Computational Chemistry, 2016.
  51. J. J. Stewart, J. Mol. Model., 2007, 13, 1173–1213 CrossRef CAS PubMed.
  52. M. Korth, J. Chem. Theory Comput., 2010, 6, 3808–3816 CrossRef CAS.
  53. M. J. Dewar and G. P. Ford, J. Am. Chem. Soc., 1977, 99, 7822–7829 CrossRef CAS.
  54. W. Beichel, U. P. Preiss, S. P. Verevkin, T. Koslowski and I. Krossing, J. Mol. Liq., 2014, 192, 3–8 CrossRef CAS.
  55. U. P. Preiss and M. I. Saleh, J. Pharm. Sci., 2013, 102, 1970–1980 CrossRef CAS PubMed.
  56. A. Klamt and G. Schüürmann, J. Chem. Soc., Perkin Trans. 2, 1993, 799–805 RSC.
  57. R. Bini, C. Chiappe, C. Duce, A. Micheli, R. Solaro, A. Starita and M. R. Tiné, Green Chem., 2008, 10, 306–309 RSC.
  58. V. Aryadoust, Psychol. Test Assess. Model., 2015, 57, 301 Search PubMed.
  59. C. Cortes and V. Vapnik, Mach. Learn., 1995, 20, 273–297 Search PubMed.
  60. V. Vapnik, S. E. Golowich and A. Smola, Advances in neural information processing systems, 1997, 281–287 Search PubMed.
  61. B. E. Boser, I. M. Guyon and V. N. Vapnik, Proceedings of the fifth annual workshop on Computational learning theory, ACM, 1992, p. 144.
  62. B. Schölkopf and H. A. Mallot, Adaptive Behavior, 1995, 3, 311–348 CrossRef.
  63. B. Schölkopf, C. Burges and V. Vapnik, Artificial Neural Networks—ICANN 96, 1996, pp. 47–52 Search PubMed.
  64. V. N. Vapnik and V. Vapnik, Statistical learning theory, Wiley, New York, 1998 Search PubMed.
  65. T. Hofmann, B. Schölkopf and A. J. Smola, Ann. Stat., 2008, 1171–1220 CrossRef.
  66. B. Schölkopf, A. J. Smola, R. C. Williamson and P. L. Bartlett, Neural Comput., 2000, 12, 1207–1245 CrossRef.
  67. V. Cherkassky and F. Mulier, 1998.
  68. V. Cherkassky and Y. Ma, Artificial Neural Networks—ICANN 2002, 2002, p. 82 Search PubMed.
  69. C.-C. Chang and C.-J. Lin, ACM Transactions on Intelligent Systems and Technology (TIST), 2011, vol. 2, p. 27 Search PubMed.
  70. G. C. Cawley and N. L. Talbot, J. Mach. Learn. Res., 2010, 11, 2079–2107 Search PubMed.
  71. I. Kłosowska-Chomiczewska, K. Mędrzycka, E. Hallmann, E. Karpenko, T. Pokynbroda, A. Macierzanka and C. Jungnickel, J. Colloid Interface Sci., 2017, 488, 10–19 CrossRef PubMed.
  72. V. N. Vapnik and S. Kotz, Estimation of dependences based on empirical data, Springer-Verlag, New York, 1982 Search PubMed.
  73. V. Vapnik, Nonlinear Modeling, Springer, 1998, pp. 55–85 Search PubMed.
  74. M. A. Varfolomeev, A. A. Khachatrian, B. S. Akhmadeev, B. N. Solomonov, A. V. Yermalayeu and S. P. Verevkin, J. Solution Chem., 2015, 44, 811–823 CrossRef CAS.
  75. U. P. Preiss, P. Eiden, J. Łuczak and C. Jungnickel, J. Colloid Interface Sci., 2013, 412, 13–16 CrossRef CAS PubMed.
  76. C. Rodriguez-Abreu, K. Aramaki, Y. Tanaka, M. A. Lopez-Quintela, M. Ishitobi and H. Kunieda, J. Colloid Interface Sci., 2005, 291, 560–569 CrossRef CAS PubMed.
  77. S. Chen, D. F. Evans, B. Ninham, D. Mitchell, F. D. Blum and S. Pickup, J. Phys. Chem., 1986, 90, 842–847 CrossRef CAS.
  78. W. M. Gelbart, W. E. McMullen, A. Masters and A. Ben-Shaul, Langmuir, 1985, 1, 101–103 CrossRef CAS.
  79. H. Heerklotz and R. M. Epand, Biophys. J., 2001, 80, 271–279 CrossRef CAS PubMed.
  80. E. Dutkiewicz and A. Jakubowska, Colloid Polym. Sci., 2002, 280, 1009–1014 CAS.
  81. S. R. Raghavan, G. Fritz and E. W. Kaler, Langmuir, 2002, 18, 3797–3803 CrossRef CAS.
  82. S. C. Owen, D. P. Chan and M. S. Shoichet, Nano Today, 2012, 7, 53–65 CrossRef CAS.
  83. A. L. Edwards, The correlation coefficient, An Introduction to Linear Regression and Correlation, 1976, vol. 4, pp. 33–46 Search PubMed.


Electronic supplementary information (ESI) available: Complete dataset of ILs used for prediction, comparison of data with and without bootstrap aggregation, and EA equations. See DOI: 10.1039/c7cp05019d
Both authors are equal contributing first authors.

This journal is © the Owner Societies 2017