Open Access Article
Mood
Mohan
*a,
Michelle K.
Kidder
c and
Jeremy C.
Smith
*ab
aBiosciences Division and Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA. E-mail: moodm@ornl.gov; smithjc@ornl.gov
bDepartment of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, Tennessee 37996, USA
cManufacturing Science Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831-6201, USA
First published on 28th October 2025
The prospect of using artificial intelligence (AI) to accurately screen very large databases of compounds for multiple properties has yet to be realized. Here, we explore this possibility using ionic liquids (ILs) which offer unique physicochemical properties and excellent tunability, making them highly versatile solvents for various research applications. Screening millions of potential ILs for the best perfomance for use in specific tasks with experimental methods alone however, is impractical. Further, traditional’ physics-based computational chemistry is hindered by high computational cost. To address this challenge, we leverage a natural language processing (NLP)-based molecular embedding technique with advanced machine learning (ML) models to predict seven key IL properties: viscosity, density, ionic conductivity, surface tension, melting temperature, toxicity, and water solubility. Comprehensive datasets for these properties are obtained, then NLP featurization with Mol2vec is compared with other featurization techniques such as 2D Morgan fingerprints, and 3D quantum chemistry-derived sigma profiles. NLP-based featurization exhibited the best predictive performance, achieving the highest R2 and lowest RMSE values for all the studied IL properties. Further, we present case studies of how ILs might be screened using combined property criteria for practical cases – lignocellulosic biomass processing, CO2 capture, and optimal electrolytes for batteries – screening a novel database of ∼10.6 million generated feasible ILs. The results introduce NLP as a powerful tool for engineering many designer solvents with desirable properties for task specific applications.
Green foundation1. Our work advances green chemistry by using AI and natural language processing to predict critical properties of ionic liquids, enabling rapid screening of over 10.6 million candidates. This avoids time- and resource-intensive experimental testing, supporting safer solvent design with minimal environmental impact. The approach identifies task-specific ILs for CO2 capture, biomass processing, and reducing waste, energy use, and hazardous materials in solvent discovery.2. We predicted seven key properties of ILs and identified optimal candidates using AI/ML. Our model achieved R2 > 0.9 for most properties and revealed strong correlations between IL properties, enabling selection of safer solvents. This significantly reduces chemical waste, resource use, and trial-and-error experimentation, making solvent discovery faster, cleaner, and more sustainable. 3. The work could be greener by integrating metrics such as biodegradability, synthetic accessibility, and life cycle impacts into the screening pipeline. Future models could prioritize ILs from renewable feedstocks and include ecosystem toxicity data. Integrating this framework with experimental validation workflows would enable a closed-loop, AI-guided platform for sustainable solvent discovery, further aligning with green chemistry principles. |
Over the past decades, researchers have developed a range of computational methods to predict the thermodynamics properties of ILs. These methods include equations of state (EoS) (PC-SAFT: perturbed-chain statistical associating fluid theory),13 group contribution (GC) methods,14 the conductor-like screening model for real solvents (COSMO-RS),15 other quantum chemistry (QC) approaches,16 and molecular dynamics (MD) simulations.17–19 Among these, EoS methods possess a strong theoretical foundation for calculating thermodynamic properties, but their application is often limited by the complexity of estimating the required model parameters.20 QC and MD calculations provide detailed atomistic and molecular insights into the behavior of ILs, but their high computational costs restrict their use in large-scale screening.18,19,21 Further, COSMO-RS framework has emerged as a versatile quantum chemical method for calculating thermodynamic properties of ILs and deep eutectic solvents (DESs), but generating COSMO files remains computationally demanding and can sometimes yield only qualitative predictions rather than quantitative.11,19,22
To address the above issues, quantitative structure–property relationship (QSPR) models have emerged as a powerful tool for molecular property prediction, leveraging data-driven techniques to uncover complex patterns and relationships within the molecular datasets and driven by advances in AI and machine learning (ML).23–26 The award of the 2024 Nobel Prize in Chemistry for the AlphaFold methods underscores the transformative potential of AI/ML across scientific disciplines, particularly in chemistry and materials science, by enabling predictive models that significantly reduce the time and resources required for complex research processes. Recent studies have demonstrated that COSMO-RS-derived sigma (σ) profiles can be effectively used to develop ML models for accurately predicting thermodynamic and physicochemical properties of ILs, DESs, and organic solvents.11,22,25,27–30 However, the generation of σ-profiles for molecules is computationally intensive and impractical for large chemical libraries. As an alternative, other featurization techniques such as Morgan fingerprints, atomic counts, molecular descriptors, and functional group estimations are gaining much attention with less effort and computational time.17,31,32 Although these approaches reduce computational cost, however, their predictive accuracy often remains limited compared to quantum-chemistry-based descriptors.11,25,28–30
In recent years, natural language processing (NLP) methods have introduced a new paradigm for molecular representation.31,33 The molecular string notation such as SMILES (simplified molecular input line entry system),34 encode chemical structures as sequences of characters, they can be treated analogously to text. Embedding techniques such as Mol2vec, inspired by Word2vec, are able to learn high-dimensional vector representations of molecular substructures that capture both local and global chemical environments. These embeddings can then be used as features in ML models, providing rich structural information without the computational burden of quantum calculations. Recent applications of Mol2vec have demonstrated promising predictive performance for small-molecule properties, but its potential for accurately modeling the diverse and complex chemical space of ILs has not been fully realized.
In this work, we leverage NLP-based Mol2vec embeddings combined with the CATBoost algorithm to predict seven critical IL properties-viscosity, surface tension, ionic conductivity, density, melting temperature, toxicity, and water solubility. The rationale for selecting these seven physicochemical properties were guided by three key considerations: (i) fundamental relevance to IL performance across several industrial processes (e.g., biomass processing, CO2capture, battery electrolytes, chemical synthesis, etc.), (ii) coverage of fundamental property classes (transport properties, thermodynamic and phase behavior, and environmental/biological interactions), and (iii) availability of high-quality experimental datasets. Other properties such as solvent polarity or cation–anion coulombic interaction energies are indeed important, but they are not as consistently available in large datasets, and their effects are largely reflected in the selected measurable properties. For example, polarity influences surface tension and water activity, whereas coulombic interactions affect viscosity and ionic conductivity. We compile large and diverse experimental datasets and compare Mol2vec with traditional featurization techniques, including Morgan fingerprints,35 atom counts, and COSMO-RS-derived σ-profiles.25 Finally, we demonstrate the practical utility of our approach through high-throughput screening of approximately 10.6 million systematically generated ILs, identifying potential candidates with desirable physicochemical properties for specific tasks of societal importance and potential enhanced performance. This work establishes NLP-based molecular embeddings as a powerful and computationally efficient technique for the accelerated discovery and design of task-specific ionic liquids.
| IL property | Number of ILs | Number of cations | Number of anions | Data points | Exp. T (K) | Exp. P (kPa) | Property range | Source of data |
|---|---|---|---|---|---|---|---|---|
| Surface tension, σ (mN m−1) | 370 | 121 | 70 | 2663 | 263.08–533.2 | 101.325 | 16.9–76.2 | Mohan et al.29,30 and NIST38 |
| Viscosity, ln(η) (mPa s) | 967 | 419 | 172 | 11 721 |
278.15–353.15 | 60–950 000 |
−0.03–13.85 | Mohan et al.11 |
| Ionic conductivity, κ (S m−1) | 414 | 180 | 85 | 5700 | 208.15–528.55 | 100–101.325 | 1.7 × −7–14.54 | Song et al.36 |
| Density, ρ (kg m−3) | 1687 | 906 | 341 | 52 278 |
90–573 | 81.5–300 000 |
780–2150 | NIST38 |
| Melting temperature, Tm (K) | 3076 | 1763 | 205 | 3076 | — | 101.325 | 177.15–292.15 | Feng et al.42 |
| Activity of water, γwIL | 168 | 108 | 45 | 3578 | 288.05–433.15 | 80–101.33 | 0.028–9.36 | NIST38 |
Toxicity, log10 EC50 |
312 | 141 | 49 | 312 | — | — | −0.24 to 4.9 | Zhong et al.61 |
For all ILs in the dataset, SMILES strings were generated using the OPSIN method,39 and those SMILES unavailable in OPSIN were retrieved from PubChem40 or ChemDraw 3D.41 SMILES is a widely used notation for representing molecular structures, in which compound has a specific canonical SMILES. For example, the canonical SMILES for 1,3-dimethylimidazolium acetate is CN1C
C[N + ](
C1)C·CC(
O)[O–], where the dot (·) separates the cation and anions. The notation [N+] indicates a positively charged cation, while [O−] represents a negatively charged anion. Several studies in the literature,36,42 have developed ML models using erroneous SMILES representations, such as charge imbalances in IL SMILES. In our work, we corrected these charge imbalances to ensure accurate IL featurization and ML model development. A detailed discussion on the erroneous IL SMILES can be found in the SI (Section S1).
In addition, to further improve robustness and quantify predictive uncertainty, we trained multiple CATBoost models with different random seeds and applied the Virtual Ensembles (VE) method.54 Each model generates multiple VE predictions, and the aggregated mean and variance provide both accurate point estimates and calibrated prediction intervals for the IL properties. This ensemble–VE framework enhances the reliability of IL property predictions without the computational cost of training a large number of independent models. This method prevents the overfitting of ML models. A schematic overview of the ML framework for predicting IL properties using various input featurization techniques is shown in Fig. 1c.
Model performance was evaluated on the independent test set using root-mean-squared error (RMSE) and mean absolute error (MAE) to evaluate error distribution, along with the coefficient of determination (R2) to measure agreement between predicted and experimental IL property values. Lower RMSE and MAE with higher R2 value indicate greater prediction accuracy and better model reliability. The final RMSE, MAE, and R2 are reported based on the testing dataset.
In this study, we did not use manually or random data splitting. Instead, we employed iSIM-based stratified, medoid, and outlier sampling techniques for data splitting, which clusters ILs using Morgan fingerprints and Tanimoto similarity and then draws representative points from each cluster. This split acts as a scaffold-like split, reducing the chance that highly similar ILs (or duplicates in chemical space) dominate a single split and thereby mitigating leakage relative to random splitting. Further, the availability of dataset is small for many IL properties (example, surface tension- 370 ILs; toxicity-312 ILs, activity of water-168, viscosity-967, and ionic conductivity-414) and the dataset is strongly imbalanced with IL classes. Imidazolium-based ILs constitute a major fraction (∼60%) and remaining dataset belonging to pyridinium, pyrrolidinium, ammonium, phosphonium, and sulfonium families. To ensure that the model trained across IL classes, we utilized iSIM-based sampling technique to maintain structural diversity across the training, validation, and testing sets. This sampling technique prevents the model from being biased toward dominant classes and improves its generalizability across IL families. The dataset was split into 70% for training, 10% for validation, and 20% for testing the model. The training set was used for model development and the validation set used for fine-tuning and hyperparameter optimization, and the test set was used for testing the model on unseen data. Fig. 2–6 show comparisons of the different featurization techniques for each of the properties examined. The metrics are tabulated in Table 2. The main conclusion from these plots is that CATBoost with the Mol2vec NLP featurization method outperforms the other approaches for each of the properties studied and with very high predictive power.
| IL property | Feature | Data set | R 2 | MAE | RMSE |
|---|---|---|---|---|---|
| σ, mN m−1 | Atom count | Training | 0.974 | 0.921 | 1.520 |
| Testing | 0.884 | 1.865 | 2.601 | ||
| σ, mN m−1 | Morgan FP | Training | 0.980 | 0.950 | 1.355 |
| Testing | 0.880 | 1.986 | 2.644 | ||
| σ, mN m−1 | Sigma profiles | Training | 0.992 | 0.526 | 0.863 |
| Testing | 0.951 | 1.057 | 1.668 | ||
| σ, mN m−1 | Mol2vec | Training | 0.999 | 0.105 | 0.170 |
| Testing | 0.990 | 0.407 | 0.755 | ||
| ln(η), mPa s | Atom count | Training | 0.970 | 0.193 | 0.301 |
| Testing | 0.928 | 0.235 | 0.363 | ||
| ln(η), mPa s | Morgan FP | Training | 0.991 | 0.114 | 0.167 |
| Testing | 0.974 | 0.148 | 0.218 | ||
| ln(η), mPa s | Sigma profiles | Training | 0.996 | 0.069 | 0.111 |
| Testing | 0.978 | 0.115 | 0.201 | ||
| ln(η), mPa s | Mol2vec | Training | 0.998 | 0.049 | 0.080 |
| Testing | 0.987 | 0.084 | 0.151 | ||
| κ, S m−1 | Atom count | Training | 0.989 | 0.121 | 0.185 |
| Testing | 0.958 | 0.149 | 0.255 | ||
| κ, S m−1 | Morgan FP | Training | 0.993 | 0.100 | 0.149 |
| Testing | 0.969 | 0.140 | 0.220 | ||
| κ, S m−1 | Mol2vec | Training | 0.997 | 0.059 | 0.095 |
| Testing | 0.987 | 0.087 | 0.142 | ||
| ρ, kg m−3 | Atom count | Training | 0.993 | 6.689 | 14.267 |
| Testing | 0.995 | 6.119 | 12.201 | ||
| ρ, kg m−3 | Morgan FP | Training | 0.994 | 6.320 | 12.928 |
| Testing | 0.993 | 6.997 | 13.815 | ||
| ρ, kg m−3 | Mol2vec | Training | 0.999 | 2.067 | 5.952 |
| Testing | 0.999 | 2.246 | 5.742 | ||
| log10EC50 | Mol2vec | Training | 0.935 | 0.197 | 0.261 |
| Testing | 0.880 | 0.315 | 0.376 | ||
| T m, (K) | Mol2vec | Training | 0.946 | 14.39 | 18.57 |
| Testing | 0.720 | 30.43 | 39.87 | ||
| γ WIL | Mol2vec | Training | 0.999 | 0.0034 | 0.008 |
| Testing | 0.990 | 0.017 | 0.063 | ||
Fig. 2(a–d) illustrate the correlation between experimental and predicted surface tension of ILs in the training, validation, and testing sets for different featurization techniques using the CATBoost model. The parity plots in Fig. 2(a–d) show that the ML model performs relatively poorly when using atom-count, sigma profiles, and Morgan FP features, yielding relatively low R2 (0.880–0.951) and high RMSE (1.668–2.644 mN m−1) on the test sets. However, CATBoost performs well on the training sets with these features. CATBoost with Mol2vec featurization demonstrates excellent predictive performance on training and test sets, achieving a high R2 value of 0.990 and a low RMSE of 0.755 mN m−1 on the test set (see Table 2). To further assess model reliability, Virtual Ensemble (VE)-based uncertainty quantification was applied, with 95% prediction intervals (PIs) shown as vertical error bars in the plots. Across all featurization techniques, the Mol2vec-based model exhibits the narrowest and most consistent uncertainty bounds, indicating high predictive capability and lower epistemic uncertainty and also achieved 95% VE–PI coverage after applying calibration scaling factor. Lemaoui et al. (2024)59 developed ML and deep learning (DL) model to predict the surface tension of ILs using σ-profiles, and the performance of their DL model is comparable to our models based on the Morgan FP and atom-count featurization technique. In summary, based on the ML performance the ranking of these featurization techniques is as follows: Morgan FP < atom count < σ-profile < Mol2vec. To further illustrate model performance, Fig. S2(a–d) presents the residual deviations between the experimental and ML-predicted surface tensions. Mol2vec exhibits the lowest residual deviations, confirming that this model provides the most accurate predictions with the smallest errors.
Fig. 2(e–h) illustrates the correlation between experimental and CATBoost predicted IL viscosity in the training, validation, and testing sets and employing the different featurization methods. Apart from the atom-count (Fig. 2e), all the investigated featurization techniques demonstrated excellent predictive accuracy. Among these, Mol2vec outperformed Morgan FPs, achieving higher R2 values and lower RMSE and MAE, and this model also improved IL viscosity prediction relative to recently published models.11,59 For example, Lemaoui et al. (2024)59 developed a DL model using COSMO-RS-derived σ-profiles to predict IL viscosity, but its performance (R2 = 0.907 and RMSE = 0.477 mPa s) was somewhat weaker than that of the present study. The calculated experimental standard deviation (SD) of IL viscosities is closer to the SD values predicted using Mol2vec. To further analyze prediction errors, residuals (the difference between experimental and predicted values) were plotted against experimental viscosity (Fig. S2(e–h)). The residual plot indicates that again Mol2vec yields lower residual deviations than the other featurization approaches.
We have also developed predictive CATBoost models for the ionic conductivity of ionic liquids using the three different featurization techniques: atom count, Morgan FPs, and Mol2vec (Fig. 3(a–c)). For all featurization techniques the correlation between experimental and ML predicted ionic conductivities is excellent. However, again the Mol2vec-based model demonstrated marginally better performance. Table 2 presents the performance metrics for the investigated models, with the models are ranked based on their predictive accuracy. The Mol2vec-based model, with the highest performance, has R2 = 0.987, MAE = 0.087 S m−1, and RMSE = 0.142 S m−1 on the test set. In comparison, Chen et al. (2024)60 developed two ML models using σ-profile featurization for predicting IL ionic conductivity, but their models exhibited relatively weak performance with an R2 of 0.77 on the test set. Recently, Song et al. (2024)36 developed four ML models to predict ionic conductivity using graph neural networks (GNN) featurization and reported excellent predictive accuracy with a high R2 values of 0.97–0.99 and low RMSEs on the test set. Our model utilizing the NLP-based Mol2vec featurization approach performs better as compared previous attempts to predict ionic conductivity, demonstrating its effectiveness in capturing structure–property relationships.
The predictive performance of the CATBoost model using three different featurization techniques for IL density prediction is depicted in Fig. 3(d–f). NLP-based featurization technique exhibit strong correlation with experiments. In contrast, the Morgan FP and atom count technique showed relatively weaker predictive performance, yielding R2 and RMSE values of 0.99 and 12.20–13.81 kg m−3, respectively, on the test set. Among the three approaches, the Mol2vec featurization demonstrated the highest accuracy, achieving R2, MAE, and RMSE values of 0.999, 2.25 kg m−3, and 5.74 kg m−3, respectively. Fig. S3(d–f) evaluates the relative deviation distribution for the Mol2vec model, revealing that most data points fall within a ±200 kg m−3 range, predominantly concentrated between ±100 kg m−3. This distribution suggests minimal prediction bias compared to the other two featurization techniques. Finally, the toxicity (log10
EC50) of ionic liquids (leukemia rat cell line IPC-81) was also predicted. For IL toxicity, ML model uses only structural information encoded through Mol2vec embeddings generated from canonical SMILES. These embeddings, which capture both local and global substructural chemical features, are used as input to train ML models that predict experimental EC50 values, and the results presented in Fig. 4. Toxicity data for 312 ILs were sourced from Zhong et al. (2024)61 in which the log10
EC50 values range from −0.24 to 4.9. The ML model again demonstrated strong predictive accuracy, achieving an excellent (if not close to perfect) correlation with experimental data (high R2 = 0.880 and low RMSE = 0.376). This model exhibits slightly improved predictive performance than the model reported by Zhong et al. (2024)61 (R2 = 0.859). The vertical error bars in Fig. 4 represent 95% prediction intervals, capturing the spread of model predictions. Notably, the PI widths are relatively large for certain ILs, indicating higher uncertainty in toxicity predictions compared to other properties.
As discussed, Mol2vec features are generated by summing the vectors of molecular substructures identified from SMILES strings, which provide a chemically meaningful yet highly abstract representation. Unlike traditional featurization techniques (e.g., σ-profiles, Morgan FPs, and atom count), understanding the importance of functional groups or local chemical environments in Mol2vec-based models is challenging.
We have also performed an error analysis to identify the structural classes responsible for the largest deviations in melting temperature predictions. We observed that ILs containing sulfonate (O
S(
O)([O–])) and sulfone (O
S
O) anions exhibited higher residuals when combined with piperidinium, pyrrolidinium, and imidazolium-based cations. Furthermore, we retrained the CATBoost model by augmenting the original Mol2vec embeddings with two classes of physically meaningful descriptors: (i) atom-count features (11 elemental descriptors) and (ii) RDKit-derived molecular descriptors capturing topology, polarity, and hydrogen-bonding capacity, etc. This hybrid input featurization results in slightly improved performance, increasing R2 from 0.713 to 0.758 and reducing the MAE and RMSE from 30.43 K and 39.87 K to 27.64 K and 36.85 K, respectively (see Fig. S4).
However, despite these refinements, the accuracy of the regression model leaves significant room for improvement, and therefore, we explored a classification-based approach, using the Mol2vec features and the CATBoost method. The ILs were categorized into two classes based on their Tm values: those with Tm below 300 K were classified as “liquid”, and those above 300 K were classified as “solid”. As shown in Fig. 5(b–d), we conducted a comprehensive evaluation of the classification model's performance using metrics such as accuracy, the confusion matrix, and the ROC/AUC curve. The model achieves an accuracy of 0.844 with an ROC curve yielding an AUC of 0.88 (Fig. 5b), indicating that the model's accuracy on the test set is notably high, and the confusion matrix (Fig. 5b) indicates a robust classification performance. Additionally, precision, recall, and F1-score metrics were evaluated (Fig. 5d), demonstrating balanced performance across different classes and underscoring the reliability of the predictions. Relative to the regression models, the classification model performance is better with higher accuracy and precision. The better performance of classification model can be attributed to the model's ability to simplify the prediction task, thus effectively capturing the structure–property relationships in ILs. The classification approach offers more accurate and precise metrics for predicting the phase behavior of ILs at given temperatures, providing valuable insights for applications, where knowing the liquid or solid phase of ILs is critical. Furthermore, we also developed a classification model using hybrid featurization (Mol2vec, atom counts, and RDKit molecular descriptors); however, its performance was weaker than the model using only Mol2vec features.
We employed the Mol2vec featurization method to predict γWIL. Fig. 6b illustrates the correlation between ML predicted and experimental γWIL values, demonstrating excellent predictive performance on the test set, with low RMSE and MAE values of 0.063 and 0.017, respectively, and a high R2 value of 0.99. In comparison, Paduszyński (2016)65 developed various traditional ML models to predict the activity of molecular solvents in ILs, achieving an RMSE of 0.205 with a feed-forward neural network (FFNN), which was less accurate than our NLP-based approach. Further, we also explored a classification approach using the same Mol2vec featurization in combination with the CATBoost method to categorize the ILs. For this, the ILs were divided into two classes based on their γWIL values: those with γWIL below 1 were classified as “hydrophilic”, and those above 1 were classified as “hydrophobic”. Fig. 6(c and d) shows a comprehensive evaluation of the classification model's performance using metrics such as accuracy, ROC/AUC curves, and the confusion matrix. On the test set the model achieves an accuracy of 0.997 with an ROC curve yielding an AUC value of 1.0, signifying excellent performance. The confusion matrix (Fig. 6c) indicates a robust classification performance with accurate predictions. Moreover, the precision, recall, and F1-score metrics show balanced performance across both classes, reinforcing the reliability of the predictions (Fig. 6d).
| IL property | Model | Featurization | No. of ILs | Data points | R 2 | RMSE | Reference |
|---|---|---|---|---|---|---|---|
| σ, mN m−1 | XGBoost | σ-Profiles | 360 | 2524 | 0.963 | 1.716 | Mohan et al.30 |
| σ, mN m−1 | DL | σ-Profiles | 579 | 6599 | 0.931 | 2.251 | Lemaoui et al.59 |
| σ, mN m−1 | CATBoost | Mol2vec | 370 | 2663 | 0.990 | 0.755 | Present study |
| ln(η), mPa s | QSPR | Norm indexes | 832 | 9238 | 0.910 | — | Liu et al.66 |
| ln(η), mPa s | CATBoost | σ-Profiles | 967 | 11 721 |
0.984 | 0.200 | Mohan et al.11 |
| ln(η), mPa s | DL | σ-Profiles | 2026 | 25 243 |
0.907 | 0.477 | Lemaoui et al.59 |
| ln(η), mPa s | CATBoost | Mol2vec | 967 | 11 721 |
0.987 | 0.151 | Present study |
| κ, S m−1 | XGBoost | GNN | 414 | 5700 | 0.988 | 0.180 | Song et al.36 |
| κ, S m−1 | XGBoost | σ-Profiles | 242 | 2168 | 0.870 | — | Chen et al.60 |
| κ, S m−1 | CATBoost | Mol2vec | 414 | 5700 | 0.987 | 0.142 | Present study |
| ρ, kg m−3 | DL | σ-Profiles | 2100 | 40 860 |
0.993 | 14.360 | Lemaoui et al.59 |
| ρ, kg m−3 | CATBoost | Mol2vec | 1687 | 52 278 |
0.999 | 5.742 | Present study |
| log10EC50 | CATBoost | Mol2vec | 332 | 332 | 0.880 | 0.376 | Present study |
| log10EC50 | CATBoost | C-MF | 332 | 332 | 0.859 | 0.338 | Zhong et al.61 |
| log10EC50 | SVM | 2D descriptors | 355 | 355 | 0.927 | 0.288 | Wang et al.82 |
| log10EC50 | MLR | 2D descriptors | 304 | 304 | 0.77 | 0.51 | Sosnowska et al.83 |
| T m, (K) | GNN | GNN | 3080 | 3080 | 0.760 | 37.06 | Feng et al.42 |
| T m, (K) | Transformer CNN | SMILES | 3073 | 3073 | 0.66 | 45.0 | Makarov et al.63 |
| T m, (K) | DL | σ-profiles | 1145 | 1145 | 0.875 | 17.45 | Lemaoui et al.59 |
| T m, (K) | CATBoost | Mol2vec | 3080 | 3080 | 0.720 | 39.87 | Present study |
| γ WIL | CATBoost | Mol2vec | 168 | 3578 | 0.990 | 0.063 | Present study |
| γ WIL | FFANN | GC | 53 | 399 | 0.921 | 0.205 | Paduszyński65 |
| γ WIL | LSSVM | Critical features | 53 | 318 | 0.999 | 0.018 | Benimam et al.84 |
Liu et al. (2023),66 Mohan et al. (2024),11 and Lemaoui et al. (2024)59 developed ML models to predict the IL viscosities at different temperatures and pressures. Liu et al. (2023)66 and Lemaoui et al. (2024)59 has developed ML models based on the norm indexes and σ-profile featurization techniques and reported a R2 and RMSE values of 0.91 and 0.477 mPa s, respectively, i.e., lower predictive performance compared to our study. It is also important to mention that Lemaoui et al. (2024)59 compiled a dataset for surface tension and viscosity that is 2–2.6 times larger than the present study. This increase in data size is because the authors did not remove duplicate and triplicate IL data points. The inclusion of these redundant data points, which had larger experimental deviations, together with an inability to generate stable IL conformers, resulted in the developed model underperforming in predicting surface tension and viscosity.59 Chen et al. (2023)67 developed an IL transfer learning of representations model (ILTransR) based on a language model to predict IL viscosities. However, this also exhibited weaker predictive performance (MAE = 0.17 mPa s) than the present study (MAE = 0.09 mPa s). Furthermore, norm index-based models, which reduce ILs to global dimensionless parameters such as size (NS), polarity (NP), and symmetry (NQ), have also been used to predict temperature-dependent IL properties such as density,66 viscosity,66 and surface tension.68 These indices provide simple and computationally inexpensive descriptors; they compress complex structural information into a few coarse values. However, this coarse-grained representation overlooks the influence of local substructures, branching, charge localization, and specific cation–anion interactions that strongly affect dynamic properties like ionic conductivity.
Song et al. (2024)36 employed a deep learning-based GNN featurization approach to predict the ionic conductivity of 414 ILs across various temperature and pressure conditions, achieving high R2 and low RMSE values (Table 3), indicating excellent predictive performance. Similarly, Chen et al. (2024)60 developed several ML models to predict IL ionic conductivity using the σ-profiles as input features; however, this approach yielded poorer performance with an R2 value of 0.773. In contrast, our NLP-based featurization technique resulted in excellent predictive capability of IL ionic conductivity, as evidenced by high R2 and lower RMSE values (Table 3). In our previous work, we demonstrated that COSMO-RS-derived σ-profiles can indeed yield accurate ML models for IL properties such as viscosity, surface tension, and speed of sound.11,29,30 We have performed the conformer analysis of ILs and generated their most stable conformer for input featurization. In contrast, Chen et al.60 reported weaker performance for ionic conductivity when using σ-profiles, partly because their approach did not explicitly include conformer analysis, a key limitation of σ-profile featurization. Sigma profiles and other quantum chemistry-derived features are highly sensitive to molecular conformer, even a small conformational change can lead to large deviations in electronic descriptors, and thus in ML predictions.
Our Mol2vec-based CATBoost models improve IL property predictions because they avoid dependence on conformers and also capture molecular details differently. Mol2vec embeddings represent molecules through their local substructures and chemical context (such as headgroup type, chain branching, and anion functionalities), which are important for ion mobility and other IL properties. Unlike physics-derived descriptors, Mol2vec encodes cation and anion in the same vector space, allowing the ML model to capture patterns related to their combined structural features without the need for manual feature engineering.
Lemaoui et al. (2024)59 developed nine ML models for predicating IL density using the σ-profile as inputs, achieving a relatively poor performance with an RMSE value of 14.36 kg m−3 compared to our approach. In recent years, various IL melting temperature prediction models have been developed, employing diverse methodologies with varying degrees of success. Feng et al. (2024)42 and Makarov et al. (2022)63 utilized deep learning techniques based on the GNN and CNNs to predict Tm. Feng et al. (2024)42 obtained an R2 value of 0.76 and RMSE of 37.06 °C, whereas Makarov et al. (2022)63 achieved an R2 value of 0.66 and RMSE of 45 K. Additionally, Lemaoui et al. (2024)59 developed a deep learning model for 1145 ILs Tm with σ-profile as input features, and reported the R2 value of 0.875 and RMSE of 17.45 K. In comparison, our study covers more Tm data with a wider structural diversity of ILs (3080 unique ILs), encompassing 3080 unique ILs with large structural chemical diversity. This extensive dataset allows our model to capture a wider range of structural variations and also predict the melting temperatures with greater accuracy and generalizability.
To analyze relationships among IL properties, we calculated a correlation matrix between the properties, and the results are shown in Fig. 8. It is interesting to note that the density is positively correlated with the surface tension and toxicity, with Pearson correlation coefficients of 0.62–0.63. As proposed by MacLeod,74 surface tension is directly proportional to the density, reflecting stronger cohesive interactions among tightly packed ions. As the temperature increases, molecules move farther apart, weakening intermolecular forces. According to van der Waals, these forces decrease with the fourth power of intermolecular distance.74 As a result, surface tension of a liquid is a function of the distance between the molecules and thus depends on density. A negative correlation was observed between the viscosity and ionic conductivity, in line with the Stokes–Einstein equations, which demonstrates a strong relationship between these two properties.75 Further, the molar volume of ILs shows a negative correlation with surface tension and this observation is in line with experiments, as the alkyl chain length of IL increases the surface tension decreases.29 Higher alkyl chain length corresponds to higher molar volume. Gardas and Coutinho proposed an inverse relationship between surface tension and molar volume in ILs.76 In addition, the positive correlation is observed between the surface tension and the toxicity, with a 0.57 Pearson R correlation. Higher surface tension indicates strongly interacting ions that can disrupt biological membranes and increase toxicity.
For case studies, ILs were screened for their potential applications in lignocellulosic biomass processing, carbon capture, and electrolytes for lithium and sodium-ion batteries. Table S3 summarizes the desirable IL property ranges required for effective biomass pretreatment, high CO2 capture, and act as an electrolyte in battery research. For lignocellulosic biomass pretreatment, the key characteristics of ILs include low viscosity, moderate surface tension and ionic conductivity, being liquid at room temperature, and low toxicity. Therefore, based on Table S3, the optimal design criteria for ILs in biomass applications are as follows: ln(η) < 5 mPa s, σ: 30–45 mN m−1, κ < 0.8 S m−1; log10
EC50 > 2.1; hydrophilic IL, synthesizability scores (SA score) < 6, and Tm < 298.15 K (liquid). Based on these criteria, 1937 ionic liquids were initially identified, and 206 ILs of these were selected based on the polarity for further evaluation using COSMO-RS calculations. For this, COMSO-RS was employed to calculate the logarithmic activity coefficient (ln(γ)) and excess enthalpy (HE, kJ mol−1) of lignin, cellulose, and hemicellulose in ILs at 298.15 K (Fig. S7–S9); details of COSMO-RS methodology used for calculating ln(γ) and HE can be found in our previous publications.4,15,77,78 These two properties are critical, as low ln(γ) and HE indicate strong interactions between the IL and lignin, enhancing lignin solubility and facilitating effective biomass fractionation. First, the ln(γ) and HE were calculated for well-studied ILs from the literature, including 1-ethyl-3-methylimidazolium acetate ([EMIM][OAc]), 1-ethyl-3-methylimidazolium chloride ([EMIM]Cl), 1-allyl-3-methylimidazolium chloride ([AMIM]Cl), and choline lysinate ([Ch][Lys]).4,79–81 The corresponding COSMO-RS values are provided in Table S4. We primarily focused on lignin solubility because complete lignin removal remains the main challenge for effective fractionation, and solvents capable of dissolving lignin are urgently needed. Based on these COSMO-RS results, an additional selection criterion for ILs targeting lignin solvation was established: ln(γ)lignin < −8 and HE < −5.0 kcal mol−1. From the set of 206 ILs, 57 ILs lower ln(γ) and HE for lignin and the other stated desirable IL properties were thus retained (Fig. S7). Notably, phosphonium-based ILs dominate this pool of promising candidates. The top-predicted ILs, exhibiting lower ln(γ) and HE than the literature reported ILs, include: tetraethylphosphonium (E)-2-methylbut-2-enoate, tetramethylphosphonium pyrrolidine-2-carboxylate, triethyl(2-methoxyethyl)phosphonium (E)-2-methylbut-2-enoate, (1,3-dimethylimidazolidin-2-ylidene)-methylazanium propanoate, and trimethyl(propyl)phosphonium acetate (Fig. 9a).
For carbon capture, reference properties of ILs were obtained from the literature and are listed in Table S3. The key selection criteria are: ln(η) < 4.5 mPa s, σ < 45 mN m−1, κ: 0.1–0.5 S m−1; hydrophilic IL, SA score < 6, and liquid at ambient condition. Based on these criteria, 3986 ILs were retained. A number of interesting ILs were found based on the (methylsulfonyl)acetonitrile ([MSA]−) anion. Recent work by Qiu et al. (2024)69 explored the potential of [MSA]-based ILs for CO2 capture, demonstrating a cascade insertion mechanism of two CO2 molecules via consecutive C–C and O–C bond formation with [MSA]−. In our screening, [MSA]− formed ILs with various cations exhibiting desirable properties for CO2 capture. Furthermore, cyano (–C#N), carboxylate, borate, and TF2N-derived ILs are also potential candidates for CO2 capture. The chemical structures of a few selected ILs are illustrated in Fig. S10.
Finally, ILs were also assessed for their potential suitability in lithium- and sodium-ion batteries. The conventional Li-ion battery electrolyte, LP30, consists of 1 M LiPF6 in a 1
:
1 mixture of ethyl carbonate (EC) and dimethylcarbonate (DMC), with an ionic conductivity of 1.26 at 298.15 K. To serve as a viable electrolyte additive or replacement for LP30, ILs must exhibit an ionic conductivity greater than 1.5 S m−1. The criteria for selecting ILs as electrolytes thus comprised: ln(η) < 5 mPa s, κ > 1.5 S m−1; log10
EC50 > 2.0; SA score < 6; and Tm = liquid. Based on these parameters, 117 IL candidates were identified. The top predicted ILs share dicyanamide anions paired with imidazolium, triazolium, pyridinium, ammonium, and sulfonium cations, and the chemical structures of the top six are depicted in Fig. 9b.
In summary, our ML model with NLP-featurization offers a reliable and efficient tool for predicting desirable IL properties and enables large-scale high-throughput screening, paving the way for precision carbon capture and energy storage applications. In our future studies, we will expand the ML framework to include additional physicochemical properties, facilitating the design and generation of optimal ILs for diverse research applications, and experimentally validating the ML predicted results.
Furthermore, we generated ∼10.61 million novel ILs by systematically combining 7200 cations and 1474 anions and calculating from them the seven critical physicochemical properties using Mol2vec featurization with the pre-trained ML models. Notably, among the predicted properties, surface tension exhibited the strongest positive linear correlation with IL toxicity, with a Pearson correlation coefficient of 0.81. Further, the density shows a positive correlation with both surface tension and toxicity, with correlation coefficients ranging from 0.66 to 0.68, and this observation is in line with the theoretically proposed MacLeod equation, which establishes a direct proportionality between surface tension and density.74 Finally, we demonstrated the capability of the Mol2vec-based ML model in high-throughput screening of ILs for specific topical research applications identifying promising IL candidates with desirable properties for lignocellulosic biomass, CO2 capture, and electrolytes. These findings highlight the advantages of data-driven NLP approaches and pave the way for their integration into experimental high-throughput screening pipelines for chemical and materials discovery.
Supplementary information (SI): Relative deviations of ML models for the IL properties, IL property datasets, numbers of input features in the featurization techniques, COSMO-RS predicted activity coefficients and excess enthalpies of biomass in ILs, correlations between experimental vs. ML IL properties, and screened ILs for CO2 capture. See DOI: https://doi.org/10.1039/d5gc02803e.
| This journal is © The Royal Society of Chemistry 2025 |