Open Access Article
Jiechen Guoab,
Yifan Chaiab,
Cancan Hongab,
Hao Liuab,
Lijing Xiea,
Tianle Wangab,
Jingpeng Chena,
Ge Songa,
Zonglin Yi*a and
Fangyuan Su
*a
aShanxi Key Laboratory of Carbon Materials, Institute of Coal Chemistry, Chinese Academy of Sciences, Taiyuan, 030001, China. E-mail: sufangyuan@scxicc.ac.cn
bUniversity of Chinese Academy of Sciences, Beijing 100049, China
First published on 13th January 2026
Electrolytes with low melting points (MPs), high boiling points (BPs), and high dielectric constants (ε) can effectively mitigate performance degradation in lithium-ion batteries (LIBs) under low-temperature conditions. However, the lack of systematic experimental data on electrolyte properties poses a significant challenge to traditional design approaches. To address this limitation, we developed a machine learning workflow that integrates data acquisition using large language models, model construction, and interpretability analysis, aiming to predict key molecular properties, with a focus on MPs, BPs and ε. We constructed a multi-source database, LiElectroDB, that contains over 150
000 electrolyte molecules relevant to LIBs. The prediction models demonstrate strong performance across all three properties, achieving an R2 of 0.8864 and a root mean square error (RMSE) of 23.3 K for the MP, a coefficient of determination (R2) of 0.9608 and an RMSE of 14.3 K for the BP using the XGBoost algorithm, and an R2 of 0.8718 and a RMSE of 6.7 for ε using an artificial neural network. To further uncover structure–property relationships, t-SNE and SHAP are employed to analyze the molecular features contributing to thermal behavior at a microscopic level. Finally, by integrating molecular neighborhood search with high-throughput screening, nine candidate molecules are identified as promising low-temperature electrolytes for LIBs. This work provides an efficient and generalizable framework for the design of low-temperature electrolytes in LIBs.
Recently, machine learning (ML) has emerged as a powerful tool to accelerate material design and property prediction. In the field of LIBs, high-throughput experimentation and data-driven modelling enable the development of key components, such as cathodes,12,13 anodes,14,15 and electrolytes.16–18 The application of molecular simulation and ML to screen ideal electrolytes facilitates the advancement of next-generation low-temperature LIBs. In recent years, various ML algorithms have been employed—such as random forests, support vector machines, and neural networks—to predict key electrolyte properties including ionic conductivity,18–20 the electrochemical stability window,21–23 and viscosity.24–26 However, these purely data-driven approaches often suffer from limited generalization ability and the lack of interpretability. To address these limitations, researchers have proposed the knowledge–data dual-driven Knowledge-based Property prediction Integration (KPI) framework to predict the critical thermophysical properties of electrolyte molecules including melting points (MPs), boiling points (BPs), and flash points (FPs).27 KPI is specifically designed for applications requiring a wide temperature range and high-safety battery operation. KPI integrates expert domain knowledge with data-driven learning. This integration not only enhances predictive accuracy but also significantly improves model interpretability and generalization. Leveraging molecular neighbourhood search and high-throughput virtual screening, KPI successfully identified 29 promising electrolyte candidates with desirable safety and thermal properties, demonstrating its effectiveness in accelerating electrolyte discovery. However, due to the significant differences in the datasets used across various studies, the existing models still exhibit insufficient generalization capabilities across different datasets or under specific operating conditions, such as low-temperature environments (Fig. 1a).
![]() | ||
| Fig. 1 Overall workflow of machine learning. (a) Background. (b) Data collection. (c) Preparation of training sets. (d) Feature engineering. (e) Model training and validation. | ||
In this work, we propose a modelling method that integrates multi-level chemical knowledge to achieve highly precise prediction of the key physical properties of low-temperature electrolytes, including MPs, BPs and ε. Firstly, we constructed a LIB electrolyte database (LiElectroDB), a structurally diverse electrolyte database containing more than 150
000 molecules, assembled through multi-source data integration and LLM-assisted extraction. To fulfil the modelling requirements of different properties, XGBoost28 (XGB) and neural network-based modelling strategies are designed to uncover the relationships between molecular structures and target properties. By embedding multi-level chemical knowledge including chemical composition, structural characteristics, and electronic descriptors, the models not only achieve high predictive accuracy but also exhibit enhanced interpretability. Additionally, by combining molecular neighbourhood search with high-throughput screening strategies, nine promising molecules are successfully identified as low-temperature electrolyte candidates for LIBs. This workflow not only establishes a novel and scalable paradigm for electrolyte discovery but also demonstrates significant potential to accelerate molecular screening, reduce experimental costs, and provide actionable insights to guide future electrolyte design.
000 unique molecules. To ensure model stability and generalization, strategies such as data augmentation and feature engineering are introduced (Fig. 1d). Several ML models based on various algorithms are systematically evaluated. Through five-fold cross-validation, the performance of each model is compared on the same task. The XGB-based model is ultimately selected for predicting MPs and BPs, while an artificial neural network (ANN)31 is used for predicting ε (Fig. 1e). t-Distributed stochastic neighbour embedding (t-SNE)32 and Shapley additive explanations (SHAP) analyses are further applied to provide interpretability. By integrating molecular neighbourhood exploration with high-throughput screening, nine candidate molecules are identified as promising low-temperature electrolytes for LIB applications. This workflow covers data construction, preprocessing, model optimization, and result interpretation, with the goal of offering reliable data support and interpretable modelling strategies for molecular property prediction.
000 molecules, and is further extended to include common organic solvents and representative inorganic compounds based on literature data23,34–37 and external databases. Simplified molecular input line entry system (SMILES)38 strings are used as unique identifiers, and the PubChem API is employed to standardize molecular structures and eliminate duplicates. LLMs are also used to assist in extracting data from textual sources, with all retrieved information manually verified. This hybrid approach enables efficient scaling while maintaining data accuracy. Ultimately, the multi-source fusion database LiElectroDB is constructed (Fig. 1b), providing a structurally diverse foundation for downstream modelling.
To build a high-quality training set, we obtain property parameters such as MP, BP, and ε from authoritative sources, including the NIST and PubChem. During data collection, SMILES strings serve as a unified primary key to retrieve structured chemical information via the PubChem API. An LLM (GPT-4-turbo) is further applied to extract physical properties—MPs, BPs, and ε—for molecules in LiElectroDB (Fig. 1c). All extracted data are manually annotated to ensure accuracy and serve as the basis for subsequent model training and evaluation. Information from SMILES strings is integrated as the primary input, and three types of molecular characteristics are additionally extracted using the RDKit39 in Python 3.11.7, including chemical composition, molecular structure, and electronic structure (Fig. 2a and Tables S1–S3).
![]() | ||
| Fig. 2 Feature engineering. (a) Feature extraction. (b) Pearson correlation heatmaps. (c) Feature importance ranking based on XGBoost. | ||
To ensure physical plausibility, chemical diversity, and model stability, we restrict the molecular weight to 0–300 and the number of heavy atoms to 0–30 (Fig. S1). The final dataset includes 1251 MP data points, 1502 BP data points, and 895 ε data points. The resulting chemical space primarily consists of organic molecules containing C, H, O, N, S, and P, with diverse linear, cyclic, and aromatic scaffolds and a broad distribution of polar functional groups (ethers, carbonyls, amines, nitriles, sulfones, and alcohols). The property ranges—MP (50–450 K), BP (300–650 K), and ε (1–60)—are well aligned with electrolyte-relevant regimes, supporting reliable model training within the applicability domain.
We note that certain sparsely sampled regions—such as highly fluorinated systems, large polycyclic aromatics, and molecules containing fewer common heteroatoms—carry higher predictive uncertainty. Molecules with extremely high polarity or very limited conformational flexibility are likewise under-represented. These constitute the limitations of our mode. Overall, the curated structural and property diversity provides a robust basis for generalizable and physically meaningful predictions across conventional organic electrolyte candidates.
Kernel Density Estimation (KDE)40 analysis is conducted on the numerical distributions of MP, BP, and ε to verify the rational division and representativeness of the dataset. The MP and BP datasets are randomly divided into training and test sets at a ratio of 4
:
1, while the ε dataset is randomly divided into training, validation, and test sets at a ratio of 8
:
1
:
1 (Fig. S2–S4). The distribution curves of each subset are generally consistent with the overall dataset, showing good uniformity. These differences are slightly reflected in the smoothness and local peaks of the curves, but the overall differences are minimal, indicating that the data partitioning process effectively retains the representativeness and diversity of the original data and avoids oversampling or bias. This consistency is essential in data modelling and model evaluation, ensuring that model training, validation, and testing processes are conducted under similar distribution conditions, thereby enhancing the robustness and generalization ability of the model.
First, missing values in the dataset are detected and processed. Since missing values are inevitable in real-world data collection, using the data directly for modelling affects the accuracy and stability of the model. Therefore, the mean imputation method is applied to fill missing values in numerical features. Specifically, the SimpleImputer class from the sklearn.impute module41 is employed to replace missing values in each column with the mean of the available values.
Next, low-variance feature selection is performed on the completed dataset. In modelling, low-variance features typically fluctuate minimally across samples and provide limited explanatory power for the target variable, potentially introducing redundant information. Based on this, the VarianceThreshold class from the sklearn.feature_selection module is applied to remove features with variance below 0.01. This step helps to reduce feature dimensionality and improve both training speed and model generalization.
Furthermore, to address potential multicollinearity among features, we computed the Pearson correlation coefficients for all feature pairs and visualized the results using a correlation heatmap (Fig. S5–S10). Feature pairs with absolute correlation coefficients exceeding a predefined threshold were considered strongly correlated. In such cases, only one feature from each pair was retained for subsequent modelling. This procedure was adopted to prevent highly correlated features from jointly influencing the model, which could lead to unstable parameter estimates and potential overfitting. Fig. 2b shows Pearson correlation heatmaps of the engineered features.42
Finally, feature importance is assessed using a preliminary XGB model (Fig. S11–S13). Importance scores are derived from the internal gain-based metric, which quantifies the total information gain a feature contributes across all tree splits during model construction. Features with low importance scores are removed to reduce dimensionality and mitigate noise, except in the case of the ANN model, where such importance-based filtering is not directly applicable due to its non-tree-based architecture. Based on the above-mentioned feature engineering process, a representative subset of core features is selected as the final feature set (Fig. 2c), resulting in 17 features for melting-point prediction, 18 features for boiling-point prediction, and 61 features for dielectric constant prediction.
| Training set | Test set | Test set (cross-validation) | ||||
|---|---|---|---|---|---|---|
| R2 | RMSE (K) | R2 | RMSE (K) | R2 | RMSE (K) | |
| DT | 0.9028 | 21.60 | 0.7805 | 32.36 | 0.7664 | 33.46 |
| BG | 0.9368 | 17.41 | 0.8498 | 26.77 | 0.8402 | 27.67 |
| RF | 0.9795 | 9.92 | 0.8722 | 24.69 | 0.8642 | 25.51 |
| ET | 0.9530 | 15.02 | 0.8467 | 27.04 | 0.8432 | 27.41 |
| ADBR | 0.7329 | 30.46 | 0.7617 | 28.00 | 0.7017 | 37.81 |
| GBR | 0.9814 | 9.45 | 0.8546 | 26.34 | 0.8611 | 25.19 |
| XGB | 0.9795 | 9.91 | 0.8856 | 23.41 | 0.8864 | 23.30 |
| ANN | 0.9447 | 11.83 | 0.8193 | 22.90 | 0.8463 | 19.84 |
| Training set | Test set | Test set (cross-validation) | ||||
|---|---|---|---|---|---|---|
| R2 | RMSE (K) | R2 | RMSE (K) | R2 | RMSE (K) | |
| DT | 0.9581 | 15.10 | 0.8998 | 21.54 | 0.8635 | 26.85 |
| BG | 0.9695 | 12.89 | 0.9316 | 17.79 | 0.9187 | 20.73 |
| RF | 0.9896 | 7.51 | 0.9461 | 15.80 | 0.9359 | 18.40 |
| ET | 0.9815 | 10.05 | 0.9403 | 16.62 | 0.9379 | 18.1094 |
| ADBR | 0.8077 | 27.52 | 0.7552 | 28.10 | 0.7938 | 33.03 |
| GBR | 0.9813 | 10.08 | 0.9440 | 16.10 | 0.9398 | 18.19 |
| XGB | 0.9787 | 10.78 | 0.9550 | 14.43 | 0.9608 | 14.30 |
| ANN | 0.9751 | 11.48 | 0.9323 | 18.46 | 0.9513 | 15.90 |
The XGB regression model is optimized to improve the accuracy of MP and BP predictions. Key hyperparameters are fine-tuned using grid search combined with five-fold cross-validation (Fig. S11 and S20). Model performance is evaluated using cross-validated RMSE and R2 scores. The MP model achieves an R2 of 0.8868 and an RMSE of 23.3 K under five-fold cross-validation (Table 1). The BP model further improves upon this, reaching an R2 of 0.9608 and an RMSE of 14.3 K (Table 2). ε depends more strongly on electronic descriptors—including dipole-related features, partial charge distributions, and energy-state indices—which introduce complex higher-order nonlinear relationships. The ANN exhibits superior capability in capturing such intricate nonlinear mappings within high-dimensional spaces, thus achieving the highest predictive accuracy among all evaluated models; specifically, the ANN model achieves an R2 of 0.8863 and an RMSE of 6.7 (Table 3). Upon implementing the Keras framework, the network adopts a simple yet effective architecture with two fully connected hidden layers and a single output node. The input layer receives a 61-dimensional feature vector generated from molecular features, covering the electronic properties, molecular structure, and chemical composition. Each hidden layer contains 300 neurons with ReLU activation to enhance non-linear representation and mitigate vanishing gradients. To improve generalisation ability and prevent overfitting, each hidden layer is followed by a 10% dropout layer. Weight parameters are initialized using a random normal distribution, and L2 regularization (λ = 0.001) is applied to limit model complexity. The output layer consists of a single neuron. The model is trained using the MSE loss function and optimized with the Adam algorithm to ensure stable and efficient convergence. The training and validation loss curves show rapid initial convergence and sustained stability during training epochs, with minimal differences between training and validation losses (Fig. S39).
| Training set | Test set | Test set (cross-validation) | ||||
|---|---|---|---|---|---|---|
| R2 | RMSE | R2 | RMSE | R2 | RMSE | |
| DT | 0.5330 | 13.18 | 0.4715 | 13.14 | 0.5007 | 13.46 |
| BG | 0.8704 | 6.9415 | 0.7022 | 9.87 | 0.7421 | 9.67 |
| RF | 0.9687 | 3.41 | 0.7285 | 9.42 | 0.7826 | 8.88 |
| ET | 0.9725 | 3.20 | 0.5927 | 11.54 | 0.7662 | 9.21 |
| ADBR | 0.5878 | 12.38 | 0.3345 | 12.43 | 0.3471 | 15.12 |
| GBR | 0.9856 | 2.31 | 0.7538 | 8.97 | 0.7904 | 8.17 |
| XGB | 0.9090 | 5.81 | 0.6743 | 10.32 | 0.7428 | 8.95 |
| ANN | 0.9779 | 2.85 | 0.8870 | 7.91 | 0.8718 | 6.70 |
To further benchmark the model performance, a classical Group Contribution (GC) baseline was implemented using a Joback-type correlation. A validation subset of 47 structurally diverse molecules—covering linear, cyclic, aromatic, and heteroatom-containing species (O, N, S, and halogens), as well as functional groups such as ethers and carbonyls—was selected from the curated database. For each molecule, GC-estimated MP and BP were computed and compared against the corresponding experimental measurements and ML predictions. Across all three properties, the ML models consistently achieved significantly lower RMSE and MAE relative to the GC baseline, demonstrating their superior accuracy and generalizability (Tables S7 and S8) (Fig. 3).
![]() | ||
| Fig. 3 Predicted versus actual values for (a) the MP, (b) the BP using the XGBoost algorithm, and (c) ε using an artificial neural network. | ||
In the t-SNE plot based on MPs and BPs (Fig. 4a and b), distinct clustering patterns emerge across the three structural types. Aromatic compounds form a compact cluster in the left region, reflecting the rigidity and symmetry conferred by aromatic rings and conjugated systems, which contribute to higher and more consistent MPs (BPs).49,50 Linear compounds are dispersed across the right side of the plot with weak clustering, attributed to variations in chain length, branching, and functional groups, resulting in widely scattered thermal properties.51,52 Cyclic compounds occupy the lower-central region, showing intermediate clustering behaviour. Although lacking aromatic stabilization, their ring-induced rigidity still imparts some packing regularity. It can also be observed that the boundaries between the three structural types are less distinct.
Aromatic compounds exhibit broader dispersion and overlap with cyclic compounds due to the influence of polar substituents (e.g., hydroxyl and carboxyl groups), which enhance hydrogen bonding and increase the variability of MPs and BPs (Fig. 4a and b). Linear compounds remain on the right with increased density but fuzzy borders, reflecting the complex nonlinear interactions among chain length, polarity, and branching.53 These observations suggest that while structure–property associations are evident, MPs and BPs offer limited discriminative power for structural classification in certain cases. Violin plots further support the observed clustering patterns, revealing significant differences in MP and BP distributions across structural classes (Fig. S41 and S42). Aromatic compounds exhibit narrow and high-centered distributions, reflecting the inherent rigidity and symmetry of their conjugated ring systems. In contrast, linear compounds show broad and multimodal distributions, indicative of substantial structural diversity and corresponding variability in thermal properties. Cyclic compounds exhibit intermediate behaviour in both spread and central tendency. Overall, the distribution of MPs and BPs is closely linked to molecular polarity and functional group composition. Strongly polar compounds (e.g., carboxylic acids and phenols) tend to exhibit high MPs and BPs due to enhanced intermolecular interactions. Compounds of moderate polarity (e.g., aldehydes and ketones) occupy an intermediate range, while nonpolar molecules (e.g., hydrocarbons and ethers) cluster at the lower end of the spectrum. The inclusion of heteroatoms such as nitrogen and sulfur further amplifies polarity differences, broadening the overall property distribution.
In contrast, as for ε, the clustering distinction among molecular structural types is significantly weakened (Fig. 4c). The overall distribution is highly scattered, with substantial overlap across aromatic, linear, and cyclic compounds, and without the emergence of distinct boundaries. Aromatic compounds are loosely scattered in the upper region of the map, while linear and cyclic structures run throughout the entire region. This distribution reflects that ε is primarily influenced by electronic structure, especially charge distribution, conjugation effects, and polar substituents rather than by the molecular backbone or topology.54–56 Linear compounds typically contain a wide range of polar functional groups and exhibit a particularly broad spread in ε, ranging from nonpolar alkanes (ε ≈ 1) to highly polar amines and carboxylic acids (ε > 30). Cyclic compounds exhibit similar diffuse distributions, falling between the two extremes and contributing to the overall overlap. Violin plots further confirm this trend: oxygen-rich, highly polar compounds (e.g., alcohols, carboxylic acids, and phenols) consistently show elevated ε (30–50), while nonpolar species such as aromatics and alkanes cluster in the low-permittivity region (ε = 1–3). Heteroatom-containing compounds (e.g., amides and sulfones) exhibit wide variability due to their diverse polar characteristics (Fig. S43). These results indicate that ε primarily captures the electronic responsiveness of molecules, with limited correlation to structural symmetry or geometry.
![]() | ||
| Fig. 5 Interpretation of the machine learning models for the MP (a) and the BP (b) using the SHAP algorithm. | ||
In the MP model (Fig. 5a), chemical composition features play a dominant role with heavy-atom molecular weight (HeavyAtomMolWt) emerging as the single most influential feature (mean |SHAP| = 0.26), while the impact of the number of heteroatoms (Het, 0.10) and the ratio of oxygen atoms to carbon atoms (O/C, 0.04) is also pronounced. Critically, the O/C ratio shows structure-dependent behaviour: in highly polar, oxygen-rich scaffolds, it tends to increase the MP, whereas in other contexts, it may exert a negative influence. For BP prediction (Fig. 5b), chemical-composition features exert an even more decisive influence: three of the five most predictive variables encode molecular size and elemental constitution, namely the number of heavy atoms (Heavy, 0.20), hydrogen-bond donor count (0.18), and HeavyAtomMolWt (0.15). Elevated HeavyAtomMolWt values are associated with positive SHAP contributions, implying that larger and heavier molecules are assigned higher predicted BPs. The analogous behaviour of hydrogen bond donor counts underscores the pivotal role of hydrogen bonding in enhancing cohesive interactions. This comparative analysis reveals that the BP is governed primarily by global molecular properties, whereas the MP is more sensitively modulated by specific functional group interactions.
In MP prediction (Fig. 5a), structural features account for ≈38.5% of explanatory power. The maximum number of atoms of the ring (Max Ring Size, 0.09) and the number of aromatic rings (NumAromaticRings, 0.08) confer moderate positive contributions, suggesting that extended ring systems and aromaticity enhance structural rigidity and thus the MP. The number of rotatable bonds (NumRotatableBonds, 0.05) shows a weaker negative contribution, consistent with the notion that melting involves the disruption of crystal packing, which is a process less sensitive to conformational flexibility than vaporization. For BP prediction (Fig. 5b), structural features contribute approximately 30–35% to the predictability. Here, increased NumRotatableBonds (0.09) yields more pronounced negative SHAP values, indicating that highly flexible molecules (e.g., long-chain alkanes) more readily access conformational freedom, thereby lowering the BP. Conversely, Max Ring Size and NumAromaticRings exert positive contributions, attesting to the beneficial influence of molecular rigidity and π–π stacking on elevating the BP.
In MP prediction (Fig. 5a), electronic features account for ≈29% of explanatory power. The maximum value of the electron state exponent for all atoms (MaxEStateIndex, 0.20) and the minimum partial charge of an atom (MinPartialCharge, 0.17) emerge among the most influential features, underscoring the pronounced role of localized extreme charge sites in intensifying intermolecular electrostatic attraction. This dominant electronic contribution highlights the exceptional relevance of electronic properties to crystal stability, as optimal lattice packing necessitates precise electrostatic complementarity. By comparison, electronic structure features contribute modestly (≈19%) to global BP predictability while still furnishing critical local interpretability for polar architectures (Fig. 5b). Positive correlations observed for MaxEStateIndex and average electronegativity (AvgX) indicate that enhanced electron delocalization or polarity strengthens intermolecular electrostatic interactions, thereby elevating the BP.
These differential patterns work synergistically in the collective model interpretations. The MP model reveals a reordered importance result where chemical-composition features (≈32.5%) drive variations by modulating polarity and intermolecular interaction strength; electronic structure features collectively contribute ≈29%, attesting to their pervasive influence; and structural features provide secondary modulation (≈38.5%). This configuration indicates that MP changes are governed primarily by the intensity of intermolecular forces, jointly determined by molecular composition and electronic structures (Fig. 5a).
The BP model presents a fundamentally distinct hierarchy: chemical-composition variables dominate (≈58% of explained variance), primarily encoding molecular size and hydrogen bond capacity. Structural features contribute 23% with aromaticity and ring architecture being paramount, while electronic structural features contribute ≈19% (Fig. 5b). This contrast signifies that the BP is determined primarily by global molecular attributes, whereas the MP is governed by electronic and structural considerations.
In summary, based on the results of t-SNE clustering, violin plot distribution, and SHAP feature importance analysis, this study proposes the key structural features associated with the excellent low-temperature electrolyte performance. First, molecules with high conformational flexibility, as reflected by the increased NumRotatable bonds and the scattered t-SNE distribution of linear structures, achieved lower melting points (MPs) and boiling points (BPs). Second, moderately polar functional groups such as ethers and carbonyl units, can enhance the dielectric constant (ε) while avoiding excessive hydrogen bonding interactions that lead to elevated MPs/BPs—a characteristic validated by the SHAP contributions of the O/C ratio and MinPartialCharge. Third, electronic features such as distributed partial charges and moderate electronic state endpoints, captured by MaxEStateIndex and MinPartialCharge, facilitate low melting transitions by weakening structural stability. In contrast, rigid aromatic systems and macrocyclic systems, which cluster in the high MP/BP region in t-SNE plots and exhibit positive SHAP contributions, are generally detrimental to low-temperature electrolyte performance due to enhanced molecular rigidity and significant cohesive interactions. These comprehensive insights provide practical structural guidance for designing electrolyte molecules with improved low-temperature performance.
![]() | ||
| Fig. 6 Workflow for identifying low-temperature electrolyte candidates. (a) Local exploration around DOL and DMS. (b) High-throughput. (c) Nine candidate electrolyte molecules. | ||
To further evaluate their practical relevance, the nine screened candidate molecules were compared with existing experimental data. The deviations fall within the RMSE range of the model (Tables S4 and S5). We have also compared the candidate molecules with widely used low-temperature solvents, such as 1,3-dioxolane (DOL), 1,2-dimethoxyethane (DME), and 2-methyltetrahydrofuran (MeTHF). These molecules are generally consistent with the criteria we established, which confirms the rationality of our screening process and demonstrates the potential of the nine selected molecules as low-temperature electrolytes. Although this work focuses on large-scale ML-based screening, further validation is planned in future studies, focusing primarily on the candidate molecules with available CAS numbers. Beyond the primary screening properties, we have reported the HOMO and LUMO values of the nine electrolyte molecules (Table S4) and compared them with those of commercial low-temperature electrolyte molecules to gain further insights into their electrochemical stability (Table S5). The results confirm the adequacy of the electrochemical window for the nine screened molecules. In future studies, we intend to incorporate HOMO and LUMO descriptors into the large-scale screening workflow.
000 molecules from multiple sources. The structure–property relationships for MPs, BPs, and ε are systematically analysed. XGB models are employed for MP and BP prediction due to their effectiveness in feature selection and segmented fitting. We achieved R2 values of 0.8868 and 0.9608 and RMSEs of 16.8 K and 9.15 K under five-fold cross-validation, respectively. In contrast, an ANN is adopted to model ε, which shows strong nonlinearity with respect to molecular polarity and electron distribution. The model achieves an R2 of 0.8863 and an RMSE of 6.7. t-SNE visualization reveals that the distributions of the MP and BP are closely related to molecular polarity and functional group composition while ε primarily reflects electronic response characteristics. SHAP analysis further confirms that the BP depends on global molecular features such as size and hydrogen-bonding capacity, whereas the MP is influenced by intermolecular interactions. These insights not only validate the predictive models but also provide actionable guidance for rational electrolyte design. Finally, by combining molecular neighbourhood search with high-throughput screening, nine candidate molecules are identified as promising low-temperature electrolytes for lithium-ion batteries. This work establishes an efficient and generalizable framework for the rational design of advanced electrolytes under low-temperature conditions. Overall, this study contributes a novel data-driven framework that accelerates molecular screening, reduces experimental cost, and enables interpretable and generalizable design of advanced electrolytes under low-temperature conditions.
All other relevant data are available from the corresponding author upon reasonable request.
| This journal is © The Royal Society of Chemistry 2026 |